/g/ - /wsg/ - Web Scraping General - Technology

Anonymous

/wsg/ - Web Scraping General 06/29/24(Sat)19:12:37 No.101208241

File: scraper.png (1.62 MB, 1892x2142)

/wsg/ - Web Scraping General Anonymous 06/29/24(Sat)19:12:37 No.101208241 Archived

Web Scraping General

Clever edition

QOTD: What's easier? Parsing HTML or reverse engineerin priv APIs?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101177514

Anonymous
06/29/24(Sat)19:51:28 No.101208716

Anonymous 06/29/24(Sat)19:51:28 No.101208716

Does anyone have any resources or advice for reverse engineering angular web apps? A site I used to scrape has been (((upgraded))) to an SPA with infinite scroll and it's giving me grief.
The site is now so JS heavy I have to use playwright and I keep running out of memory. Each time I load a new page, they send 2mb of data to the client and this is never freed up so it just grows until my scraper crashes. Reloading the page takes me back to the beginning with no way to jump to the last page scraped.
The site is running in production mode and I can't get ng.probe() or calls to angular.element(...).scope() to work.

Anonymous
06/29/24(Sat)19:54:31 No.101208751

Anonymous 06/29/24(Sat)19:54:31 No.101208751

>>101208716
That means that the data you're trying to scrape is almost certainly being requested via some sort of undocumented API. You might be able to just figure out which endpoints the JS on site is hitting by opening the network tab in inspect element.

Anonymous
06/29/24(Sat)20:04:51 No.101208893

Anonymous 06/29/24(Sat)20:04:51 No.101208893

File: scraper.png (443 KB, 1080x606)

443 KB PNG

>>101208751
kek I was hoping this wasn't going to be the answer. They do have a private API but it's one of the most horrible, byzantine monstrosities I've seen. It will take a while to reverse engineer

Anonymous
06/29/24(Sat)20:11:05 No.101208964

Anonymous 06/29/24(Sat)20:11:05 No.101208964

>>101208893
What's wrong with it in specific? Most of them are easier to do than just parsing HTML

Anonymous
06/29/24(Sat)20:32:31 No.101209154

Anonymous 06/29/24(Sat)20:32:31 No.101209154

How do I scrape twitter? Rate limits, account blocks, it seems impossible now, unless you invest in proxies and maybe human captcha solvers.

Anonymous
06/29/24(Sat)20:38:03 No.101209203

Anonymous 06/29/24(Sat)20:38:03 No.101209203

>>101209154
Browser with a userscript. The userscript copies the post data and then you send it to a 127.0.0.1:8081 backend that collects it.

Bonus: You are using a real browser

Anonymous
06/29/24(Sat)21:01:54 No.101209424

Anonymous 06/29/24(Sat)21:01:54 No.101209424

>>101208241
Why does it feel so good to scroop mobile apis?

Anonymous
06/29/24(Sat)21:21:13 No.101209584

Anonymous 06/29/24(Sat)21:21:13 No.101209584

>>101209203
I don't think that helps. The rate limiting or viewing limit applies to normal human users too. They do it to get people to pay for twitter. The account banning only started when I tried to use multiple accounts in rotation.

Anonymous
06/29/24(Sat)21:23:27 No.101209614

Anonymous 06/29/24(Sat)21:23:27 No.101209614

>>101209424

For me its probably the cozy setup of android app with system certificate and using mitmproxy, maybe its because what you need is easier to use the api on mobile vs on your desktop/website.

>want to make a telegram bot
>use it to send request to planet fitness
>to different planet fitness gyms to see live update of occupancy

Why? Just so I can see if its too crowded for my liking.

For some fucking reasons "user-agent" set to any mozilla will 403 status code, I then used the one insomnia defaults to and it works, fucking weird, maybe if I tried mobile api I wouldn't have to worry about user-agent.

Anonymous
06/29/24(Sat)21:28:35 No.101209678

Anonymous 06/29/24(Sat)21:28:35 No.101209678

>>101209424
Best of both worlds in the OP image.

>>101209614
Any other scraping wins?

Anonymous
06/29/24(Sat)21:31:53 No.101209711

Anonymous 06/29/24(Sat)21:31:53 No.101209711

>>101209154
Just try scraping nitter instances. Of course, be prudent about it. Don't be that one asshole that forces a nitter instance owner to implement (((captchas))) and (((ratelimits)))

Anonymous
06/29/24(Sat)21:33:08 No.101209727

Anonymous 06/29/24(Sat)21:33:08 No.101209727

File: main-qimg-7b234f6053a9d2d(...).jpg (99 KB, 602x570)

99 KB JPG

>>101208241
I am a WAFag making money 429ing you.
Thank you for providing me with gainful employment.
Ask away if you want to know about WAFaggotry.
>pic semi related

Anonymous
06/29/24(Sat)21:38:37 No.101209781

Anonymous 06/29/24(Sat)21:38:37 No.101209781

>>101209727

> Thank you for providing me with gainful employment.
Works out for both of us, since we'll just get around it anyways

> Ask away if you want to know about WAFaggotry.
Yeah, do you know why this one site keeps 403 and 429ing me? I'm using a diff IP (residential) and TLS fingerprint (chrome and firefox, headers spoofed too) on each attempt but keep getting 403'd and 429'd. Site's behind CF, program I'm writing basically just logs into a bunch of different accounts and scrapes different info from parts of the website. For some reason, one login attempt won't get me blocked but a bunch will

Anonymous
06/29/24(Sat)21:40:19 No.101209807

Anonymous 06/29/24(Sat)21:40:19 No.101209807

>>101209727
Does CF actually send any data to detect if you're a bot on those "Wait a moment" pages or is it basically just detecting if you have javascript enabled?

If we get bypasses up for those pages we'll be unstoppable

Anonymous
06/29/24(Sat)21:40:39 No.101209815

Anonymous 06/29/24(Sat)21:40:39 No.101209815

>>101209678

semi win is downloading my orders in json format from aliexpress so it can automatically import what I bought into invetory management self hosting app called "inventree" https://github.com/inventree

there is still some stuff I need to change to fully automate it, but im in no hurry.

While this one may not count I'll add it, thingiverse web scraping, while it does have an open api, at the time I didn't see any wrappers for it, so I just coding for myself, a lot of 3d models I bookmark end up 404ing on me, got tired of it, and coded to scrape and download and then push them into my git self host called "gitea" to download for later if I want to.
https://github.com/go-gitea/gitea

another scraping win was what I mentioned earlier about the yeezy sneakers, when my friend asked me I had 2 days to code a bot to try to snag the sneakers, so I spent 2 days no sleep, I got the bot to successfully make a purchase but as I got closer to the deadline the more hacky I got with the code, but I didn't expect those line queues footlocker put on their site and my bot got blacklisted pretty quick.. lmao, but hey I had 2 days and it was my first time trying something like that.

Anonymous
06/29/24(Sat)21:41:59 No.101209832

Anonymous 06/29/24(Sat)21:41:59 No.101209832

>>101209815
>I got the bot to successfully make a purchase

Sorry I meant on some random clearance item, cheapest thing they had.

Anonymous
06/29/24(Sat)21:45:22 No.101209866

Anonymous 06/29/24(Sat)21:45:22 No.101209866

>>101209815
> another scraping win was what I mentioned earlier about the yeezy sneakers
KEK you could probably make $100K+ sneakerbotting

Stores have a fuckton of protection against it though and most of the data that would help you carry it out is probably extremely private and well-kept in secrecy

Anonymous
06/29/24(Sat)21:47:06 No.101209886

Anonymous 06/29/24(Sat)21:47:06 No.101209886

>>101209727
scraping images from Y3p search, using curl-impersonate. Sometimes it blocks me and returns a javascript challenge, but only on images. why is that?

Anonymous
06/29/24(Sat)22:03:53 No.101210031

Anonymous 06/29/24(Sat)22:03:53 No.101210031

>>101209781
>403 and 429ing
100% chance they wrote a retarded manual rule for that case, combining login count and request count
Your description smells of manually added pajeetwork
>client gets big mad
>scraper using fuckall resources but dashboard shows big bad
>client taking it personally
>orders pajeet support to do something
>yes sir will do the needful
>ends up with a unpredictable rule that in most cases kicks more legitimate users
I can't tell you the amount of times I had to explain to people many scrapers either have no discernible effect on their operating costs or are even desirable. They are getting mad at you. They want you to get of their lawn.
>>101209807
>Does CF actually
Surprisingly never worked for CF. Afaik they track your user inputs and create scores based on that, like keys and pointer location and movement. Sounds like a job for ML.
>>101209886
WAF set up in CDN only, Many such cases.

Anonymous
06/29/24(Sat)22:10:42 No.101210088

Anonymous 06/29/24(Sat)22:10:42 No.101210088

>>101209711
Isn't nitter dead?

Anonymous
06/29/24(Sat)22:35:13 No.101210278

Anonymous 06/29/24(Sat)22:35:13 No.101210278

>>101210031
> 100% chance they wrote a retarded manual rule for that case, combining login count and request count
> Your description smells of manually added pajeetwork
Any idea how to bypass it? Am getting paid for this and the info is just useful anyways

Anonymous
06/29/24(Sat)22:36:14 No.101210284

Anonymous 06/29/24(Sat)22:36:14 No.101210284

>>101210088
https://nitter.poast.org/
Not sure how much easier to scrape than actual Twitter it is but it's def easier than just scraping Twitter

Anonymous
06/29/24(Sat)22:37:15 No.101210293

Anonymous 06/29/24(Sat)22:37:15 No.101210293

>>101210031
> Surprisingly never worked for CF
Have you ever worked for Akamai?

Anonymous
06/29/24(Sat)22:56:50 No.101210465

Anonymous 06/29/24(Sat)22:56:50 No.101210465

>>101210284
>starts with requiring JS and "verifying your browser"
Christ.

Anonymous
06/29/24(Sat)23:10:28 No.101210567

Anonymous 06/29/24(Sat)23:10:28 No.101210567

how do you guys make money off scraping

Anonymous
06/29/24(Sat)23:10:56 No.101210576

Anonymous 06/29/24(Sat)23:10:56 No.101210576

>>101210567
For fuck's sake stop asking this and read the FAQ

Anonymous
06/29/24(Sat)23:52:25 No.101210868

Anonymous 06/29/24(Sat)23:52:25 No.101210868

>>101210278
Observe what it does and then do something else
>RTFM on CF custom rules
>think like a poo
99/100 time they are on 1 or 5 min sliding window for timeouts. Play with that.

Anonymous
06/30/24(Sun)00:54:32 No.101211261

Anonymous 06/30/24(Sun)00:54:32 No.101211261

>>101209727
How to be like you good sar

Anonymous
06/30/24(Sun)01:56:44 No.101211693

Anonymous 06/30/24(Sun)01:56:44 No.101211693

>>101208964
Part of it is retarded design (eg a mixture of snake and camel case for json keys, invalid characters in urls that most http libraries try to escape for you, unintuitive organisation of resources) and part of it is they have this weird concept of pushing and popping state to the server.
I'm scraping test results from the site so a basic example is fetching the results for the candidates in a test. First you fetch all the tests available, but to do anything with a test you need its category. There's no endpoint to get all the categories so you have to get all the unique category ids from the tests then fetch the categories one by one. Then you use the test and the categories to fetch a list of all the sitting ids for a test, but if you send an invalid category and test combo the server will return the sitting ids for the default data set with any error messages. Then you let the server know you want the data for these sittings so you push the category, test, and sitting ids back to it. Then you send them to server again and telling it to prepare the candidate responses for each sitting in the test. Then you pop them off at which point the server finally sends you a token you can use to get your data. If you make requests for data for any other category+test+sitting ids combo before you pop off the last lot the server sends you the default data set without any errors.
And that's the "simple" case. There are other kinds of assessments where you have to request each question and each candidate response for each test individually in a similar manner.

Anonymous
06/30/24(Sun)02:42:26 No.101211971

Anonymous 06/30/24(Sun)02:42:26 No.101211971

>>101210868
Tried basically everything. You think they're just banning my proxies? I'm using iproyals.

Also, I was thinking it was a global limit until I tried going to the form I was abusing with the bot runing and it worked fine. I was using an extremely outdated version of Chrome at the time too, so it's not that.

Can't be that I'm speeding through the login form too fast because it usually 429s and 403s at the very first request on failed attempts, and it can't be that I'm inputting the same thing every time because thing thing I put in is different. I'm at a loss

Anonymous
06/30/24(Sun)06:01:03 No.101213544

Anonymous 06/30/24(Sun)06:01:03 No.101213544

File: 1617693129717.jpg (116 KB, 851x1175)

116 KB JPG

How to bypass cloudflare protection of 4chan's archives?
archived.moe/desuarchive?

Anonymous
06/30/24(Sun)06:37:47 No.101213801

Anonymous 06/30/24(Sun)06:37:47 No.101213801

Absolute state. Git gud retards

Anonymous
06/30/24(Sun)06:40:35 No.101213830

Anonymous 06/30/24(Sun)06:40:35 No.101213830

I have been scraping a porn site for the last 5 years and I won't stop!

Anonymous
06/30/24(Sun)06:53:38 No.101213916

Anonymous 06/30/24(Sun)06:53:38 No.101213916

>>101213544
Manually copy the cookies.

Anonymous
06/30/24(Sun)06:58:44 No.101213968

Anonymous 06/30/24(Sun)06:58:44 No.101213968

>>101213916
Which one?
The csrf_token?

Anonymous
06/30/24(Sun)07:05:17 No.101214036

Anonymous 06/30/24(Sun)07:05:17 No.101214036

>>101213544
What are you scraping from 4chan archives?

Anonymous
06/30/24(Sun)07:35:02 No.101214327

Anonymous 06/30/24(Sun)07:35:02 No.101214327

>>101214036
Speaking of, does anyone know any archives that don't have the search limitations that archived.moe has?

Anonymous
06/30/24(Sun)07:41:46 No.101214389

Anonymous 06/30/24(Sun)07:41:46 No.101214389

File: 1615371265300.jpg (128 KB, 1080x1350)

128 KB JPG

>>101214036
>What are you scraping from 4chan archives?
Scarping links from the H voice thread on /h/, and RJ codes.

Anonymous
06/30/24(Sun)08:58:24 No.101215108

Anonymous 06/30/24(Sun)08:58:24 No.101215108

is reCaptchav3 literally invincible?

I'm trying to figure out a way to download the files in the 'Documentos' tab here but it just isn't happening, 2captcha just refunds my credits claiming they can't do it. (times out after 600s)

https://www.contratosdegalicia.gal/licitacion?N=822456&OR=49&ID=801&S=C&lang=gl

Anonymous
06/30/24(Sun)09:09:10 No.101215233

Anonymous 06/30/24(Sun)09:09:10 No.101215233

>>101215108
Use capsolver. V3 invisible is literally cheaper than their ither recaptchas with them anyways

Anonymous
06/30/24(Sun)12:33:35 No.101217440

Anonymous 06/30/24(Sun)12:33:35 No.101217440

>>101213830
Share ,

Anonymous
06/30/24(Sun)14:17:17 No.101218718

Anonymous 06/30/24(Sun)14:17:17 No.101218718

I want to scrape linkedin using python, beautifulsoup4, and probably selenium? I would like to gather information on tech companies, and their employees - specifically sales people or people who make business to business transactions (manager types). How should I go about this to ensure my scraper works on any env and is reliable in the future? I've already got a prototype running but it's broken as I used xpaths instead of something more reliable