[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: scraper.png (1.62 MB, 1892x2142)
1.62 MB
1.62 MB PNG
Web Scraping General

Clever edition

QOTD: What's easier? Parsing HTML or reverse engineerin priv APIs?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101177514
>>
Does anyone have any resources or advice for reverse engineering angular web apps? A site I used to scrape has been (((upgraded))) to an SPA with infinite scroll and it's giving me grief.
The site is now so JS heavy I have to use playwright and I keep running out of memory. Each time I load a new page, they send 2mb of data to the client and this is never freed up so it just grows until my scraper crashes. Reloading the page takes me back to the beginning with no way to jump to the last page scraped.
The site is running in production mode and I can't get ng.probe() or calls to angular.element(...).scope() to work.
>>
>>101208716
That means that the data you're trying to scrape is almost certainly being requested via some sort of undocumented API. You might be able to just figure out which endpoints the JS on site is hitting by opening the network tab in inspect element.
>>
File: scraper.png (443 KB, 1080x606)
443 KB
443 KB PNG
>>101208751
kek I was hoping this wasn't going to be the answer. They do have a private API but it's one of the most horrible, byzantine monstrosities I've seen. It will take a while to reverse engineer
>>
>>101208893
What's wrong with it in specific? Most of them are easier to do than just parsing HTML
>>
How do I scrape twitter? Rate limits, account blocks, it seems impossible now, unless you invest in proxies and maybe human captcha solvers.
>>
>>101209154
Browser with a userscript. The userscript copies the post data and then you send it to a 127.0.0.1:8081 backend that collects it.

Bonus: You are using a real browser
>>
>>101208241
Why does it feel so good to scroop mobile apis?
>>
>>101209203
I don't think that helps. The rate limiting or viewing limit applies to normal human users too. They do it to get people to pay for twitter. The account banning only started when I tried to use multiple accounts in rotation.
>>
>>101209424

For me its probably the cozy setup of android app with system certificate and using mitmproxy, maybe its because what you need is easier to use the api on mobile vs on your desktop/website.


>want to make a telegram bot
>use it to send request to planet fitness
>to different planet fitness gyms to see live update of occupancy

Why? Just so I can see if its too crowded for my liking.

For some fucking reasons "user-agent" set to any mozilla will 403 status code, I then used the one insomnia defaults to and it works, fucking weird, maybe if I tried mobile api I wouldn't have to worry about user-agent.
>>
>>101209424
Best of both worlds in the OP image.

>>101209614
Any other scraping wins?
>>
>>101209154
Just try scraping nitter instances. Of course, be prudent about it. Don't be that one asshole that forces a nitter instance owner to implement (((captchas))) and (((ratelimits)))
>>
>>101208241
I am a WAFag making money 429ing you.
Thank you for providing me with gainful employment.
Ask away if you want to know about WAFaggotry.
>pic semi related
>>
>>101209727

> Thank you for providing me with gainful employment.
Works out for both of us, since we'll just get around it anyways

> Ask away if you want to know about WAFaggotry.
Yeah, do you know why this one site keeps 403 and 429ing me? I'm using a diff IP (residential) and TLS fingerprint (chrome and firefox, headers spoofed too) on each attempt but keep getting 403'd and 429'd. Site's behind CF, program I'm writing basically just logs into a bunch of different accounts and scrapes different info from parts of the website. For some reason, one login attempt won't get me blocked but a bunch will
>>
>>101209727
Does CF actually send any data to detect if you're a bot on those "Wait a moment" pages or is it basically just detecting if you have javascript enabled?

If we get bypasses up for those pages we'll be unstoppable
>>
>>101209678

semi win is downloading my orders in json format from aliexpress so it can automatically import what I bought into invetory management self hosting app called "inventree" https://github.com/inventree

there is still some stuff I need to change to fully automate it, but im in no hurry.

While this one may not count I'll add it, thingiverse web scraping, while it does have an open api, at the time I didn't see any wrappers for it, so I just coding for myself, a lot of 3d models I bookmark end up 404ing on me, got tired of it, and coded to scrape and download and then push them into my git self host called "gitea" to download for later if I want to.
https://github.com/go-gitea/gitea

another scraping win was what I mentioned earlier about the yeezy sneakers, when my friend asked me I had 2 days to code a bot to try to snag the sneakers, so I spent 2 days no sleep, I got the bot to successfully make a purchase but as I got closer to the deadline the more hacky I got with the code, but I didn't expect those line queues footlocker put on their site and my bot got blacklisted pretty quick.. lmao, but hey I had 2 days and it was my first time trying something like that.
>>
>>101209815
>I got the bot to successfully make a purchase

Sorry I meant on some random clearance item, cheapest thing they had.
>>
>>101209815
> another scraping win was what I mentioned earlier about the yeezy sneakers
KEK you could probably make $100K+ sneakerbotting

Stores have a fuckton of protection against it though and most of the data that would help you carry it out is probably extremely private and well-kept in secrecy
>>
>>101209727
scraping images from Y3p search, using curl-impersonate. Sometimes it blocks me and returns a javascript challenge, but only on images. why is that?
>>
>>101209781
>403 and 429ing
100% chance they wrote a retarded manual rule for that case, combining login count and request count
Your description smells of manually added pajeetwork
>client gets big mad
>scraper using fuckall resources but dashboard shows big bad
>client taking it personally
>orders pajeet support to do something
>yes sir will do the needful
>ends up with a unpredictable rule that in most cases kicks more legitimate users
I can't tell you the amount of times I had to explain to people many scrapers either have no discernible effect on their operating costs or are even desirable. They are getting mad at you. They want you to get of their lawn.
>>101209807
>Does CF actually
Surprisingly never worked for CF. Afaik they track your user inputs and create scores based on that, like keys and pointer location and movement. Sounds like a job for ML.
>>101209886
WAF set up in CDN only, Many such cases.
>>
>>101209711
Isn't nitter dead?
>>
>>101210031
> 100% chance they wrote a retarded manual rule for that case, combining login count and request count
> Your description smells of manually added pajeetwork
Any idea how to bypass it? Am getting paid for this and the info is just useful anyways
>>
>>101210088
https://nitter.poast.org/
Not sure how much easier to scrape than actual Twitter it is but it's def easier than just scraping Twitter
>>
>>101210031
> Surprisingly never worked for CF
Have you ever worked for Akamai?
>>
>>101210284
>starts with requiring JS and "verifying your browser"
Christ.
>>
how do you guys make money off scraping
>>
>>101210567
For fuck's sake stop asking this and read the FAQ
>>
>>101210278
Observe what it does and then do something else
>RTFM on CF custom rules
>think like a poo
99/100 time they are on 1 or 5 min sliding window for timeouts. Play with that.
>>
>>101209727
How to be like you good sar
>>
>>101208964
Part of it is retarded design (eg a mixture of snake and camel case for json keys, invalid characters in urls that most http libraries try to escape for you, unintuitive organisation of resources) and part of it is they have this weird concept of pushing and popping state to the server.
I'm scraping test results from the site so a basic example is fetching the results for the candidates in a test. First you fetch all the tests available, but to do anything with a test you need its category. There's no endpoint to get all the categories so you have to get all the unique category ids from the tests then fetch the categories one by one. Then you use the test and the categories to fetch a list of all the sitting ids for a test, but if you send an invalid category and test combo the server will return the sitting ids for the default data set with any error messages. Then you let the server know you want the data for these sittings so you push the category, test, and sitting ids back to it. Then you send them to server again and telling it to prepare the candidate responses for each sitting in the test. Then you pop them off at which point the server finally sends you a token you can use to get your data. If you make requests for data for any other category+test+sitting ids combo before you pop off the last lot the server sends you the default data set without any errors.
And that's the "simple" case. There are other kinds of assessments where you have to request each question and each candidate response for each test individually in a similar manner.
>>
>>101210868
Tried basically everything. You think they're just banning my proxies? I'm using iproyals.

Also, I was thinking it was a global limit until I tried going to the form I was abusing with the bot runing and it worked fine. I was using an extremely outdated version of Chrome at the time too, so it's not that.

Can't be that I'm speeding through the login form too fast because it usually 429s and 403s at the very first request on failed attempts, and it can't be that I'm inputting the same thing every time because thing thing I put in is different. I'm at a loss
>>
File: 1617693129717.jpg (116 KB, 851x1175)
116 KB
116 KB JPG
How to bypass cloudflare protection of 4chan's archives?
archived.moe/desuarchive?
>>
Absolute state. Git gud retards
>>
I have been scraping a porn site for the last 5 years and I won't stop!
>>
>>101213544
Manually copy the cookies.
>>
>>101213916
Which one?
The csrf_token?
>>
>>101213544
What are you scraping from 4chan archives?
>>
>>101214036
Speaking of, does anyone know any archives that don't have the search limitations that archived.moe has?
>>
File: 1615371265300.jpg (128 KB, 1080x1350)
128 KB
128 KB JPG
>>101214036
>What are you scraping from 4chan archives?
Scarping links from the H voice thread on /h/, and RJ codes.
>>
is reCaptchav3 literally invincible?

I'm trying to figure out a way to download the files in the 'Documentos' tab here but it just isn't happening, 2captcha just refunds my credits claiming they can't do it. (times out after 600s)

https://www.contratosdegalicia.gal/licitacion?N=822456&OR=49&ID=801&S=C&lang=gl
>>
>>101215108
Use capsolver. V3 invisible is literally cheaper than their ither recaptchas with them anyways
>>
>>101213830
Share ,
>>
I want to scrape linkedin using python, beautifulsoup4, and probably selenium? I would like to gather information on tech companies, and their employees - specifically sales people or people who make business to business transactions (manager types). How should I go about this to ensure my scraper works on any env and is reliable in the future? I've already got a prototype running but it's broken as I used xpaths instead of something more reliable



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.