/g/ - /wsg/ - Web Scraping General - Technology

Anonymous

/wsg/ - Web Scraping General 06/30/24(Sun)17:47:17 No.101221231

File: scraper.png (1.62 MB, 1892x2142)

/wsg/ - Web Scraping General Anonymous 06/30/24(Sun)17:47:17 No.101221231 Archived

Web Scraping General

Cloudflare inny edition

QOTD: What are the most common custom cloudflare rules and what's the easiest way to bypass them?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101208241

Anonymous
06/30/24(Sun)17:48:08 No.101221238

Anonymous 06/30/24(Sun)17:48:08 No.101221238

>>101221231
bump

Anonymous
06/30/24(Sun)19:12:47 No.101222088

Anonymous 06/30/24(Sun)19:12:47 No.101222088

just finished reading the rentry. i like how we avoid talking about the most obvious way to make money from scraping to prevent skid influx. firm handshake gentleman.

Anonymous
06/30/24(Sun)19:43:51 No.101222463

Anonymous 06/30/24(Sun)19:43:51 No.101222463

>>101222088
> firm handshake gentleman.
> implying web scrapers are employable in the regular job market

Anonymous
06/30/24(Sun)23:42:12 No.101224422

Anonymous 06/30/24(Sun)23:42:12 No.101224422

'mp

Anonymous
06/30/24(Sun)23:44:38 No.101224435

Anonymous 06/30/24(Sun)23:44:38 No.101224435

>>101221231
Reposting from the coomtech blogpost thread
>Currently trying to reverse engineer thot hub video API (first time reverse engineering)
>This way I can access private videos without having to send a friend request
>Have narrowed the mp4 url down to the following pattern:
```
Https://thothub.ch/get_file/<small number based on recency of upload>/<some hex>/<file number rounded down to the nearest 10000>/<file_number>/<file_number.mp4>/?rnd=<some number based on current time I think>
```
Can anyone help me figure it out? I know when I have access to the video, my client is calling remote_control.php with some auth, so that's what I'm trying to bypass when I don't have access.

Anonymous
06/30/24(Sun)23:47:27 No.101224456

Anonymous 06/30/24(Sun)23:47:27 No.101224456

Reposting bc these generals are always dead anyways

I want to scrape linkedin using python, beautifulsoup4, and probably selenium? I would like to gather information on tech companies, and their employees - specifically sales people or people who make business to business transactions (manager types). How should I go about this to ensure my scraper works on any env and is reliable in the future? I've already got a prototype running but it's broken as I used xpaths instead of something more reliable

Anonymous
06/30/24(Sun)23:49:04 No.101224470

Anonymous 06/30/24(Sun)23:49:04 No.101224470

>>101224435
Are any of those numbers generated client side? If so, you should try reversing with the debugger in inspect element

Anonymous
06/30/24(Sun)23:50:15 No.101224482

Anonymous 06/30/24(Sun)23:50:15 No.101224482

>>101224456
Which part in specific are you having trouble with?

Anonymous
06/30/24(Sun)23:58:57 No.101224527

Anonymous 06/30/24(Sun)23:58:57 No.101224527

>>101224435
>>101224470 best answer I think.

What's your sample size for known working links?

Anonymous
07/01/24(Mon)00:04:23 No.101224568

Anonymous 07/01/24(Mon)00:04:23 No.101224568

>>101224482
Sort of planning out the most effective way to scrape the data I'd like to find. In addition I also am having trouble making my scraper reliable.

For example, I want to find information about sales people in the tech industry for specific companies. How can I navigate linkedin to find these sorts of people? What job titles do they have? How can I search for these on linkedin's platform?

I guess planning it out is where I'm having trouble.

When I say reliable I mean I started this project a couple days ago then got a little busy with work. When I came back to run my scraper again it didn't work. Xpaths obviously change so I'm thinking of using ids instead.

Anonymous
07/01/24(Mon)00:10:46 No.101224625

Anonymous 07/01/24(Mon)00:10:46 No.101224625

>>101224470
The file number is the same as in the webpage url, so it's a given. The small number is somewhat easy to reverse engineer since it's based on the file number. I think the time number might not be an issue. I'm noob so this will be my first time using the debugger and I have no idea what I should use it to do. At least I know JavaScript.
>>101224527
Sample size: practically unlimited

Anonymous
07/01/24(Mon)00:24:26 No.101224739

Anonymous 07/01/24(Mon)00:24:26 No.101224739

>>101224625
>Sample size: practically unlimited

Alright. First advice is get a few thousand and run it through gpt to see if it can find an obvious pattern. It will probably fail because gpt is ass. Sometimes it's good at recognizing a pattern where we are bot, but its mostly ass.

Second. You can use the browser debugger open up "network" tab and see what is being sent and received and in what format to and from your browser. There's probably a lot of useful data there.

Third advice is create a bot that automatically creates accounts, adds people, and scrapes those urls from the successfully added people. Which will be more future proof because eventually the retards that admin the site won't allow you to access files directly without proper authentication. Although that may never happen

Anonymous
07/01/24(Mon)00:27:34 No.101224765

Anonymous 07/01/24(Mon)00:27:34 No.101224765

>>101224568
You should not use direct XPaths. If you don't give a shit about speed at acale you should search the entire page for headers or content text that proceeds or succeeds the content your scraping. Large sites like Twitter and LinkedIn change layouts and XPath all the time. It will break over and over and over again. As for the planning idk what to tell you, not exactly sure what you're trying to achieve

Anonymous
07/01/24(Mon)00:37:11 No.101224843

Anonymous 07/01/24(Mon)00:37:11 No.101224843

>>101224739
>First
Nice meme
>Third
Unfortunately, in order to send friend requests, you have to upload 3 videos. This wouldn't be a problem, except that it seems whoever owns thothub regularly runs lengthy server maintenance that prevents new uploads for days at a time. They are also really picky about what kind of videos you're allowed to upload, and will ban you if you upload low quality porn or just the wrong type of porn.
>Second
That's what I'm trying to figure out how to do, as other anons have suggested. I'm a total noob, and this is my first time even opening up the JavaScript debugger panel. Someone in the other thread said porn devs are notoriously retarded, and you can download motherless vids by changing 2 variables, so this gives me hope.

Anonymous
07/01/24(Mon)01:01:31 No.101224995

Anonymous 07/01/24(Mon)01:01:31 No.101224995

>>101224843
>you have to upload 3 videos.
Annoying but possible. Table it for now.
>first time even opening up the JavaScript debugger
Read the requests, look at the data. See if there's any data or requests you can duplicate to get the data you want back from their server. It's tedious. You're going to want to learn how to duplicate those requests via get, post, whatever. If they use json. Etc. There's going to be a lot of python libraries that trivialize this.
>Nice meme
Eh, worth a try. Maybe it will notice a Unix time code pattern or something who knows. As I said it's ass

Anonymous
07/01/24(Mon)01:23:11 No.101225092

Anonymous 07/01/24(Mon)01:23:11 No.101225092

>>101224995
I found a promising lead: if I pause the debugger at Script First Statement, there is a userId variable that isn't present when I'm logged out. The userId of the video's uploader seems to be public. Now I just need to figure out how to change the source code in the debugger.

Anonymous
07/01/24(Mon)01:29:53 No.101225130

Anonymous 07/01/24(Mon)01:29:53 No.101225130

>>101221231
how do I scrape all of artofproblemsolving.com? it requires an account and is extremely anti scraping, and requires a custom scraper with js features enabled
how do I go on about building one? and how do I not get rate and account limited?

Anonymous
07/01/24(Mon)02:51:08 No.101225589

Anonymous 07/01/24(Mon)02:51:08 No.101225589

>>101225130
Alright what part of the process do you need help with?

Anonymous
07/01/24(Mon)02:54:06 No.101225613

Anonymous 07/01/24(Mon)02:54:06 No.101225613

bump

Anonymous
07/01/24(Mon)02:54:54 No.101225623

Anonymous 07/01/24(Mon)02:54:54 No.101225623

>>101225589
I am a total newbie to webscraping, so a starting point for resources to make a custom scraper that loads js and works with artofproblemsolving.com would be nice

Anonymous
07/01/24(Mon)03:05:50 No.101225708

Anonymous 07/01/24(Mon)03:05:50 No.101225708

>>101225623
You can use selenium/playwright (playright recommended), though from how it looks, it looks like you can do it with pure requests if you just use curl_cffi to bypass CF

Not sure how you'll be able to get around ratelimits though given site's behind WAF, you might need to invest in some residential proxies. For JS, if request data is being sent with variables generated with client-side JS, you might be able to reverse it with the JS debugger in inspect element.

Anonymous
07/01/24(Mon)07:16:34 No.101227226

Anonymous 07/01/24(Mon)07:16:34 No.101227226

'mp

Anonymous
07/01/24(Mon)08:03:13 No.101227584

Anonymous 07/01/24(Mon)08:03:13 No.101227584

File: file.png (30 KB, 1276x276)

30 KB PNG

>>101224435
https://thothub.mx/static/js/main.js
try the unminified js
><some hex>
md5, seems to either come from filename or is random
>rounded down to the nearest 10000
to nearest 1000
>?rnd=
unix timestamp, not required
> remote_control.php
get_file redirects here, there's a file parameter, cv,cv2,cv3,cv4 seem to be md5, cv3 is the same for different urls ive checked