/g/ - /wsg/ - Web Scraping General - Technology

Anonymous

/wsg/ - Web Scraping General 06/24/24(Mon)17:22:37 No.101135838

File: scraper.png (1.62 MB, 1892x2142)

/wsg/ - Web Scraping General Anonymous 06/24/24(Mon)17:22:37 No.101135838 Archived

Web Scraping General

Reverse engineering edition pt 3

QOTD: What music do you listen to while scraping?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101090758

Anonymous
06/24/24(Mon)17:32:23 No.101135976

Anonymous 06/24/24(Mon)17:32:23 No.101135976

My penis exhales in cloudflare's nostrils

Anonymous
06/24/24(Mon)17:36:36 No.101136030

Anonymous 06/24/24(Mon)17:36:36 No.101136030

mcbroken.com

what is your excuse for not making such a scraping project?

Anonymous
06/24/24(Mon)17:51:02 No.101136210

Anonymous 06/24/24(Mon)17:51:02 No.101136210

>>101136030
Holy based

Anonymous
06/24/24(Mon)17:57:19 No.101136302

Anonymous 06/24/24(Mon)17:57:19 No.101136302

>scrap the website
>miss some data because of random edge cases
>don't know exactly what items have those edge cases without checking all millions of them separately
>the fag owner adds Turnstile
>use anti-captcha to bypass it and scrap the whole website again
>still miss some data because of random edge cases
>the fag owner gets really mad and literally adds a global rate limit and makes the whole website unusable.
>can''t scrap it anymore in a reasonable time
owari da.

Anonymous
06/24/24(Mon)18:04:30 No.101136422

Anonymous 06/24/24(Mon)18:04:30 No.101136422

>>101136302
next time, store the urls you scraped properly in memcached or something, so you don't have to revisit them when you refactor and run your script again

Anonymous
06/24/24(Mon)18:38:40 No.101136873

Anonymous 06/24/24(Mon)18:38:40 No.101136873

what do you usually scrap?

Anonymous
06/24/24(Mon)18:39:53 No.101136885

Anonymous 06/24/24(Mon)18:39:53 No.101136885

>>101136873
football stuff.
it pays a lot.

Anonymous
06/24/24(Mon)18:45:06 No.101136946

Anonymous 06/24/24(Mon)18:45:06 No.101136946

>>101135838
I usually use scraping to auto download porn. I rarely find paid work. Probably because I'm still a noob. I've been having trouble downloading from xhamster. I looked at the network tab, but I can't figure out how the videos are being requested by the API. Only the ads show up. Can anyone help me understand what's going on?

Anonymous
06/24/24(Mon)18:48:35 No.101136983

Anonymous 06/24/24(Mon)18:48:35 No.101136983

>>101136946
yt-dlp doesn't work for most porn sites?

Anonymous
06/24/24(Mon)18:50:27 No.101137008

Anonymous 06/24/24(Mon)18:50:27 No.101137008

>>101136983
Damn never heard of it haha. Will report back

Anonymous
06/24/24(Mon)18:56:43 No.101137093

Anonymous 06/24/24(Mon)18:56:43 No.101137093

>>101135838
https://www.youtube.com/watch?v=E7rwRqnkymw

Anonymous
06/24/24(Mon)19:05:43 No.101137203

Anonymous 06/24/24(Mon)19:05:43 No.101137203

>>101136030
>Just reverse engineer an app bro and get sued too!

Anonymous
06/24/24(Mon)19:23:13 No.101137442

Anonymous 06/24/24(Mon)19:23:13 No.101137442

>>101136946
Just grab the page itself and extract the slab of json within the tag 'initials-script'

Anonymous
06/24/24(Mon)20:19:36 No.101138113

Anonymous 06/24/24(Mon)20:19:36 No.101138113

I'm building a job application bot to apply to jobs using playwright and beautiful soup. The code is pretty decent, its gotten to a point where it can handle most of the custom questions HR throws my way. The downside is that I have a mountain of if statements looking for key words in each of the form fields. Is there a better way to handle this? I'm playing around with several design patterns to reduce the size of if blocks but I'm at a loss. for reference
```
def validate_custom(self, field: Tag, applicant: Applicant):
label_field = field.find_parent("label")

if not label_field or not isinstance(label_field, Tag):
raise Exception("unknown field", field)

input_id, input_name, correct_field = "", "", ""
upper_text = label_field.next.upper() if isinstance(label_field.next, NavigableString) else label_field.text.upper()

if SingleWordKeys.LINKEDIN.value in upper_text:
input_id, input_name, correct_field = self._matcher.match_linkedin(field, label_field, applicant)
elif SingleWordKeys.GITHUB.value in upper_text:
input_id, input_name, correct_field = self._matcher.match_github(field, label_field, applicant)
elif SingleWordKeys.SALARY.value in upper_text:
input_id, input_name, correct_field = self._matcher.match_salary(field, label_field)
elif any(string in upper_text for string in self._start_date_keys):
input_id, input_name, correct_field = self._matcher.match_earliest_start(field, label_field)
elif any(string in upper_text for string in self._citizenship_keys):
input_id, input_name, correct_field = self._matcher.match_citizenship(field, label_field)
return label_field, input_id, input_name, correct_field
```

Obviously I cut it down to prove a point but this is starting to balloon and I know there's a better way to handle this

Anonymous
06/24/24(Mon)20:35:36 No.101138294

Anonymous 06/24/24(Mon)20:35:36 No.101138294

>>101137203
> imagine caring about the law
> not using a bunch of proxies and being impossible to find (and sue)

>>101138113
You could have a list that's like
listkeys = {
    SingleWordKeys.LINKEDIN.value: self._matcher.match_linkedin,
    SingleWordKeys.GITHUB.value: self._matcher.match_github
    ...
}
Then just use something like for k,v in listkeys.items() and handle the logic that way

Anonymous
06/24/24(Mon)21:06:11 No.101138548

Anonymous 06/24/24(Mon)21:06:11 No.101138548

>>101137442
>>101136983
Figured it out. There's an amphtml link that takes me to a video that has the full source.

Anonymous
06/24/24(Mon)21:34:47 No.101138857

Anonymous 06/24/24(Mon)21:34:47 No.101138857

>>101137442
>>101138548
Ok, xvideos is a little harder. They prevent downloads from certain channels and I can't seem to find or access the video source. Every YouTube scraping tutorial teaches fuck all about requests

Anonymous
06/24/24(Mon)21:39:16 No.101138900

Anonymous 06/24/24(Mon)21:39:16 No.101138900

>>101138857
Xvideos is using HLS, if you use the network tab on chrome you can see it requesting the m3u8 file (and you should be able to use this to get the whole stream and convert it to something like mp4)

Anonymous
06/24/24(Mon)21:46:29 No.101138961

Anonymous 06/24/24(Mon)21:46:29 No.101138961

>>101138857
You'd have give an example link because xvideos looks to be as simple as xhamster.

Anonymous
06/24/24(Mon)21:52:18 No.101139024

Anonymous 06/24/24(Mon)21:52:18 No.101139024

>>101138900
Checked, but I don't know how to go from the hls.m3u8 to mp4.

Anonymous
06/24/24(Mon)21:54:28 No.101139042

Anonymous 06/24/24(Mon)21:54:28 No.101139042

>>101138900
>>101139024
Stackoverflow is giving answers of VLC and ffmpeg. Is this where those memes come from?

Anonymous
06/24/24(Mon)21:55:43 No.101139055

Anonymous 06/24/24(Mon)21:55:43 No.101139055

>>101138294
yeah, not a bad idea. Thanks anon

Anonymous
06/24/24(Mon)21:56:34 No.101139065

Anonymous 06/24/24(Mon)21:56:34 No.101139065

>>101139055
Thanks and checked again

Anonymous
06/24/24(Mon)22:17:28 No.101139313

Anonymous 06/24/24(Mon)22:17:28 No.101139313

>>101139065
>>101139055
>>101139042
>>101139024
>>101138857
>>101138900
Thanks based anon, I figured it out.

Anonymous
06/25/24(Tue)01:22:36 No.101140810

Anonymous 06/25/24(Tue)01:22:36 No.101140810

'mp

Anonymous
06/25/24(Tue)04:37:17 No.101142206

Anonymous 06/25/24(Tue)04:37:17 No.101142206

>>101136885
You mean betting

Anonymous
06/25/24(Tue)08:59:28 No.101144339

Anonymous 06/25/24(Tue)08:59:28 No.101144339

>>101135838
I am trying to scrape a website with hproxy.com residential proxies but all the connections are getting rejected by the proxy server. It seems that the server for some reason rejects all requests for .gov domains, since they work just fine on regular websites. Am I correct? Is there any service that offers residential proxies and will not BTFO my scraper for that?

Anonymous
06/25/24(Tue)08:59:56 No.101144344

Anonymous 06/25/24(Tue)08:59:56 No.101144344

I have a few questions, I'm new to coding so what I ask may have overlap or lead to confusion due to ignorance but nonetheless here they are if anyone is willing to help.

First, my understanding is scraping is just the act of using an application to better understand how finalized data functions, it doesn't provide a concise answer as to how each piece of data was created to provide functionality (it doesnt tell you each string of code basically),
from my understanding it can be used to give an idea of how it was created based on interactions but it is still possible to not understand how it works. Which for example if you were trying to create something similar and were looking for insight you could still be left with questions (is this wrong?)

Second, encryption/decryption step by step example (if wrong please correct):
Every website has its backend code which is executed when invoked, next they use HTML to structure the code and the content on the website, they then use an API which acts as "housing" for the front and backend requests.
The providers of the service do not want 3rd party access to their material so they use a protocol like TLS to encrypt this info (I have questions about the algorithms and how they function but they seem too case specific to include in this),
once encrypted I assume this is where something like an mitmproxy comes in (I understand an mitmproxy intercepts requests, but is it able to understand the requests based on being able to translate the encryption?) (does an mitmproxy log anything that cannont be found in the devtools of the website?),
as far as I know this seems to be the extent of steps for encryption/decryption with the specific example I used is this correct?

Thanks in advance anon

Anonymous
06/25/24(Tue)09:02:49 No.101144388

Anonymous 06/25/24(Tue)09:02:49 No.101144388

>>101144339
DM the owner, @screenshot on telegram.

Anonymous
06/25/24(Tue)09:03:50 No.101144395

Anonymous 06/25/24(Tue)09:03:50 No.101144395

>>101144344
Please, PLEASE read the FAQ

Anonymous
06/25/24(Tue)09:09:48 No.101144452

Anonymous 06/25/24(Tue)09:09:48 No.101144452

>>101144395
didn't see faq my bad