[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: scraper.png (1.62 MB, 1892x2142)
1.62 MB
1.62 MB PNG
Web Scraping General

Reverse engineering edition pt 3

QOTD: What music do you listen to while scraping?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101090758
>>
My penis exhales in cloudflare's nostrils
>>
mcbroken.com

what is your excuse for not making such a scraping project?
>>
>>101136030
Holy based
>>
>scrap the website
>miss some data because of random edge cases
>don't know exactly what items have those edge cases without checking all millions of them separately
>the fag owner adds Turnstile
>use anti-captcha to bypass it and scrap the whole website again
>still miss some data because of random edge cases
>the fag owner gets really mad and literally adds a global rate limit and makes the whole website unusable.
>can''t scrap it anymore in a reasonable time
owari da.
>>
>>101136302
next time, store the urls you scraped properly in memcached or something, so you don't have to revisit them when you refactor and run your script again
>>
what do you usually scrap?
>>
>>101136873
football stuff.
it pays a lot.
>>
>>101135838
I usually use scraping to auto download porn. I rarely find paid work. Probably because I'm still a noob. I've been having trouble downloading from xhamster. I looked at the network tab, but I can't figure out how the videos are being requested by the API. Only the ads show up. Can anyone help me understand what's going on?
>>
>>101136946
yt-dlp doesn't work for most porn sites?
>>
>>101136983
Damn never heard of it haha. Will report back
>>
>>101135838
https://www.youtube.com/watch?v=E7rwRqnkymw
>>
>>101136030
>Just reverse engineer an app bro and get sued too!
>>
>>101136946
Just grab the page itself and extract the slab of json within the tag 'initials-script'
>>
I'm building a job application bot to apply to jobs using playwright and beautiful soup. The code is pretty decent, its gotten to a point where it can handle most of the custom questions HR throws my way. The downside is that I have a mountain of if statements looking for key words in each of the form fields. Is there a better way to handle this? I'm playing around with several design patterns to reduce the size of if blocks but I'm at a loss. for reference
```
def validate_custom(self, field: Tag, applicant: Applicant):
label_field = field.find_parent("label")

if not label_field or not isinstance(label_field, Tag):
raise Exception("unknown field", field)

input_id, input_name, correct_field = "", "", ""
upper_text = label_field.next.upper() if isinstance(label_field.next, NavigableString) else label_field.text.upper()

if SingleWordKeys.LINKEDIN.value in upper_text:
input_id, input_name, correct_field = self._matcher.match_linkedin(field, label_field, applicant)
elif SingleWordKeys.GITHUB.value in upper_text:
input_id, input_name, correct_field = self._matcher.match_github(field, label_field, applicant)
elif SingleWordKeys.SALARY.value in upper_text:
input_id, input_name, correct_field = self._matcher.match_salary(field, label_field)
elif any(string in upper_text for string in self._start_date_keys):
input_id, input_name, correct_field = self._matcher.match_earliest_start(field, label_field)
elif any(string in upper_text for string in self._citizenship_keys):
input_id, input_name, correct_field = self._matcher.match_citizenship(field, label_field)
return label_field, input_id, input_name, correct_field
```

Obviously I cut it down to prove a point but this is starting to balloon and I know there's a better way to handle this
>>
>>101137203
> imagine caring about the law
> not using a bunch of proxies and being impossible to find (and sue)

>>101138113
You could have a list that's like
listkeys = {
SingleWordKeys.LINKEDIN.value: self._matcher.match_linkedin,
SingleWordKeys.GITHUB.value: self._matcher.match_github
...
}


Then just use something like for k,v in listkeys.items() and handle the logic that way
>>
>>101137442
>>101136983
Figured it out. There's an amphtml link that takes me to a video that has the full source.
>>
>>101137442
>>101138548
Ok, xvideos is a little harder. They prevent downloads from certain channels and I can't seem to find or access the video source. Every YouTube scraping tutorial teaches fuck all about requests
>>
>>101138857
Xvideos is using HLS, if you use the network tab on chrome you can see it requesting the m3u8 file (and you should be able to use this to get the whole stream and convert it to something like mp4)
>>
>>101138857
You'd have give an example link because xvideos looks to be as simple as xhamster.
>>
>>101138900
Checked, but I don't know how to go from the hls.m3u8 to mp4.
>>
>>101138900
>>101139024
Stackoverflow is giving answers of VLC and ffmpeg. Is this where those memes come from?
>>
>>101138294
yeah, not a bad idea. Thanks anon
>>
>>101139055
Thanks and checked again
>>
>>101139065
>>101139055
>>101139042
>>101139024
>>101138857
>>101138900
Thanks based anon, I figured it out.
>>
'mp
>>
>>101136885
You mean betting
>>
>>101135838
I am trying to scrape a website with hproxy.com residential proxies but all the connections are getting rejected by the proxy server. It seems that the server for some reason rejects all requests for .gov domains, since they work just fine on regular websites. Am I correct? Is there any service that offers residential proxies and will not BTFO my scraper for that?
>>
I have a few questions, I'm new to coding so what I ask may have overlap or lead to confusion due to ignorance but nonetheless here they are if anyone is willing to help.

First, my understanding is scraping is just the act of using an application to better understand how finalized data functions, it doesn't provide a concise answer as to how each piece of data was created to provide functionality (it doesnt tell you each string of code basically),
from my understanding it can be used to give an idea of how it was created based on interactions but it is still possible to not understand how it works. Which for example if you were trying to create something similar and were looking for insight you could still be left with questions (is this wrong?)

Second, encryption/decryption step by step example (if wrong please correct):
Every website has its backend code which is executed when invoked, next they use HTML to structure the code and the content on the website, they then use an API which acts as "housing" for the front and backend requests.
The providers of the service do not want 3rd party access to their material so they use a protocol like TLS to encrypt this info (I have questions about the algorithms and how they function but they seem too case specific to include in this),
once encrypted I assume this is where something like an mitmproxy comes in (I understand an mitmproxy intercepts requests, but is it able to understand the requests based on being able to translate the encryption?) (does an mitmproxy log anything that cannont be found in the devtools of the website?),
as far as I know this seems to be the extent of steps for encryption/decryption with the specific example I used is this correct?

Thanks in advance anon
>>
>>101144339
DM the owner, @screenshot on telegram.
>>
>>101144344
Please, PLEASE read the FAQ
>>
>>101144395
didn't see faq my bad



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.