[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: file.png (1.85 MB, 1892x2142)
1.85 MB
1.85 MB PNG
I can't stop scraping.
>>
>>101582470
Luminati is now called BrightData.
Old meme. Good service, irregardless.
https://brightdata.com/luminati
>>
>>101582498
>irregardless
try scraping a dictionary you dumb bitch
>>
>>101582879
it's a word thoughbeit
>>
You know what's funny, I do this for a living
I'm right now lying in my bed in a home I bought from my job webscraping
And if anyone ever tried brought up this image in an interview, they'd most likely get turned down for anything other than a junior position
Like, just the idea that these are choices you have is already so naive, let alone thinking HTML scraping is in any way superior to an API
Do me a favor, try scraping pinnacle.com using only HTML, no APIs, since it's a choice
You can't, right? Like, the HTML is just a template filled by *gasp* API requests?!?!
So you end up using selenium, running an entire browser to do the same API request you were just mocking, while also doing a shitton more requests, rendering the page and overall spend about 100x more resources than doing that one single request
Dunning-Kruger isn't real, but this image is the best representation of it I've seen: being extremely smug about a topic you know absolutely nothing about.
>>
>>101582470
>data is 10 minutes behind
?
>>
>>101585324
Top is a person using a documented api provided by the company which has limited access and conditions they have to follow

Below still uses the api but none of it is documented and corporate is actively trying to stop them from doing so
>>
Is there a way to scrape a page after it has loaded fully without rendering everything with selenium?
Vinted.com comes to mind for a site that takes ages to load.
>>
File: bdd0kylwwsk81.jpg (82 KB, 1080x606)
82 KB
82 KB JPG
>>101582470
>Found the general
I'm trying to scrape listcrawler to run analytics on WHOOOERS. So I click on a posted ad and from there want to go to that bitch's review page.
>Inspect
> <a> tag with href is nowhere to be found
However, when I login and inspect that same element, the a tag I want is there somehow. I know I could easily do this by having Selenium chromedriver click on the element, but I'm trying to git gud as well as run analytics. Can anyone explain what's going on? And perhaps how to scrape the data without logging in or using Selenium?
>>
>>101586158
"Eager" page load strategy (return control to script as soon as DOM is ready without waiting for all images, styles, scripts) + explicit WebdriverWaits on the relevant elements
>>
>>101585324
This >>101585511
Why would I cream my pants over parsing HTML in particular? The image is a literal meme. Of course touching JSON APIs isn't evil, that'd be completely retarded. There are options for interacting with and parsing data from sites with different tradeoffs for data/interactivity completeness, durability against external changes, reliability, implementation complexity, and resource intensity.
That is all to say that engineering is full of options and tradeoffs, and I really think everyone in these threads already knows what you are pointing out. Modern "scraping" in a world where Javascript-heavy and mobile-only software companies want immense control over how external parties interact with their products involves everything from reverse engineering a private unofficial API to replay crafted requests, to parsing HTML, to opening thin browsers that will solve captchas for you and intercepting JSON as it comes on the wire. They are all just paths to solutions.
People processing data from Facebook, Reddit, Linkedin, or Twitter in July of 2024, tasks for services that constantly break API, gate clients behind complex Javascript puzzles and captchas, and generally want to make their life difficult if they do not sign certain contracts, don't choose full Selenium-type scraping because they have a fetish for XML trees and high RAM usage.
>>
Playwright is such a cocktease. Literally never works more than once. Come back next week and it doesnt work.
>>
>>101585324
tism
>>
How can I make a living with web scraping?
>>
Literally everyone should have a home server
>>
>>101585324
so (You) can't reverse engineer APIs to scrape shit? you must be fucking retarded
how much are you getting paid? maybe I should do this for a living too...
>>
you scrapers all pay to get around captchas right? proxy ips alone surely cant be enough, sometimes you'll get a cloudflare or google recaptcha and there is nothing you can do then. Tell me how to beat you and destroy your scraping attempts. What would make it impossible for you to scrape? I already have tons of ideas, you will never beat me. You might as well surrender and tell me
>>
>>101589088
>There's nothing you can do
That's where you're wrong Captchas are solved by AI or outsourced to Indians to solve for fractions of a cent in real time
>>
File: 1721855040876788.gif (507 KB, 287x373)
507 KB
507 KB GIF
>>101589088
You cannot have an unscrapeable site in the same way you cannot have a perfectly secure computer or an unspammable email. All you can do in both cases is raise the base required effort floor and (time/financial) cost required for an "adversary".
Where they are unavoidable by other means, captchas are promptly sent out to advanced AI (All Indian) networks for a fee. The point of captchas, as well as phone verification and proof-of-work is not that no bot will ever be able to get past them, but that the increased financial cost and complexity deters attempts in theory.
If someone with technical capability (for example, the autists around these threads,) is willing to pay decent money and infinite time to deal with YOU, it's over. They will show up to your server with patched headless browsers that look and behave exactly like your users, pierce Cloudflare in less time than you can blink and stomp Datadome with exploits, then take what they want. If you require accounts, they will pay thirdworlders for temporary access to their number farm and create a thousand.
These things can be more than decent for keeping out low tech bots and spiders looking across the entire web. Not because they are even necessarily good at their job, but because it increases the required effort and cost. If someone is conducting that sort of large-scoped operation, the gain from stopping to deal with you in particular isn't worth it with easier and "cheaper" equivalent targets available.
>>
>>101589963
>>101589559
I could require a google login and suddenly there is no way a scraper will be able to get around it except by making a new google account every time I figure out that he's a scraper. Suddenly I outsourced this problem to a multi billion dollar company who defends me for free. You just got owned
>>
File: 1722058450263.jpg (943 KB, 1040x1040)
943 KB
943 KB JPG
>>101589088
I'm a no-coding functional retard, so take this with a pinch of salt, but couldn't you create some sort of poof-of-work that must be solved before your server will send the data? It seems like that would rape the scraper, but it will make honest visitors to your site have a worse experience, but wait 5 seconds isn't THAT bad

Again, I've never coded a day in my life, but this solution came out of my head.
>>
>>101590189
My guy, those are sold for like a quarter each max.
>>
>>101590498
I'm not talking about the google captcha, I'm talking about requiring the indian/AI captcha solver to log into a google account first before he even gets a captcha. If you think this too is so cheap then show me a site that touts this as a feature and sells it. Because discord works like this, they require an account to view any content whatsoever and this is precisely why discord content is hidden from every search engine and crawler.
>>
>>101590228
nowadays the devices are pretty strong so the issue is: how much work are you going to make the guy do? 3sec of 100% cpu on an intel 2700k ? that's not going to increase the cost of some random indian captcha solver service by almost anything but it's actually quite turbo annoying for regular users if I'm requiring them to max out their hardware for a few seconds. It's almost entirely pointless in fact, you might as well not do any work and just do a proof of time (aka forcing the user to wait a fixed 5 seconds)
>>
>read on up their API
>create the required account and key
>do everything as documented like a good little cuck
>get shittier results from the API than I'd get from a simple HTML scraper
I fucking hate opensubtitles.
>>
>>101591067
>I'm not talking about the google captcha
I know.
I'm talking about Google accounts.
It's really not that hard to buy them, login in the browser, then get your captcha and send it off for solving.
>>
>>101591130
as expected, you were wrong. A gmail account costs $1.50. Have fun paying this constantly everytime I manage to block one
https://useviral.com/buy-gmail-account
>>
>>101590189
I refuse to use any site that requires me to log in with a Google or Facebook account.
>>
>>101590189
>I could require a google login
Good job, your site is now totally dead
>>
>>101591219
the principle is the same for any account wall, it doesn't have to be an external provider although it makes it easier because I dont have to pay for SMS. Discord requires an account too you know? And plenty of people use Discord
>>
>>101582470
Whenever I visit a site I find useful I
wget2 -r -k -NP --max-threads=6
it and back it up to both my backup drives.

Is this retarded?
>>
>>101586469
sir, data is dynamically loaded with javascript
use a headless browser
>>
>>101591253
People are much less likely to use a site that requires you to register for basic shit too.
>>
>>101589088
>>101589963
>>101590228
>>101585324
Transgender take
>>
does anyone scrape civitai, they have an API. Is it better to scrape using api or through the website itself?
>>
How do I stop people like you? I've got a pretty fancy fail2ban setup with some clever heuristics in nginx, but I'd like to know if there might be more effective ways.

t. Running a smol homeserver, fuck scrapers.
>>
>>101585324
giga nigga trvthnvke
>>
>>101591809
>smol server
Your choices are to either throttle-fuck anyone not on some kind of whitelist, pay out the ass for extortionate third party protection or to get fucked.
By the time automated scraper traffic starts looking any different from normal traffic your server's already down.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.