I can't stop scraping.
>>101582470Luminati is now called BrightData.Old meme. Good service, irregardless.https://brightdata.com/luminati
>>101582498>irregardlesstry scraping a dictionary you dumb bitch
>>101582879it's a word thoughbeit
You know what's funny, I do this for a livingI'm right now lying in my bed in a home I bought from my job webscrapingAnd if anyone ever tried brought up this image in an interview, they'd most likely get turned down for anything other than a junior positionLike, just the idea that these are choices you have is already so naive, let alone thinking HTML scraping is in any way superior to an APIDo me a favor, try scraping pinnacle.com using only HTML, no APIs, since it's a choiceYou can't, right? Like, the HTML is just a template filled by *gasp* API requests?!?!So you end up using selenium, running an entire browser to do the same API request you were just mocking, while also doing a shitton more requests, rendering the page and overall spend about 100x more resources than doing that one single requestDunning-Kruger isn't real, but this image is the best representation of it I've seen: being extremely smug about a topic you know absolutely nothing about.
>>101582470>data is 10 minutes behind?
>>101585324Top is a person using a documented api provided by the company which has limited access and conditions they have to followBelow still uses the api but none of it is documented and corporate is actively trying to stop them from doing so
Is there a way to scrape a page after it has loaded fully without rendering everything with selenium? Vinted.com comes to mind for a site that takes ages to load.
>>101582470>Found the generalI'm trying to scrape listcrawler to run analytics on WHOOOERS. So I click on a posted ad and from there want to go to that bitch's review page.>Inspect> <a> tag with href is nowhere to be foundHowever, when I login and inspect that same element, the a tag I want is there somehow. I know I could easily do this by having Selenium chromedriver click on the element, but I'm trying to git gud as well as run analytics. Can anyone explain what's going on? And perhaps how to scrape the data without logging in or using Selenium?
>>101586158"Eager" page load strategy (return control to script as soon as DOM is ready without waiting for all images, styles, scripts) + explicit WebdriverWaits on the relevant elements
>>101585324This >>101585511Why would I cream my pants over parsing HTML in particular? The image is a literal meme. Of course touching JSON APIs isn't evil, that'd be completely retarded. There are options for interacting with and parsing data from sites with different tradeoffs for data/interactivity completeness, durability against external changes, reliability, implementation complexity, and resource intensity.That is all to say that engineering is full of options and tradeoffs, and I really think everyone in these threads already knows what you are pointing out. Modern "scraping" in a world where Javascript-heavy and mobile-only software companies want immense control over how external parties interact with their products involves everything from reverse engineering a private unofficial API to replay crafted requests, to parsing HTML, to opening thin browsers that will solve captchas for you and intercepting JSON as it comes on the wire. They are all just paths to solutions.People processing data from Facebook, Reddit, Linkedin, or Twitter in July of 2024, tasks for services that constantly break API, gate clients behind complex Javascript puzzles and captchas, and generally want to make their life difficult if they do not sign certain contracts, don't choose full Selenium-type scraping because they have a fetish for XML trees and high RAM usage.
Playwright is such a cocktease. Literally never works more than once. Come back next week and it doesnt work.
>>101585324tism
How can I make a living with web scraping?
Literally everyone should have a home server
>>101585324so (You) can't reverse engineer APIs to scrape shit? you must be fucking retardedhow much are you getting paid? maybe I should do this for a living too...
you scrapers all pay to get around captchas right? proxy ips alone surely cant be enough, sometimes you'll get a cloudflare or google recaptcha and there is nothing you can do then. Tell me how to beat you and destroy your scraping attempts. What would make it impossible for you to scrape? I already have tons of ideas, you will never beat me. You might as well surrender and tell me
>>101589088>There's nothing you can do That's where you're wrong Captchas are solved by AI or outsourced to Indians to solve for fractions of a cent in real time
>>101589088You cannot have an unscrapeable site in the same way you cannot have a perfectly secure computer or an unspammable email. All you can do in both cases is raise the base required effort floor and (time/financial) cost required for an "adversary".Where they are unavoidable by other means, captchas are promptly sent out to advanced AI (All Indian) networks for a fee. The point of captchas, as well as phone verification and proof-of-work is not that no bot will ever be able to get past them, but that the increased financial cost and complexity deters attempts in theory.If someone with technical capability (for example, the autists around these threads,) is willing to pay decent money and infinite time to deal with YOU, it's over. They will show up to your server with patched headless browsers that look and behave exactly like your users, pierce Cloudflare in less time than you can blink and stomp Datadome with exploits, then take what they want. If you require accounts, they will pay thirdworlders for temporary access to their number farm and create a thousand. These things can be more than decent for keeping out low tech bots and spiders looking across the entire web. Not because they are even necessarily good at their job, but because it increases the required effort and cost. If someone is conducting that sort of large-scoped operation, the gain from stopping to deal with you in particular isn't worth it with easier and "cheaper" equivalent targets available.
>>101589963>>101589559I could require a google login and suddenly there is no way a scraper will be able to get around it except by making a new google account every time I figure out that he's a scraper. Suddenly I outsourced this problem to a multi billion dollar company who defends me for free. You just got owned
>>101589088I'm a no-coding functional retard, so take this with a pinch of salt, but couldn't you create some sort of poof-of-work that must be solved before your server will send the data? It seems like that would rape the scraper, but it will make honest visitors to your site have a worse experience, but wait 5 seconds isn't THAT bad Again, I've never coded a day in my life, but this solution came out of my head.
>>101590189My guy, those are sold for like a quarter each max.
>>101590498I'm not talking about the google captcha, I'm talking about requiring the indian/AI captcha solver to log into a google account first before he even gets a captcha. If you think this too is so cheap then show me a site that touts this as a feature and sells it. Because discord works like this, they require an account to view any content whatsoever and this is precisely why discord content is hidden from every search engine and crawler.
>>101590228nowadays the devices are pretty strong so the issue is: how much work are you going to make the guy do? 3sec of 100% cpu on an intel 2700k ? that's not going to increase the cost of some random indian captcha solver service by almost anything but it's actually quite turbo annoying for regular users if I'm requiring them to max out their hardware for a few seconds. It's almost entirely pointless in fact, you might as well not do any work and just do a proof of time (aka forcing the user to wait a fixed 5 seconds)
>read on up their API>create the required account and key>do everything as documented like a good little cuck>get shittier results from the API than I'd get from a simple HTML scraperI fucking hate opensubtitles.
>>101591067>I'm not talking about the google captchaI know.I'm talking about Google accounts.It's really not that hard to buy them, login in the browser, then get your captcha and send it off for solving.
>>101591130as expected, you were wrong. A gmail account costs $1.50. Have fun paying this constantly everytime I manage to block onehttps://useviral.com/buy-gmail-account
>>101590189I refuse to use any site that requires me to log in with a Google or Facebook account.
>>101590189>I could require a google loginGood job, your site is now totally dead
>>101591219the principle is the same for any account wall, it doesn't have to be an external provider although it makes it easier because I dont have to pay for SMS. Discord requires an account too you know? And plenty of people use Discord
>>101582470Whenever I visit a site I find useful I wget2 -r -k -NP --max-threads=6 it and back it up to both my backup drives.Is this retarded?
wget2 -r -k -NP --max-threads=6
>>101586469sir, data is dynamically loaded with javascriptuse a headless browser
>>101591253People are much less likely to use a site that requires you to register for basic shit too.
>>101589088>>101589963>>101590228>>101585324Transgender take
does anyone scrape civitai, they have an API. Is it better to scrape using api or through the website itself?
How do I stop people like you? I've got a pretty fancy fail2ban setup with some clever heuristics in nginx, but I'd like to know if there might be more effective ways.t. Running a smol homeserver, fuck scrapers.
>>101585324giga nigga trvthnvke
>>101591809>smol serverYour choices are to either throttle-fuck anyone not on some kind of whitelist, pay out the ass for extortionate third party protection or to get fucked. By the time automated scraper traffic starts looking any different from normal traffic your server's already down.