/g/ - I can't stop scraping. - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
07/26/24(Fri)13:21:28 No.101582470

File: file.png (1.85 MB, 1892x2142)

1.85 MB PNG

Anonymous 07/26/24(Fri)13:21:28 No.101582470

I can't stop scraping.

Anonymous
07/26/24(Fri)13:23:54 No.101582498

Anonymous 07/26/24(Fri)13:23:54 No.101582498

>>101582470
Luminati is now called BrightData.
Old meme. Good service, irregardless.
https://brightdata.com/luminati

Anonymous
07/26/24(Fri)13:50:18 No.101582879

Anonymous 07/26/24(Fri)13:50:18 No.101582879

>>101582498
>irregardless
try scraping a dictionary you dumb bitch

Anonymous
07/26/24(Fri)15:05:42 No.101583851

Anonymous 07/26/24(Fri)15:05:42 No.101583851

>>101582879
it's a word thoughbeit

Anonymous
07/26/24(Fri)16:54:41 No.101585324

Anonymous 07/26/24(Fri)16:54:41 No.101585324

You know what's funny, I do this for a living
I'm right now lying in my bed in a home I bought from my job webscraping
And if anyone ever tried brought up this image in an interview, they'd most likely get turned down for anything other than a junior position
Like, just the idea that these are choices you have is already so naive, let alone thinking HTML scraping is in any way superior to an API
Do me a favor, try scraping pinnacle.com using only HTML, no APIs, since it's a choice
You can't, right? Like, the HTML is just a template filled by *gasp* API requests?!?!
So you end up using selenium, running an entire browser to do the same API request you were just mocking, while also doing a shitton more requests, rendering the page and overall spend about 100x more resources than doing that one single request
Dunning-Kruger isn't real, but this image is the best representation of it I've seen: being extremely smug about a topic you know absolutely nothing about.

Anonymous
07/26/24(Fri)16:56:48 No.101585365

Anonymous 07/26/24(Fri)16:56:48 No.101585365

>>101582470
>data is 10 minutes behind
?

Anonymous
07/26/24(Fri)17:08:18 No.101585511

Anonymous 07/26/24(Fri)17:08:18 No.101585511

>>101585324
Top is a person using a documented api provided by the company which has limited access and conditions they have to follow

Below still uses the api but none of it is documented and corporate is actively trying to stop them from doing so

Anonymous
07/26/24(Fri)18:00:17 No.101586158

Anonymous 07/26/24(Fri)18:00:17 No.101586158

Is there a way to scrape a page after it has loaded fully without rendering everything with selenium?
Vinted.com comes to mind for a site that takes ages to load.

Anonymous
07/26/24(Fri)18:20:35 No.101586469

Anonymous 07/26/24(Fri)18:20:35 No.101586469

File: bdd0kylwwsk81.jpg (82 KB, 1080x606)

82 KB JPG

>>101582470
>Found the general
I'm trying to scrape listcrawler to run analytics on WHOOOERS. So I click on a posted ad and from there want to go to that bitch's review page.
>Inspect
> <a> tag with href is nowhere to be found
However, when I login and inspect that same element, the a tag I want is there somehow. I know I could easily do this by having Selenium chromedriver click on the element, but I'm trying to git gud as well as run analytics. Can anyone explain what's going on? And perhaps how to scrape the data without logging in or using Selenium?

Anonymous
07/26/24(Fri)19:13:31 No.101587138

Anonymous 07/26/24(Fri)19:13:31 No.101587138

>>101586158
"Eager" page load strategy (return control to script as soon as DOM is ready without waiting for all images, styles, scripts) + explicit WebdriverWaits on the relevant elements

Anonymous
07/26/24(Fri)19:43:43 No.101587489

Anonymous 07/26/24(Fri)19:43:43 No.101587489

>>101585324
This >>101585511
Why would I cream my pants over parsing HTML in particular? The image is a literal meme. Of course touching JSON APIs isn't evil, that'd be completely retarded. There are options for interacting with and parsing data from sites with different tradeoffs for data/interactivity completeness, durability against external changes, reliability, implementation complexity, and resource intensity.
That is all to say that engineering is full of options and tradeoffs, and I really think everyone in these threads already knows what you are pointing out. Modern "scraping" in a world where Javascript-heavy and mobile-only software companies want immense control over how external parties interact with their products involves everything from reverse engineering a private unofficial API to replay crafted requests, to parsing HTML, to opening thin browsers that will solve captchas for you and intercepting JSON as it comes on the wire. They are all just paths to solutions.
People processing data from Facebook, Reddit, Linkedin, or Twitter in July of 2024, tasks for services that constantly break API, gate clients behind complex Javascript puzzles and captchas, and generally want to make their life difficult if they do not sign certain contracts, don't choose full Selenium-type scraping because they have a fetish for XML trees and high RAM usage.

Anonymous
07/26/24(Fri)19:47:33 No.101587513

Anonymous 07/26/24(Fri)19:47:33 No.101587513

Playwright is such a cocktease. Literally never works more than once. Come back next week and it doesnt work.

Anonymous
07/26/24(Fri)21:02:52 No.101588219

Anonymous 07/26/24(Fri)21:02:52 No.101588219

>>101585324
tism

Anonymous
07/26/24(Fri)21:10:35 No.101588288

Anonymous 07/26/24(Fri)21:10:35 No.101588288

How can I make a living with web scraping?

Anonymous
07/26/24(Fri)21:15:21 No.101588330

Anonymous 07/26/24(Fri)21:15:21 No.101588330

Literally everyone should have a home server

Anonymous
07/26/24(Fri)21:16:05 No.101588336

Anonymous 07/26/24(Fri)21:16:05 No.101588336

>>101585324
so (You) can't reverse engineer APIs to scrape shit? you must be fucking retarded
how much are you getting paid? maybe I should do this for a living too...

Anonymous
07/26/24(Fri)22:44:53 No.101589088

Anonymous 07/26/24(Fri)22:44:53 No.101589088

you scrapers all pay to get around captchas right? proxy ips alone surely cant be enough, sometimes you'll get a cloudflare or google recaptcha and there is nothing you can do then. Tell me how to beat you and destroy your scraping attempts. What would make it impossible for you to scrape? I already have tons of ideas, you will never beat me. You might as well surrender and tell me

Anonymous
07/26/24(Fri)23:42:39 No.101589559

Anonymous 07/26/24(Fri)23:42:39 No.101589559

>>101589088
>There's nothing you can do
That's where you're wrong Captchas are solved by AI or outsourced to Indians to solve for fractions of a cent in real time

Anonymous
07/27/24(Sat)00:46:46 No.101589963

Anonymous 07/27/24(Sat)00:46:46 No.101589963

File: 1721855040876788.gif (507 KB, 287x373)

507 KB GIF

>>101589088
You cannot have an unscrapeable site in the same way you cannot have a perfectly secure computer or an unspammable email. All you can do in both cases is raise the base required effort floor and (time/financial) cost required for an "adversary".
Where they are unavoidable by other means, captchas are promptly sent out to advanced AI (All Indian) networks for a fee. The point of captchas, as well as phone verification and proof-of-work is not that no bot will ever be able to get past them, but that the increased financial cost and complexity deters attempts in theory.
If someone with technical capability (for example, the autists around these threads,) is willing to pay decent money and infinite time to deal with YOU, it's over. They will show up to your server with patched headless browsers that look and behave exactly like your users, pierce Cloudflare in less time than you can blink and stomp Datadome with exploits, then take what they want. If you require accounts, they will pay thirdworlders for temporary access to their number farm and create a thousand.
These things can be more than decent for keeping out low tech bots and spiders looking across the entire web. Not because they are even necessarily good at their job, but because it increases the required effort and cost. If someone is conducting that sort of large-scoped operation, the gain from stopping to deal with you in particular isn't worth it with easier and "cheaper" equivalent targets available.

Anonymous
07/27/24(Sat)01:25:15 No.101590189

Anonymous 07/27/24(Sat)01:25:15 No.101590189

>>101589963
>>101589559
I could require a google login and suddenly there is no way a scraper will be able to get around it except by making a new google account every time I figure out that he's a scraper. Suddenly I outsourced this problem to a multi billion dollar company who defends me for free. You just got owned

Anonymous
07/27/24(Sat)01:34:15 No.101590228

Anonymous 07/27/24(Sat)01:34:15 No.101590228

File: 1722058450263.jpg (943 KB, 1040x1040)

943 KB JPG

>>101589088
I'm a no-coding functional retard, so take this with a pinch of salt, but couldn't you create some sort of poof-of-work that must be solved before your server will send the data? It seems like that would rape the scraper, but it will make honest visitors to your site have a worse experience, but wait 5 seconds isn't THAT bad

Again, I've never coded a day in my life, but this solution came out of my head.

Anonymous
07/27/24(Sat)02:15:25 No.101590498

Anonymous 07/27/24(Sat)02:15:25 No.101590498

>>101590189
My guy, those are sold for like a quarter each max.

Anonymous
07/27/24(Sat)03:46:16 No.101591067

Anonymous 07/27/24(Sat)03:46:16 No.101591067

>>101590498
I'm not talking about the google captcha, I'm talking about requiring the indian/AI captcha solver to log into a google account first before he even gets a captcha. If you think this too is so cheap then show me a site that touts this as a feature and sells it. Because discord works like this, they require an account to view any content whatsoever and this is precisely why discord content is hidden from every search engine and crawler.

Anonymous
07/27/24(Sat)03:48:29 No.101591078

Anonymous 07/27/24(Sat)03:48:29 No.101591078

>>101590228
nowadays the devices are pretty strong so the issue is: how much work are you going to make the guy do? 3sec of 100% cpu on an intel 2700k ? that's not going to increase the cost of some random indian captcha solver service by almost anything but it's actually quite turbo annoying for regular users if I'm requiring them to max out their hardware for a few seconds. It's almost entirely pointless in fact, you might as well not do any work and just do a proof of time (aka forcing the user to wait a fixed 5 seconds)

Anonymous
07/27/24(Sat)03:52:55 No.101591102

Anonymous 07/27/24(Sat)03:52:55 No.101591102

>read on up their API
>create the required account and key
>do everything as documented like a good little cuck
>get shittier results from the API than I'd get from a simple HTML scraper
I fucking hate opensubtitles.

Anonymous
07/27/24(Sat)03:58:56 No.101591130

Anonymous 07/27/24(Sat)03:58:56 No.101591130

>>101591067
>I'm not talking about the google captcha
I know.
I'm talking about Google accounts.
It's really not that hard to buy them, login in the browser, then get your captcha and send it off for solving.

Anonymous
07/27/24(Sat)04:10:14 No.101591194

Anonymous 07/27/24(Sat)04:10:14 No.101591194

>>101591130
as expected, you were wrong. A gmail account costs $1.50. Have fun paying this constantly everytime I manage to block one
https://useviral.com/buy-gmail-account

Anonymous
07/27/24(Sat)04:14:05 No.101591219

Anonymous 07/27/24(Sat)04:14:05 No.101591219

>>101590189
I refuse to use any site that requires me to log in with a Google or Facebook account.

Anonymous
07/27/24(Sat)04:20:42 No.101591249

Anonymous 07/27/24(Sat)04:20:42 No.101591249

>>101590189
>I could require a google login
Good job, your site is now totally dead

Anonymous
07/27/24(Sat)04:21:58 No.101591253

Anonymous 07/27/24(Sat)04:21:58 No.101591253

>>101591219
the principle is the same for any account wall, it doesn't have to be an external provider although it makes it easier because I dont have to pay for SMS. Discord requires an account too you know? And plenty of people use Discord

Anonymous
07/27/24(Sat)04:24:26 No.101591265

Anonymous 07/27/24(Sat)04:24:26 No.101591265

>>101582470
Whenever I visit a site I find useful I
wget2 -r -k -NP --max-threads=6
it and back it up to both my backup drives.

Is this retarded?

Anonymous
07/27/24(Sat)04:25:20 No.101591273

Anonymous 07/27/24(Sat)04:25:20 No.101591273

>>101586469
sir, data is dynamically loaded with javascript
use a headless browser

Anonymous
07/27/24(Sat)05:13:46 No.101591599

Anonymous 07/27/24(Sat)05:13:46 No.101591599

>>101591253
People are much less likely to use a site that requires you to register for basic shit too.

Anonymous
07/27/24(Sat)05:18:09 No.101591627

Anonymous 07/27/24(Sat)05:18:09 No.101591627

>>101589088
>>101589963
>>101590228
>>101585324
Transgender take

Anonymous
07/27/24(Sat)05:46:00 No.101591784

Anonymous 07/27/24(Sat)05:46:00 No.101591784

does anyone scrape civitai, they have an API. Is it better to scrape using api or through the website itself?

Anonymous
07/27/24(Sat)05:49:20 No.101591809

Anonymous 07/27/24(Sat)05:49:20 No.101591809

How do I stop people like you? I've got a pretty fancy fail2ban setup with some clever heuristics in nginx, but I'd like to know if there might be more effective ways.

t. Running a smol homeserver, fuck scrapers.

Anonymous
07/27/24(Sat)06:31:42 No.101592084

Anonymous 07/27/24(Sat)06:31:42 No.101592084

>>101585324
giga nigga trvthnvke

Anonymous
07/27/24(Sat)06:48:04 No.101592185

Anonymous 07/27/24(Sat)06:48:04 No.101592185

>>101591809
>smol server
Your choices are to either throttle-fuck anyone not on some kind of whitelist, pay out the ass for extortionate third party protection or to get fucked.
By the time automated scraper traffic starts looking any different from normal traffic your server's already down.

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.