/g/ - /wsg/ - Web Scraping General - Technology

Anonymous

/wsg/ - Web Scraping General 07/22/24(Mon)17:51:56 No.101525886

File: scraper.png (1.62 MB, 1892x2142)

/wsg/ - Web Scraping General Anonymous 07/22/24(Mon)17:51:56 No.101525886 Archived

Web Scraping General

Revival edition

QOTD: What do you do when curl-impersonate doesn't work?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101398720

Anonymous
07/22/24(Mon)17:56:09 No.101525926

Anonymous 07/22/24(Mon)17:56:09 No.101525926

She scrape on my back until my end crashes

Anonymous
07/22/24(Mon)19:24:15 No.101526846

Anonymous 07/22/24(Mon)19:24:15 No.101526846

'mp

Anonymous
07/22/24(Mon)20:23:09 No.101527447

Anonymous 07/22/24(Mon)20:23:09 No.101527447

File: 1720874133187562.jpg (152 KB, 1080x1247)

152 KB JPG

>>101525886
how to profit off scraping API?

Anonymous
07/22/24(Mon)20:24:24 No.101527459

Anonymous 07/22/24(Mon)20:24:24 No.101527459

>>101527447
Sell scraped gpt4 keys to /aicg/ coombrains for 50 bucks

Anonymous
07/22/24(Mon)22:10:45 No.101528533

Anonymous 07/22/24(Mon)22:10:45 No.101528533

>>101527459
how to scrape aws keys?

Anonymous
07/22/24(Mon)22:15:40 No.101528705

Anonymous 07/22/24(Mon)22:15:40 No.101528705

>>101528533
> Make regex for aws keys
> scrape github
> ???
> Profit

Anonymous
07/23/24(Tue)02:14:25 No.101530523

Anonymous 07/23/24(Tue)02:14:25 No.101530523

>>101528705
but how do (you) scrape github?

Anonymous
07/23/24(Tue)02:17:34 No.101530544

Anonymous 07/23/24(Tue)02:17:34 No.101530544

>>101527447
- scrape data from multiple sites that sell something
- make a website that can compare the prices on those sites
- put ads or affiliate links everywhere / make deals with those marketplaces
- you could also easily make your shit better than what the original sites do, an example would be a website that uses scraped house listings and displays them on a map BUT you also give the user ability to filter based on distance to the nearest school/hospital/lake/whatever. or you scrape car listings but you combine that with some user data / reviews from all over the place. like give the user a warning that this model is prone to this and that part breaking, shit like that

at least from what I've noticed there's plenty of things like this I could do locally since they just dont exist here.

Anonymous
07/23/24(Tue)04:05:00 No.101531318

Anonymous 07/23/24(Tue)04:05:00 No.101531318

>>101530523
Figure it out we can't spoonfeed you. Here's a hint though: copying files from repos is trivially easy since github is literally intended for people to do this anyways

Anonymous
07/23/24(Tue)06:05:37 No.101532160

Anonymous 07/23/24(Tue)06:05:37 No.101532160

What happened to the discord guild that i got removed from ?

Anonymous
07/23/24(Tue)06:36:41 No.101532395

Anonymous 07/23/24(Tue)06:36:41 No.101532395

scraping keys off github, is that even a thing? i thought github had good security about notifying you whenever your credentials were made public.
https://tiborhercz.com/what-happens-when-you-leak-aws-credentials-and-how-aws-minimizes-the-damage/
i remember reading this experiment awhile back, but i misremembered. seems like it's up to the specific services to alert you themselves. cool that AWS does that, shitty that github doesnt.

>>101532160
it got banned cause the owner started talking about hacking webcams and was going to stream the hijinks

Anonymous
07/23/24(Tue)07:17:25 No.101532657

Anonymous 07/23/24(Tue)07:17:25 No.101532657

>>101532395
> scraping keys off github, is that even a thing? i thought github had good security about notifying you whenever your credentials were made public.
You can scrape keys for companies that aren't corpos (like snusbase)

You can also scrape other sources (like pastebin)

Anonymous
07/23/24(Tue)07:59:51 No.101532977

Anonymous 07/23/24(Tue)07:59:51 No.101532977

>>101525886
to the guy who scraped aops aka artofproblemsolving.com : Thank you! you're goated
a very newbie question: how do I convert the jsonl and bson files to a human readable format? they're too large to get opened by any text editor
I would like to host an offline version of the forum by grabbing all the attachments and avatars off the links(will have to use chatgpt to write a script for that) and making it indexable.

Anonymous
07/23/24(Tue)08:09:27 No.101533067

Anonymous 07/23/24(Tue)08:09:27 No.101533067

>>101532977
>jsonl
works with jq
>bson
convert it to json then use jq

Anonymous
07/23/24(Tue)08:09:36 No.101533068

Anonymous 07/23/24(Tue)08:09:36 No.101533068

>>101532977
Nosql or watch the one tutorial when the guy talks about web scrapping with HTMLS and then through json files.

Anonymous
07/23/24(Tue)08:51:30 No.101533418

Anonymous 07/23/24(Tue)08:51:30 No.101533418

>>101525886
Why do dumbass judeo1984sites like reddit, which i am trying to scrape, use retarded javascript to load more content? I cant use curl + html parser + xml library because i would only scrape content that loads upon loading of the site, to load more content, the only way i've found so far is to scroll down using a web scraping engine. Problem is web scraping engines only ship for pajeetware scripting non-scalable toy languages such as python, javascript, ruby and whatnot, and not for aryankino such as C/C++ or Rust. So my solution is calling python scripts which use the web scraping engine manually from a C++ file, and using C++ to write the data to .csvs and call other subsequent scripts, then use a C++ video editing library to add a background of subway surfers/minecraft parkour or whatever goyim kids use to get their dopamine hits. Problem is this solution is kinda retarded imo, i would like to use 1 (serious programming) language for the whole project. Any ideas?

Anonymous
07/23/24(Tue)09:17:52 No.101533695

Anonymous 07/23/24(Tue)09:17:52 No.101533695

>>101533068
Who?

Anonymous
07/23/24(Tue)09:23:29 No.101533752

Anonymous 07/23/24(Tue)09:23:29 No.101533752

File: 1698484258553133.png (25 KB, 542x219)

25 KB PNG

>>101533418

Anonymous
07/23/24(Tue)09:52:49 No.101534025

Anonymous 07/23/24(Tue)09:52:49 No.101534025

>>101533418
> I cant use curl + html parser + xml library because i would only scrape content that loads upon loading of the site, to load more content, the only way i've found so far is to scroll down using a web scraping engine
Holy newfag. Literally just inspect network traffic and find the private API calls that are grabbing that data

Anonymous
07/23/24(Tue)10:56:47 No.101534573

Anonymous 07/23/24(Tue)10:56:47 No.101534573

>>101534025
screenshot them for reddit
ill wait

Anonymous
07/23/24(Tue)12:42:38 No.101535957

Anonymous 07/23/24(Tue)12:42:38 No.101535957

>>101533752
ill check that out
>>101534025
ermm chuddy i tried that and those were obfuscated or something

Anonymous
07/23/24(Tue)13:20:41 No.101536498

Anonymous 07/23/24(Tue)13:20:41 No.101536498

>>101404729
Jeets fight by slapping their opponent to death, like muslims without a flip flop in their hand. Even a cripple could fight and win against a jeet.

Anonymous
07/23/24(Tue)14:38:47 No.101537456

Anonymous 07/23/24(Tue)14:38:47 No.101537456

>>101533752
Hey I'm this anon >>101535957, i just checked out your method of using curl + bash jq to sort the .json. Holy shit i didn't know you could put /hot.json in the URL to get the top posts, my shit was retarded before. Thanks again, goy.

Anonymous
07/23/24(Tue)17:34:53 No.101540276

Anonymous 07/23/24(Tue)17:34:53 No.101540276

>>101525886
test

Anonymous
07/23/24(Tue)19:22:24 No.101541956

Anonymous 07/23/24(Tue)19:22:24 No.101541956

>>101540276
You're not banned, congratulations.

Anonymous
07/23/24(Tue)19:47:49 No.101542331

Anonymous 07/23/24(Tue)19:47:49 No.101542331

>>101535957
> i tried that and those were obfuscated or something
Nigga reverse engineer it then, it's so easy

Anonymous
07/23/24(Tue)22:23:34 No.101544225

Anonymous 07/23/24(Tue)22:23:34 No.101544225

'mp

Anonymous
07/23/24(Tue)22:33:15 No.101544337

Anonymous 07/23/24(Tue)22:33:15 No.101544337

Thad API Publisher