Web Scraping GeneralRevival editionQOTD: What do you do when curl-impersonate doesn't work?FAQ: rentry co/t6237g7x> Captcha serviceshttps://2captcha.com/https://www.capsolver.com/https://anti-captcha.com/> Proxieshttps://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)https://infiniteproxies.com/ (no blacklist)https://www.thunderproxies.com/http://proxies.fo/ (not recommended)> Network analysishttps://mitmproxy.org/https://portswigger.net/burp> Scraping toolshttps://beautiful-soup-4.readthedocs.io/en/latest/https://www.selenium.dev/documentation/https://playwright.dev/docs/codegenhttps://github.com/lwthiker/curl-impersonatehttps://github.com/yifeikong/curl_cffiOfficial Telegram: @scrapistsLast thread: >>101398720
She scrape on my back until my end crashes
'mp
>>101525886how to profit off scraping API?
>>101527447Sell scraped gpt4 keys to /aicg/ coombrains for 50 bucks
>>101527459how to scrape aws keys?
>>101528533> Make regex for aws keys> scrape github> ???> Profit
>>101528705but how do (you) scrape github?
>>101527447- scrape data from multiple sites that sell something- make a website that can compare the prices on those sites- put ads or affiliate links everywhere / make deals with those marketplaces- you could also easily make your shit better than what the original sites do, an example would be a website that uses scraped house listings and displays them on a map BUT you also give the user ability to filter based on distance to the nearest school/hospital/lake/whatever. or you scrape car listings but you combine that with some user data / reviews from all over the place. like give the user a warning that this model is prone to this and that part breaking, shit like thatat least from what I've noticed there's plenty of things like this I could do locally since they just dont exist here.
>>101530523Figure it out we can't spoonfeed you. Here's a hint though: copying files from repos is trivially easy since github is literally intended for people to do this anyways
What happened to the discord guild that i got removed from ?
scraping keys off github, is that even a thing? i thought github had good security about notifying you whenever your credentials were made public. https://tiborhercz.com/what-happens-when-you-leak-aws-credentials-and-how-aws-minimizes-the-damage/i remember reading this experiment awhile back, but i misremembered. seems like it's up to the specific services to alert you themselves. cool that AWS does that, shitty that github doesnt. >>101532160it got banned cause the owner started talking about hacking webcams and was going to stream the hijinks
>>101532395> scraping keys off github, is that even a thing? i thought github had good security about notifying you whenever your credentials were made public.You can scrape keys for companies that aren't corpos (like snusbase)You can also scrape other sources (like pastebin)
>>101525886to the guy who scraped aops aka artofproblemsolving.com : Thank you! you're goateda very newbie question: how do I convert the jsonl and bson files to a human readable format? they're too large to get opened by any text editorI would like to host an offline version of the forum by grabbing all the attachments and avatars off the links(will have to use chatgpt to write a script for that) and making it indexable.
>>101532977>jsonlworks with jq>bsonconvert it to json then use jq
>>101532977Nosql or watch the one tutorial when the guy talks about web scrapping with HTMLS and then through json files.
>>101525886Why do dumbass judeo1984sites like reddit, which i am trying to scrape, use retarded javascript to load more content? I cant use curl + html parser + xml library because i would only scrape content that loads upon loading of the site, to load more content, the only way i've found so far is to scroll down using a web scraping engine. Problem is web scraping engines only ship for pajeetware scripting non-scalable toy languages such as python, javascript, ruby and whatnot, and not for aryankino such as C/C++ or Rust. So my solution is calling python scripts which use the web scraping engine manually from a C++ file, and using C++ to write the data to .csvs and call other subsequent scripts, then use a C++ video editing library to add a background of subway surfers/minecraft parkour or whatever goyim kids use to get their dopamine hits. Problem is this solution is kinda retarded imo, i would like to use 1 (serious programming) language for the whole project. Any ideas?
>>101533068Who?
>>101533418
>>101533418> I cant use curl + html parser + xml library because i would only scrape content that loads upon loading of the site, to load more content, the only way i've found so far is to scroll down using a web scraping engineHoly newfag. Literally just inspect network traffic and find the private API calls that are grabbing that data
>>101534025screenshot them for reddit ill wait
>>101533752ill check that out>>101534025ermm chuddy i tried that and those were obfuscated or something
>>101404729Jeets fight by slapping their opponent to death, like muslims without a flip flop in their hand. Even a cripple could fight and win against a jeet.
>>101533752Hey I'm this anon >>101535957, i just checked out your method of using curl + bash jq to sort the .json. Holy shit i didn't know you could put /hot.json in the URL to get the top posts, my shit was retarded before. Thanks again, goy.
>>101525886test
>>101540276You're not banned, congratulations.
>>101535957> i tried that and those were obfuscated or somethingNigga reverse engineer it then, it's so easy
Thad API Publisher