[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: scraper.png (1.62 MB, 1892x2142)
1.62 MB
1.62 MB PNG
Web Scraping General

Revival edition

QOTD: What do you do when curl-impersonate doesn't work?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101398720
>>
She scrape on my back until my end crashes
>>
'mp
>>
File: 1720874133187562.jpg (152 KB, 1080x1247)
152 KB
152 KB JPG
>>101525886
how to profit off scraping API?
>>
>>101527447
Sell scraped gpt4 keys to /aicg/ coombrains for 50 bucks
>>
>>101527459
how to scrape aws keys?
>>
>>101528533
> Make regex for aws keys
> scrape github
> ???
> Profit
>>
>>101528705
but how do (you) scrape github?
>>
>>101527447
- scrape data from multiple sites that sell something
- make a website that can compare the prices on those sites
- put ads or affiliate links everywhere / make deals with those marketplaces
- you could also easily make your shit better than what the original sites do, an example would be a website that uses scraped house listings and displays them on a map BUT you also give the user ability to filter based on distance to the nearest school/hospital/lake/whatever. or you scrape car listings but you combine that with some user data / reviews from all over the place. like give the user a warning that this model is prone to this and that part breaking, shit like that

at least from what I've noticed there's plenty of things like this I could do locally since they just dont exist here.
>>
>>101530523
Figure it out we can't spoonfeed you. Here's a hint though: copying files from repos is trivially easy since github is literally intended for people to do this anyways
>>
What happened to the discord guild that i got removed from ?
>>
scraping keys off github, is that even a thing? i thought github had good security about notifying you whenever your credentials were made public.
https://tiborhercz.com/what-happens-when-you-leak-aws-credentials-and-how-aws-minimizes-the-damage/
i remember reading this experiment awhile back, but i misremembered. seems like it's up to the specific services to alert you themselves. cool that AWS does that, shitty that github doesnt.

>>101532160
it got banned cause the owner started talking about hacking webcams and was going to stream the hijinks
>>
>>101532395
> scraping keys off github, is that even a thing? i thought github had good security about notifying you whenever your credentials were made public.
You can scrape keys for companies that aren't corpos (like snusbase)

You can also scrape other sources (like pastebin)
>>
>>101525886
to the guy who scraped aops aka artofproblemsolving.com : Thank you! you're goated
a very newbie question: how do I convert the jsonl and bson files to a human readable format? they're too large to get opened by any text editor
I would like to host an offline version of the forum by grabbing all the attachments and avatars off the links(will have to use chatgpt to write a script for that) and making it indexable.
>>
>>101532977
>jsonl
works with jq
>bson
convert it to json then use jq
>>
>>101532977
Nosql or watch the one tutorial when the guy talks about web scrapping with HTMLS and then through json files.
>>
>>101525886
Why do dumbass judeo1984sites like reddit, which i am trying to scrape, use retarded javascript to load more content? I cant use curl + html parser + xml library because i would only scrape content that loads upon loading of the site, to load more content, the only way i've found so far is to scroll down using a web scraping engine. Problem is web scraping engines only ship for pajeetware scripting non-scalable toy languages such as python, javascript, ruby and whatnot, and not for aryankino such as C/C++ or Rust. So my solution is calling python scripts which use the web scraping engine manually from a C++ file, and using C++ to write the data to .csvs and call other subsequent scripts, then use a C++ video editing library to add a background of subway surfers/minecraft parkour or whatever goyim kids use to get their dopamine hits. Problem is this solution is kinda retarded imo, i would like to use 1 (serious programming) language for the whole project. Any ideas?
>>
>>101533068
Who?
>>
File: 1698484258553133.png (25 KB, 542x219)
25 KB
25 KB PNG
>>101533418
>>
>>101533418
> I cant use curl + html parser + xml library because i would only scrape content that loads upon loading of the site, to load more content, the only way i've found so far is to scroll down using a web scraping engine
Holy newfag. Literally just inspect network traffic and find the private API calls that are grabbing that data
>>
>>101534025
screenshot them for reddit
ill wait
>>
>>101533752
ill check that out
>>101534025
ermm chuddy i tried that and those were obfuscated or something
>>
>>101404729
Jeets fight by slapping their opponent to death, like muslims without a flip flop in their hand. Even a cripple could fight and win against a jeet.
>>
>>101533752
Hey I'm this anon >>101535957, i just checked out your method of using curl + bash jq to sort the .json. Holy shit i didn't know you could put /hot.json in the URL to get the top posts, my shit was retarded before. Thanks again, goy.
>>
>>101525886
test
>>
>>101540276
You're not banned, congratulations.
>>
>>101535957
> i tried that and those were obfuscated or something
Nigga reverse engineer it then, it's so easy
>>
'mp
>>
Thad API Publisher



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.