[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: scraper.png (1.62 MB, 1892x2142)
1.62 MB
1.62 MB PNG
Web Scraping General

FAQ: https://rentry.org/scrapists

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

> Cool projects by members of our community
doubledouble.top / lucida.to - Free music scraped from spotify
nekohouse.su - Kemonoparty for fanbox/fantia/subscribestar
tv.weboasis.app - Falcon, a goy invite-only pirate streaming service that scrapes video streams from multiple sources

Official Telegram: @scrapists
Last thread: >>102861683
>>
how do i scrap
>>
>>103238069
with ai
>>
how do i into scraping twitter accounts? i wouldn't mind paying for a tool if i had to
>>
>>103239300
gallery-dl
>>
>>103239312
can you get the tweets themselves with that? thought it was just the media
>>
The telegram channel doesn't exist, where are you now?
>>
>>103239607
You can with postprocessors
>>
>>103239878
thanks anon, time to blow up my ssd
>>
>>103237164
How can I pass google's shitware block when using selenium for authenticating on jew sites
>>
>>103237164
Sup g

I want to scrape an archive of a 4chan's sister site to get a dataset. Using chatgpt I've successfully made a working scraper for the official archive, but for some fucking reason it breaks when I try to adapt it to the unofficial one, despite the general principle being the same.
Please help
pastebin com 12RzFEXE
>>
>>103240986
Another pastebin with archive examples bc 4chan likes to be annoying
pastebin com Dg9Yzn8N
>>
>>103240240
https://www.reddit.com/r/DataHoarder/comments/yy8o9w/for_everyone_using_gallerydl_to_backup_twitter/
>>
>>103239312
NTA but do you have any tips for not getting logged out for using gallery-dl a lot?
>>
>>103241782
Cookies work better than username and password. I have like 10 accounts I use to scrape, I've found that scraping more than 1K posts sequentially gives you error 401 and even if you log in again and give it new cookies you still have to wait
>>
>>103241796
Alternatively you could use a timeout between each request
>>
>>103241796
>>103241806
The 1K limit explains a lot, last time I tried to download a lot I set timers between requests and downloads varying from 5 seconds to a minute, used cookies as well, still got logged out.
>>
>>103237164
>his business is profitable
how?
>>
>>103237164
The virgin scraper vs The chad api reverse engineerer.
>>
Should I use Puppeteer, Playwright, Selenium, or something else?
>>
>>103243798
Playwright is the best
>>
>>103237164
Im trying to scrape kemono.party, however some artists put their stuff into encrypted archives. Kemono.party usually has the password included when you click on the archive, but jdownloader doesn’t seem to download it. What other tool could I use?
>>
Oh shit, neat that this is a thread here, was coming to /g/ to ask for help with some stuff

I'm trying to scrape a bunch of stuff from Twitter: I do follow some niche history and archeology topics and there's a bunch of researchers as well as people who artistic reconstructions on the platform I want to back stuff up from.

I've tried to look into tips and info about using Gallery-dl and WFdownloader for this, and while I have a lot of good leads on what to do or avoid, it's still a bit much for me, there's still unanswered questions I have I can't get solid info on, and I've run into some issues trying to just starting to scrape stuff even if imperfectly, mot notably that I'm getting warnings from Twitter that it sees "suspicious activity" on my account and I'm worried I'll get suspended for scraping more

I'll try to dump the actual questions I have tomorrow in the thread, but I'm also busy with a lot of other shit, so:

If anybody is willing to work with me on it one-on one, and will run the crawling on their end and can o just send me the scraped, I'd be open to paying $150 to $300 USD, depending on how much they can get, and i'm open to paying potentially even more. My email is in the email field

>>103241796
See my offer to pay you to help me above

>>103239300
>>103239312
>>103239607
>>103239878
>>103240240
>>103241317
>>103241796
>>103241806
>>103242825
Here's some stuff I've been looking at in case it's useful to any of you: https://pastebin.com/ij0y04Gd
Not sure if twint or stweet still work, seen mixed things
>>
>>103244987
I guess the email field doesn't work anymore, so: saintseiyasource@gmail.com
>>
File: 1578758861650.jpg (114 KB, 960x540)
114 KB
114 KB JPG
>>103237164
do you actually get banned now on twitter for using gallery-dl?
i have a simple script for random intervals between each request
>>
>>103237164
how do i load proxies to my .zshrc env PATH correctly. Anybody got good documentation i could read up on ? please help
>>
bump
>>
>>103245255
i scraped 16k tweets last night and wasn't banned. took a while though because i got ratelimited over and over
>>
File: 1703471535636611.jpg (118 KB, 1024x1023)
118 KB
118 KB JPG
>>103237164
Hello scrapefrens,

What SMS receiving services do you use to set up accounts?

I've tried a couple but had bad experiences with numbers being extremely unreliable. Preferably looking for something that is reasonably priced and accepts crypto but I'll settle for anything that works.
>>
>>103244987
Post an example account you're trying to scrape to test
>>
>>103248942
https://onlinesim.io/
>>
Make a new telegram channel, I have things to discuss with you
>>
>>103249533
onlinesim was actually exactly the one I had in mind when I was writing "extremely unreliable". I tried using them a while ago and most of the numbers I got just straight up didn't work.
>>
>>103250421
who?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.