Web Scraping GeneralFAQ: https://rentry.org/scrapists> Captcha serviceshttps://2captcha.com/https://www.capsolver.com/https://anti-captcha.com/> Proxieshttps://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)https://infiniteproxies.com/ (no blacklist)https://www.thunderproxies.com/http://proxies.fo/ (not recommended)> Network analysishttps://mitmproxy.org/https://portswigger.net/burp> Scraping toolshttps://beautiful-soup-4.readthedocs.io/en/latest/https://www.selenium.dev/documentation/https://playwright.dev/docs/codegenhttps://github.com/lwthiker/curl-impersonatehttps://github.com/yifeikong/curl_cffi> Cool projects by members of our communitydoubledouble.top / lucida.to - Free music scraped from spotifynekohouse.su - Kemonoparty for fanbox/fantia/subscribestartv.weboasis.app - Falcon, a goy invite-only pirate streaming service that scrapes video streams from multiple sourcesOfficial Telegram: @scrapistsLast thread: >>102861683
how do i scrap
>>103238069with ai
how do i into scraping twitter accounts? i wouldn't mind paying for a tool if i had to
>>103239300gallery-dl
>>103239312can you get the tweets themselves with that? thought it was just the media
The telegram channel doesn't exist, where are you now?
>>103239607You can with postprocessors
>>103239878thanks anon, time to blow up my ssd
>>103237164How can I pass google's shitware block when using selenium for authenticating on jew sites
>>103237164Sup gI want to scrape an archive of a 4chan's sister site to get a dataset. Using chatgpt I've successfully made a working scraper for the official archive, but for some fucking reason it breaks when I try to adapt it to the unofficial one, despite the general principle being the same. Please help pastebin com 12RzFEXE
>>103240986Another pastebin with archive examples bc 4chan likes to be annoyingpastebin com Dg9Yzn8N
>>103240240https://www.reddit.com/r/DataHoarder/comments/yy8o9w/for_everyone_using_gallerydl_to_backup_twitter/
>>103239312NTA but do you have any tips for not getting logged out for using gallery-dl a lot?
>>103241782Cookies work better than username and password. I have like 10 accounts I use to scrape, I've found that scraping more than 1K posts sequentially gives you error 401 and even if you log in again and give it new cookies you still have to wait
>>103241796Alternatively you could use a timeout between each request
>>103241796>>103241806The 1K limit explains a lot, last time I tried to download a lot I set timers between requests and downloads varying from 5 seconds to a minute, used cookies as well, still got logged out.
>>103237164>his business is profitablehow?
>>103237164The virgin scraper vs The chad api reverse engineerer.
Should I use Puppeteer, Playwright, Selenium, or something else?
>>103243798Playwright is the best
>>103237164Im trying to scrape kemono.party, however some artists put their stuff into encrypted archives. Kemono.party usually has the password included when you click on the archive, but jdownloader doesn’t seem to download it. What other tool could I use?
Oh shit, neat that this is a thread here, was coming to /g/ to ask for help with some stuffI'm trying to scrape a bunch of stuff from Twitter: I do follow some niche history and archeology topics and there's a bunch of researchers as well as people who artistic reconstructions on the platform I want to back stuff up from.I've tried to look into tips and info about using Gallery-dl and WFdownloader for this, and while I have a lot of good leads on what to do or avoid, it's still a bit much for me, there's still unanswered questions I have I can't get solid info on, and I've run into some issues trying to just starting to scrape stuff even if imperfectly, mot notably that I'm getting warnings from Twitter that it sees "suspicious activity" on my account and I'm worried I'll get suspended for scraping moreI'll try to dump the actual questions I have tomorrow in the thread, but I'm also busy with a lot of other shit, so:If anybody is willing to work with me on it one-on one, and will run the crawling on their end and can o just send me the scraped, I'd be open to paying $150 to $300 USD, depending on how much they can get, and i'm open to paying potentially even more. My email is in the email field>>103241796See my offer to pay you to help me above>>103239300>>103239312>>103239607>>103239878>>103240240>>103241317>>103241796>>103241806>>103242825Here's some stuff I've been looking at in case it's useful to any of you: https://pastebin.com/ij0y04GdNot sure if twint or stweet still work, seen mixed things
>>103244987I guess the email field doesn't work anymore, so: saintseiyasource@gmail.com
>>103237164do you actually get banned now on twitter for using gallery-dl?i have a simple script for random intervals between each request
>>103237164how do i load proxies to my .zshrc env PATH correctly. Anybody got good documentation i could read up on ? please help
bump
>>103245255i scraped 16k tweets last night and wasn't banned. took a while though because i got ratelimited over and over
>>103237164Hello scrapefrens,What SMS receiving services do you use to set up accounts?I've tried a couple but had bad experiences with numbers being extremely unreliable. Preferably looking for something that is reasonably priced and accepts crypto but I'll settle for anything that works.
>>103244987Post an example account you're trying to scrape to test
>>103248942https://onlinesim.io/
Make a new telegram channel, I have things to discuss with you
>>103249533onlinesim was actually exactly the one I had in mind when I was writing "extremely unreliable". I tried using them a while ago and most of the numbers I got just straight up didn't work.
>>103250421who?