/g/ - /wsg/ - Web Scraping General - Technology

Anonymous

/wsg/ - Web Scraping General 11/19/24(Tue)12:33:36 No.103237164

File: scraper.png (1.62 MB, 1892x2142)

/wsg/ - Web Scraping General Anonymous 11/19/24(Tue)12:33:36 No.103237164 Archived

Web Scraping General

FAQ: https://rentry.org/scrapists

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

> Cool projects by members of our community
doubledouble.top / lucida.to - Free music scraped from spotify
nekohouse.su - Kemonoparty for fanbox/fantia/subscribestar
tv.weboasis.app - Falcon, a goy invite-only pirate streaming service that scrapes video streams from multiple sources

Official Telegram: @scrapists
Last thread: >>102861683

Anonymous
11/19/24(Tue)14:00:51 No.103238069

Anonymous 11/19/24(Tue)14:00:51 No.103238069

how do i scrap

Anonymous
11/19/24(Tue)14:06:28 No.103238126

Anonymous 11/19/24(Tue)14:06:28 No.103238126

>>103238069
with ai

Anonymous
11/19/24(Tue)16:10:34 No.103239300

Anonymous 11/19/24(Tue)16:10:34 No.103239300

how do i into scraping twitter accounts? i wouldn't mind paying for a tool if i had to

Anonymous
11/19/24(Tue)16:11:29 No.103239312

Anonymous 11/19/24(Tue)16:11:29 No.103239312

>>103239300
gallery-dl

Anonymous
11/19/24(Tue)16:38:43 No.103239607

Anonymous 11/19/24(Tue)16:38:43 No.103239607

>>103239312
can you get the tweets themselves with that? thought it was just the media

Anonymous
11/19/24(Tue)17:01:33 No.103239813

Anonymous 11/19/24(Tue)17:01:33 No.103239813

The telegram channel doesn't exist, where are you now?

Anonymous
11/19/24(Tue)17:08:36 No.103239878

Anonymous 11/19/24(Tue)17:08:36 No.103239878

>>103239607
You can with postprocessors

Anonymous
11/19/24(Tue)17:41:05 No.103240240

Anonymous 11/19/24(Tue)17:41:05 No.103240240

>>103239878
thanks anon, time to blow up my ssd

Anonymous
11/19/24(Tue)18:20:33 No.103240654

Anonymous 11/19/24(Tue)18:20:33 No.103240654

File: Screenshot from 2024-11-2(...).png (21 KB, 561x389)

21 KB PNG

>>103237164
How can I pass google's shitware block when using selenium for authenticating on jew sites

Anonymous
11/19/24(Tue)18:53:31 No.103240986

Anonymous 11/19/24(Tue)18:53:31 No.103240986

>>103237164
Sup g

I want to scrape an archive of a 4chan's sister site to get a dataset. Using chatgpt I've successfully made a working scraper for the official archive, but for some fucking reason it breaks when I try to adapt it to the unofficial one, despite the general principle being the same.
Please help
pastebin com 12RzFEXE

Anonymous
11/19/24(Tue)18:57:25 No.103241024

Anonymous 11/19/24(Tue)18:57:25 No.103241024

>>103240986
Another pastebin with archive examples bc 4chan likes to be annoying
pastebin com Dg9Yzn8N

Anonymous
11/19/24(Tue)19:23:19 No.103241317

Anonymous 11/19/24(Tue)19:23:19 No.103241317

>>103240240
https://www.reddit.com/r/DataHoarder/comments/yy8o9w/for_everyone_using_gallerydl_to_backup_twitter/

Anonymous
11/19/24(Tue)20:24:03 No.103241782

Anonymous 11/19/24(Tue)20:24:03 No.103241782

>>103239312
NTA but do you have any tips for not getting logged out for using gallery-dl a lot?

Anonymous
11/19/24(Tue)20:26:23 No.103241796

Anonymous 11/19/24(Tue)20:26:23 No.103241796

>>103241782
Cookies work better than username and password. I have like 10 accounts I use to scrape, I've found that scraping more than 1K posts sequentially gives you error 401 and even if you log in again and give it new cookies you still have to wait

Anonymous
11/19/24(Tue)20:27:24 No.103241806

Anonymous 11/19/24(Tue)20:27:24 No.103241806

>>103241796
Alternatively you could use a timeout between each request

Anonymous
11/19/24(Tue)22:41:30 No.103242825

Anonymous 11/19/24(Tue)22:41:30 No.103242825

>>103241796
>>103241806
The 1K limit explains a lot, last time I tried to download a lot I set timers between requests and downloads varying from 5 seconds to a minute, used cookies as well, still got logged out.

Anonymous
11/20/24(Wed)01:08:50 No.103243746

Anonymous 11/20/24(Wed)01:08:50 No.103243746

>>103237164
>his business is profitable
how?

Anonymous
11/20/24(Wed)01:17:13 No.103243789

Anonymous 11/20/24(Wed)01:17:13 No.103243789

>>103237164
The virgin scraper vs The chad api reverse engineerer.

Anonymous
11/20/24(Wed)01:18:39 No.103243798

Anonymous 11/20/24(Wed)01:18:39 No.103243798

Should I use Puppeteer, Playwright, Selenium, or something else?

Anonymous
11/20/24(Wed)01:24:04 No.103243820

Anonymous 11/20/24(Wed)01:24:04 No.103243820

>>103243798
Playwright is the best

Anonymous
11/20/24(Wed)05:00:45 No.103244974

Anonymous 11/20/24(Wed)05:00:45 No.103244974

>>103237164
Im trying to scrape kemono.party, however some artists put their stuff into encrypted archives. Kemono.party usually has the password included when you click on the archive, but jdownloader doesn’t seem to download it. What other tool could I use?

Anonymous
11/20/24(Wed)05:02:46 No.103244987

Anonymous 11/20/24(Wed)05:02:46 No.103244987

Oh shit, neat that this is a thread here, was coming to /g/ to ask for help with some stuff

I'm trying to scrape a bunch of stuff from Twitter: I do follow some niche history and archeology topics and there's a bunch of researchers as well as people who artistic reconstructions on the platform I want to back stuff up from.

I've tried to look into tips and info about using Gallery-dl and WFdownloader for this, and while I have a lot of good leads on what to do or avoid, it's still a bit much for me, there's still unanswered questions I have I can't get solid info on, and I've run into some issues trying to just starting to scrape stuff even if imperfectly, mot notably that I'm getting warnings from Twitter that it sees "suspicious activity" on my account and I'm worried I'll get suspended for scraping more

I'll try to dump the actual questions I have tomorrow in the thread, but I'm also busy with a lot of other shit, so:

If anybody is willing to work with me on it one-on one, and will run the crawling on their end and can o just send me the scraped, I'd be open to paying $150 to $300 USD, depending on how much they can get, and i'm open to paying potentially even more. My email is in the email field

>>103241796
See my offer to pay you to help me above

>>103239300
>>103239312
>>103239607
>>103239878
>>103240240
>>103241317
>>103241796
>>103241806
>>103242825
Here's some stuff I've been looking at in case it's useful to any of you: https://pastebin.com/ij0y04Gd
Not sure if twint or stweet still work, seen mixed things

Anonymous
11/20/24(Wed)05:03:54 No.103244995

Anonymous 11/20/24(Wed)05:03:54 No.103244995

>>103244987
I guess the email field doesn't work anymore, so: saintseiyasource@gmail.com

Anonymous
11/20/24(Wed)05:53:37 No.103245255

Anonymous 11/20/24(Wed)05:53:37 No.103245255

File: 1578758861650.jpg (114 KB, 960x540)

114 KB JPG

>>103237164
do you actually get banned now on twitter for using gallery-dl?
i have a simple script for random intervals between each request

Anonymous
11/20/24(Wed)06:52:25 No.103245567

Anonymous 11/20/24(Wed)06:52:25 No.103245567

>>103237164
how do i load proxies to my .zshrc env PATH correctly. Anybody got good documentation i could read up on ? please help

Anonymous
11/20/24(Wed)10:07:26 No.103246862

Anonymous 11/20/24(Wed)10:07:26 No.103246862

bump

Anonymous
11/20/24(Wed)11:39:29 No.103247587

Anonymous 11/20/24(Wed)11:39:29 No.103247587

>>103245255
i scraped 16k tweets last night and wasn't banned. took a while though because i got ratelimited over and over

Anonymous
11/20/24(Wed)13:54:26 No.103248942

Anonymous 11/20/24(Wed)13:54:26 No.103248942

File: 1703471535636611.jpg (118 KB, 1024x1023)

118 KB JPG

>>103237164
Hello scrapefrens,

What SMS receiving services do you use to set up accounts?

I've tried a couple but had bad experiences with numbers being extremely unreliable. Preferably looking for something that is reasonably priced and accepts crypto but I'll settle for anything that works.

Anonymous
11/20/24(Wed)14:47:25 No.103249519

Anonymous 11/20/24(Wed)14:47:25 No.103249519

>>103244987
Post an example account you're trying to scrape to test

Anonymous
11/20/24(Wed)14:48:26 No.103249533

Anonymous 11/20/24(Wed)14:48:26 No.103249533

>>103248942
https://onlinesim.io/

Anonymous
11/20/24(Wed)16:05:18 No.103250421

Anonymous 11/20/24(Wed)16:05:18 No.103250421

Make a new telegram channel, I have things to discuss with you

Anonymous
11/20/24(Wed)18:41:52 No.103252066

Anonymous 11/20/24(Wed)18:41:52 No.103252066

>>103249533
onlinesim was actually exactly the one I had in mind when I was writing "extremely unreliable". I tried using them a while ago and most of the numbers I got just straight up didn't work.

Anonymous
11/20/24(Wed)21:17:46 No.103253320

Anonymous 11/20/24(Wed)21:17:46 No.103253320

>>103250421
who?