[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: scraper.png (1.62 MB, 1892x2142)
1.62 MB
1.62 MB PNG
Web Scraping General

Whitehat cuck edition continued

QOTD: What are some good sources for scraping AI training data from?

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Discord: discord.gg/9EKk3psXMr
Last thread: >>100150524
>>
bump
>>
sage
>>
>>100167855
Poster had to show his drivers license and a DNA and semen sample and pay $100/m just to gain access to a read-only API he could have just scraped (even though that would have gone against the website's TOS)
>>
At the end of the day yt-dlp is really the solution to pretty much everything
>>
>>100141630
Aren't there like 10B possible phone numbers?

>>100143919
> Indirectly by training ML models on data
On this, what are some good sources for pulling data for training ML models?

>>100150865
Join cybercrime TG groups and look for people spreading drainer links, they should know about Twitter scraping
>>
File: file.png (8 KB, 444x87)
8 KB
8 KB PNG
>>100167898
was waiting for them to fix comments not downloading before I started scraping channels again but the zfs pool I was going to use to store the videos fuckin died
>>
>>100168026
youtube sucks ass, who cares about video comments
>>
File: *#($.jpg (101 KB, 1024x683)
101 KB
101 KB JPG
where's the euro greek anon that runs the discord with a data scraping channel
show yourself
>>
>>100167925
>On this, what are some good sources for pulling data for training ML models?
HuggingFace, Kaggle, roboflow or I scrap myself which is way more rewarding since the best data is always gatekept
>>
bump
>>
Does anyone here know a castle bypass or am I gonna have to pay some jeet in the sneaker botting coms?
>>
>>100168034
Imagine scraping comments and using it to train a YouTube comment bot
>>
bump
>>
Having an issue with the selenium IDE (the web browser extension) throwing a fit over a 2d array:

Command: execute script
Target: return [["val1", "val2", "val3"], ["2d", "3d", "4d"]]
Value: A1
it gives me an error invalid or unexpected token

has anyone tried using 2d arrays before in their little web app. I can get it to work fine in the normal selenium webdriver but the IDE is a bit of a pain.
>>
whats web scraping?
>>
>>100171769
never mind I got it working.
>>
>>100170754
You'd need a shitload of proxies though
>>
Anyone know where I can scrape unobfuscated browser JS from?

Planning on training a GPT to deobfuscate obfuscated JS



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.