[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: scraper.png (1.62 MB, 1892x2142)
1.62 MB
1.62 MB PNG
Web Scraping General

AI datamine general

QOTD: If you had to fine-tune an LLM, what would you optimize it for and where would you scrape training data from?

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Discord: discord.gg/9EKk3psXMr
Last thread: >>100166675
>>
>>100192529
>QOTD: If you had to fine-tune an LLM, what would you optimize it for and where would you scrape training data from?
I'd finetune it on a bunch of shitcoin candlestick charts to make myself rich
>>
File: 1711577564007128.jpg (555 KB, 2000x1333)
555 KB
555 KB JPG
>>100192529
I wrote my own ChatGPT scrapper + GUI in c++
>>
is he one of us?
>>
don't you have any other images for the OP?
>>
>>100193742
No
>>
File: 1711248980632817.jpg (126 KB, 768x1024)
126 KB
126 KB JPG
>~$80/3months for 20 datacenter proxies
Before I renew, am I overpaying or is this about expected?
>>100192529
>QOTD
Finetuning... probably Wikia/Fandom and F-List to get whatever to understand RP concepts and characters better.
For pretraining though, I will state the same thing I did in /lmg/ (when I was bigger into AI and not totally focused on a scraping project) and say that Anna's Archive is untapped if you are willing to put the work into cleaning. Almost as big* as Commoncrawl but that and Books3 have been implemented into datasets over and over and over again. If you give any credence to the "slop problem" of collecting more garbage and GPT output in web scrapes as time goes on, it's a potential alternative.
>>100193742
On the off chance the guy is here, NIG respond to my emails. If shit falls through completely I want to know if I can pick up the data.
>>
>>100193742
> spy.pet
You the actual spy.pet admin?
>>
>>100198235
> am I overpaying
Yes, especially for DC proxies
>>
>>100198458
Noted. I'll take a look at services in OP as well as Proxyrack. I DON'T need huge amounts of data at all, I do downloads as needed over a 'vad VPN rig, I just need something stable and IPV4 compatible to have my browser instances connect to.
>>
If people want I'll do an explanation image of the vad/Nordlord (VPN + Docker + HAproxy) and post scripts.
I doubt this is really anything "new", but variations of this have served my projects well as a way to get low-cost, unbanned, unlimited bandwidth proxies when stability doesn't matter all that much
>>
File: bobo-peeking-behind-door.png (392 KB, 1550x1404)
392 KB
392 KB PNG
I found this library on npm the other day. It's for getting data from the Twitter API that you usually can't get without paying.
https://github.com/Rishikant181/Rettiwt-API
I didn't really look into how it works but it does seem to work, at least the bits I tried work. For example, with the free Twitter API tier you can only get user details for your own account, but with this library you can get details for any account. It doesn't seem to include everything though, like the user details are missing the "website" property, if a user has a website url on their profile this library doesn't return that, but it seems to return most data
>>
>>100198235
>>~$80/3months for 20 datacenter proxies
>Before I renew, am I overpaying or is this about expected?
that does sound a bit expensive, but it depends on how much you're using it. On Azure you can create a function app and you get 1 million free executions per month. I'm fairly certain that if you create another function app you get another million, but you'd need to check that. But you could just create 20 function apps if you wanted. I scraped data from ebay and used 10 function apps in various regions and it prevented getting IP rate limited which was happening on my local pc
>>
>>100199227
How much did you pay for Azure?
>>
Why are ISP proxies so good? Literally doing click F with AdSense and they don't bat an eye and just keep paying (I have 0 legitimate traffic lol). Is it because they are set up and configured differently than your average botnet residential proxies? TCP fingerprint? anything else I should add to my proxy test routine?
>connect your proxy in anti detect browser that takes care of all spoofing (like webRTC and spoofing proxy ip geo location)
>go to https://proxy.incolumitas.com/proxy_detect.html (the latency test can be ignored)
>go to https://browserleaks.com/ip check ISP and usage type. Check TCP/IP fingerprint.
>do UDP/TCP port scan on proxy ip for common ports. all of them open = bad
>check IP in blacklist (ipqualityscore.com too strict IMO, more something like https://www.ipvoid.com/ip-blacklist-check/ or a google search)
>DNS check (?)
>>
>>100199056
I'm interested
>>
File: mobiproxy.jpg (63 KB, 445x586)
63 KB
63 KB JPG
>>100200016
You know something I was wondering?

I was talking with a friend and apparently his friend bought this device that could get him 10 IPs from his ISP simultaneously and he could rotate it at will. (pic related)

He lives in Vietnam apparently and idk their infrastructure, but I was wondering if I could split up the coaxial connection at my house to get me multiple simultaneous connections I could use and rotate at will.

Does anyone here have any experience with this subject?
>>
>>100200016
How much have you made? If it were that easy everyone would be doing it
>>
>>100199953
It was free for what I was doing because I was under the 1 million executions per month. That was in total across the 10 function apps though. I'm not sure if I would be charged more if I say used 800k executions on every function app so the total was 8 million executions for the month. You might have to pay for storage too because function apps log things to storage occasionally and you need a storage account to create a function app, but storage accounts are free too and you'd probably only be charged a few cents a month for storage, like 10 cents or something. So i probably did pay a few cents for that but it was basically zero. The pricing is here if you want to try to work out exactly how much you'd pay
https://azure.microsoft.com/en-au/pricing/details/functions/
Or you can use the pricing calculator
https://azure.microsoft.com/en-au/pricing/calculator/?service=functions
>>
You can proxy with Tor too. Obviously it's pretty slow though, and it doesn't work on a lot of websites, but it might be useful in some instances if you're getting IP restriction problems using other methods.

What you need to do is go to the Tor download page, then download the "Tor expert bundle" for your computer
https://www.torproject.org/download/tor/

Then put the expert bundle files in a folder somewhere on your PC. Then in a command line go to that folder and run tor.exe which should start the service, you'll see it say "bootstrapping" for a while.

Then you can use the proxy like in this node.js sample that's using playwright. Or you can just do direct http GET requests for the HTML if the website will allow that

import { chromium } from "@playwright/test";

(async () => {
const browser = await chromium.launch({
proxy: {
server: "socks5://127.0.0.1:9050"
},
headless: false
});
const page = await browser.newPage();
await page.goto("https://api.ipify.org/?format=json");
const html = await page.content();
console.log(html);
await browser.close();
})();


In some cases you might want to restrict the Tor service to only use exit nodes from a certain country. I can't remember exactly how to do that, it's in a settings file you need to edit somewhere, but you can probably find out pretty easy online
>>
>>100200225
Yeah, Azure free trial says it gives you a million functions per moth 'Always' and not just for 12 months. I think I'll give it a try, I've never used Azure before
>>
>>100200225
...another note, each function app will have an IP address which is for that data centre. So if you need multiple different IP addresses I'm pretty sure you need to create a function app in different regions. Like one in the U.S. and one in the UK and one in Australia and so on and you should get a different IP for each one. There are other ways to get IP addresses on Azure but because I'm familiar with function apps and because they're so cheap I decided to just do it this way
>>
>>100200300
> You can proxy with Tor too
Maybe in like 2008. I hope you're ready for captcha hell and blocks on most websites given all exit nodes are public and all of them have really bad fraud scores
>>
>>100200309
Usually i'd setup CI/CD for azure functions when i'm working on a project. But if you're doing something small like making a proxy you can do all your code in the browser. When you create the function app you can then create a function and there's a code editor and stuff in the portal you can use. Works well for small things. This page here covers most of what you need I think
https://learn.microsoft.com/en-us/azure/azure-functions/functions-create-function-app-portal?pivots=programming-language-javascript

If you create a node.js function the only code you need really is this

module.exports = async function (context, req) {

const response = await fetch(req.query.url);

context.res = {
body: await response.text()
};
}


and then you just make a request to that function from your PC and pass a URL as a query paramer
>>
>>100200338
Yeah most websites don't like it. Some are ok though. Ebay works with Tor without any captchas or anything last time I checked like last year some time. Not disabling javascript in Tor helps too
>>
>>100198235
I want to have sex with Miku I want to have sex with Miku she's so cute and hot I want to breed her over and over again I want to cum inside Miku I want to cum inside Miku I want to coom inside her so bad
>>
>>100200094
Never heard of people doing that with ISP internet connections, just with mobile. there are a few providers which sell premade modems/sticks farms. for example https://xproxy.io/ (idk if they are good).

>>100200133
100$ so far. will test it more before scaling. I'm making sure to remain in the 5% CTR range. my automated browser does random tasks daily on the internet.
>>
>>100201185
How's this done with mobile? Do you stick a SIM card into the box or something?

Using a fuckton of proxy bandwidth a month on bruteforcing accounts, wanna try to cut down on costs
>>
>>100201185
So what is it that you do? Click ads on your own website through browser automation using proxies?
>>
>>100192529
im building a gigant stalker scraper that will analize normies behaviour and relationships over the internet. i got all this data in the database and a scrapper that scrapes like every 5 minutes all links in my link table.
what should i do with all this data? should i make a discordhook that sends messages when some nigger is online? should i make a graph with relationshi maps?
>>
>>100201345
What do you have to gain from this?
>>
>>100201185
Have you thought about doing this with Spotify? I've heard it's way easier and really easy to scale
>>
>>100201522
That's what he's asking faggot
>>
>>100201522
i want to be my own nsa. nsa does surveillance on american citizens.
i want to surveil them and gather data for my own sake.
this way i can see when a friend stabs me behind my back or can see what server he is playing on even though steam shows him as 'offline'
>>
>>100202039
>i want to be my own nsa
that sounds cool
>or can see what server he is playing on even though steam shows him as 'offline'
that does not sound cool
>>
>>100202183
i use this to monitor gangstalking behavoir in my rust servers. i have seen a guy constantly chaning his steam name with the same steamid targetting me across different servers.
this is not cool and it's really annoying. i just want to map out these anomalies and get the upper hand. imagine you analyse a whole year at which times someone goes online and offline. you can predict his behavoir. coupled with other data i can predict every move someone makes
>>
>>100202243
Holy schizo
>>
bump
>>
File: 1698576304027976.png (171 KB, 1492x723)
171 KB
171 KB PNG
>>100200030
here ya go
setup: https://litter.catbox.moe/09xrrp.zip
based on: https://github.com/bernardko/mullvad-proxy
>>
>>100198451
no, i just think xe is one of us
>>
>>100202707
just wait until you have 佳哥玩游戏 on your ass in the middle of the night, following you to every new server
>>
How should I bulk download images from archiveofsins.com? I have the image URLs ready, would curl-impersonate work?
>>
>>100206384
I piped them into
curl --verbose -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
which works for all but five links
>>
>>100206384
yeah, seems to work, searx tells me to tell you to xargs to make it go through the urls
Don't know how Cloudflare will react if you hit it hard; consider proxying
>>
>>100204683
> NordVPN seems to have seriously tightened up and now leaked accounts are promptly security locked
Surfshark and IPVanish logs are still a thing. Surfshark requires proxies to crack (and curl-impersonate) though IPVanish can actually be done without proxies. It allows you to login with your credentials directly on OpenVPN so you could make a script to check through your combolist without any proxies by just testing logins via OpenVPN.
>>
>>100204683
Don't VPN proxies get sent through captcha hell and blocked on a bunch of stuff? Plus, there're 5 VPN locations max on one VPN provider, so if you need a bunch of US proxies, good luck
>>
>>100208521
My setup gets through Cloudflare and Google firewalls easy, obviously it's shared and datacenter which needs to be taken into account but I've never really had complaints about quality.
Mullvad has quite a few locations and specific servers (looking at my stats page for the Slimvad thing, it's at about ~430 total relays) but this is indeed another variable to take into account with any given provider. If you need a bunch of IPs in one very specific country guaranteed you should probably just buy them from a proxy store outright.
>>100208438
Nice, noted.
>>
>>100208925
One other way shittier option is to just run masscan on all the ports that HTTP/SOCKS4/SOCKS5 proxies are usually hosted on then just have a script to check all the results to see if you can get really shitty exposed DC proxies (this is actually how "free proxies" on breachforums/cracked.to are found)

At this point I'm actually thinking, what if I just wrote some malware that turned a host machine into a proxy? If one could get a bunch of virus downloads in the US, it should be extremely easy to build a huge network of high quality US resi IPs
>>
>>100209015
> malware that turned a host machine into a proxy
On this, it could be both more scaleable and even higher quality to do this with mobile devices. If you can write a malicious android app, most phones have way higher uptime than computers, and normalfags tend to use phones over computers nowadays. They also have LTE IPs when on cell data
>>
>>100209015
>One other way shittier option is to just run masscan on all the ports that HTTP/SOCKS4/SOCKS5 proxies are usually hosted on then just have a script to check all the results to see if you can get really shitty exposed DC proxies
Yeah, but these are EXTREMELY shitty, you have to assume that port will stop connecting at literally any time and is being raped by literally everyone.
You could probably put in some elbow grease and roundrobin the most stable ones above a certain speed that pass botchecks though, but >effort
> what if I just wrote some malware that turned a host machine into a proxy?
I'm pretty sure the ruskis and chinks actually already do this a lot to make profitable botnets.
>>
>>100209125
> Yeah, but these are EXTREMELY shitty
Obviously, this is literally the black people option for niggers that can't afford 1GB DC proxies

> I'm pretty sure the ruskis and chinks actually already do this a lot to make profitable botnets.
If I were to do this, what would be some good spreading methods? Google ads + fake download links come to mind.

>>100209066
I really wanna do this but how would I get Google Play downloads? I imagine it would be extremely easy in practice but you would have to spend a fortune on advertising
>>
bump
>>
>>100199227
>>100199953
>>100200225
Someone explain this. Never used Azure but I've abused Github actions to run stuff on their VMs before and I want to abuse this too
>>
>>100211995
Based abuser
>>
'mp
>>
>>100211995
Azure Functions are very similar to AWS Lambdas. They're a "serverless" service. In a regular server like express.js you write logic to start a server and then define your API by creating a routing file with a bunch of routes with a function to handle each route. In Azure Functions you don't have to write any server startup code, and you use one Azure Function per API endpoint. If you're doing a simple scraping job you often only need one Azure Function. Each azure function is literally just like a single function you'd have in your code. You can import libraries and call other functions and stuff of course. But they're meant for smaller jobs, an Azure Function will only run for a max of 10 minutes. But they're designed to be called a lot and will automatically scale out so multiple clones of the Azure Function will run in parallel if needed. They're pretty easy to setup though, easier than a regular server on a VM or in Docker or something. They can have bindings too to other Azure services, so if you upload something to blob storage or to a queue service it will trigger a function to run and do something. Github actions are a bit different, they're usually a Docker container or similar that's run as part of a CI/CD pipeline, so they're not really designed for running arbitrary jobs like Azure Functions are
>>
>>100216281
What lang do I need to write these functions in? Does it have to be anything in specific?

Also, how "free" are these Azure functions? Are they totally free? Like you can just sign up and start using it? Or does it require a credit card?
>>
>>100201185
>100$ so far. will test it more before scaling. I'm making sure to remain in the 5% CTR range. my automated browser does random tasks daily on the internet.

You are most likely gonna get b& when you try to cash out. Google tries to jew both sides the users and the advertisers
>>
bump
>>
File: 1683137158130733.jpg (53 KB, 719x601)
53 KB
53 KB JPG
I feel like a massive failure never having profited a penny in my free time. what does /wsg/ have in store for me?
>>
>Was a virgin API cuck because I thought it'd be faster and nicer to just use the API
>API was badly documented, output only 200 results before rate limiting, can't access everything
>Stopped giving a fuck and use selenium driverless
Mfw I get everything I want and more.
>>
>>100220354
What o you use it on?
>>
>>100219220
Same. I will lurk patiently and share if I cook up something.
>>
>>100220729
Civitai
>>
'mp
>>
>>100220354
Just wait until this guy discovers BeautifulSoup and curl_cffi
>>
>>100223858
>BeautifulSoup
lmao is this 2016 or smt?
>>
>>100223970
What's wrong with BeautifulSoup?
>>
>>100219220
Write a username swapper and swap OG usernames. Ez paper
>>
'mp
>>
'mp
>>
>>100219220
>>100220802
Create private APIs for services that don't have first party APIs and sell access
>>
'mp
>>
>>100228791
Is it legal?
>>
>>100225930
For what purpose?
>>
>>100230369
For the most part yes. Companies may try to sue you but you can always just accept crypto and have good opsec
>>
>>100228791
this is an interesting suggestion. do you have examples of services doing such scheme?
>>
>>100232079
Think some redditfags did it while the whole drama surrounding their first-party API was going on
>>
>>100192529
I just use grab-site and sometimes pywb
>>
>>100228791
Like what? Desuarchive?
>>
>>100202243
We're going to get you. You will never be able to predict our next move.
>>
>>100233965
That might be a good idea.
>>
>>100225187
Too slow for parsing.
>>
>>100234892
What is better?
>>
>>100234947
selectolax
>>
File: AeKE05J.png (120 KB, 469x378)
120 KB
120 KB PNG
>>100199147
Thanks for sharing this

>So far, the following operations are supported:
>Getting the details of a tweet
>Liking/favoriting a tweet
>Retweeting/reposting a tweet
>Searching for the list of tweets that match a given filter
>Tweeting/posting a new tweet
>Replying to a tweet
If these features work then I am definitely using this. Managing twitter accounts is a pita and their API is priced to exclude single entities
>>
>>100221276
What are you using it to do?
>>
imagine being so low iq that you can't even comprehend a well documented api

thanks for the captchas, browser verifications and rate limits baboons
>>
>>100234947
Regex
>>
>>100235932
What did the passive aggressive anon mean by this?
>>
>>100236006
Based, I too love type 3 grammar
>>
>>100235932
>thanks for the captchas, browser verifications and rate limits baboons
Cry me a river faggot, if you can't bypass that you don't belong here :^)
>>
'mp
>>
>>100235932
Imagine being so low IQ you need a well documented read-only API with data 10 minutes behind and kike ratelimits
>>
File: images.jpg (7 KB, 224x225)
7 KB
7 KB JPG
Anyone got scrapers or ideas for grabbing follower numbers from social media accounts? I know selenium can be used but I was thinking of doing it without that to keep the script light
>>
>>100241874
Any interesting website to scrap is behind two captchas and cloudflare. Good luck bypassing that without selenium
>>
>>100241947
> Any interesting website to scrap is behind two captchas
Use capsolver

> and cloudflare. Good luck bypassing that without selenium
Curl_cffi + residential proxies never fail
>>
>>100192529
How do I scrape stuff if I have no autism, knowelde or time to waste
>>
>>100242200
Hire a pajeet from fiverr
>>
File: shark-keyboard.gif (2.3 MB, 480x390)
2.3 MB
2.3 MB GIF
>>100242014
>Curl_cffi
Thanks fren. I'm also now looking into hrequests
>>
File: 1683922784706646.png (315 KB, 450x553)
315 KB
315 KB PNG
I'm this >>100198235 anon, I dug around in the archive and found >>99854872 in an old thread. Currently trying Swiftproxy, I'm using more data than expected (~500MB, need to triple check that my code that disables images/styles/installs Ublock/swaps to 'vad for image/binary downloads is working correctly) but I'm still extremely satisfied, especially since they are residential and aren't locked to any timeframe.
>>100241874
Depends heavily on site, not sure what else to say.
Other than Twitter and Instagram, snscrape might work with no further effort.
Twitter and Instagram change API a lot, former definitely drops user information reliably without sign-in though. Instagram is a fucking bitch, bring an army of high quality IPs in any case.
>>100242200
Let's talk, I'm very busy and I don't know what you want done but also up for a challenge.
water3227@cock.li
>>
>>100242372
From AI to scraping, lmao, I'm sliding on this pipeline too and got into crypto before ai
>>
>>100242406
I've been into scraping for three or four years, art archival autism, some minor experiments with other things, etc.
AI stuff is great, I was with /aicg/ and then /lmg/ a lot dreaming of robowaifu. Didn't lose interest, still occasionally hop in to see what's going on, my focus is just elsewhere. I want to build more with LLMs and push voice synth but there's a shitton to grasp if you want to break new ground and I can only just barely keep up computationally with anything going on using my RTX 3060, Mixtral-intended PC. There's some easier theoretical applications for the existing stuff like imageboard/forum/whatever moderation assistance, enhanced search, tagging suggestions and recommendations, but as for the bolder stuff... someday.
I have a feeling the former interest will help datasetting for the latter though.
>>
>>100242692
> AI stuff is great, I was with /aicg/ and then /lmg/ a lot dreaming of robowaifu
I've been thinking about scraping Github for a bunch of browser Javascript and training an LLM with the inputs as the obfuscated versions of this JS and the expected outputs as the unobfuscated versions, with the goal of making an LLM that can deobfuscate JS.

Do you think this would work? If so, do you have any recommendations for making this work?
>>
is it just me or is elon making it harder and harder to scrape from twitter
>>100192587
why arent you already doing that
>>
File: 4change rebrand_.jpg (208 KB, 1920x960)
208 KB
208 KB JPG
>>100245054
Twitter changed their api to a higher tier. It's mostly aimed at service providers now
>>
>>100243912
Given good base model and enough examples, I don’t see why not
Mixtral has excellent abilities when it comes to understanding code, but it’s MoE and I don’t know if training those has progressed from begging the French for documentation. Also severely overweight for the problem. Maybe you could wrangle Codellama or Llama3/8B…
There is a guide for finetuning in general, may be helpful: https://rentry.org/llm-training
>>
What are some good tools to get the "real" content from a website/HTML? I run puppeteer to scrape websites, and send it to an AI to get a summary of the content, but the HTML content is filled with bullshit like ads, or the headers etc. I just want the main content. Any good libraries or something that can effectivity extract the relevant text from a website?
>>
>>100245966
Fix your script. You should be scraping main content div blocks only
>>
>>100246341
I am, but a lot of websites do all kinds of weird stuff for the HTML content.
>>
>>100246445
Learn to parse mate and put a proper delay
>>
File: looks-at-anon.jpg (156 KB, 500x333)
156 KB
156 KB JPG
>>100246445
Learn to spot the div classes that are related to what you want to grab
Read the thread because other scraper tools have already been mentioned
Try sharing your code so you can get more detailed advice
>>
Reminder to parse __NEXT_DATA__ if it's present, way easier than parsing HTML shit.
>>
>>100192529
>discord.gg/9EKk3psXMr
why is there a discord?? just make a irc (rizon)/xmpp(yourdata.forsale) channel/room
>>
a curl_cffi just flew over my house
>>
bump
>>
When you using a search engine and they display a paragraph or so under the webpage as a preview, that data is scraped right? Is there a way to get a whole plain text version of the webpage from that? I frequently run into issues where the info is formatted poorly or there's a paywall or the site doesn't even match the text preview at all that could all be remedied by just getting even a few more sentences from the preview
>>
>>100254596
that usually happens when you search a question like
>who discovered america?
the paragraph is the exact copy of a part of the page that google deems appropriate
there is no way to obtain more text without visiting the page
>>
>>100254596
They're usually stored in meta tags in the HTML. Usually it's in </head> and is stored as <meta name="description">
>>
bump
>>
Be careful with extensions, it could affect your detectability in the future.
https://github.com/z0ccc/extension-detector
>>
File: high seas explorer.jpg (114 KB, 1024x525)
114 KB
114 KB JPG
>>100257975
You'd still want to have extensions if you want to protect your identity, because it helps for your footprint to appear to be normie-tier. What you want protection from is what personal data is being collected about you and how unique your digital footprint may be.
>See how trackers view your browser
https://coveryourtracks.eff.org/
>>
What would be the usecase for web scraping? I understand downloading video off of sites like youtube because they have fucking garbage policies.
What exactly do I scrape? I'm not scraping 4chan posts or porn for that matter. I don't care what you niggas say here. I don't give a shit about porn. No bait and a real question btw.
>>
>>100258271
>>100228791
>>
...never mind about swiftproxy, at least for the moment, the connection issues are really bad right now and shit is more likely to timeout than return data because of it.
i'd HOPE this isn't a pajeet scam where they start yanking my chain after a certain amount of money goes in. That would have been only ~$10 monero but still
Mildly pissed because I had shit to complete by the end of the month, but waiting on support response before buying shit from elsewhere
>>
>>100228791
That doesn't sound very legal desu
>>
>>100258940
Go back to your hugbox
>>
>>100234449
Would anyone actually pay for access to a desuarchive API?

Might do this
>>
>>100259346
>Would anyone actually pay for access to a desuarchive API?
I highly doubt it. When building online resources for proift it's not a good idea to go after low hanging fruit. You need to put your thinking cap on and branch out outside of your comfort zone
>>
bump!
>>
>all these people asking what to scrape/what the point is
why is this board so unimaginative?
>>
>>100261605
Zoomers
>>
>>100259346
Use it? Yes. Pay for it? No.
>>
>>100263097
I'll make a free desuarchive API if someone gives me a Linux box and domain to host it on
>>
>>100263111
+ possibly proxies if they're necessary
>>
>>100263111
A VPS is a few bucks per month + use a free domain
>>
>>100263180
You know anywhere I can get a free domain from without requiring a credit card?
>>
>>100263201
freedns
>>
Webmasters are generous enough to share their data via API. All you have to do is read the documentation and abide by the rate limits. You don't need these blunt tools.
>>
>>100265259
Not every website has an API, dumdum. And not every API is free. Furthermore, not all of the data that anons are using to build their own datastores come from one place
>>
File: 174093465.gif (2.67 MB, 480x268)
2.67 MB
2.67 MB GIF
>>100263180
>A VPS is a few bucks per month
That's for a low amount of bandwidth. If/when his project becomes popular he will most likely shut it down almost right away because he can no longer justify the investment because his provider would have increased his monthly bill. That's what happens to all of these do-gooder anons that try to build free resources without planning it out.
>>100263201
Even if you manage to keep cost low you still have a cost as well as the time sink for building a resource where people just call you a faggot and make fun of you for working for free.
>>
>>100265647
If it becomes popular you're supposed to make money from it so these costs shouldn't be an issue. Like catbox.
>>
File: OP_TRP_UPSIDE.jpg (307 KB, 1920x1080)
307 KB
307 KB JPG
>>100266543
Which take us full circle to how this little thread chain started. Welcome to the conversation
>>
>>100266614
Everything has value, even something like desuarchive would be useful to train ML models
>>
File: YbPqhqd.png (67 KB, 332x247)
67 KB
67 KB PNG
>>100267422
>Everything has value
>so work for free training ML models
Are you retarded? Reddit got paid $60M for using their data to help train AI. Why the fuck would you build the means to do that with 4chan data for free?
>>
>>100267697
>Grab 4chan data
>Sell it yourself
Think a bit
>>
>>100267697
>>100267755
> Implying anyone would pay for 4chan archives to train the next TayAI
>>
>>100269404
>Implying Claude Opus wasn't trained on 4chan data when it can even name /aicg/ tripfags
>>
>>100269446
> Implying they paid for that data instead of just scraping it themselves
>>
>>100270450
>Implying dev time is free
>>
>>100267755
>Grab 4chan data
Work
>>Sell it yourself
Sales and marketing = more work
>host it yourself
scaling expenses and work
>Think a bit
Did you? There are already models trained off 4chan data. How do you think the bots here work?
>>
>>100270748
You're on /wsg/, scraping is a hobby you enjoy not work
>>
Does anyone actually script captcha solving at significant scale over just paying clickworkers?
I think people have talked about it in these threads before, while I love Anticaptcha and am not keen on spending what has to be extraordinary amounts of time on true automatic solving (weaving your way around the splintered current main providers and neverending bullshit like the challenge refusing to send you audio based on your profile on the way) I'd be open to anything that makes the task even cheaper.
>>
>>100272327
I've actually seen recaptcha bypasses publicly available

Generally they just mimick the requests that happen when you click on the checkbox, and just retry the request after rotating IP if one of those secondary checks pop up. Requires good sticky residential proxies but it works

They usually look a bit like this:
https://github.com/xcscxr/Recaptcha-v3-bypass/blob/main/recaptcha-v3.py
>>
>>100192529
Is scraping fun

Someone told me to get into this because Im learning web dev and I learned Javascript but I don't really know what to code etc and I don't want to lose it I wajt to find a purpose.
>>
>>100263111
https://desuarchive.org/_/api/chan/post/?board=g&num=100263111
>>
>>100275303
Is there any way to get a list of posts? Maybe even all posts on the site?
>>
>>100275739
https://4plebs.tech/foolfuuka/
>>
>>100275769
https://archive.org/details/desuarchive_db_201909
>>
File: .png (121 KB, 881x697)
121 KB
121 KB PNG
>>
>>100192529
Out of sheer curiosity: if I were to have multiple IP addresses on a local machine to webscrape, how would I implement that? Multiple VMs?
>>
>>100272327
Here is the hard truth: if you have good ISP IPs you won't get any captcha so you won't need an anticaptcha
>>
>>100192592
Dude looks like a harkonnen
>>
>>100278660
kek
>>
>>100192529
>>100267755
>>100269404
>>100269446
you know there are dumps of 4chan archives shared freely every year by the bibanon group, right? there's little incentive to even try to make money off 4chan data unless you're providing a service, and even then
>>
>>100192529
>consumer
missed opportunity to call it a consoomer
>>
>>100192529
does anyone scrape grocery prices? that's probably the easiest way to save money with this hobby. You could even think of it as making money
>>
>>100279780
Are you talking about desuarchive?
>>
>>100279983
4plebs is the most consistent dumper, and has the best search options available, but yes.

I have my own archive set up at home and need to learn how elastic search works with a mysql database in order leverage all this data. I LOVE searching taboo concepts with it. Lots of stuff that gets ignored or deleted from the log is actually very interesting
>>
>>100279780
.moe and all of his sites are black holes with no dumps and walled behind cloudflare
images especially are nowhere else
lmao
>>
>>100279943
There are websites that do local grocery store pricing. I guess anons can see if there aren't any for their city and then do that. You'd still need to work out a way to monetize it and have things in place to keep that data current
>>
>>100280113
check out this absolute monster archive by bepis
https://ultra.gondola.pics/info

>>100280285
with this type of project, just keeping the data for yourself is a way to save money. That's kinda what I was thinking
>>
>>100280344
>a way to save money
How?
>>
>>100276468
Need to read more into it, but they are referencing GDPR (EU law) and also really assume that personal data collection (like, Linkedin, Facebook projects) is the only type of scraping that exists.
>>
>>100280285
And how will you get your local grocery store pricing, genius? Are you going to tour all the stores every day lol?
>>
>>100280344
website's going to go down for an unspecified amount of time as I'm importing data through april 2024 and cripplechan (why the fuck is there still a 4chan filter?) + crystal cafe + lolcow farm

i'm not too sure what other altchans are worth my time to host archives of
>>
>>100284417
I wonder how many GB are needed to store all that shit
>>
>>100285003
For just the text & search data it's around 1-2TB. If I wanted to store full images for every post I have, it's roughly 160TB of space needed
Thumbnails would also be a considerable fraction, maybe another 15-20TB

That could be pushed down by deduping with image fingerprinting, but that has a lot of caveats that people wouldn't be happy with

Buying enough drives just for a single copy of the data would cost around $2600, but things like RAIDZ, tape backups, the server itself, future expansion etc. would drive that number up a shitload. I think my original estimate of $15k is wrong but it would still be a lot more than I'm willing to pay out of pocket for something that costs me money to host
>>
>>100285210
So mainly glowniggers are willing to spend all that cash
>>
>>100282352
They have APIs, dipshit
>>
>>100286754
Good luck using that shit for free dumbo
>>
>>100287063
Learn how to follow conversations. What has been said already is to learn how to monetize your efforts before building these systems that you want to offer to the public. And btw I think some places do offer some level of API access for free, but you need to present your plan for how that benefits them i.e. affiliate marketing or the like.

If you want to build some random scraper for free because "coding is kool d00d" then I suggest sticking to small open source datasets and mom and pop blogs
>>
>>100287258
Imagine needing an API instead of being a scrapechad
>>
What's the best way to use MITMproxy for Android apps? It seems like each app has to be patched individually for it to work on the newest Android versions. Would it be better to use a VM with an older version?
>>
>>100198235
I've been using smartproxy. Commercial proxies work OK, although latency is a bit high (1-2s) delay
>>
>>100288375
Does smartproxy have any blacklist? For my scraping usecase I need to reach out to a lot of mail providers and most proxy providers seem to blacklist mail providers to prevent abuse
>>
>>100288426
https://smartproxy.com/faq/general/do-you-have-any-blocked-sites

pic related costs me 11$/mo
>>
>>100288439
FUCK

They've blocked literally everything I want to access

Know anything with similar prices that allow cracking/mail services?
>>
>>100288474
one that comes to mind is webshare.io, they're really really cheap although im pretty sure they block the same subset of services.
>>
>>100288489
Pretty sure they do but I'm also pretty sure you can bypass the block by just contacting the IP directly/having the DNS lookup done without the proxy
>>
>>100288578
nah, you need to connect to webshare.io, they give you a port number to select the IP
>>
Does anyone know how to get gallery-dl to download text posts of 4chan archives?
>>
how do i deal with hcaptcha?
>>
>>100291763
this. I don't want to pay. I'm broke.
>>
>>100219220
traditionally web scraping hasn't made a lot of money and is dominated by non-Americans because foundationally you're profiting off of someone else's work and that scares big money away. except these days they call it "AI Training" and yeet their credibility into the wind because Elon Musk said funny word on xitter

point is most people don't make money off any of this and if so it's only a little. it's mostly a hobby for hoarders and voyeurs
>>
>>100288474
At this point, just buy a SIM card and use it as a proxy
>>
>>100292379
A lot of services are using APIs to sell you something (affiliates) or display ads with the scraped content (anime, manga, porn, whatever). If you think they're not making any money from this you're just retarded
>>
>>100290383
Just go to an archive? Or copy and paste.
>>
>>100293238
anon, the only ones making big money in internet advertising are the big tech companies that hold a monopoly on their part industry. nickel and dime crooks trying to make a quick buck on some porn redirect scam are paying Venezuelans $10 a week do do their scraping
>>
>>100293807
If it's that easy just scale up
>>
>>100292379
some 500 trumps a month unachievable margins?
I'm comfortable with keeping operations to a minimum. from the replies I feel I understand I'd need more of a good idea rather than good enough scale
>>
>>100292379
when cloudflare and google have completely rewritten history, untainted human data will be a scare resource and valuable



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.