/g/ - /wsg/ - Web Scraping General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

Thread archived.
You cannot reply anymore.

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous

/wsg/ - Web Scraping General 04/26/24(Fri)14:27:52 No.100192529

File: scraper.png (1.62 MB, 1892x2142)

1.62 MB PNG

/wsg/ - Web Scraping General Anonymous 04/26/24(Fri)14:27:52 No.100192529 Archived

Web Scraping General

AI datamine general

QOTD: If you had to fine-tune an LLM, what would you optimize it for and where would you scrape training data from?

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Discord: discord.gg/9EKk3psXMr
Last thread: >>100166675

Anonymous
04/26/24(Fri)14:32:36 No.100192587

Anonymous 04/26/24(Fri)14:32:36 No.100192587

>>100192529
>QOTD: If you had to fine-tune an LLM, what would you optimize it for and where would you scrape training data from?
I'd finetune it on a bunch of shitcoin candlestick charts to make myself rich

Anonymous
04/26/24(Fri)14:32:59 No.100192592

Anonymous 04/26/24(Fri)14:32:59 No.100192592

File: 1711577564007128.jpg (555 KB, 2000x1333)

555 KB JPG

>>100192529
I wrote my own ChatGPT scrapper + GUI in c++

spy.pet
04/26/24(Fri)16:06:32 No.100193742

spy.pet 04/26/24(Fri)16:06:32 No.100193742

is he one of us?

Anonymous
04/26/24(Fri)18:41:30 No.100195618

Anonymous 04/26/24(Fri)18:41:30 No.100195618

don't you have any other images for the OP?

Anonymous
04/26/24(Fri)21:08:34 No.100197307

Anonymous 04/26/24(Fri)21:08:34 No.100197307

>>100193742
No

Anonymous
04/26/24(Fri)22:39:43 No.100198235

Anonymous 04/26/24(Fri)22:39:43 No.100198235

File: 1711248980632817.jpg (126 KB, 768x1024)

126 KB JPG

>~$80/3months for 20 datacenter proxies
Before I renew, am I overpaying or is this about expected?
>>100192529
>QOTD
Finetuning... probably Wikia/Fandom and F-List to get whatever to understand RP concepts and characters better.
For pretraining though, I will state the same thing I did in /lmg/ (when I was bigger into AI and not totally focused on a scraping project) and say that Anna's Archive is untapped if you are willing to put the work into cleaning. Almost as big* as Commoncrawl but that and Books3 have been implemented into datasets over and over and over again. If you give any credence to the "slop problem" of collecting more garbage and GPT output in web scrapes as time goes on, it's a potential alternative.
>>100193742
On the off chance the guy is here, NIG respond to my emails. If shit falls through completely I want to know if I can pick up the data.

Anonymous
04/26/24(Fri)23:05:11 No.100198451

Anonymous 04/26/24(Fri)23:05:11 No.100198451

>>100193742
> spy.pet
You the actual spy.pet admin?

Anonymous
04/26/24(Fri)23:06:12 No.100198458

Anonymous 04/26/24(Fri)23:06:12 No.100198458

>>100198235
> am I overpaying
Yes, especially for DC proxies

Anonymous
04/26/24(Fri)23:37:47 No.100198738

Anonymous 04/26/24(Fri)23:37:47 No.100198738

File: proxy-seller.com (this is(...).png (50 KB, 1190x400)

50 KB PNG

>>100198458
Noted. I'll take a look at services in OP as well as Proxyrack. I DON'T need huge amounts of data at all, I do downloads as needed over a 'vad VPN rig, I just need something stable and IPV4 compatible to have my browser instances connect to.

Anonymous
04/27/24(Sat)00:08:19 No.100199056

Anonymous 04/27/24(Sat)00:08:19 No.100199056

If people want I'll do an explanation image of the vad/Nordlord (VPN + Docker + HAproxy) and post scripts.
I doubt this is really anything "new", but variations of this have served my projects well as a way to get low-cost, unbanned, unlimited bandwidth proxies when stability doesn't matter all that much

Anonymous
04/27/24(Sat)00:19:05 No.100199147

Anonymous 04/27/24(Sat)00:19:05 No.100199147

File: bobo-peeking-behind-door.png (392 KB, 1550x1404)

392 KB PNG

I found this library on npm the other day. It's for getting data from the Twitter API that you usually can't get without paying.
https://github.com/Rishikant181/Rettiwt-API
I didn't really look into how it works but it does seem to work, at least the bits I tried work. For example, with the free Twitter API tier you can only get user details for your own account, but with this library you can get details for any account. It doesn't seem to include everything though, like the user details are missing the "website" property, if a user has a website url on their profile this library doesn't return that, but it seems to return most data

Anonymous
04/27/24(Sat)00:30:12 No.100199227

Anonymous 04/27/24(Sat)00:30:12 No.100199227

>>100198235
>>~$80/3months for 20 datacenter proxies
>Before I renew, am I overpaying or is this about expected?
that does sound a bit expensive, but it depends on how much you're using it. On Azure you can create a function app and you get 1 million free executions per month. I'm fairly certain that if you create another function app you get another million, but you'd need to check that. But you could just create 20 function apps if you wanted. I scraped data from ebay and used 10 function apps in various regions and it prevented getting IP rate limited which was happening on my local pc

Anonymous
04/27/24(Sat)01:54:57 No.100199953

Anonymous 04/27/24(Sat)01:54:57 No.100199953

>>100199227
How much did you pay for Azure?

Anonymous
04/27/24(Sat)02:02:12 No.100200016

Anonymous 04/27/24(Sat)02:02:12 No.100200016

Why are ISP proxies so good? Literally doing click F with AdSense and they don't bat an eye and just keep paying (I have 0 legitimate traffic lol). Is it because they are set up and configured differently than your average botnet residential proxies? TCP fingerprint? anything else I should add to my proxy test routine?
>connect your proxy in anti detect browser that takes care of all spoofing (like webRTC and spoofing proxy ip geo location)
>go to https://proxy.incolumitas.com/proxy_detect.html (the latency test can be ignored)
>go to https://browserleaks.com/ip check ISP and usage type. Check TCP/IP fingerprint.
>do UDP/TCP port scan on proxy ip for common ports. all of them open = bad
>check IP in blacklist (ipqualityscore.com too strict IMO, more something like https://www.ipvoid.com/ip-blacklist-check/ or a google search)
>DNS check (?)

Anonymous
04/27/24(Sat)02:03:41 No.100200030

Anonymous 04/27/24(Sat)02:03:41 No.100200030

>>100199056
I'm interested

Anonymous
04/27/24(Sat)02:10:21 No.100200094

Anonymous 04/27/24(Sat)02:10:21 No.100200094

File: mobiproxy.jpg (63 KB, 445x586)

63 KB JPG

>>100200016
You know something I was wondering?

I was talking with a friend and apparently his friend bought this device that could get him 10 IPs from his ISP simultaneously and he could rotate it at will. (pic related)

He lives in Vietnam apparently and idk their infrastructure, but I was wondering if I could split up the coaxial connection at my house to get me multiple simultaneous connections I could use and rotate at will.

Does anyone here have any experience with this subject?

Anonymous
04/27/24(Sat)02:13:37 No.100200133

Anonymous 04/27/24(Sat)02:13:37 No.100200133

>>100200016
How much have you made? If it were that easy everyone would be doing it

Anonymous
04/27/24(Sat)02:22:01 No.100200225

Anonymous 04/27/24(Sat)02:22:01 No.100200225

>>100199953
It was free for what I was doing because I was under the 1 million executions per month. That was in total across the 10 function apps though. I'm not sure if I would be charged more if I say used 800k executions on every function app so the total was 8 million executions for the month. You might have to pay for storage too because function apps log things to storage occasionally and you need a storage account to create a function app, but storage accounts are free too and you'd probably only be charged a few cents a month for storage, like 10 cents or something. So i probably did pay a few cents for that but it was basically zero. The pricing is here if you want to try to work out exactly how much you'd pay
https://azure.microsoft.com/en-au/pricing/details/functions/
Or you can use the pricing calculator
https://azure.microsoft.com/en-au/pricing/calculator/?service=functions

Anonymous
04/27/24(Sat)02:31:00 No.100200300

Anonymous 04/27/24(Sat)02:31:00 No.100200300

You can proxy with Tor too. Obviously it's pretty slow though, and it doesn't work on a lot of websites, but it might be useful in some instances if you're getting IP restriction problems using other methods.

What you need to do is go to the Tor download page, then download the "Tor expert bundle" for your computer
https://www.torproject.org/download/tor/

Then put the expert bundle files in a folder somewhere on your PC. Then in a command line go to that folder and run tor.exe which should start the service, you'll see it say "bootstrapping" for a while.

Then you can use the proxy like in this node.js sample that's using playwright. Or you can just do direct http GET requests for the HTML if the website will allow that
import { chromium } from "@playwright/test";

(async () => {
  const browser = await chromium.launch({
    proxy: {
      server: "socks5://127.0.0.1:9050"
    },
    headless: false
  });
  const page = await browser.newPage();
  await page.goto("https://api.ipify.org/?format=json");
  const html = await page.content();
  console.log(html);
  await browser.close();
})();
In some cases you might want to restrict the Tor service to only use exit nodes from a certain country. I can't remember exactly how to do that, it's in a settings file you need to edit somewhere, but you can probably find out pretty easy online

Anonymous
04/27/24(Sat)02:31:44 No.100200309

Anonymous 04/27/24(Sat)02:31:44 No.100200309

>>100200225
Yeah, Azure free trial says it gives you a million functions per moth 'Always' and not just for 12 months. I think I'll give it a try, I've never used Azure before

Anonymous
04/27/24(Sat)02:33:00 No.100200321

Anonymous 04/27/24(Sat)02:33:00 No.100200321

>>100200225
...another note, each function app will have an IP address which is for that data centre. So if you need multiple different IP addresses I'm pretty sure you need to create a function app in different regions. Like one in the U.S. and one in the UK and one in Australia and so on and you should get a different IP for each one. There are other ways to get IP addresses on Azure but because I'm familiar with function apps and because they're so cheap I decided to just do it this way

Anonymous
04/27/24(Sat)02:35:27 No.100200338

Anonymous 04/27/24(Sat)02:35:27 No.100200338

>>100200300
> You can proxy with Tor too
Maybe in like 2008. I hope you're ready for captcha hell and blocks on most websites given all exit nodes are public and all of them have really bad fraud scores

Anonymous
04/27/24(Sat)02:47:32 No.100200464

Anonymous 04/27/24(Sat)02:47:32 No.100200464

>>100200309
Usually i'd setup CI/CD for azure functions when i'm working on a project. But if you're doing something small like making a proxy you can do all your code in the browser. When you create the function app you can then create a function and there's a code editor and stuff in the portal you can use. Works well for small things. This page here covers most of what you need I think
https://learn.microsoft.com/en-us/azure/azure-functions/functions-create-function-app-portal?pivots=programming-language-javascript

If you create a node.js function the only code you need really is this
module.exports = async function (context, req) {
    
    const response = await fetch(req.query.url);

    context.res = {        
        body: await response.text()
    };
}
and then you just make a request to that function from your PC and pass a URL as a query paramer

Anonymous
04/27/24(Sat)02:48:48 No.100200479

Anonymous 04/27/24(Sat)02:48:48 No.100200479

>>100200338
Yeah most websites don't like it. Some are ok though. Ebay works with Tor without any captchas or anything last time I checked like last year some time. Not disabling javascript in Tor helps too

Anonymous
04/27/24(Sat)03:39:34 No.100200954

Anonymous 04/27/24(Sat)03:39:34 No.100200954

>>100198235
I want to have sex with Miku I want to have sex with Miku she's so cute and hot I want to breed her over and over again I want to cum inside Miku I want to cum inside Miku I want to coom inside her so bad

Anonymous
04/27/24(Sat)04:06:47 No.100201185

Anonymous 04/27/24(Sat)04:06:47 No.100201185

>>100200094
Never heard of people doing that with ISP internet connections, just with mobile. there are a few providers which sell premade modems/sticks farms. for example https://xproxy.io/ (idk if they are good).

>>100200133
100$ so far. will test it more before scaling. I'm making sure to remain in the 5% CTR range. my automated browser does random tasks daily on the internet.

Anonymous
04/27/24(Sat)04:11:19 No.100201222

Anonymous 04/27/24(Sat)04:11:19 No.100201222

>>100201185
How's this done with mobile? Do you stick a SIM card into the box or something?

Using a fuckton of proxy bandwidth a month on bruteforcing accounts, wanna try to cut down on costs

Anonymous
04/27/24(Sat)04:24:12 No.100201313

Anonymous 04/27/24(Sat)04:24:12 No.100201313

>>100201185
So what is it that you do? Click ads on your own website through browser automation using proxies?

Anonymous
04/27/24(Sat)04:28:08 No.100201345

Anonymous 04/27/24(Sat)04:28:08 No.100201345

>>100192529
im building a gigant stalker scraper that will analize normies behaviour and relationships over the internet. i got all this data in the database and a scrapper that scrapes like every 5 minutes all links in my link table.
what should i do with all this data? should i make a discordhook that sends messages when some nigger is online? should i make a graph with relationshi maps?

Anonymous
04/27/24(Sat)04:52:40 No.100201522

Anonymous 04/27/24(Sat)04:52:40 No.100201522

>>100201345
What do you have to gain from this?

Anonymous
04/27/24(Sat)04:54:32 No.100201540

Anonymous 04/27/24(Sat)04:54:32 No.100201540

>>100201185
Have you thought about doing this with Spotify? I've heard it's way easier and really easy to scale

Anonymous
04/27/24(Sat)05:50:19 No.100202014

Anonymous 04/27/24(Sat)05:50:19 No.100202014

>>100201522
That's what he's asking faggot

Anonymous
04/27/24(Sat)05:54:07 No.100202039

Anonymous 04/27/24(Sat)05:54:07 No.100202039

>>100201522
i want to be my own nsa. nsa does surveillance on american citizens.
i want to surveil them and gather data for my own sake.
this way i can see when a friend stabs me behind my back or can see what server he is playing on even though steam shows him as 'offline'

Anonymous
04/27/24(Sat)06:16:06 No.100202183

Anonymous 04/27/24(Sat)06:16:06 No.100202183

>>100202039
>i want to be my own nsa
that sounds cool
>or can see what server he is playing on even though steam shows him as 'offline'
that does not sound cool

Anonymous
04/27/24(Sat)06:26:08 No.100202243

Anonymous 04/27/24(Sat)06:26:08 No.100202243

>>100202183
i use this to monitor gangstalking behavoir in my rust servers. i have seen a guy constantly chaning his steam name with the same steamid targetting me across different servers.
this is not cool and it's really annoying. i just want to map out these anomalies and get the upper hand. imagine you analyse a whole year at which times someone goes online and offline. you can predict his behavoir. coupled with other data i can predict every move someone makes

Anonymous
04/27/24(Sat)07:26:22 No.100202707

Anonymous 04/27/24(Sat)07:26:22 No.100202707

>>100202243
Holy schizo

Anonymous
04/27/24(Sat)10:44:53 No.100204530

Anonymous 04/27/24(Sat)10:44:53 No.100204530

bump

Anonymous
04/27/24(Sat)11:00:11 No.100204683

Anonymous 04/27/24(Sat)11:00:11 No.100204683

File: 1698576304027976.png (171 KB, 1492x723)

171 KB PNG

>>100200030
here ya go
setup: https://litter.catbox.moe/09xrrp.zip
based on: https://github.com/bernardko/mullvad-proxy

Anonymous
04/27/24(Sat)12:15:27 No.100205455

Anonymous 04/27/24(Sat)12:15:27 No.100205455

>>100198451
no, i just think xe is one of us

Anonymous
04/27/24(Sat)12:22:40 No.100205539

Anonymous 04/27/24(Sat)12:22:40 No.100205539

>>100202707
just wait until you have 佳哥玩游戏 on your ass in the middle of the night, following you to every new server

Anonymous
04/27/24(Sat)13:35:03 No.100206384

Anonymous 04/27/24(Sat)13:35:03 No.100206384

How should I bulk download images from archiveofsins.com? I have the image URLs ready, would curl-impersonate work?

Anonymous
04/27/24(Sat)14:56:08 No.100207419

Anonymous 04/27/24(Sat)14:56:08 No.100207419

>>100206384
I piped them into
curl --verbose -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
which works for all but five links

Anonymous
04/27/24(Sat)16:02:44 No.100208278

Anonymous 04/27/24(Sat)16:02:44 No.100208278

>>100206384
yeah, seems to work, searx tells me to tell you to xargs to make it go through the urls
Don't know how Cloudflare will react if you hit it hard; consider proxying

Anonymous
04/27/24(Sat)16:15:03 No.100208438

Anonymous 04/27/24(Sat)16:15:03 No.100208438

>>100204683
> NordVPN seems to have seriously tightened up and now leaked accounts are promptly security locked
Surfshark and IPVanish logs are still a thing. Surfshark requires proxies to crack (and curl-impersonate) though IPVanish can actually be done without proxies. It allows you to login with your credentials directly on OpenVPN so you could make a script to check through your combolist without any proxies by just testing logins via OpenVPN.

Anonymous
04/27/24(Sat)16:22:01 No.100208521

Anonymous 04/27/24(Sat)16:22:01 No.100208521

>>100204683
Don't VPN proxies get sent through captcha hell and blocked on a bunch of stuff? Plus, there're 5 VPN locations max on one VPN provider, so if you need a bunch of US proxies, good luck

Anonymous
04/27/24(Sat)16:55:08 No.100208925

Anonymous 04/27/24(Sat)16:55:08 No.100208925

>>100208521
My setup gets through Cloudflare and Google firewalls easy, obviously it's shared and datacenter which needs to be taken into account but I've never really had complaints about quality.
Mullvad has quite a few locations and specific servers (looking at my stats page for the Slimvad thing, it's at about ~430 total relays) but this is indeed another variable to take into account with any given provider. If you need a bunch of IPs in one very specific country guaranteed you should probably just buy them from a proxy store outright.
>>100208438
Nice, noted.

Anonymous
04/27/24(Sat)17:02:01 No.100209015

Anonymous 04/27/24(Sat)17:02:01 No.100209015

>>100208925
One other way shittier option is to just run masscan on all the ports that HTTP/SOCKS4/SOCKS5 proxies are usually hosted on then just have a script to check all the results to see if you can get really shitty exposed DC proxies (this is actually how "free proxies" on breachforums/cracked.to are found)

At this point I'm actually thinking, what if I just wrote some malware that turned a host machine into a proxy? If one could get a bunch of virus downloads in the US, it should be extremely easy to build a huge network of high quality US resi IPs

Anonymous
04/27/24(Sat)17:06:05 No.100209066

Anonymous 04/27/24(Sat)17:06:05 No.100209066

>>100209015
> malware that turned a host machine into a proxy
On this, it could be both more scaleable and even higher quality to do this with mobile devices. If you can write a malicious android app, most phones have way higher uptime than computers, and normalfags tend to use phones over computers nowadays. They also have LTE IPs when on cell data

Anonymous
04/27/24(Sat)17:11:58 No.100209125

Anonymous 04/27/24(Sat)17:11:58 No.100209125

>>100209015
>One other way shittier option is to just run masscan on all the ports that HTTP/SOCKS4/SOCKS5 proxies are usually hosted on then just have a script to check all the results to see if you can get really shitty exposed DC proxies
Yeah, but these are EXTREMELY shitty, you have to assume that port will stop connecting at literally any time and is being raped by literally everyone.
You could probably put in some elbow grease and roundrobin the most stable ones above a certain speed that pass botchecks though, but >effort
> what if I just wrote some malware that turned a host machine into a proxy?
I'm pretty sure the ruskis and chinks actually already do this a lot to make profitable botnets.

Anonymous
04/27/24(Sat)17:34:03 No.100209355

Anonymous 04/27/24(Sat)17:34:03 No.100209355

>>100209125
> Yeah, but these are EXTREMELY shitty
Obviously, this is literally the black people option for niggers that can't afford 1GB DC proxies

> I'm pretty sure the ruskis and chinks actually already do this a lot to make profitable botnets.
If I were to do this, what would be some good spreading methods? Google ads + fake download links come to mind.

>>100209066
I really wanna do this but how would I get Google Play downloads? I imagine it would be extremely easy in practice but you would have to spend a fortune on advertising

Anonymous
04/27/24(Sat)18:31:29 No.100210039

Anonymous 04/27/24(Sat)18:31:29 No.100210039

bump

Anonymous
04/27/24(Sat)21:39:42 No.100211995

Anonymous 04/27/24(Sat)21:39:42 No.100211995

>>100199227
>>100199953
>>100200225
Someone explain this. Never used Azure but I've abused Github actions to run stuff on their VMs before and I want to abuse this too

Anonymous
04/27/24(Sat)23:58:41 No.100213378

Anonymous 04/27/24(Sat)23:58:41 No.100213378

>>100211995
Based abuser

Anonymous
04/28/24(Sun)03:25:31 No.100215255

Anonymous 04/28/24(Sun)03:25:31 No.100215255

'mp

Anonymous
04/28/24(Sun)05:22:14 No.100216281

Anonymous 04/28/24(Sun)05:22:14 No.100216281

>>100211995
Azure Functions are very similar to AWS Lambdas. They're a "serverless" service. In a regular server like express.js you write logic to start a server and then define your API by creating a routing file with a bunch of routes with a function to handle each route. In Azure Functions you don't have to write any server startup code, and you use one Azure Function per API endpoint. If you're doing a simple scraping job you often only need one Azure Function. Each azure function is literally just like a single function you'd have in your code. You can import libraries and call other functions and stuff of course. But they're meant for smaller jobs, an Azure Function will only run for a max of 10 minutes. But they're designed to be called a lot and will automatically scale out so multiple clones of the Azure Function will run in parallel if needed. They're pretty easy to setup though, easier than a regular server on a VM or in Docker or something. They can have bindings too to other Azure services, so if you upload something to blob storage or to a queue service it will trigger a function to run and do something. Github actions are a bit different, they're usually a Docker container or similar that's run as part of a CI/CD pipeline, so they're not really designed for running arbitrary jobs like Azure Functions are

Anonymous
04/28/24(Sun)05:29:08 No.100216333

Anonymous 04/28/24(Sun)05:29:08 No.100216333

>>100216281
What lang do I need to write these functions in? Does it have to be anything in specific?

Also, how "free" are these Azure functions? Are they totally free? Like you can just sign up and start using it? Or does it require a credit card?

Anonymous
04/28/24(Sun)06:40:59 No.100217039

Anonymous 04/28/24(Sun)06:40:59 No.100217039

>>100201185
>100$ so far. will test it more before scaling. I'm making sure to remain in the 5% CTR range. my automated browser does random tasks daily on the internet.

You are most likely gonna get b& when you try to cash out. Google tries to jew both sides the users and the advertisers

Anonymous
04/28/24(Sun)09:06:21 No.100218574

Anonymous 04/28/24(Sun)09:06:21 No.100218574

bump

Anonymous
04/28/24(Sun)10:07:15 No.100219220

Anonymous 04/28/24(Sun)10:07:15 No.100219220

File: 1683137158130733.jpg (53 KB, 719x601)

53 KB JPG

I feel like a massive failure never having profited a penny in my free time. what does /wsg/ have in store for me?

Anonymous
04/28/24(Sun)11:49:35 No.100220354

Anonymous 04/28/24(Sun)11:49:35 No.100220354

>Was a virgin API cuck because I thought it'd be faster and nicer to just use the API
>API was badly documented, output only 200 results before rate limiting, can't access everything
>Stopped giving a fuck and use selenium driverless
Mfw I get everything I want and more.

Anonymous
04/28/24(Sun)12:24:23 No.100220729

Anonymous 04/28/24(Sun)12:24:23 No.100220729

>>100220354
What o you use it on?

Anonymous
04/28/24(Sun)12:29:26 No.100220802

Anonymous 04/28/24(Sun)12:29:26 No.100220802

>>100219220
Same. I will lurk patiently and share if I cook up something.

Anonymous
04/28/24(Sun)13:04:35 No.100221276

Anonymous 04/28/24(Sun)13:04:35 No.100221276

>>100220729
Civitai

Anonymous
04/28/24(Sun)15:52:13 No.100223285

Anonymous 04/28/24(Sun)15:52:13 No.100223285

'mp

Anonymous
04/28/24(Sun)16:30:45 No.100223858

Anonymous 04/28/24(Sun)16:30:45 No.100223858

>>100220354
Just wait until this guy discovers BeautifulSoup and curl_cffi

Anonymous
04/28/24(Sun)16:38:28 No.100223970

Anonymous 04/28/24(Sun)16:38:28 No.100223970

>>100223858
>BeautifulSoup
lmao is this 2016 or smt?

Anonymous
04/28/24(Sun)18:04:51 No.100225187

Anonymous 04/28/24(Sun)18:04:51 No.100225187

>>100223970
What's wrong with BeautifulSoup?

Anonymous
04/28/24(Sun)19:06:43 No.100225930

Anonymous 04/28/24(Sun)19:06:43 No.100225930

>>100219220
Write a username swapper and swap OG usernames. Ez paper

Anonymous
04/28/24(Sun)20:36:32 No.100226797

Anonymous 04/28/24(Sun)20:36:32 No.100226797

'mp

Anonymous
04/28/24(Sun)22:19:27 No.100227854

Anonymous 04/28/24(Sun)22:19:27 No.100227854

'mp

Anonymous
04/28/24(Sun)23:59:02 No.100228791

Anonymous 04/28/24(Sun)23:59:02 No.100228791

>>100219220
>>100220802
Create private APIs for services that don't have first party APIs and sell access

Anonymous
04/29/24(Mon)02:27:47 No.100230095

Anonymous 04/29/24(Mon)02:27:47 No.100230095

'mp

Anonymous
04/29/24(Mon)03:01:31 No.100230369

Anonymous 04/29/24(Mon)03:01:31 No.100230369

>>100228791
Is it legal?

Anonymous
04/29/24(Mon)03:35:39 No.100230602

Anonymous 04/29/24(Mon)03:35:39 No.100230602

>>100225930
For what purpose?

Anonymous
04/29/24(Mon)03:36:56 No.100230607

Anonymous 04/29/24(Mon)03:36:56 No.100230607

>>100230369
For the most part yes. Companies may try to sue you but you can always just accept crypto and have good opsec

Anonymous
04/29/24(Mon)06:53:18 No.100232079

Anonymous 04/29/24(Mon)06:53:18 No.100232079

>>100228791
this is an interesting suggestion. do you have examples of services doing such scheme?

Anonymous
04/29/24(Mon)07:58:17 No.100232725

Anonymous 04/29/24(Mon)07:58:17 No.100232725

>>100232079
Think some redditfags did it while the whole drama surrounding their first-party API was going on

Anonymous
04/29/24(Mon)08:48:59 No.100233299

Anonymous 04/29/24(Mon)08:48:59 No.100233299

>>100192529
I just use grab-site and sometimes pywb

Anonymous
04/29/24(Mon)09:51:38 No.100233965

Anonymous 04/29/24(Mon)09:51:38 No.100233965

>>100228791
Like what? Desuarchive?

Anonymous
04/29/24(Mon)10:20:56 No.100234320

Anonymous 04/29/24(Mon)10:20:56 No.100234320

>>100202243
We're going to get you. You will never be able to predict our next move.

Anonymous
04/29/24(Mon)10:32:57 No.100234449

Anonymous 04/29/24(Mon)10:32:57 No.100234449

>>100233965
That might be a good idea.

Anonymous
04/29/24(Mon)11:06:25 No.100234892

Anonymous 04/29/24(Mon)11:06:25 No.100234892

>>100225187
Too slow for parsing.

Anonymous
04/29/24(Mon)11:11:04 No.100234947

Anonymous 04/29/24(Mon)11:11:04 No.100234947

>>100234892
What is better?

Anonymous
04/29/24(Mon)11:17:23 No.100235033

Anonymous 04/29/24(Mon)11:17:23 No.100235033

>>100234947
selectolax

Anonymous
04/29/24(Mon)11:41:00 No.100235343

Anonymous 04/29/24(Mon)11:41:00 No.100235343

File: AeKE05J.png (120 KB, 469x378)

120 KB PNG

>>100199147
Thanks for sharing this

>So far, the following operations are supported:
>Getting the details of a tweet
>Liking/favoriting a tweet
>Retweeting/reposting a tweet
>Searching for the list of tweets that match a given filter
>Tweeting/posting a new tweet
>Replying to a tweet
If these features work then I am definitely using this. Managing twitter accounts is a pita and their API is priced to exclude single entities

Anonymous
04/29/24(Mon)12:13:01 No.100235691

Anonymous 04/29/24(Mon)12:13:01 No.100235691

>>100221276
What are you using it to do?

spy.kek
04/29/24(Mon)12:33:28 No.100235932

spy.kek 04/29/24(Mon)12:33:28 No.100235932

imagine being so low iq that you can't even comprehend a well documented api

thanks for the captchas, browser verifications and rate limits baboons

Anonymous
04/29/24(Mon)12:38:56 No.100236006

Anonymous 04/29/24(Mon)12:38:56 No.100236006

>>100234947
Regex

Anonymous
04/29/24(Mon)12:40:43 No.100236033

Anonymous 04/29/24(Mon)12:40:43 No.100236033

>>100235932
What did the passive aggressive anon mean by this?

Anonymous
04/29/24(Mon)12:41:25 No.100236045

Anonymous 04/29/24(Mon)12:41:25 No.100236045

>>100236006
Based, I too love type 3 grammar

Anonymous
04/29/24(Mon)13:57:39 No.100237030

Anonymous 04/29/24(Mon)13:57:39 No.100237030

>>100235932
>thanks for the captchas, browser verifications and rate limits baboons
Cry me a river faggot, if you can't bypass that you don't belong here :^)

Anonymous
04/29/24(Mon)16:55:20 No.100239610

Anonymous 04/29/24(Mon)16:55:20 No.100239610

'mp

Anonymous
04/29/24(Mon)19:08:16 No.100241232

Anonymous 04/29/24(Mon)19:08:16 No.100241232

>>100235932
Imagine being so low IQ you need a well documented read-only API with data 10 minutes behind and kike ratelimits

Anonymous
04/29/24(Mon)19:59:28 No.100241874

Anonymous 04/29/24(Mon)19:59:28 No.100241874

File: images.jpg (7 KB, 224x225)

7 KB JPG

Anyone got scrapers or ideas for grabbing follower numbers from social media accounts? I know selenium can be used but I was thinking of doing it without that to keep the script light

Anonymous
04/29/24(Mon)20:05:22 No.100241947

Anonymous 04/29/24(Mon)20:05:22 No.100241947

>>100241874
Any interesting website to scrap is behind two captchas and cloudflare. Good luck bypassing that without selenium

Anonymous
04/29/24(Mon)20:12:20 No.100242014

Anonymous 04/29/24(Mon)20:12:20 No.100242014

>>100241947
> Any interesting website to scrap is behind two captchas
Use capsolver

> and cloudflare. Good luck bypassing that without selenium
Curl_cffi + residential proxies never fail

Anonymous
04/29/24(Mon)20:30:36 No.100242200

Anonymous 04/29/24(Mon)20:30:36 No.100242200

>>100192529
How do I scrape stuff if I have no autism, knowelde or time to waste

Anonymous
04/29/24(Mon)20:35:38 No.100242239

Anonymous 04/29/24(Mon)20:35:38 No.100242239

>>100242200
Hire a pajeet from fiverr

Anonymous
04/29/24(Mon)20:47:46 No.100242318

Anonymous 04/29/24(Mon)20:47:46 No.100242318

File: shark-keyboard.gif (2.3 MB, 480x390)

2.3 MB GIF

>>100242014
>Curl_cffi
Thanks fren. I'm also now looking into hrequests

Anonymous
04/29/24(Mon)20:52:56 No.100242372

Anonymous 04/29/24(Mon)20:52:56 No.100242372

File: 1683922784706646.png (315 KB, 450x553)

315 KB PNG

I'm this >>100198235 anon, I dug around in the archive and found >>99854872 in an old thread. Currently trying Swiftproxy, I'm using more data than expected (~500MB, need to triple check that my code that disables images/styles/installs Ublock/swaps to 'vad for image/binary downloads is working correctly) but I'm still extremely satisfied, especially since they are residential and aren't locked to any timeframe.
>>100241874
Depends heavily on site, not sure what else to say.
Other than Twitter and Instagram, snscrape might work with no further effort.
Twitter and Instagram change API a lot, former definitely drops user information reliably without sign-in though. Instagram is a fucking bitch, bring an army of high quality IPs in any case.
>>100242200
Let's talk, I'm very busy and I don't know what you want done but also up for a challenge.
water3227@cock.li

Anonymous
04/29/24(Mon)20:56:27 No.100242406

Anonymous 04/29/24(Mon)20:56:27 No.100242406

>>100242372
From AI to scraping, lmao, I'm sliding on this pipeline too and got into crypto before ai

Anonymous
04/29/24(Mon)21:23:42 No.100242692

Anonymous 04/29/24(Mon)21:23:42 No.100242692

>>100242406
I've been into scraping for three or four years, art archival autism, some minor experiments with other things, etc.
AI stuff is great, I was with /aicg/ and then /lmg/ a lot dreaming of robowaifu. Didn't lose interest, still occasionally hop in to see what's going on, my focus is just elsewhere. I want to build more with LLMs and push voice synth but there's a shitton to grasp if you want to break new ground and I can only just barely keep up computationally with anything going on using my RTX 3060, Mixtral-intended PC. There's some easier theoretical applications for the existing stuff like imageboard/forum/whatever moderation assistance, enhanced search, tagging suggestions and recommendations, but as for the bolder stuff... someday.
I have a feeling the former interest will help datasetting for the latter though.

Anonymous
04/29/24(Mon)23:35:51 No.100243912

Anonymous 04/29/24(Mon)23:35:51 No.100243912

>>100242692
> AI stuff is great, I was with /aicg/ and then /lmg/ a lot dreaming of robowaifu
I've been thinking about scraping Github for a bunch of browser Javascript and training an LLM with the inputs as the obfuscated versions of this JS and the expected outputs as the unobfuscated versions, with the goal of making an LLM that can deobfuscate JS.

Do you think this would work? If so, do you have any recommendations for making this work?

Anonymous
04/30/24(Tue)01:38:59 No.100245054

Anonymous 04/30/24(Tue)01:38:59 No.100245054

is it just me or is elon making it harder and harder to scrape from twitter
>>100192587
why arent you already doing that

Anonymous
04/30/24(Tue)01:52:19 No.100245170

Anonymous 04/30/24(Tue)01:52:19 No.100245170

File: 4change rebrand_.jpg (208 KB, 1920x960)

208 KB JPG

>>100245054
Twitter changed their api to a higher tier. It's mostly aimed at service providers now

Anonymous
04/30/24(Tue)02:38:34 No.100245502

Anonymous 04/30/24(Tue)02:38:34 No.100245502

>>100243912
Given good base model and enough examples, I don’t see why not
Mixtral has excellent abilities when it comes to understanding code, but it’s MoE and I don’t know if training those has progressed from begging the French for documentation. Also severely overweight for the problem. Maybe you could wrangle Codellama or Llama3/8B…
There is a guide for finetuning in general, may be helpful: https://rentry.org/llm-training

Anonymous
04/30/24(Tue)03:37:10 No.100245966

Anonymous 04/30/24(Tue)03:37:10 No.100245966

What are some good tools to get the "real" content from a website/HTML? I run puppeteer to scrape websites, and send it to an AI to get a summary of the content, but the HTML content is filled with bullshit like ads, or the headers etc. I just want the main content. Any good libraries or something that can effectivity extract the relevant text from a website?

Anonymous
04/30/24(Tue)04:36:28 No.100246341

Anonymous 04/30/24(Tue)04:36:28 No.100246341

>>100245966
Fix your script. You should be scraping main content div blocks only

Anonymous
04/30/24(Tue)04:48:55 No.100246445

Anonymous 04/30/24(Tue)04:48:55 No.100246445

>>100246341
I am, but a lot of websites do all kinds of weird stuff for the HTML content.

Anonymous
04/30/24(Tue)06:56:15 No.100247521

Anonymous 04/30/24(Tue)06:56:15 No.100247521

>>100246445
Learn to parse mate and put a proper delay

Anonymous
04/30/24(Tue)08:18:43 No.100248306

Anonymous 04/30/24(Tue)08:18:43 No.100248306

File: looks-at-anon.jpg (156 KB, 500x333)

156 KB JPG

>>100246445
Learn to spot the div classes that are related to what you want to grab
Read the thread because other scraper tools have already been mentioned
Try sharing your code so you can get more detailed advice

Anonymous
04/30/24(Tue)10:12:28 No.100249543

Anonymous 04/30/24(Tue)10:12:28 No.100249543

Reminder to parse __NEXT_DATA__ if it's present, way easier than parsing HTML shit.

Anonymous
04/30/24(Tue)10:23:50 No.100249678

Anonymous 04/30/24(Tue)10:23:50 No.100249678

>>100192529
>discord.gg/9EKk3psXMr
why is there a discord?? just make a irc (rizon)/xmpp(yourdata.forsale) channel/room

Anonymous
04/30/24(Tue)12:49:50 No.100251600

Anonymous 04/30/24(Tue)12:49:50 No.100251600

a curl_cffi just flew over my house

Anonymous
04/30/24(Tue)15:28:50 No.100253678

Anonymous 04/30/24(Tue)15:28:50 No.100253678

bump

Anonymous
04/30/24(Tue)16:47:01 No.100254596

Anonymous 04/30/24(Tue)16:47:01 No.100254596

When you using a search engine and they display a paragraph or so under the webpage as a preview, that data is scraped right? Is there a way to get a whole plain text version of the webpage from that? I frequently run into issues where the info is formatted poorly or there's a paywall or the site doesn't even match the text preview at all that could all be remedied by just getting even a few more sentences from the preview

Anonymous
04/30/24(Tue)17:00:45 No.100254767

Anonymous 04/30/24(Tue)17:00:45 No.100254767

>>100254596
that usually happens when you search a question like
>who discovered america?
the paragraph is the exact copy of a part of the page that google deems appropriate
there is no way to obtain more text without visiting the page

Anonymous
04/30/24(Tue)17:09:28 No.100254859

Anonymous 04/30/24(Tue)17:09:28 No.100254859

>>100254596
They're usually stored in meta tags in the HTML. Usually it's in </head> and is stored as <meta name="description">

Anonymous
04/30/24(Tue)19:48:18 No.100256655

Anonymous 04/30/24(Tue)19:48:18 No.100256655

bump

Anonymous
04/30/24(Tue)22:11:49 No.100257975

Anonymous 04/30/24(Tue)22:11:49 No.100257975

Be careful with extensions, it could affect your detectability in the future.
https://github.com/z0ccc/extension-detector

Anonymous
04/30/24(Tue)22:45:06 No.100258248

Anonymous 04/30/24(Tue)22:45:06 No.100258248

File: high seas explorer.jpg (114 KB, 1024x525)

114 KB JPG

>>100257975
You'd still want to have extensions if you want to protect your identity, because it helps for your footprint to appear to be normie-tier. What you want protection from is what personal data is being collected about you and how unique your digital footprint may be.
>See how trackers view your browser
https://coveryourtracks.eff.org/

Anonymous
04/30/24(Tue)22:48:00 No.100258271

Anonymous 04/30/24(Tue)22:48:00 No.100258271

What would be the usecase for web scraping? I understand downloading video off of sites like youtube because they have fucking garbage policies.
What exactly do I scrape? I'm not scraping 4chan posts or porn for that matter. I don't care what you niggas say here. I don't give a shit about porn. No bait and a real question btw.

Anonymous
04/30/24(Tue)23:39:41 No.100258696

Anonymous 04/30/24(Tue)23:39:41 No.100258696

>>100258271
>>100228791

Anonymous
04/30/24(Tue)23:54:09 No.100258813

Anonymous 04/30/24(Tue)23:54:09 No.100258813

...never mind about swiftproxy, at least for the moment, the connection issues are really bad right now and shit is more likely to timeout than return data because of it.
i'd HOPE this isn't a pajeet scam where they start yanking my chain after a certain amount of money goes in. That would have been only ~$10 monero but still
Mildly pissed because I had shit to complete by the end of the month, but waiting on support response before buying shit from elsewhere

Anonymous
05/01/24(Wed)00:12:06 No.100258940

Anonymous 05/01/24(Wed)00:12:06 No.100258940

>>100228791
That doesn't sound very legal desu

Anonymous
05/01/24(Wed)00:13:01 No.100258945

Anonymous 05/01/24(Wed)00:13:01 No.100258945

>>100258940
Go back to your hugbox

Anonymous
05/01/24(Wed)01:13:13 No.100259346

Anonymous 05/01/24(Wed)01:13:13 No.100259346

>>100234449
Would anyone actually pay for access to a desuarchive API?

Might do this

Anonymous
05/01/24(Wed)01:19:50 No.100259390

Anonymous 05/01/24(Wed)01:19:50 No.100259390

>>100259346
>Would anyone actually pay for access to a desuarchive API?
I highly doubt it. When building online resources for proift it's not a good idea to go after low hanging fruit. You need to put your thinking cap on and branch out outside of your comfort zone

Anonymous
05/01/24(Wed)05:51:05 No.100261550

Anonymous 05/01/24(Wed)05:51:05 No.100261550

bump!

Anonymous
05/01/24(Wed)05:56:38 No.100261605

Anonymous 05/01/24(Wed)05:56:38 No.100261605

>all these people asking what to scrape/what the point is
why is this board so unimaginative?

Anonymous
05/01/24(Wed)06:37:18 No.100261970

Anonymous 05/01/24(Wed)06:37:18 No.100261970

>>100261605
Zoomers

Anonymous
05/01/24(Wed)08:44:43 No.100263097

Anonymous 05/01/24(Wed)08:44:43 No.100263097

>>100259346
Use it? Yes. Pay for it? No.

Anonymous
05/01/24(Wed)08:45:57 No.100263111

Anonymous 05/01/24(Wed)08:45:57 No.100263111

>>100263097
I'll make a free desuarchive API if someone gives me a Linux box and domain to host it on

Anonymous
05/01/24(Wed)08:46:58 No.100263120

Anonymous 05/01/24(Wed)08:46:58 No.100263120

>>100263111
+ possibly proxies if they're necessary

Anonymous
05/01/24(Wed)08:53:25 No.100263180

Anonymous 05/01/24(Wed)08:53:25 No.100263180

>>100263111
A VPS is a few bucks per month + use a free domain

Anonymous
05/01/24(Wed)08:55:43 No.100263201

Anonymous 05/01/24(Wed)08:55:43 No.100263201

>>100263180
You know anywhere I can get a free domain from without requiring a credit card?

Anonymous
05/01/24(Wed)08:57:24 No.100263214

Anonymous 05/01/24(Wed)08:57:24 No.100263214

>>100263201
freedns

protein kek
05/01/24(Wed)12:23:03 No.100265259

protein kek 05/01/24(Wed)12:23:03 No.100265259

Webmasters are generous enough to share their data via API. All you have to do is read the documentation and abide by the rate limits. You don't need these blunt tools.

Anonymous
05/01/24(Wed)12:45:23 No.100265574

Anonymous 05/01/24(Wed)12:45:23 No.100265574

>>100265259
Not every website has an API, dumdum. And not every API is free. Furthermore, not all of the data that anons are using to build their own datastores come from one place

Anonymous
05/01/24(Wed)12:51:16 No.100265647

Anonymous 05/01/24(Wed)12:51:16 No.100265647

File: 174093465.gif (2.67 MB, 480x268)

2.67 MB GIF

>>100263180
>A VPS is a few bucks per month
That's for a low amount of bandwidth. If/when his project becomes popular he will most likely shut it down almost right away because he can no longer justify the investment because his provider would have increased his monthly bill. That's what happens to all of these do-gooder anons that try to build free resources without planning it out.
>>100263201
Even if you manage to keep cost low you still have a cost as well as the time sink for building a resource where people just call you a faggot and make fun of you for working for free.

Anonymous
05/01/24(Wed)13:56:37 No.100266543

Anonymous 05/01/24(Wed)13:56:37 No.100266543

>>100265647
If it becomes popular you're supposed to make money from it so these costs shouldn't be an issue. Like catbox.

Anonymous
05/01/24(Wed)14:01:54 No.100266614

Anonymous 05/01/24(Wed)14:01:54 No.100266614

File: OP_TRP_UPSIDE.jpg (307 KB, 1920x1080)

307 KB JPG

>>100266543
Which take us full circle to how this little thread chain started. Welcome to the conversation

Anonymous
05/01/24(Wed)14:56:43 No.100267422

Anonymous 05/01/24(Wed)14:56:43 No.100267422

>>100266614
Everything has value, even something like desuarchive would be useful to train ML models

Anonymous
05/01/24(Wed)15:17:47 No.100267697

Anonymous 05/01/24(Wed)15:17:47 No.100267697

File: YbPqhqd.png (67 KB, 332x247)

67 KB PNG

>>100267422
>Everything has value
>so work for free training ML models
Are you retarded? Reddit got paid $60M for using their data to help train AI. Why the fuck would you build the means to do that with 4chan data for free?

Anonymous
05/01/24(Wed)15:22:45 No.100267755

Anonymous 05/01/24(Wed)15:22:45 No.100267755

>>100267697
>Grab 4chan data
>Sell it yourself
Think a bit

Anonymous
05/01/24(Wed)17:20:52 No.100269404

Anonymous 05/01/24(Wed)17:20:52 No.100269404

>>100267697
>>100267755
> Implying anyone would pay for 4chan archives to train the next TayAI

Anonymous
05/01/24(Wed)17:24:27 No.100269446

Anonymous 05/01/24(Wed)17:24:27 No.100269446

>>100269404
>Implying Claude Opus wasn't trained on 4chan data when it can even name /aicg/ tripfags

Anonymous
05/01/24(Wed)18:34:56 No.100270450

Anonymous 05/01/24(Wed)18:34:56 No.100270450

>>100269446
> Implying they paid for that data instead of just scraping it themselves

Anonymous
05/01/24(Wed)18:52:14 No.100270707

Anonymous 05/01/24(Wed)18:52:14 No.100270707

>>100270450
>Implying dev time is free

Anonymous
05/01/24(Wed)18:55:15 No.100270748

Anonymous 05/01/24(Wed)18:55:15 No.100270748

>>100267755
>Grab 4chan data
Work
>>Sell it yourself
Sales and marketing = more work
>host it yourself
scaling expenses and work
>Think a bit
Did you? There are already models trained off 4chan data. How do you think the bots here work?

Anonymous
05/01/24(Wed)18:59:24 No.100270819

Anonymous 05/01/24(Wed)18:59:24 No.100270819

>>100270748
You're on /wsg/, scraping is a hobby you enjoy not work

Anonymous
05/01/24(Wed)21:13:00 No.100272327

Anonymous 05/01/24(Wed)21:13:00 No.100272327

Does anyone actually script captcha solving at significant scale over just paying clickworkers?
I think people have talked about it in these threads before, while I love Anticaptcha and am not keen on spending what has to be extraordinary amounts of time on true automatic solving (weaving your way around the splintered current main providers and neverending bullshit like the challenge refusing to send you audio based on your profile on the way) I'd be open to anything that makes the task even cheaper.

Anonymous
05/01/24(Wed)21:27:46 No.100272507

Anonymous 05/01/24(Wed)21:27:46 No.100272507

>>100272327
I've actually seen recaptcha bypasses publicly available

Generally they just mimick the requests that happen when you click on the checkbox, and just retry the request after rotating IP if one of those secondary checks pop up. Requires good sticky residential proxies but it works

They usually look a bit like this:
https://github.com/xcscxr/Recaptcha-v3-bypass/blob/main/recaptcha-v3.py

Anonymous
05/01/24(Wed)23:35:28 No.100273807

Anonymous 05/01/24(Wed)23:35:28 No.100273807

>>100192529
Is scraping fun

Someone told me to get into this because Im learning web dev and I learned Javascript but I don't really know what to code etc and I don't want to lose it I wajt to find a purpose.

Anonymous
05/02/24(Thu)02:36:09 No.100275303

Anonymous 05/02/24(Thu)02:36:09 No.100275303

>>100263111
https://desuarchive.org/_/api/chan/post/?board=g&num=100263111

Anonymous
05/02/24(Thu)03:34:50 No.100275739

Anonymous 05/02/24(Thu)03:34:50 No.100275739

>>100275303
Is there any way to get a list of posts? Maybe even all posts on the site?

Anonymous
05/02/24(Thu)03:37:19 No.100275769

Anonymous 05/02/24(Thu)03:37:19 No.100275769

>>100275739
https://4plebs.tech/foolfuuka/

Anonymous
05/02/24(Thu)03:38:05 No.100275775

Anonymous 05/02/24(Thu)03:38:05 No.100275775

>>100275769
https://archive.org/details/desuarchive_db_201909

Anonymous
05/02/24(Thu)05:22:19 No.100276468

Anonymous 05/02/24(Thu)05:22:19 No.100276468

File: .png (121 KB, 881x697)

121 KB PNG

Anonymous
05/02/24(Thu)06:05:13 No.100276792

Anonymous 05/02/24(Thu)06:05:13 No.100276792

>>100192529
Out of sheer curiosity: if I were to have multiple IP addresses on a local machine to webscrape, how would I implement that? Multiple VMs?

Anonymous
05/02/24(Thu)07:26:02 No.100277354

Anonymous 05/02/24(Thu)07:26:02 No.100277354

>>100272327
Here is the hard truth: if you have good ISP IPs you won't get any captcha so you won't need an anticaptcha

Anonymous
05/02/24(Thu)09:52:28 No.100278660

Anonymous 05/02/24(Thu)09:52:28 No.100278660

>>100192592
Dude looks like a harkonnen

Anonymous
05/02/24(Thu)10:18:38 No.100278944

Anonymous 05/02/24(Thu)10:18:38 No.100278944

>>100278660
kek

Anonymous
05/02/24(Thu)11:25:01 No.100279780

Anonymous 05/02/24(Thu)11:25:01 No.100279780

>>100192529
>>100267755
>>100269404
>>100269446
you know there are dumps of 4chan archives shared freely every year by the bibanon group, right? there's little incentive to even try to make money off 4chan data unless you're providing a service, and even then

Anonymous
05/02/24(Thu)11:35:50 No.100279912

Anonymous 05/02/24(Thu)11:35:50 No.100279912

>>100192529
>consumer
missed opportunity to call it a consoomer

Anonymous
05/02/24(Thu)11:38:05 No.100279943

Anonymous 05/02/24(Thu)11:38:05 No.100279943

>>100192529
does anyone scrape grocery prices? that's probably the easiest way to save money with this hobby. You could even think of it as making money

Anonymous
05/02/24(Thu)11:41:18 No.100279983

Anonymous 05/02/24(Thu)11:41:18 No.100279983

>>100279780
Are you talking about desuarchive?

Anonymous
05/02/24(Thu)11:49:36 No.100280075

Anonymous 05/02/24(Thu)11:49:36 No.100280075

>>100279983
4plebs is the most consistent dumper, and has the best search options available, but yes.

I have my own archive set up at home and need to learn how elastic search works with a mysql database in order leverage all this data. I LOVE searching taboo concepts with it. Lots of stuff that gets ignored or deleted from the log is actually very interesting

Anonymous
05/02/24(Thu)11:52:23 No.100280113

Anonymous 05/02/24(Thu)11:52:23 No.100280113

>>100279780
.moe and all of his sites are black holes with no dumps and walled behind cloudflare
images especially are nowhere else
lmao

Anonymous
05/02/24(Thu)12:05:34 No.100280285

Anonymous 05/02/24(Thu)12:05:34 No.100280285

>>100279943
There are websites that do local grocery store pricing. I guess anons can see if there aren't any for their city and then do that. You'd still need to work out a way to monetize it and have things in place to keep that data current

Anonymous
05/02/24(Thu)12:10:25 No.100280344

Anonymous 05/02/24(Thu)12:10:25 No.100280344

>>100280113
check out this absolute monster archive by bepis
https://ultra.gondola.pics/info

>>100280285
with this type of project, just keeping the data for yourself is a way to save money. That's kinda what I was thinking

Anonymous
05/02/24(Thu)12:24:25 No.100280487

Anonymous 05/02/24(Thu)12:24:25 No.100280487

>>100280344
>a way to save money
How?

Anonymous
05/02/24(Thu)14:33:00 No.100282235

Anonymous 05/02/24(Thu)14:33:00 No.100282235

>>100276468
Need to read more into it, but they are referencing GDPR (EU law) and also really assume that personal data collection (like, Linkedin, Facebook projects) is the only type of scraping that exists.

Anonymous
05/02/24(Thu)14:43:26 No.100282352

Anonymous 05/02/24(Thu)14:43:26 No.100282352

>>100280285
And how will you get your local grocery store pricing, genius? Are you going to tour all the stores every day lol?

Anonymous
05/02/24(Thu)17:08:26 No.100284417

Anonymous 05/02/24(Thu)17:08:26 No.100284417

>>100280344
website's going to go down for an unspecified amount of time as I'm importing data through april 2024 and cripplechan (why the fuck is there still a 4chan filter?) + crystal cafe + lolcow farm

i'm not too sure what other altchans are worth my time to host archives of

Anonymous
05/02/24(Thu)17:56:45 No.100285003

Anonymous 05/02/24(Thu)17:56:45 No.100285003

>>100284417
I wonder how many GB are needed to store all that shit

Anonymous
05/02/24(Thu)18:13:30 No.100285210

Anonymous 05/02/24(Thu)18:13:30 No.100285210

>>100285003
For just the text & search data it's around 1-2TB. If I wanted to store full images for every post I have, it's roughly 160TB of space needed
Thumbnails would also be a considerable fraction, maybe another 15-20TB

That could be pushed down by deduping with image fingerprinting, but that has a lot of caveats that people wouldn't be happy with

Buying enough drives just for a single copy of the data would cost around $2600, but things like RAIDZ, tape backups, the server itself, future expansion etc. would drive that number up a shitload. I think my original estimate of $15k is wrong but it would still be a lot more than I'm willing to pay out of pocket for something that costs me money to host

Anonymous
05/02/24(Thu)18:29:38 No.100285371

Anonymous 05/02/24(Thu)18:29:38 No.100285371

>>100285210
So mainly glowniggers are willing to spend all that cash

Anonymous
05/02/24(Thu)20:26:39 No.100286754

Anonymous 05/02/24(Thu)20:26:39 No.100286754

>>100282352
They have APIs, dipshit

Anonymous
05/02/24(Thu)20:58:26 No.100287063

Anonymous 05/02/24(Thu)20:58:26 No.100287063

>>100286754
Good luck using that shit for free dumbo

Anonymous
05/02/24(Thu)21:15:47 No.100287258

Anonymous 05/02/24(Thu)21:15:47 No.100287258

>>100287063
Learn how to follow conversations. What has been said already is to learn how to monetize your efforts before building these systems that you want to offer to the public. And btw I think some places do offer some level of API access for free, but you need to present your plan for how that benefits them i.e. affiliate marketing or the like.

If you want to build some random scraper for free because "coding is kool d00d" then I suggest sticking to small open source datasets and mom and pop blogs

Anonymous
05/02/24(Thu)21:34:29 No.100287431

Anonymous 05/02/24(Thu)21:34:29 No.100287431

>>100287258
Imagine needing an API instead of being a scrapechad

Anonymous
05/02/24(Thu)23:14:27 No.100288150

Anonymous 05/02/24(Thu)23:14:27 No.100288150

What's the best way to use MITMproxy for Android apps? It seems like each app has to be patched individually for it to work on the newest Android versions. Would it be better to use a VM with an older version?

Anonymous
05/02/24(Thu)23:41:56 No.100288375

Anonymous 05/02/24(Thu)23:41:56 No.100288375

>>100198235
I've been using smartproxy. Commercial proxies work OK, although latency is a bit high (1-2s) delay

Anonymous
05/02/24(Thu)23:47:18 No.100288426

Anonymous 05/02/24(Thu)23:47:18 No.100288426

>>100288375
Does smartproxy have any blacklist? For my scraping usecase I need to reach out to a lot of mail providers and most proxy providers seem to blacklist mail providers to prevent abuse

Anonymous
05/02/24(Thu)23:49:12 No.100288439

Anonymous 05/02/24(Thu)23:49:12 No.100288439

File: Screenshot 2024-05-02 at (...).png (28 KB, 756x292)

28 KB PNG

>>100288426
https://smartproxy.com/faq/general/do-you-have-any-blocked-sites

pic related costs me 11$/mo

Anonymous
05/02/24(Thu)23:52:53 No.100288474

Anonymous 05/02/24(Thu)23:52:53 No.100288474

>>100288439
FUCK

They've blocked literally everything I want to access

Know anything with similar prices that allow cracking/mail services?

Anonymous
05/02/24(Thu)23:54:07 No.100288489

Anonymous 05/02/24(Thu)23:54:07 No.100288489

>>100288474
one that comes to mind is webshare.io, they're really really cheap although im pretty sure they block the same subset of services.

Anonymous
05/03/24(Fri)00:02:41 No.100288578

Anonymous 05/03/24(Fri)00:02:41 No.100288578

>>100288489
Pretty sure they do but I'm also pretty sure you can bypass the block by just contacting the IP directly/having the DNS lookup done without the proxy

Anonymous
05/03/24(Fri)00:08:09 No.100288628

Anonymous 05/03/24(Fri)00:08:09 No.100288628

>>100288578
nah, you need to connect to webshare.io, they give you a port number to select the IP

Anonymous
05/03/24(Fri)03:58:32 No.100290383

Anonymous 05/03/24(Fri)03:58:32 No.100290383

Does anyone know how to get gallery-dl to download text posts of 4chan archives?

Anonymous
05/03/24(Fri)06:52:48 No.100291763

Anonymous 05/03/24(Fri)06:52:48 No.100291763

how do i deal with hcaptcha?

Anonymous
05/03/24(Fri)07:40:11 No.100292169

Anonymous 05/03/24(Fri)07:40:11 No.100292169

>>100291763
this. I don't want to pay. I'm broke.

Anonymous
05/03/24(Fri)08:05:18 No.100292379

Anonymous 05/03/24(Fri)08:05:18 No.100292379

>>100219220
traditionally web scraping hasn't made a lot of money and is dominated by non-Americans because foundationally you're profiting off of someone else's work and that scares big money away. except these days they call it "AI Training" and yeet their credibility into the wind because Elon Musk said funny word on xitter

point is most people don't make money off any of this and if so it's only a little. it's mostly a hobby for hoarders and voyeurs

Anonymous
05/03/24(Fri)09:30:55 No.100293214

Anonymous 05/03/24(Fri)09:30:55 No.100293214

>>100288474
At this point, just buy a SIM card and use it as a proxy

Anonymous
05/03/24(Fri)09:34:32 No.100293238

Anonymous 05/03/24(Fri)09:34:32 No.100293238

>>100292379
A lot of services are using APIs to sell you something (affiliates) or display ads with the scraped content (anime, manga, porn, whatever). If you think they're not making any money from this you're just retarded

Anonymous
05/03/24(Fri)10:18:43 No.100293764

Anonymous 05/03/24(Fri)10:18:43 No.100293764

>>100290383
Just go to an archive? Or copy and paste.

Anonymous
05/03/24(Fri)10:22:32 No.100293807

Anonymous 05/03/24(Fri)10:22:32 No.100293807

>>100293238
anon, the only ones making big money in internet advertising are the big tech companies that hold a monopoly on their part industry. nickel and dime crooks trying to make a quick buck on some porn redirect scam are paying Venezuelans $10 a week do do their scraping

Anonymous
05/03/24(Fri)10:27:27 No.100293876

Anonymous 05/03/24(Fri)10:27:27 No.100293876

>>100293807
If it's that easy just scale up

Anonymous
05/03/24(Fri)10:39:33 No.100294048

Anonymous 05/03/24(Fri)10:39:33 No.100294048

>>100292379
some 500 trumps a month unachievable margins?
I'm comfortable with keeping operations to a minimum. from the replies I feel I understand I'd need more of a good idea rather than good enough scale

Anonymous
05/03/24(Fri)12:58:10 No.100295668

Anonymous 05/03/24(Fri)12:58:10 No.100295668

>>100292379
when cloudflare and google have completely rewritten history, untainted human data will be a scare resource and valuable

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.