/g/ - Scraping - Technology

Anonymous

Scraping 06/29/24(Sat)08:30:21 No.101200279

File: 1704606978300432.png (44 KB, 320x256)

Scraping Anonymous 06/29/24(Sat)08:30:21 No.101200279 Archived

I need something that can
>queue multiple http requests
>can execute them asynchronously (perhaps using a thread pool)
>abort and retry requests on timeouts
>return downloaded http response synchronously
Preferable for Python, but any language is fine.

Anonymous
06/29/24(Sat)08:32:06 No.101200294

Anonymous 06/29/24(Sat)08:32:06 No.101200294

https://www.python-httpx.org/

Would this work?

Anonymous
06/29/24(Sat)08:34:28 No.101200319

Anonymous 06/29/24(Sat)08:34:28 No.101200319

>>101200279
aiohttp, asyncio (builtin since 3.6 iirc).
use a standard queue, fill it up and consume it in some looping coroutine. use the timeout options and write a basic try-catch to retry. store the results somewhere, like a list or something.

you can use asyncio.run to run the entire event loop until completion to get the results synchronously and then you can continue with your code after it's done.
i don't know why you'd want to return the http response synchronously though.

if your lazy and old, use scrapy. you will use twisted.

Anonymous
06/29/24(Sat)08:38:28 No.101200362

Anonymous 06/29/24(Sat)08:38:28 No.101200362

>>101200294
Seems more like a modern replacement for the requests module.

>>101200319
I know how to do it. I'm just too lazy.

Anonymous
06/29/24(Sat)08:42:41 No.101200407

Anonymous 06/29/24(Sat)08:42:41 No.101200407

>>101200362
well the code for it is so small and would be a hassle to actually make a library to do it. like, just write it yourself.

Anonymous
06/29/24(Sat)08:47:12 No.101200455

Anonymous 06/29/24(Sat)08:47:12 No.101200455

>>101200279
use batch and aria2

Anonymous
06/29/24(Sat)09:10:27 No.101200698

Anonymous 06/29/24(Sat)09:10:27 No.101200698

>>101200319

Go back, r*ddit typer

Anonymous
06/29/24(Sat)10:50:55 No.101201744

Anonymous 06/29/24(Sat)10:50:55 No.101201744

All scraping libraries suck desu. I gave up and wrote my own shit in golang but I had more advanced requirements than you..

Anonymous
06/29/24(Sat)10:55:53 No.101201800

Anonymous 06/29/24(Sat)10:55:53 No.101201800

>>101200279
You basically need bash

Anonymous
06/29/24(Sat)11:21:30 No.101202084

Anonymous 06/29/24(Sat)11:21:30 No.101202084

File: 1719674488072.jpg (87 KB, 506x640)

87 KB JPG

>>101200362
I'd recomended you try out some estrogen, it really improved my coding performance

Anonymous
06/29/24(Sat)11:50:41 No.101202440

Anonymous 06/29/24(Sat)11:50:41 No.101202440

>>101200279
asyncio and aiohttp should do the trick easily

Anonymous
06/29/24(Sat)12:45:12 No.101203194

Anonymous 06/29/24(Sat)12:45:12 No.101203194

>>101200279
What you're looking for is httpx, check it out.

Anonymous
06/29/24(Sat)13:27:37 No.101203848

Anonymous 06/29/24(Sat)13:27:37 No.101203848

If you need to process a lot of requests in an efficient manner, just do Go, use the standard http lib and you are good.

If you like touching tips with your friends, you can use Python, async functions to request and asyncio.gather to run them concurrently. You can use tenacity for the retry stuff.

Anonymous
06/29/24(Sat)13:35:15 No.101203963

Anonymous 06/29/24(Sat)13:35:15 No.101203963

>>101203848
Sounds like he's doing I/O bound tasks, Go is unnecessary. In fact it's the perfect use case for python.

Anonymous
06/29/24(Sat)13:38:39 No.101204007

Anonymous 06/29/24(Sat)13:38:39 No.101204007

Just use celery queue my man

Anonymous
06/29/24(Sat)13:38:58 No.101204014

Anonymous 06/29/24(Sat)13:38:58 No.101204014

>>101200279
You can do all of that with just bs4, concurrent.futures/threading and requests. If you're trying to get JS content and stuff like that you'll probably need headless browsers.

I've built scrapers in Python and C but honestly the best way I've found to do it to date is in JS (I know, not my favorite either) by creating browser extensions that use stuff like Playwright.

Anonymous
06/29/24(Sat)16:27:44 No.101206302

Anonymous 06/29/24(Sat)16:27:44 No.101206302

I do this in C++ using libcurl, I don't really use a HTML parser either, I just notice the patterns then run SIMD substring search and it finds what I need x1000 faster while not using retarded tier amounts of RAM per page.

Anonymous
06/29/24(Sat)16:43:58 No.101206514

Anonymous 06/29/24(Sat)16:43:58 No.101206514

>(((asynchronously)))
Kys

Anonymous
06/29/24(Sat)19:34:46 No.101208536

Anonymous 06/29/24(Sat)19:34:46 No.101208536

/wsg/ is this way retard >>101208241

Anonymous
06/29/24(Sat)20:22:30 No.101209077

Anonymous 06/29/24(Sat)20:22:30 No.101209077

>>101200362
>I know how to do it. I'm just too lazy.
give those instructions in chat gpt then.