/g/ - /wsg/ - Web Scraping General - Technology

Anonymous

/wsg/ - Web Scraping General 06/25/24(Tue)17:18:19 No.101151145

File: scraper.png (1.62 MB, 1892x2142)

/wsg/ - Web Scraping General Anonymous 06/25/24(Tue)17:18:19 No.101151145 Archived

Web Scraping General

AI datamine edition

QOTD: How do you store the data you've scraped after scraping it?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101135838

Anonymous
06/25/24(Tue)18:45:48 No.101152203

Anonymous 06/25/24(Tue)18:45:48 No.101152203

'mp

Anonymous
06/25/24(Tue)21:14:50 No.101153749

Anonymous 06/25/24(Tue)21:14:50 No.101153749

'mp

Anonymous
06/25/24(Tue)22:11:23 No.101154280

Anonymous 06/25/24(Tue)22:11:23 No.101154280

>>101151145
can anyone help me scrape tiktok using python, and turn it into a web app:

name
their followers
viral videos

trending keywords

not sure how to start and how to store it into a database too, please help me know the best tech stack to use. they need someone from the US EU as a student if I use their research API and I don't fall into those categories

I have basic python knowledge and basic html5 css wordpress javascript php7 with heidisql experience.

Anyone want to help me out with this?

somebody
06/25/24(Tue)22:13:11 No.101154296

somebody 06/25/24(Tue)22:13:11 No.101154296

>>101154280
.

Anonymous
06/25/24(Tue)22:24:09 No.101154380

Anonymous 06/25/24(Tue)22:24:09 No.101154380

>>101154296
?

Anonymous
06/25/24(Tue)22:39:40 No.101154499

Anonymous 06/25/24(Tue)22:39:40 No.101154499

>>101154280
Open chrome devtools > go to the network tab > go to someone's tiktok page

Easy enough to scrape or nah?

somebody
06/25/24(Tue)22:41:02 No.101154513

somebody 06/25/24(Tue)22:41:02 No.101154513

>>101154499
I want to automate it and put it in a database for later use

Anonymous
06/25/24(Tue)22:45:18 No.101154544

Anonymous 06/25/24(Tue)22:45:18 No.101154544

>>101154513
Which part are you having trouble with?

Anonymous
06/25/24(Tue)22:46:19 No.101154556

Anonymous 06/25/24(Tue)22:46:19 No.101154556

>>101154280
> they need someone from the US EU as a student if I use their research API and I don't fall into
How do they validate this? You might be able to get access to their API regardless

somebody
06/25/24(Tue)23:04:33 No.101154702

somebody 06/25/24(Tue)23:04:33 No.101154702

>>101154544
I'm new to this, it's my first time scraping.

I tested on scraping ecommerce test site and amazon using python since you can see the div th table boxes. other than that I don't know how would it be possible for tiktok trending videos

somebody
06/25/24(Tue)23:06:33 No.101154714

somebody 06/25/24(Tue)23:06:33 No.101154714

>>101154499
I don't see any api information on the tiktok network page, if I go to the network tag of a tiktok page I can't even find the no. followers no. of viral videos name, only garble, I've only scraped html and string text using basic beautiful soup.

Anonymous
06/25/24(Tue)23:17:08 No.101154787

Anonymous 06/25/24(Tue)23:17:08 No.101154787

>>101154714
Alright

You said tiktok has a research API that they only allow US/EU students to access right?

What info do they require for you to get an API key. You might be able to get one regardless

Anonymous
06/25/24(Tue)23:25:46 No.101154841

Anonymous 06/25/24(Tue)23:25:46 No.101154841

>>101154702
You have to bypass the captcha and that's pretty complicated, personally I managed to bypass cloudflare's using undetected-chromedriver, ymmv

somebody
06/25/24(Tue)23:28:27 No.101154855

somebody 06/25/24(Tue)23:28:27 No.101154855

>>101154787

anyone here that is a student and can use it for their project/thesis sample?

https://developers.tiktok.com/products/research-api/

Applicants must fulfill the following criteria to qualify for access:

Be located in an eligible region and be affiliated with an eligible organization:

Non-profit academic institutions in the US, EEA, UK or Switzerland; or

Not-for-profit research institution, organization, association, or body in the EU. We are currently beta testing this service with select researchers in the US, UK, Switzerland, Norway, Iceland and Liechtenstein.

Have demonstrable academic experience and expertise in the research area specified in the application

Be independent from commercial interests and be able to conduct research on a not-for-profit basis pursuant to a public-interest mission

Disclose the funding of the research

Provide a clearly defined research proposal and show the access requested is needed for, and proportionate to, the purpose of that research

Commit to fulfilling data security and confidentiality requirements (including taking steps to protect personal data)

Be able to provide evidence that the research went through an ethical research review

Be prepared to uphold the requirements set out in the TikTok Research Tools Terms of Service

Anonymous
06/26/24(Wed)00:29:45 No.101155299

Anonymous 06/26/24(Wed)00:29:45 No.101155299

>>101154855
How are you expected to prove that you're a student? Best case scenario you could prob send in a sloppily photoshopped student ID

Anonymous
06/26/24(Wed)04:01:10 No.101156929

Anonymous 06/26/24(Wed)04:01:10 No.101156929

'mp

Anonymous
06/26/24(Wed)07:16:17 No.101158373

Anonymous 06/26/24(Wed)07:16:17 No.101158373

'mp

somebody
06/26/24(Wed)07:44:38 No.101158659

somebody 06/26/24(Wed)07:44:38 No.101158659

>>101154280
'up

Anonymous
06/26/24(Wed)12:10:03 No.101161523

Anonymous 06/26/24(Wed)12:10:03 No.101161523

>>101151145
anyone use bash with curl for scraping?

Anonymous
06/26/24(Wed)14:27:37 No.101163348

Anonymous 06/26/24(Wed)14:27:37 No.101163348

Keep up the good threads.

Anonymous
06/26/24(Wed)15:15:47 No.101164129

Anonymous 06/26/24(Wed)15:15:47 No.101164129

I'm not a scrapper, but needed to download a particular file through firefox and not curl, wget etc. Apparently you can't download files with selenium?

Anonymous
06/26/24(Wed)15:18:50 No.101164180

Anonymous 06/26/24(Wed)15:18:50 No.101164180

>>101164129
Why can't you use curl/wget?

Anonymous
06/26/24(Wed)15:20:07 No.101164202

Anonymous 06/26/24(Wed)15:20:07 No.101164202

>>101164180
because i need to use pac proxy config and as far as i know they don't support it

Anonymous
06/26/24(Wed)17:06:02 No.101165753

Anonymous 06/26/24(Wed)17:06:02 No.101165753

if anyone cares, this will download a list of gelbooru tabs. Don't remove the sleep pls don't be cunts

import requests
from bs4 import BeautifulSoup
import os
import time

FILE = '.txt'
DEST_FOLDER = ...

def scrape_source(url):
    # Fetch the web page
    response = requests.get(url)
    response.raise_for_status()  # Ensure the request was successful

    soup = BeautifulSoup(response.content, 'html.parser')
    images = soup.find_all('img', id='image')

    for img in images:
        img_src = img['src']
        if 'sample' in img_src:
            elems = soup.find_all('section', class_='note-container')
            extension = elems[0]['data-file-ext']
            first_part = img_src.split('//samples/')[0]
            test = "/".join(img_src.split('/')[-3:])
            test = test.replace("sample_", "").split(".")[-2]
            full_url = f'{first_part}/images/{test}{extension}'
            return full_url
        else:
            return img_src

def download_file(url, dest_folder):
    if not os.path.exists(dest_folder):
        os.makedirs(dest_folder)
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Check for HTTP errors

        filename = os.path.join(dest_folder, url.split("/")[-1])
        
        with open(filename, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
    except requests.exceptions.RequestException as e:
        pass


def download_from_file(file_path, dest_folder='.'):
    with open(file_path, 'r') as file:
        urls = file.readlines()

    for url in urls:
        url = url.strip()  # Remove any leading/trailing whitespace
        if url:  # Check if URL is not empty
            img_url = scrape_source(url)
            download_file(img_url, dest_folder)
            print('sleeping...')
            time.sleep(0.5)


if __name__ == "__main__":
    download_from_file(FILE, DEST_FOLDER)

Anonymous
06/26/24(Wed)17:08:22 No.101165786

Anonymous 06/26/24(Wed)17:08:22 No.101165786

File: ubel my beloved.jpg (136 KB, 853x533)

136 KB JPG

>>101165753
Works pretty well.
You can remove for img in images and just get images[0] because there's just 1, I was testing and forgot to change it
Not sure if it works for videos, probably not but it shouldn't be too difficult to change if you want
To get all tabs as a text file you can use a browser extension

Anonymous
06/26/24(Wed)17:14:17 No.101165866

Anonymous 06/26/24(Wed)17:14:17 No.101165866

File: Untitled.jpg (8 KB, 225x225)

8 KB JPG

>>101154280
why do you want to do this anon

Anonymous
06/26/24(Wed)17:17:34 No.101165916

Anonymous 06/26/24(Wed)17:17:34 No.101165916

>>101154499
It’s probably pretty hard to scrape but I just checked on github and there’s plenty of scrapers. Whether they work or not is something different kek

Anonymous
06/26/24(Wed)17:21:17 No.101165980

Anonymous 06/26/24(Wed)17:21:17 No.101165980

>>101164129
You can’t? I don’t know about your particular file but I’ve downloaded more than a TB of data with selenium

Anonymous
06/26/24(Wed)17:28:25 No.101166083

Anonymous 06/26/24(Wed)17:28:25 No.101166083

>>101163348
kek

Anonymous
06/26/24(Wed)17:28:27 No.101166085

Anonymous 06/26/24(Wed)17:28:27 No.101166085

>>101154280
look at yt-dlp
https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/extractor/tiktok.py
probably want 'nickname'

Anonymous
06/26/24(Wed)18:12:18 No.101166627

Anonymous 06/26/24(Wed)18:12:18 No.101166627

I'm gonna scrape neopets.com for codestone prices

Anonymous
06/26/24(Wed)22:17:05 No.101168975

Anonymous 06/26/24(Wed)22:17:05 No.101168975

So I've been using playwright to auto apply to jobs and often times greenhouse will demand I two factor auth my email to make sure I'm not a bot. How would I beat this? I don't want to give my bot access to my email

sage
06/26/24(Wed)22:19:31 No.101168999

sage 06/26/24(Wed)22:19:31 No.101168999

>>101154280
If u have basic python skills, you should be able to do this.

Anonymous
06/26/24(Wed)22:42:17 No.101169170

Anonymous 06/26/24(Wed)22:42:17 No.101169170

>>101165753
Gelbooru has an API. But I've never used it so I don't know if it's cucked or not.

Anonymous
06/26/24(Wed)23:53:14 No.101169741

Anonymous 06/26/24(Wed)23:53:14 No.101169741

>>101165980
it works fine when i click on a download link on a page, but when it's direct download that can't be opened in firefox it just returns empty file and the .get function never ends.

Anonymous
06/27/24(Thu)00:03:38 No.101169816

Anonymous 06/27/24(Thu)00:03:38 No.101169816

>>101164202
https://gist.github.com/mpcabd/b09688a0f5ec183afc68

Anonymous
06/27/24(Thu)03:43:31 No.101171246

Anonymous 06/27/24(Thu)03:43:31 No.101171246

'mp