[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: scraper.png (1.62 MB, 1892x2142)
1.62 MB
1.62 MB PNG
Web Scraping General

AI datamine edition

QOTD: How do you store the data you've scraped after scraping it?

FAQ: rentry co/t6237g7x

> Captcha services
https://2captcha.com/
https://www.capsolver.com/
https://anti-captcha.com/

> Proxies
https://hproxy.com/ (no blacklist) (recommended, owned by friend of /wsg/)
https://infiniteproxies.com/ (no blacklist)
https://www.thunderproxies.com/
http://proxies.fo/ (not recommended)

> Network analysis
https://mitmproxy.org/
https://portswigger.net/burp

> Scraping tools
https://beautiful-soup-4.readthedocs.io/en/latest/
https://www.selenium.dev/documentation/
https://playwright.dev/docs/codegen
https://github.com/lwthiker/curl-impersonate
https://github.com/yifeikong/curl_cffi

Official Telegram: @scrapists
Last thread: >>101135838
>>
'mp
>>
'mp
>>
>>101151145
can anyone help me scrape tiktok using python, and turn it into a web app:

name
their followers
viral videos

trending keywords

not sure how to start and how to store it into a database too, please help me know the best tech stack to use. they need someone from the US EU as a student if I use their research API and I don't fall into those categories

I have basic python knowledge and basic html5 css wordpress javascript php7 with heidisql experience.

Anyone want to help me out with this?
>>
>>101154280
.
>>
>>101154296
?
>>
>>101154280
Open chrome devtools > go to the network tab > go to someone's tiktok page

Easy enough to scrape or nah?
>>
>>101154499
I want to automate it and put it in a database for later use
>>
>>101154513
Which part are you having trouble with?
>>
>>101154280
> they need someone from the US EU as a student if I use their research API and I don't fall into
How do they validate this? You might be able to get access to their API regardless
>>
>>101154544
I'm new to this, it's my first time scraping.

I tested on scraping ecommerce test site and amazon using python since you can see the div th table boxes. other than that I don't know how would it be possible for tiktok trending videos
>>
>>101154499
I don't see any api information on the tiktok network page, if I go to the network tag of a tiktok page I can't even find the no. followers no. of viral videos name, only garble, I've only scraped html and string text using basic beautiful soup.
>>
>>101154714
Alright

You said tiktok has a research API that they only allow US/EU students to access right?

What info do they require for you to get an API key. You might be able to get one regardless
>>
>>101154702
You have to bypass the captcha and that's pretty complicated, personally I managed to bypass cloudflare's using undetected-chromedriver, ymmv
>>
>>101154787

anyone here that is a student and can use it for their project/thesis sample?

https://developers.tiktok.com/products/research-api/

Applicants must fulfill the following criteria to qualify for access:

Be located in an eligible region and be affiliated with an eligible organization:

Non-profit academic institutions in the US, EEA, UK or Switzerland; or

Not-for-profit research institution, organization, association, or body in the EU. We are currently beta testing this service with select researchers in the US, UK, Switzerland, Norway, Iceland and Liechtenstein.

Have demonstrable academic experience and expertise in the research area specified in the application

Be independent from commercial interests and be able to conduct research on a not-for-profit basis pursuant to a public-interest mission

Disclose the funding of the research

Provide a clearly defined research proposal and show the access requested is needed for, and proportionate to, the purpose of that research

Commit to fulfilling data security and confidentiality requirements (including taking steps to protect personal data)

Be able to provide evidence that the research went through an ethical research review

Be prepared to uphold the requirements set out in the TikTok Research Tools Terms of Service
>>
>>101154855
How are you expected to prove that you're a student? Best case scenario you could prob send in a sloppily photoshopped student ID
>>
'mp
>>
'mp
>>
>>101154280
'up
>>
>>101151145
anyone use bash with curl for scraping?
>>
Keep up the good threads.
>>
I'm not a scrapper, but needed to download a particular file through firefox and not curl, wget etc. Apparently you can't download files with selenium?
>>
>>101164129
Why can't you use curl/wget?
>>
>>101164180
because i need to use pac proxy config and as far as i know they don't support it
>>
if anyone cares, this will download a list of gelbooru tabs. Don't remove the sleep pls don't be cunts
import requests
from bs4 import BeautifulSoup
import os
import time

FILE = '.txt'
DEST_FOLDER = ...

def scrape_source(url):
# Fetch the web page
response = requests.get(url)
response.raise_for_status() # Ensure the request was successful

soup = BeautifulSoup(response.content, 'html.parser')
images = soup.find_all('img', id='image')

for img in images:
img_src = img['src']
if 'sample' in img_src:
elems = soup.find_all('section', class_='note-container')
extension = elems[0]['data-file-ext']
first_part = img_src.split('//samples/')[0]
test = "/".join(img_src.split('/')[-3:])
test = test.replace("sample_", "").split(".")[-2]
full_url = f'{first_part}/images/{test}{extension}'
return full_url
else:
return img_src

def download_file(url, dest_folder):
if not os.path.exists(dest_folder):
os.makedirs(dest_folder)
try:
response = requests.get(url, stream=True)
response.raise_for_status() # Check for HTTP errors

filename = os.path.join(dest_folder, url.split("/")[-1])

with open(filename, 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
except requests.exceptions.RequestException as e:
pass


def download_from_file(file_path, dest_folder='.'):
with open(file_path, 'r') as file:
urls = file.readlines()

for url in urls:
url = url.strip() # Remove any leading/trailing whitespace
if url: # Check if URL is not empty
img_url = scrape_source(url)
download_file(img_url, dest_folder)
print('sleeping...')
time.sleep(0.5)


if __name__ == "__main__":
download_from_file(FILE, DEST_FOLDER)
>>
File: ubel my beloved.jpg (136 KB, 853x533)
136 KB
136 KB JPG
>>101165753
Works pretty well.
You can remove for img in images and just get images[0] because there's just 1, I was testing and forgot to change it
Not sure if it works for videos, probably not but it shouldn't be too difficult to change if you want
To get all tabs as a text file you can use a browser extension
>>
File: Untitled.jpg (8 KB, 225x225)
8 KB
8 KB JPG
>>101154280
why do you want to do this anon
>>
>>101154499
It’s probably pretty hard to scrape but I just checked on github and there’s plenty of scrapers. Whether they work or not is something different kek
>>
>>101164129
You can’t? I don’t know about your particular file but I’ve downloaded more than a TB of data with selenium
>>
>>101163348
kek
>>
>>101154280
look at yt-dlp
https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/extractor/tiktok.py
probably want 'nickname'
>>
I'm gonna scrape neopets.com for codestone prices
>>
So I've been using playwright to auto apply to jobs and often times greenhouse will demand I two factor auth my email to make sure I'm not a bot. How would I beat this? I don't want to give my bot access to my email
>>
>>101154280
If u have basic python skills, you should be able to do this.
>>
>>101165753
Gelbooru has an API. But I've never used it so I don't know if it's cucked or not.
>>
>>101165980
it works fine when i click on a download link on a page, but when it's direct download that can't be opened in firefox it just returns empty file and the .get function never ends.
>>
>>101164202
https://gist.github.com/mpcabd/b09688a0f5ec183afc68
>>
'mp



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.