[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/wsr/ - Worksafe Requests


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1722325110782011.jpg (255 KB, 872x964)
255 KB
255 KB JPG
I have several thousand books, movies, anime, manga, games, etc. that I need to organize into spreadsheets. How in the everloving fuck do I automate this, even partially?
I have a shitload of .txt files with lists in them and links to miscellaneous database websites and need to scrape a few specific things from each one (like genres, who made it, when it was made, etc.) and map that to a spreadsheet. Some things need data from multiple websites because a lot of databases suck for anything that isn't mainstream. I don't mind cleaning it up after, I just need a way to get the information in a readable format with as little editing as I can get away with.
If there's any retard-friendly software, programming languages, etc. I can learn and use to make this even a little easier, please direct me to them. I have no clue how to code, but I'm assuming learning how and automating this stuff would be much faster than doing everything manually at this point. There's gotta be a better way.
>>
>>1510821
>I have a shitload of .txt files with lists in them and links to miscellaneous database websites
What format are they in, is it all plain text, CSV, html tables or what...? Regardless there's ways to extract data from them, a lot of languages have purpose built modules for that kind of work

https://metacpan.org/pod/HTML::TableExtract
https://pypi.org/project/beautifulsoup4/
>>
>>1510832
It's unfortunate but you'll probably need one specific extractor for each data source since they'll have it all in different formats (xml, jason etc) and different schemas as well
>>
>>1510821
Won't be too difficult to script
>>
>>1510832
They're all in plain text, one per line. Thanks for the links, this gives me a good place to start. HTML and Python.
>>1510833
I figured as much. Sad news, but that's fine by me. Any speedup I can get is valuable.
>>1511093
Thanks, that's good to know. Do you have any pointers on what I should learn or focus on?
>>
>>1511149
One per line? The names? The database websites?
>>
>>1511155
Sorry, I meant the titles. They're either like
Title
Title
Title
or they're links to specific entries on the database website like
https://www.url.com/etc/title
https://www.url.com/etc/title
https://www.url.com/etc/title
where title is either insert-title-here or some numeric ID code like on imdb. They aren't mixed in any of the files afaict though, it's either all plain titles or all plain links. None are hyperlinked or wrapped in any kind of code.
>>
>>1511160
can you just post an example?
>>
>>1511209
I don't really understand what you mean. Do you mean a list with specific titles? One of the .txt files? Sorry, I'm a little retarded and not familiar with this stuff. I figured that since it's all either plaintext or plain links with nothing else it would be fairly straightforward.
>>
>>1511649
People are still confused about the specifics of the files and the format of the excel document you want. >>1511160 is kinda confusing and for the spreadsheet do you want the manga, books,etc to be in one spreadsheet or split into separate pages?

That kinda thing
>>
>>1510821
>If there's any retard-friendly [...] programming languages
python
https://docs.python.org/3/tutorial/index.html
>>
>>1511805
I see, sorry about that. To (hopefully) explain a little better:
I have various .txt files, each grouped into folders. Each folder covers one type of media; e.g., a folder named "Movies" contains only lists of movies. These files are in plaintext and list one title per line. For example, one file has the following list:
La Gloire de Mon Père
Manon des Sources
Night Train to Lisbon
It contains exclusively plaintext. A different file in the same folder has this:
https://www.imdb.com/title/tt0050083/
https://www.imdb.com/title/tt0053125/
https://www.imdb.com/title/tt0080455/
It contains exclusively imdb links. These are plaintext too and are not hyperlinked. The only exception to the former list is when either dates (e.g., Suspiria (1977)) or names (e.g., Italo Calvino - Marcovaldo) are included.
My goal is to take these lists and extract the data I want from relevant database sites into LibreOffice Calc spreadsheets. For some things such as manga, I need to be able to extract from multiple sites per entry, like https://mangaupdates.com for basic info and https://ja.wikipedia.com or https://manba.co.jp/ for the publication period/any missing info.
Currently, I have one spreadsheet per medium - books, movies, etc. - with one sheet each. The data I want differs by medium, but in general it's as follows: Type, Title, Director (or Author for books/manga/etc.), Studio (or Publisher), Date, Genre, whether it's original or an adaption, and whether it has any sequels/prequels/etc. I'm not picky about this though and figured I would need to alter some things to work better, like concatenating all my files together and then converting them to either all links or all titles.
If this still doesn't make much sense or I'm missing something, please let me know and I'll try to explain more.
>>
>>1511908
Thanks, will read through this.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.