gm
stock image set up to 5m, now a mix of creative and editorial
downloaded top 5m liked images from leonardo set, ~2.5TB
downloading images from playground-liked set, ~3/~14m, ~2TB
downloading _k size (~3MP on avg, resolution like 1536x2048) public domain flickr images, ~2.69M done, ~2TB, current:
All Rights Reserved - 171026196
Attribution-NonCommercial-ShareAlike License - 5402856
Attribution-NonCommercial License - 2829057
Attribution-NonCommercial-NoDerivs License - 4855351
Attribution License - 3915955
Attribution-ShareAlike License - 2157905
Attribution-NoDerivs License - 1062517
No known copyright restrictions - 174526
United States Government Work - 104012
Public Domain Dedication (CC0) - 1416751
Public Domain Mark - 3876924
mainly all rights reserved atm because i started with explore section, then photos from those users, photos from groups is in progress. there's a search endpoint that ill start on soon, i can get everything on the site from that
honestly though there's no issue using all rights reserved imo, everything related to training, recaptioning etc is fair use. ironically the problem licenses here are attribution, non-commercial, share-a-like and no derivatives
ive been looking at different caption models, florence still seems the best choice, others like moondream make stuff up, then you've got the ones that write a novel. florence can also do other tasks which ill find useful later on and its tiny so runs fast
will be running a test on A100 80GB later to know captions/hour throughput then estimate how many gpu hours i need
also experimenting with n-gram and noun group stuff
and finally working on expanding my infrastructure with download nodes, bigmongo can't cope with all this load desu
thanks for reading my update blog