>>106681424
They ran out of money or whatnot. It's quite telling looking at the ancestry chart at https://www.neta.art/blog/neta_lumina/ and their FAQ response.
>Why did we choose to continue training from AES-1-e100 instead of RAW’-e7?
>We believe that adopting a training curriculum that alternates between large and small datasets (a “large-small-large-small” strategy) yields better overall model performance.
>Switching to AES-1 let us complete the final “large-small” stages and ship the best trade-off we could (stability, looks, short-prompt performance).
>If you are interested in training with more abundant GPU resources, RAW’-e7 checkpoint is a strong raw starting point.
I really also take issue with the fact that 1.) The Lumina folks didn't open source their captioning software and 2.) Neta Lumina didn't even try and replicate that captioning structure and instead decided to implement their own without the structure that was used in the base model.
{
"gemini_caption_v10": {
"master_player_detailed_caption_en": "",
"compress_nl_en": "",
"Tag_mix_sentence_en": "",
"Medium_caption_en": "",
"short_summary": "",
"designer_caption_en": "",
"structured_summary_en": "",
"midjourney_style_summary_en": "",
"chinese_translation": "",
"midjourney_style_summary_zh": "",
"designer_caption_ja": ""
}
},
"wd_tagger": "",
"wd_tagger_metadata": {
"character": [""],
"series": [""],
"artist": [""],
"rating_tag": [""],
"quality_tag": [""]
}
}
Like yeah, no shit, half of the training time was used on the model trying to reconcile your new structured captioning with the base model's trained captions. Minimal changes would've been just adding a ja version of the long, medium, short and tag and then put in booru tagging in all languages inside those 4 prompts. Instead, you throw another 4 captions at the EN side and make the Chinese and Japanese sides inadequate and add in booru tags unnaturally.