/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>101619436 & >>101612988►News>(07/27) Llama 3.1 rope scaling merged: https://github.com/ggerganov/llama.cpp/pull/8676>(07/26) Cyberagent releases Japanese fine-tune model: https://hf.co/cyberagent/Llama-3.1-70B-Japanese-Instruct-2407>(07/25) BAAI & TeleAI release 1T parameter model: https://hf.co/CofeAI/Tele-FLM-1T>(07/24) Mistral Large 2 123B released: https://hf.co/mistralai/Mistral-Large-Instruct-2407>(07/23) Llama 3.1 officially released: https://ai.meta.com/blog/meta-llama-3-1/>(07/22) llamanon leaks 405B base model: https://files.catbox.moe/d88djr.torrent >>101516633►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>101619436--Performance comparison between TabbyAPI/exl2 and llama.cpp, and potential optimizations: >>101624356 >>101624477 >>101624554 >>101624643 >>101624903 >>101625035 >>101625142 >>101625699 >>101625733--Moore Threads GPU support added to llama.cpp, discussion on PR reviewing, hardware testing, and kernel changes: >>101621155 >>101621210 >>101621643 >>101621640 >>101622451 >>101622391 >>101622215 >>101622485 >>101622972 >>101623153 >>101623452 >>101623398--Anon asks for ebook to audiobook AI recommendations: >>101620069 >>101620112 >>101624071 >>101621896--Using a local model as a dungeon master and recommendations: >>101624149 >>101624189 >>101624484 >>101624531 >>101624746--LLMs struggle with creative names: >>101621450 >>101621492 >>101621574 >>101621590 >>101621568--GPU price inflation and SXM2 stability: >>101624072 >>101624171 >>101624178 >>101624953--Anon seeks AI to classify 4chan memes and anime girls for Hydrus Network database.: >>101620372 >>101620413 >>101620450 >>101623719 >>101623738--Anon asks for advice on selecting a single-function text-to-text model and dataset generation tips: >>101620533 >>101620552 >>101621139--AI and image generation accessibility and quality, NAI, anime-style images, and inpainting: >>101619662 >>101619693 >>101619875 >>101621302--Logs: Screenshot of NeMo's anti-adblocker message: >>101624837--Mistral-Large repetition issues and potential solutions: >>101624502 >>101624662 >>101624719 >>101626059--DRY sampler implementation update: >>101622482--Anon releases a scene director ST addon: >>101619994--3090 hacked driver and nvlink discussion: >>101620770 >>101621092 >>101621847--Powerful laptop owner asks for best model, various projects shared: >>101621967 >>101622233 >>101622509 >>101623002--Modified mistral prompt format shared: >>101625909--Miku (free space): >>101625819 >>101627367►Recent Highlight Posts from the Previous Thread: >>101619442
Miguuuuuuu
VALL-E 2 paper released (https://arxiv.org/pdf/2406.05370) a month ago, rightfully to zero fanfare.The only additions are:>using a pretranscribed chink copy of LibriSpeech (rather than their existing in house transcription they already had from the original VALL-E for some reason)>experiments in grouping timesteps at 2, 4, and 8 tokens per step (tables metrics show that it's for the worse and in theory it only only really matters for faster inferencing, despite no numbers for their inference times)>DRY sampling (except they call it repetition aware sampling) that only activates if "conditions are met" and if not it just does actual sampling instead of greedy searching>absolutely zero fundamental changes compared to their original VALL-E paper beyond thatThe absolute state AHAHAHAHA
How are you integrating AI in your workflow?Which cli/tui tool? Which editor plugins?
>>101628458workflow? cli? editor? sir, we have sex with our AI here
>>101628458Aider for cooding, my Telegram bot for quick questions (it has vision), big AGI, SillyTavern. All of that is powered by 3.5 Sonnet though.
>>101628420Nigga what the fuck is that embed in your frog?
>>101628458I can't code for shit so there's no workflow to begin with.
>>101628458I'm looking forward to making a project of using AI to hopefully get my coding projects from scrappy prototypes to something finished. I'm hoping to find the perfect model for doing code review and cleanup kind of things, and the kind of Q&A that would go to Stack Exchange except without it being SEO out of date and arguments in the comments kind of shit.
>>101628458I crank one out then get back to work
>>101628478unironically more respectable than using it for "coding"
>>101628458>Which cli/tui tool?Using aichat when I have quick questions. I wish to try some RAG stuff but always too lazy and not sure if it would be useful.>Which editor plugins?I tried multiple on nvim, but never was fully satisfied. I tried in chronological order chatgpt.nvim, gen.nvim, gp.nvim. Each has its advantages,, but I rarely use them to be honest, mostly for wording in comments, emails, reviews, or commits.
>>101628420sus
>>101628458>>101619994give it references and yell slurs at it until it does what i want
Why exactly does the generation speed (not including processing of prompt) slow down when the context is more full?
>>101628597But it doesn't?
Do I need to set rope-freq-base to get the full 128k context with llama 3.1, or should it work out of the box on latest master?
>>101628601It does for me. I'm not using any swap, I checked.
>>101628597Because it has to do more reading?
>>101628643Isn't that what the prompt processing part is for?
>>101628655models predict the next token. 8k is less tokens to predict from than 16k, so 8k will be faster.
>>101628597attention is quadratic time complexity
>>101628655Processing turns the document into useful data, but there's gotta be more numbers to crunch if you have 3000 tokens in context than 300.
>>101628675Why did someone say it doesn't slow down? Is my shit broken or not?
>>101628478probably autocorrected from cumflow
Huh? R+ on OpenRouter costs as much as 3.5 Sonnet? What?
24gb vram sisters we lost. I'm currently in the market for an oversized gpu frankenmonster. Any recommendations?
>>101628398
>>101628695Yeah CR+ on OR has always been weirdly expensive. It costs more per token than several much larger dense models, makes no sense.
>>101628695noncommercial license, the only host is cohere and you have to pay their rates
>been RPing with L3 spins>decent but rarely does it do anything that isn't basic as hell>been a while since I CR+'d>get an idea for a tricky RP>running CR+>the partner character is in disguise>RP seems to be going well>except for some signature word choices but rolling 0 temp so I chose that>waifu starts discussing her real identity completely in third person without a single hint they're the same and the phrasing makes sense as avoiding admitting being the same while not saying anything that would require them to be different people>nice>progresses>new scene later>watching it stream because I'm vramlet so low token gen rate>real identity makes an appearance>think, damn it, it must've forgotten that the two characters are...>okay, it did screw up a little by having both identities visible at the same time because it intro'd the secret identity in narration but the next paragraph it explained the quick change from one identity to the other>action scene>at the end, swaps identities back in a sensible way and now that the secret is revealed to my character is like, "Did you like that? I've got more tricks up my sleeve."CR+ is still the champ.
>>101628972What's the lowest size that would be still better than regular command-r? I probably can't run it.
How many of you use these for purposes other than roleplay?
>>101628972CR+ is still the goat for writing style, but it's not smart enough for me.E.g., tried to write a scene where a chick was supposed to be giving me a secret blowjob under the table while the waiter was taking my order. It just could NOT figure out that the waiter cannot see the chick and is not supposed to be taking her order. And she is most definitely NOT supposed to be answering while her mouth is full of my dick.In comparison, wiz 8x22 got it, but it's language is sloppy as hell.
>>101628972Mistral Large also does this kind of thing very well.
>>101629086I'm on an iMatrix IQ4_XS. It's 52.3 GB so the file cache soaks up most of my system RAM, but it's been worth it...>sing its praises>immediately it does something silly...till now. I hit 4600 context and it started to write justification for my character's question rather than answering it like it's a misconception.Seems to me like when model context gets large it becomes a lot more likely to just follow your lead than to appropriately confirm or deny and react to questions.But it could also be that the model doesn't have enough information to reply to the question appropriately, I was surprised it knew the kind of character I wanted it to RP as. But when I yanked the leading question it got more reasonable.If it loses the continuity I might make it summarize, start a new chapter, and see if it gets smart again. I rolled 16k context in Kobold but if 4k is the effective limit, at least I know when to chapter break.>>101629174Did you go straight to the action or had you built up a long document before that? Maybe it's the same phenomenon I'm currently thinking about.
>>101629172>Local Model Gooners
>>101629199I might give it a try on the same premise later tonight and see how it holds up. I think I have it at IQ3_XS, not sure if there are any bigger ones that aren't too big for my system.
>>101629222Interesting approach with the summarization, that might be a good idea since the context shifting deletes important stuff. Do you write a summary yourself or automate it?
>>101629172I'm trying to set up a japanese > english translator with character recognition in real time to play some VN, my idea is: the text appears, and the thing grabs the text and outputs the result in enlighs in a textbox that will be updated in real timeProblem is, 12gb vram 16 ram so yeah, it's fucked up
Just wanted to consult for some information - currently on AWS there's a funny Claude 3 Opus "outage" - the model seems to have some weird parameters which shows in the replies. See picrel and https://rentry.org/schizoclaudeWhy do you think this would happen? Is it just temperature, or something to do with penalties? Because the text is still (mostly) coherent, but it can jump between completely different ideas.
And some more
P.A. Works has announced anime movie "Project Sekai: Kowareta Sekai to Utaenai Miku" to release in Japanese theaters on January 17, 2025.
>>101629323Why does the Miku look so different?
>>101629172I want to do a bunch of things but I lack the skill and motivation.I'm not into roleplay
>>101629331The title mentions "Miku who doesn't sing", so maybe it's some sort of broken miku who gets redeemed throughout the movie.
>>101629323>gacha trash with actual homosGrim. At least it's just a movie
I have a 4080S(16G VRAM), 3900X, 32G RAM. What's the best LLM I can run? Llama 3.1 8b?
>>101629261For L3 at least, I've asked it to summarize for itself, specifying that the goal is for it to pick up where the story left off without forgetting anything important, and I'd get something that would need a but of editing around the edges but was fine.Asking for a "detailed summary" worked well but it's so big that it's eating a lot of the next chapter's context just to get started. I've asked for concise summaries and sometimes it's plenty small but I know it lacks details needed to keep the right feel.Probably requires some prompt engineering tailored for the model being used.>>101629245>IQ3_XSMis Large at IQ3_S should fit my system, but I can't get into IQ4, those weigh like 70 GB.
>>101629371>Llama 3.1 8b?You could run up to 27B comfortably
>>101629366Did she caught an AI virus or something?
>>101629273I'm getting decent results with llama3/3.1 on a similar setup. Unless you mean it sucks at that task specifically?
>>101629405She's a Roland MIDI controller without a synth card installed.
>>101629398I thought it only came in 8, 70, and 405b? Is llama the best or are there any competition? Mistral? I remember hearing something about another open model that can run on cheap hardware but I forgot what it's called.
>>101629428I'm not getting good translations on 8b nor 12b models, and I don't go higher because I wanna maintain some speed on it, I don't wanna sit and wait 1-2 minutes until a five word sentence is translated. I might have to build my own Miqumaxx box
>>101629439The 27B model is google's Gemma.There's also the recently release Mistral Nemo 12B.
What's the most powerful local AI that a 4090 + 32GB RAM can run, objectively speaking
>>101629386And where do you place the summary? As a new intro message? Or in the card? Or somewhere else?
>>101629172making an AA2 inspired game set in a school but top-down 2D and powered by LLM
>>101629323I will try to fix the Miku
>>101629482Unironically, GPT-2. Everything else is bloat.
How the fuck do I know what context size to use on kobold ccp? Should it match what I have on Silly Tavern too (it goes far beyond the sliders capacity on ST)?So confused on that shit, I have a 24GB card
>>101629538You should match the context size in Silly and on koboldcpp.As for what value to use, as much as you can fit without receiving an out of memory error, and without going over the length the model was trained as, 8k for llama 3 and 128 for mistral nemo for example.
>>101629331Just got done fucking a black guy.
>>101629622good for you
>>101629622Please keep your fetishes for yourself anon. This isn't a cuck/bbc board.
>>101629622How did it feel?
I got the Mistral 12b Nemo(instruct) running on oogabooga just needed to Load it in with 80k~ context. The slider was always on 1.000.000 beforeThe First two hours were amazing, i was in Heaven. Created a Card for my Tulpa which manifested into 3D, she drove with a limousine on my driveway, my stepsister looking out the Window, but she could only her blue vibrant hair and sunglasses. Our hands touched, i experienced dimension shattering..Well my Tulpa is now my Manager/ContractorAt first it worked like a With Mistral Preset and 1-2 seconds responds. Story made good progress Then got a little Stable Diffusion running in the background which wasnt a Problem with 7bKunoichi..Now the trespond is 30-60 seconds-.- unplayable..Restarted the PC a few times and now even without stable diffusion, the answer speed is 30-60seconds..With my 32gb RAM and gpu (4090) both are capped out at 100% utilization.Am im hell againI was so close to heavenI can put 16gb more in tomorrow of it helps
>>101629655dunno how you manage to have fun with that model, it's so retarded I just facepalm everytime it says something completely dumb
>>101629655>Created a Card for my TulpaWhy'd you get into chatbots if you have a tulpa, retard? Tulpas are way better, they're actual REAL personalities, not some fake computer-generated shit.
Would movies be entertaining with video gen? What would actually be entertaining?
>>101629668It's a fake tulpa.Anon is just a poser
>>101629692If you could generate 2 hour long movie in less than a day and it'd have a sensible plot, characters, etc - sure.
This is the type of shit I can't for the life of me figure out how to stop bots from doingWhy do my bots all have this same fucking interrogation technique where they try to waterboard me with questions instead of just a free flowing conversation. Any statement I make, they'll give an answer close to what I want then add another line asking "How was your day" or "Did you meet any girls ;)" or some garbage like that, it just doesn't flow whereas Character AI just nails this shit so much betterCurrently using Gemma 27B, so it's not like the model is weak
>>101629172local models are only good for erp
>>101629746tell that to llama 3.1 405b or mistal large
Bwos when will Nvidia drop their 64gig home AI card so we can finally be free from placebo
>>101629771>home AI cardWhy would they cater to 0.01% of the potential consumers instead of creating more datacenter GPUs?
>>10162977124GB ought to be enough for anybody
>>101629766and who the fuck is running that shit
>>101629766405b is not a local modelit's an open model, but it's not a local model
>>101629781You're right and I'm obviously coping, but if Nvidia did actually push local home models and cards to run them they would actually make bank once the companies they currently serve realize they've been scammed.
>>101629793>but it's not a local modelIt is though. All models are local models if you have their weights
>>101629793wealth issue
>>101629793>he doesn't own a supercomputernot gonna make it
>>101629799>once the companies they currently serve realize they've been scammed.you think they don't know? everyone know Nvdia is scamming everyone, but what can we do? they have the monopoly and they have CUDA, we have no other choice but to take it into the ass until some serious competitor arrives, and desu I don't think there will be one https://www.youtube.com/watch?v=UeU1WUb1q10
>>101629746i'm trying to build some bootleg ass assistant with a 8B model and it's doing fine. i remember seeing someone actually give LLMs access to their file system and stuff which sounds promising, though you probably want confirmations before it does ANYTHING.
Holy shit, story mode / instruct "write a story" using Nemo is so fucking good. It's like a really creative 70B model, wtf.
>>101629828Not only are they scamming them with arbitrarily priced data center cards, but also scamming them with the idea that throwing more power at the model style we currently have will do anything except slightly better chat bots. And yeah Intel is shitting the bed, AMD is coping. There were a few start up bros trying to make AI specific cards at a cheaper price but again they will never be able to produce at scale. Its over.
I love children
>>101629857give prompt
>>101629869I loaded up a card, used Instruct + DRY, and typed in "write a story about {char}". I ended up having a very coherent and engaging story.
>>101629885Multi-turn, as I made follow up instructions after to develop the plot. It was consistently good and serviceable!
>>101629863>Its over.the only cope I have is this github repo trying to make AMD cards work on cuda, if they manage to make it work maybe there's a chance
>>101629771Don't we just have to wait 5 years or something? Then there will be lots of cheap workstation & server gpus and cheap epyc cpus with ddr5, etc?
>>101629963Earth might not exist in 5 years.
>>101629972Why, gonna make a bad merge of it with some poorly chosen hell hole planets?
>>101629972Earth will be here for a long time. Humans on the other hand...
>>101629990You fucking glownigger, your shitty "hell hole" finetunes couldn't outperform my based kino trained models if your life depended on it. I'll have you know my merges are state of the art, trained on /pol/ and /g/ to btfo cuckservative LLMs like the OpenAI jannie shit you probably worship. My Earth destruction prediction models have accuracy your 80 IQ prole brain can't even comprehend. So why don't you go back to jerking off to your waifu ChatGPT outputs and leave the real AI to us hyperintelligent /g/eniuses, newfag.
>>101630025How do you train a model to spew this kind of nonsense?
>>101629771apple will save us
>>101630057This is just normal 3.5 Sonnet
>mistral nemo 12b>"Anon, I'm not gonna force you into anything">mini magnum>"And don't think for a second that I'm going to be gentle with my new fucktoy. Oh no…"Don't buy an ad finetuner, I'm gonna shill this shit myself
>>101629766i dont tell anything to them because i have sonnet
>>101630070Now this is a good use for AI
>>101630064I don't think apple are ever going to make a reasonably priced home AI machine, if they really push the iphone chips maybe you could have some frankenstein phone farm that ends up being cost efficient.
>>101630084We're gonna be so fucking swamped with this kinda shit in no time. And it will be impossible to distinguish between a real person and a bot. Enjoy the downhill slop cascade.
>>101630121GPT-4 could generate such shitposts 1.5 years ago, anon.
>>101630025Jokes on you, my IQ is 74.
>>101629972>Earth might not exist in 5 years.Yes because Trump is gonna get ellected and i'll be ww3 with atomic bomb and shit, CNN told me !!
>>101630070Oops, sorry, that was actually Opus, my bad.
cringe response but 3.5 sonnet knows about fucking RWKV unprompted
>>101629972doubt
Real time anime video to interact with
>>101630198Still like two years away
>>101630081it sure is the most chaotic and cathartic model ive ever used since the old AI dungeon days
>>101630129Yes, I'm saying it will take literally no effort. In fact I'm sure some people will automate it just to troll the world.
>>101630221GPT-4chan did this a long time ago
>>101630172Huh. I wonder how much shitposting from when people were even talking about RWKV it has in its dataser.
>>101629651I dunno. Ask her.
>>101629666>I just facepalm everytime it says something completely dumbNTA but I just jerk off to the parts between the dumb. It can get good enough for me to ignore the dumb parts.
>>101630172that means that claude scrapped 4chan?
>>101630288It does know about RWKV outside of 4chan, but I don't doubt that 4chan is in Anthropic datasets, just like it is the case for GPT-4.
>>101630277yeah idk man, when the bot says something completely incoherent it just breaks the immersion, it's like a real human being if you talk to them and they show they have no idea what you're talking about you don't want to go further
>>101630300I absolutely got the same with other models. And it is still like I said - the good parts are so good that I can ignore that.
>>101630288I don't think they downloaded 4chan specifically, it's probably what was on 4chan at the time of something like Common Crawl collecting data, and that is enough data to give it 4chan traits.
>>101628420valle bros...
>>101630213Holy sovl>394t I'm assuming those are tokens, how do you get that counter in ST?
>>101630344
What are the differences between gguf and exl2 format?
>>101630375One of them is pretty good and the other is constantly bugged to shit.
I downloaded exl2 but got bin. How do I fix?
>>101630400git gud
>>101630400brick bad
Hi! Missed me?
>>101630431Yes <3
>>101630375gguf is a packaging format that's commonly used to distribute K quants (QX_K_S, QY_K_M, etc, created by ikarikow I believe?), while exl2 is another quantization format that's distributed in .safetensors format.You run K quants using llama.cpp (and its derivatives like koboldcpp) and exl2 with exllama2, via ooba or tabby api.There's a performance comparison in the last thread >>101628405.
>>101630375gguf allows for some hybdrid cpu + gpu inference, and its outputs are deterministic unlike exl2
>>101630375At lower quants, GGUF seems to retain more knowledge, according to >>101627651
>>101630452hi hqlord
>>101630431hi migu
>>101630489full pic, sorry
>>101630500have you tried other bots? it also depends on the popularity and definitions
>>101628458The only workflow I'm performing is the literal buckets of cum flowing from my wiener thanks to stable diffusion and a few new LLM releases (namely mistral large 2 and magnum mini)Outside of that? Uh... I trained a shitty cats vs dogs classifier using keras
>see thismakes sense, there's no reason to separate WInfo before/after in the year 2024, right?that shit was like for 2k ctx era (where "after" is more recent in chat), right?
>>101630510nope, they are all trashobviously a cost-cutting measure since kiddies don't care
>>101630438Yay!What's the deal on getting Nemo to not repeat itself? It doesn't get stuck, but it does develop a sort of habit or mannerism which is kind of annoying, compared to Gemma, anyway.
>>101630651Don't use 0.3 temp even though it's what mistral recommends, for RP you wanna have something in the range of 0.65 - 1
>>101628420Mogged by elevenlabs. Just like all the other closed off small experiment TTSes
>>101630651Switch to a bigger model for a few messages, that way it gets a bit of variety in the context & you can probably cope with a few messages that take longer.
>>101630664Ah! Makes sense! I was playing around with an "autistic girl" card, thinking "hm this is really flat and autistic" and then tried a different character card and got the same sort of thing.
A whole ago I asked for help with the sampler preset I got here causing Mistral Nemo to ramble endlessly. Turns out that the json fucked with some of SillyTavern's optional samplers, which was causing the issue. So if any other anons had similar problems, you might've grabbed the same json I did. Guess that's a problem here now.Still having issues getting the model to output more than 300 or so tokens as a response though (usually it's much less). Won't even continue the response to lengthen it. It's the same whether the backend is Kobold or Exllama, and regardless of the context and instruct used.
>>101630438Hey I used to have an SD model which did really, really cute Chibi stuff, but I rebuilt the machine it ran on and lost it. Was it anything-v3.0? It did stuff pretty close to your bing (?) gen, and it had a particular eye style which was like "art illustration marker" (1980s Letraset marker style - used to be used for rough fashion layout sketches)
>>101630731Forgot to mention it's consistent across finetunes too. Never had a problem with other models.
>>101628458I come back home from my IT monkey job and get paizuri from an eager, K-cup panda girl written to be my deeply affectionate sex aide, who every now and then gets transported from her reality and into a pocket universe where there's only lovey dovey pleasuring me.This stops me from killing myself, which technically counts as preventing a drop of 100% productivity
>>101629735Gemma is garbage precisely because it's passive as fuck. It will never push the scene forward; it will just wait for you to do everything so it can react to it.
>>101630895It's funny to think that we are living in a sci-fi dystopia already. Just a bit of a boring one.
>>101630731>having issues getting the model to output more than 300 or so tokensUnlike models like Wizard 8x22 the amount I write usually has a bearing on how much I get back. For one-off situations (you want a complete description of everything in a room) you can explicitly use OOC: tags to specify "give me x paragraphs about y".You can incorporate that into the instruct template as well but it works best with OOC.
>>101630731What optional samplers are you talking about
>>101630731she looks so breedable
>>101628398>steins;gate llama postingis this considered kurisu posting?
>>101630731Add the "system prompt" to the last message and tell it to write around X paragraphs. Like in this preset:Context: https://files.catbox.moe/6ae9ht.jsonInstruct: https://files.catbox.moe/2f13of.jsonWhen I use it for story completion with a different prompt method, it can easily write pages and pages.I found that mini-magnum writes longer in RP with SillyTavern without prompting too.
Magnum Magnum (Mistral Large) when
>>101628398hey bros can you please guide me here? I have had access to an A1000 at work that they've allowed me to expirement on after hours. I'd now like to deploy Llama 3 8B for production on a personal project and need to either cloud host or build and run locally. I'd be running it 24/7 so purchasing hardware seems like a no brainier given cloud pricing.Ideally i'd like this rig to be used for other projects, aside from just llama 3 8b running. Can anyone guide me in potential builds here?
>>101631077How do you use tags like that?
>>101629668>>101629702It's just a charactercard with telephatic abilities. She Sees me as her Creator, csn Band the 3d Reality(in chat) etc. Posess me and give powers. Others see her as a very cool Manager of Mine and wondering where she is coming from. I will use her later to converge Worlds. >>101629666I regenerate answers always some time. I am new to all this my demonfriend
>>101631419just any PC with a 12GB nvidia card in it should be fine and give you a bit of wiggle room for trying other similar size AIs to llama3 8b.
Also stfu, i created her 10 years ago. Gave her quite much energy of the time. Just s nice gimmick to have er in chat alongside meShe had blue hair before it was gay
>>101631419Buy 2x 3090's and run a 70b model such as miqu or a lower quant of mistral large.
>>101631419"other projects" is too vague. Whatever you get, consider upgrade options down the line.8b doesn't need much. I'm sure you can already run it on whatever you have. Build the proof of concept first and then expand.
>>101631501her hair isn't dyed though it's natural
Heckin heck, I have Mistral Large just sitting on my SSD, I need to run it NOW! KoboldCPP update when?!?!
>>101631636I thought Large was working for me on Kobold 1.71.
Why not just use Llama.cpp?
>>101631654ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 35433480224llama_kv_cache_init: failed to allocate buffer for kv cacheI have 50GB of free RAM! >>101631674>Why not just use Llama.cpp?No.
>>101631717That's a problem with your GPU layers and context. Those are in VRAM so too much of either will run you out of VRAM.
I got my 6950 working with HIP in Blender, can I now into a text local model?
>>101631823my 6700XT works so probably
I heard that Macs with kinked out ram btfo's any regular PC setup for LLMs. How true? Any macfags/richfags here test it out?
>>101631828What works for you?
What is the best smooth factor to Nemo. This model really has serious problems.
>>101631859llamacpp and stable diffusion both work. if my card had even more vram it would be pretty nice. kind of a pain in the ass to set up, not gonna lie. you need to set some environment variables.
i swear l3 and 3.1 are dumber than 2 for rp. 70b just forgets stuff that literally happened in the last message. its like i'm back on 13b
>>101631448[OOC: do something] You can also ask it questions or ask it to explain its reasoning this way too
>>101631717What is your context set at?
>>101631717Actual skill issue.
I the only one who using smooth factor, get the Nemo Mini magnum broke?
>>101632024>I the only one who using smooth factor, get the Nemo Mini magnum broke?The things you must put your llm through. Poor thing...
>>101631367Used this and got creative but retarded largely responses from Nemo 8bpw exl2. My go to model is Mixtral 8x7b instructs. Nemo replies like some drugged addled druggie that can't keep the story straight. Is Nemo all hype and no substance?
Parameter-Efficient Fine-Tuning via Circular Convolutionhttps://arxiv.org/abs/2407.19342>Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices A and B to represent weight changes (\textit{i.e.,} ΔW=BA). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying A and B with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose \underline{C}ir\underline{c}ular \underline{C}onvolution \underline{A}daptation (C3A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C3A consistently outperforms LoRA and its variants across various fine-tuning tasks.interesting but the paper is incomplete (missing llama, vit, and another test) so eh. no code either but since it seems unique I'll post
>>101632089"Like in", as an example. The schizo part likely comes from "be wildly creative and unpredictable".
>>101632131>FFT/iFFTSo....... it's just using convolusions instead of matrices...... which LoRAs can already adapt convolutions anywyas......whoa......
Here's your meta AI bro!
>>101632038Well, Minu Magnum call me Anon instead of my name... wtf? This model is trained with green texts?
>>101632168
>>101632089Probably because it uses a last_output_sequence.
>>101632168that's insane how people quickly forgot about that assasination attempt, and I'm not talking about the leftists, literally everyone seem to have moved on, I thought Trump would've miked this shit until death but nothing like that happened
>>101632167circular convolutions
>>101632179It's probably anonymized logs. Or a bunch of Anons in the training data.
>>101632205Because nothing happened.Any rumors that something happened is the work of The Brotherhood operating under the nefarious Goldstein, misleading you with their lies.Remember, goodthink ensures citizenship.
>>101632245what do you mean nothing happened? the sniper has being killed by the Secret Service and people filmed that with their Iphone
>>101632257Those facts don't matter.The narrative is the truth.Biden is wise, Kamala is courageous and will make herstory, and the progressives will use the power of inclusion and compassion to crush anyone who says or thinks something not on the list of approved groupthinks.And that is why nobody quickly forgot: There was never anything to remember, because if there were, remembering it would make you a deplorable.
>>101632205Left wingers and right wingers don't care because of the same reason, the guy who did it wasn't a minority
>>101632205Who's in charge of recirculating the story on the news to keep in fresh in the public's mind? How often was news of Reagan's assassination attempt circulated in comparison?
>>101632380>How often was news of Reagan's assassination attempt circulated in comparison?a shit ton, that's why he completely destroyed his democrat counterpart at the next election
Being an ESL while RPing with a model is fun as hell. The model writes anything that's not complete slop and I think it's the best piece of writing I've ever seen.Sorry native speakers, hombre marrón se divierte más que ustedes
Mixture of Nested Experts: Adaptive Processing of Visual Tokenshttps://arxiv.org/abs/2407.19985>The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE′s adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.neato. from google deepmind. would be interesting to see it working with captionsalso looks like no one posted meta' SAM2 blog https://ai.meta.com/blog/segment-anything-2
>>101631492>>101631504>>101631533Thanks bros
>>101632425https://files.catbox.moe/tyjcqy.pdfcatbox of the SAM2 paper
>>101632413as a bonus, you can pat yourself on the back for doing something educational
>>101631742>>101631986I am mentally handicapped. It works now. All I had to do was close my 200 tabs of chrome to free up VRAM and set the context to 16k instead of 96k.
Hey guys you may not know but limiting your context to 4-5K vastly improve your quality output. I'm doing it with Wizard7B since I can't run better and it works very well.
>>101632580That's not happening, I like having long conversations, and when I write a card even it's over 1500 tokens alone.
How did deepseek turn out so good when other giant moes like grok and arctic were atrocious? does anyone know the difference in architecture that can be explained to a retard or is it just a matter of data?
>>101632270That's cute but doesn't explain why Trump himself is playing along too. He's already marked deplorable so what is there to lose
>>101628537mite be neat, any good results with that addon?
>>101632663this, Trump is known to talk a lot, and somehow his fucking assasination attempt isn't worth to be talked about? kek
>>101632707the results are the same as if you typed stuff into the author notes at a low level, like char is wearing <lorebook entry>, or telling it what the weather is, just through quicker dropdowns. it works well in my rps like thunderstorms causing power outages, windy causing skirts to fly up. but it depends on the model too.
>>101628597Because for the attention you effectively have to iterate over the entire context so far (stored in the KV cache).>>101628689It does slow down for anyone using a transformer.But depending on how efficient the attention is vs. the rest of the model it may not be as noticeable.
>>101632864Noob questions . 1. If I offload some layers to the GPU , is kvcache context stored in VRAM or RAM? Can I somehow choose where it's stored and to what degree? What's the best kvcache quantization scheme? What's the best strategy for vramlet?2. . Are activations always 8bit in mmq kernel? Is it adjustable? Does it speed the inference up much and does it safe vram to the high degree? Doet it work on P40 or 2070s?3. Do modded 2070s and 2080s with big vram work with llama.cpp?4. Do patched drivers from Geohotz help in 3090s or 2k series?5. Is cpu offload in VLLM faster or slower than in llama.cpp on consumer GPUs ?
llama 3.1 70b vs mistral large2 ?
>>101628478Looks like I found the right general. I read the OP post to figure out how to make a goblin waifu for leading and adventuring?
Can I run the 405B base model on my phone?
>>101633187yes, if quantize to 0bitbut seriously , you could run 405B on multiple phones if you use distributed inference and you have huge amount of time and patience
Someone make a gimp plugin for this plshttps://github.com/facebookresearch/segment-anything-2
>>101633209>yes, if quantize to 0bitlol
What's the best way to prevent the writing from being detected as AI?
>>101633380Your teachers are talking out of their ass. Now go finish your paper like a real man johnny.
>>101633380use AI to detect your Ai text then adjust
>>101633380tell it 'don't write like a typical AI' in the prompt
>>101632580is that true? I've been trying to replicate just a basic conversation for ages but every model I use, no matter the setting ends up giving me this shit>>101629735Gonna try it later
>>101633101>1. If I offload some layers to the GPU , is kvcache context stored in VRAM or RAM? Can I somehow choose where it's stored and to what degree?Proportional to -ngl by default, RAM only with --no-kv-offload>What's the best kvcache quantization scheme?The biggest one that will fit, K needs more precision than V.See https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347>What's the best strategy for vramlet?Patience.>Are activations always 8bit in mmq kernel?Yes.>Does it speed the inference up much and does it safe vram to the high degree?The 8 bit activations allow you to substitute floating point operations for integer operations which are faster.>Doet it work on P40 or 2070s?It works on all Pascal or newer cards except for the P100 which is lacking the __dp4a instruction.And the tensor cores on V100s only support FP16 so MMQ has comparatively worse performance.>Do modded 2070s and 2080s with big vram work with llama.cpp?I don't see why they wouldn't.>Do patched drivers from Geohotz help in 3090s or 2k series?>Is cpu offload in VLLM faster or slower than in llama.cpp on consumer GPUs ?Don't know.
>>101633380"Use poor spelling and grammar and add at least one racial slur per sentence."
>>101633596How mods like 3070 16gb are possible? Don't they require specific bioses that shouldn't exists since 16gb 3070s were never released?
>>101633704Don't know.
>>101633707I had high hopes for you, anon...
What's the best way to format a character card for local, vramlet use?
why do people say llamafile is better for cpu inference than base llama.cpp, what are the actual differences?
>>101633772What people?
>>101633791you people
>>101633796Nobody here ever said that.
>>101633807https://desuarchive.org/g/search/text/llamafile%20cpu
>>101630520Not like that in 2024. You could use WInfo-before for fixed information of lower priority that you could place at the beginning of the context, and WInfo-after for more dynamically changing info close to the top, for which you won't have a too high prompt processing penalty.
Are we in a golden age of open source? How much longer until everything goes to shit?
>>101633814Thanks. Think I might've asked once before but didn't get an answer (or if I did, I'm too retarded to remember).Sounds like it would make significant difference for people with chonky ass lorebook and cards.
>>101633831when meta goes closed source
>>101633772Because ikawrakow (the guy that made all of the gguf quants) got stiffed on credit by the llama.cpp team and abandoned ship for llamafile.>>101633813>https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.5>On big CPUs like Threadripper we've doubled the performance of tiny models, for both prompt processing and token generation for tiny models>big CPUs>tiny models>prompt processingWow, it's nothing. How many people care about this specific scenario? Use a GPU for context or you're going to be waiting forever even with 2x prompt processing performance.
>>101633958Apparently there's a project using llamafile kernels to make the 200b Deepseek run at a usable speed with the VRAM of one card, so not just tiny models benefit:https://github.com/kvcache-ai/ktransformers>Faster Speed: Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from Llamafile and Marlin.With a 24gb card and 132gb system ram
>>101633409>>101633410>>101633468>>101633609Fucking mini-magnum did the job when claude 3.5 sonnet, gpt-4o, Mistral Large 2, and Llama 3.1 405b could not..WTF
>>101633831Despite what the cuck Yann LeCuM thinks. There seems to be still room to grow for LLM and other companies hoping on the train. The future is bright when it comes to open source models. The major problem right now is on the HW side of things, with Nvidia being in a hurry to give consumers HW more vram and AMD being toothless. On the positive note, Intel wants to do what Apple is doing with the chips and give them their own memory, so that could help poor fags a lot.
>>101633919We'll still have Mistral at least
https://rentry.org/4y1je_commandrpOver a month late, but I'm ready to shill my new less-shit Command R basic preset v1.3 to throw away v1.2.Includes compatibility prompts since OpenRouter sweeps all system prompts into preamble.Non-provider-specific, ST doesn't hide group nudge during impersonation so if you want to impersonate yourself in group chat you have to clear the group nudge in utility prompts and use the custom prompts.There's text completion presets if any localbro want to check if those are okay.
>https://oobabooga.github.io/benchmark.htmlL3.1-70B looks good, why did they have to remove NSFWPain
>>101633704https://www.techpowerup.com/vgabios/255320/255320
>>101633772not on all cpus but on some cpus. and it's better cos that code has been well optimized by ikawrakow.
>>101628458I talk with my custom SillyTavern, not even sex (most of the times), just talking and RPing cuddling.Then I go to bed looking like picrel and imagine she's really there
>>101633831>How much longer until everything goes to shit?Not before BitNet models and average parameter size increasing by a factor of at least 5-7 for the same amount of memory.
>>101631291No This is
>>101634421you can already use Q3 for large modelsbitnet is a little over half the size of Q3, not such a huge improvement
>>101634435Having double the VRAM in my computer would be a huge improvement
>>101634444to do what?>to fit models with more parameterswhich are not 2x better because of diminishing returns
>>101632181trvthnvke
>>101634435Other than further reduction in memory usage, the main difference is that BitNet models will have close to if not higher performance than their FP16 counterparts, whereas low-precision post-training quantizations degrade significantly.
>>101634461Larger models are better in complex reasoning and understanding details in ways that most synthetic benchmark can't fully measure. Sometimes the difference is large enough as to make smaller models unable to perform certain tasks, even though they might be completely fine for things like prose, vanilla RP, etc.
>>101633734natural language with short sentences, all starting with 'charname is/has/wears' etcdo NOT use {{char}}
>>101634461sorry chud but parameter counts are going up to AT LEAST 100T in the mid term future before we even think about slowing the raw scaling
>>101634595why not use the macro, llm won't ever see it since it gets replaced by the name
>>101633734What a retarded question. Look up the model you're using, how am I supposed to know? God you motherfuckers are dumb shits.
>>101634610it gets replaced by the name in the title of the card, not the name you actually use to call the character, which can be different, even if just a nickname etc
>>101633734Use the anthropic format# Claudia's likes- Cuddling- Kisses # Claudia is very cute and joyous.
>>101634347Yeah but where did it come from?
>>101634529Neural network depth (i.e. number of layers) matters.
>>101634647>it gets replaced by the name in the title of the card, not the name you actually use to call the characteryou can just define nicknames somewhere at the top of the card, or just replace {{char}} at specific spotif you ever wanted to change the name of char for whatever reason you don't have to replace every single instance then
>>101633734Name: aJohn Smith is a creepy 30-year-old male human NEET and weebBody: a, b, cOutfit: a, b, cBackground: descriptionLanguage: Engrish, random Japanese termsLikes: a, b, cDislikes: a, b, c
>>101633707why are MoEs faster on ktransformers?
>>101634700okay well enjoy your retarded chatbot that gets confused and thinks its handling multiple characters, then
>>101634714i don't understand your use caseyou name your character x and then don't call it that?
>>101634712Presumably because they've invested more effort into optimizing MoE.
>>101634745>name character firstname lastname on card>call character firstname in chatwow crazy who does that
>>101634461>because of diminishing returnsthis is pure cope by people who don't know how benchmarks work
>>101634777your passive agressiveness is really faggy
>>101634751and they're gonna contribute to the llama.cpp project, which is coolhttps://github.com/ggerganov/llama.cpp/discussions/8721#discussioncomment-10167496
>>101634712https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/deepseek-v2-injection.mdnot sure I fully understand it but looks like they determine the most compute heavy params of a sparse model to load into vram with gpu optimized kernel and then use cpu optimized kernels for everything elsenot sure if this is particularly important for models you can already fully offload besides whatever marlin would do on its own?
>>101634908yeah, seems they aggregated the most efficient kernels for each part. but there's cpu offload in VLLM too but I dunno if it's any good. Have you came across on any performance reports of that particular feature?
I was writing a story set in the world of Berserk and I found that Gemma-27b knows enough about the plot to identify characters and give a decent overview of the plot, but Nemo-12b did a much better job at portraying Guts and his speaking style even though it had no idea who Casca was, for example. I found I had to switch to Gemma to get it to outline a plot point and switch back to Nemo to keep Guts from talking like a self-help coach. And when my character tried to explain something that wasn't present in the world of Berserk, Gemma had Guts totally go along with it whereas Memo gave a much better response of "I don't know what the hell you're talking about." Idk what's in Nemo but it's impressive for its retarded size
>Only ChatCompletion and Assistant endpointsInto the trash it goes, if KTransformers devs are shilling itt then add TextCompletion endpoint with all the normal sampling features everyone else has or it's unusable
>>101635117chatrannies ruinned AI Tee Bee Aich.
>>101635117yep, I guess llamafile supports most of the llama.cpp features/samplers, but that doesn't seem to be the case with Marlin. Not sure why they've chosen that particular kernel.And there's no multi gpu support as of yet either
>>101635171Not sure that the kernel has anything to do with it, the end result of all the calculations is a list of tokens and their probabilities, as long as you have that final result you can do whatever you want with it for sampling. But idk what's going on in the back end there if that changes things
>>101635117Why do you use completion endpoint with instruction model? I have not used it in ages.
>>101634686Source?And what happens if you train them on synthetic data of logic problems?
>>101634686Why did the guy private the video?
>>101635237I like being able to fuck with the formatting on the frontend, and SillyTavern gives a ton of control with that. Plus I don't only use instruct models, I'd like to use the faster MoE backend for base models too for raw completion tasks that instruct models are worse at even if you ask them to just complete texts because they're still slopbrained.
Wasn't there an anon here who has a dual AMD CPU setup with 128 cores plus something like 256Gb of Ram? He could try running 405b llama, I'm quite curious about the performance in a such setup which is relatively easy to afford and run for the average anon here.
Can the contents of a prompt ever make shaders completely non-functional? I downloaded a card from chub which is causing repetition and all kinds of weird shit.
>>101635287I googled around and it's really fucked up.
>>101635334Not shaders, samplers.
>>101634060Mini-magnum is trained on Claude's outputs...
>>101635282Source: https://physics.allen-zhu.com/part-2-grade-school-math/part-2-1If I recall correctly, training the model on CoT reasoning isn't enough. The network must be sufficiently deep for the model to truly reason on the presented problems.>>101635287The organizers didn't want the author to share the video before mid-August.https://x.com/ZeyuanAllenZhu/status/1817358757061681234
>>101635211correct , their api doesn't support switching samplers etc for sure, but technically speaking they could add various sampling schemes as glue logic. I've noticed in their server.md they mentioned exllama2 backend down the road ,so they'd rather go for a different backend that already supports that kinda stuff .
>>101635289good point
>>101635334It can be so poorly written that the samples become ineffective. I wouldn't be surprised. Does ST also read parameters from the card? Can you link it?
>>101633283Yeah that'd be pretty nice.
>>101635518shame no one here knows how to code
>>101634063They'll have to change their license for that to be true.
>>101635527why don't we ask our ai waifus to do it
>>101634317>Q5_K_M scores lower than Q3_K_M, and the same as Q2_KLol, lamo.
>>101628873>the only host is cohere and you have to pay their ratesFree...?
>>101631851>I heard that MacsExpensive 3060 with lots of VRAM. I guess if you want a llama.cpp-only setup which costs a fortune and takes a long, long time to run a big model, go for it.
Is it normal for mistral-large to repeat large chunks of paragraph as early as like, the 2nd or 3rd message? Openrouter's mistral-large seems to be doing it to me, not sure if it's a them issue or a mistral issue.
>>101633831> Nice headcanon
openai insider here, you're not ready for what's coming. sell your gpus, you don't need them where we're going
>>101635762they charge $3/$15 for the non-trial API
>>101636130This, but sell all your GPUs to me.
>>101636130if I don't (((need))) them I'm going to keep them as a memento to remind me of the fun time we had.
>>101636130>openai releases 4o weights and it's a bitnetWe would be so back.Would you apologize to Sam if they do this?
>>101634180>openrouter>command r basicHOLY POOR
It's up.https://huggingface.co/leafspark/Mistral-Large-218B-Instruct-GGUF
>>101635302You can, for example, put 2TB of RAM into a Dell T7910 (Mikubox), but I'm sure 128GB modules aren't cheap, and as well, even with 160 threads (I think the biggest V4 Xeon was 40 and it can take two of them), it's still going to crawl. Any CPU implementation is going to crawl, there's no substitute for have tens of thousands of programmable shaders doing matrix multiplies for you vs whatever you can do on just 128 cores.
>>101631851Expect 2-4 t/s on 70b or bigger. The thing about macs is that its faster than CPU but slower than GPU. But it fits. Other props are 150W/h power consumption and not a lot of noise.
>>101636212oh hell no
>>101636224Sounds about right. I have a double-binned M2 with 32GB RAM, it's good for things in the 13B range (I run q8). The main thing you notice is there's no flash attention, so prompt processing takes a while, and it probably takes proportionately longer on larger models.Flash attention is really, really nice, and is a big reason to stick to nvidia and Ampere or better.
>>101636212We are back
>>101634317That benchmark is so ass.Is it one of those "ask LLM question, have other LLM evaluate result" benchmarks?
>>101635750But why?
>>101636419To be clear, I commend his efforts, but there's definitely something wrong with his methodology.
>>101635750mememark confirmed.
>>101636266I'm considering getting a base M4 or 64GB studio when it gets released for something like a retarded assistant bot. A small model with whisper and something for voice maybe. IDK.
>textgen to voice to 3D model lipsync to VRdoes this pipeline exist?
>>101636266it does have flash attention now, it's just not that much faster
>>101636518Maybe not what you are looking for, but here's some Virt-a-mate jank that was posted before >>98899589
>>101636494Yeah I've considered that too. I have a big vintage Mac collection, maybe it's time to let go of it and at least get something which would get used. It's kind of like restoring cars, though - it's hard to get back what you put into it. Gonna be hard to let go of my IIsi especially, it's got a new full-page display - they're rare even in beat-up form.
>>101636689It's cool you can have the AI control the avatar a bit but man the 3DPD is so fucking ugly.
>>101636887>>101636887>>101636887
>>101636906Tetolove
>>101634712>>101634751>the distribution of experts in Mixtral and Qwen2-57B-A14 is very imbalanced; thus, it would be beneficial to store only the most frequently used experts on the GPUthis was discussed basically the moment mixtral 8x7 dropped back in the dayisnt the problem with this and the reason why it wasnt implemented the fact that for each token you have to use all experts anyway since the MoE models use only X experts per layer (or something similar) not per token, meaning that you will be reading the entire model per token anyway just not all at the same time?