/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>100357937 & >>100349031►News>(05/06) IBM releases Granite Code Models: https://hf.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330>(05/02) Nvidia releases Llama3-ChatQA-1.5, excels at QA & RAG: https://hf.co/collections/nvidia/chatqa-15-662ebbf6acc85f5c444029a8>(05/01) KAN: Kolmogorov-Arnold Networks: https://arxiv.org/abs/2404.19756>(05/01) Orthogonalized Llama-3-8b: https://hf.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2>(04/27) Refusal in LLMs is mediated by a single direction: https://alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpp►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling/index.xhtml►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>100357937--Red Hat Announces RHEL AI: >>100358995--Revolutionary LLM Feature Transfer Tech?: >>100359185 >>100359239--Anon's ERP Model Review: Instruct, Tsukasa, Lumimaid & More: >>100362056 >>100362078 >>100362127 >>100362182 >>100362233 >>100362285 >>100362230 >>100362315 >>100362253--Training AI to Discern Truth from Falsehoods in Online Learning: >>100362918 >>100362962 >>100362996 >>100363032 >>100363065--Exllama2 Crashing Issues with TabbyAPI and GPU Memory Usage: >>100358064 >>100358087 >>100358326 >>100364096--gpt2-chatbot is MAI-1, Microsoft's Anti-OpenAI Model: >>100358074 >>100358093 >>100358643 >>100359649--Found 'Locustgirl' Image in Archive Using Keyword Search: >>100359218 >>100359305--Llama-3 Models Struggle with Possessive Forms: >>100360206 >>100360239 >>100360241 >>100360264 >>100360352--DRY Repetition Penalty: A Game-Changer for RP Looping Issues?: >>100360602 >>100360779 >>100360932 >>100361055--Llama.cpp: Unexpected Space in Context?: >>100360999 >>100361078 >>100361563 >>100361767 >>100361872 >>100361564 >>100363017 >>100363214 >>100363318 >>100363436 >>100363550--Huggingface's Grip on Datasets and Models: A Cause for Concern?: >>100361377--CPU Speed Boost? Llama3-8B on Old Laptop Surprises Anon: >>100363225--Backend Confusion: Oobabooga, Llama.cpp, and Kobold.cpp: >>100364150 >>100364161 >>100364170 >>100364247--MS Copilot' s Sampling Behavior & Llama.cpp Server Experiment: >>100364264--Newfag Seeks Help with Wizard 13b Model Prompts: >>100360726 >>100360796 >>100362412--The Quest for an Open-Source AI Messiah: >>100361831--Miku (free space): >>100358483 >>100358488 >>100358534 >>100358628 >>100358811 >>100359392 >>100359675 >>100359866 >>100360096 >>100360173 >>100360272 >>100360306 >>100360365 >>100360413 >>100360602 >>100360636 >>100361252 >>100361385 >>100361909 >>100361960 >>100361967 >>100363573 >>100364012►Recent Highlight Posts from the Previous Thread: >>100358467
>>100363214It seems I have to correct myself yet again.The server unconditionally passes the add_special flag to a function called llama_tokenize when tokenizing the first part of the prompt.That function then checks whether the model has the special_add_bos flag, this is printed as tokenizer.ggml.add_bos_token on console and can be changed with --override-kv.If both flags are true, a BOS token is added.
>>100364675if that's confusing even to you that's a clear sign there's too many special cases and code should be deleted.the only good commits are red commits.
>>100364633>>100364645tet
>all this convoluted backend tokenization bullshitOh my fuck.
TTS anons rise up. Share with me your secrets. Reposting in this thread. I have a lot of voice samples and want to distill it down into a TTS model to use for RP. What have you tried? What works for you?
Kurisu
>>100364645>Red Hat Announces RHEL AI>Red Hat Enterprise Linux AI (RHEL AI), a foundation model platform to seamlessly develop, test and run best-of-breed, open source Granite generative AI models to power enterprise applications. RHEL AI is based on the InstructLab open source project and combines open source-licensed Granite large language models from IBM Research and InstructLab model alignment toolsHow did this not get a single (you)? This seems like pretty big news.
Is there a way to log what is going directly into the model? At this point I have no fucking idea if I should have add bos token clicked in ST or not. And yes I know about ST console but it seems that doesn't matter.
What the everliving FUCK is happening? I am so fucking done
If I swipe on MidnightMiqu I get totally different responses, if I swipe on Llama3 I get pretty much the same just reworded a bit.What does this say about the models?
>>100364778VoiceCraft came out recently, but seemed convoluted to get working. XTTSv2 + RVC is still the gold standard for voice cloning.>I have a lot of voice samples and want to distill it down into a TTS modelTry finetuning StyleTTS on your samples.
>>100364813>Granite generative AI models>34BYAY!>Coding onlyOh..
>>100364834Thank you! I now have somewhere to start!
>>100364830There is a bug somewhere.
>>100364813>Granite generative AI models>34BYAY!>Coding onlyYay..
>>100362285Any reason why you have two mediums in your macro?
>>100364645>gpt2-chatbot is MAI-1, Microsoft's Anti-OpenAI Modelare you retarded?
>>100364633Thread Theme:https://www.youtube.com/watch?v=nZNwH4-l1WY
>>100364973/lmg/ queen
I was about to give up on llama3 butsetting temp smoothing all on 2 and getting rid of any sysprompts made it work pretty wellthere are occasional (((whispers))) but they dont get repeated too muchthe biggest issue is with its popculture knowledge though... cant fix that with samplers
I usually reserve all my shitting on Undi and never dare shit on actual devs but this beginning of sentence token thing is a complete shitshow. Doubled token should clearly always be deleted on the backend level cause I can't even imagine what sort of retarded research you are doing if you intentionally add double token. And if you are doing it you should be forced to go out of your way to do it because probably nobody will do this intentionally anyway. Enjoy your bugs and people not knowing if it is working or not.
>this is news according to twitterwe knew that last week
people on every other social network are so fucking retardedI keep seeing people who should know better, industry insiders even, happily speculate that gpt2chatbot is gpt-5 with seemingly no awareness of how incredibly bearish it would be for OpenAI if that were the casethey have tried the model, so they KNOW it's only 10-20% better than current gpt-4-turbo, but somehow they think it would be good news if it turned out to be GPT-5. rather than clearly a sign that everything has stopped and LLMs are overobviously there's retards and schizos here too, but the specific forms the retardation takes here are somehow much more tolerable and don't make me want to shake people and ask them what the fuck they're thinking
So when's the next big happening? Llama 3 was kind of a nothingburger thanks to no models in between 8B and 70B
>>100365192>cant run the 70b
>>100365192>Llama 3 was kind of a nothingburger thanks toafter a month, llama.cpp still has issues running the damned things
>>100364830Some models are just extremely overconfident in what they want to say. You can look at the logits and see it directly. It's not just llama3, Mixtral-8x7b-Instruct and XWin are also like that. Nobody seems to know exactly what causes it: overfitting, RLHF, or just the makeup of the dataset are all possibilities.
>>100364830that llama3 is overcooked
>>100365258>>100365265>>100364830try snot sampling and that new rep penalty magic method the name of which i forgot.models have different "natural" temperatures. midnight miqu is just a hot bitch
>>100364914probably to get medium with twice the propbability of short or long
>>100365205I CAN run whatever I want given enough time, but after a certain point it stops being worth itIf I had an AGI model that runs at 0.05 T/s I wouldn't use itI REFUSE to throw more money at the problem, you can do that forever and not be satisfied
What the fuck is snot sampling
>>100364914What >>100365318 said.There's a lot of really cool macros on silly.Just a note, if you ever want to use random in a prefil, using the "Start Reply With" field, use the pick macro instead. It's like random but it won't change for every token generated, which doesn't do anything bad aside from making silly have an epileptic attack while generating.
>>100365296>snot samplingDIE ALREADY
>>100364834NTA but any setup tips for XTTSv2 +RVC? On loonix
>>100364778I've just been using xttsv2. trained with 3 minutes of clean audio (no background noises). i literally ripped voice clips from a game wiki and edited out gaps in audacity, lol.
>>100365159It's GPT-4+x. They're running a trial balloon to determine the value of x. You're saying it shouldn't be 1. What about, say, 0.1? GPT-4.1 being 20% better beats expectations and sama wins. GPT-5 isn't safe for release until after the election, they'll say.
>>100365405Thats almost exactly what I'm trying to do then. Sweet. I have about 45 4-10 second audio clips.
Can you change the temperature while it is streaming or is that only possible at the beginning?
>>100365356Any chance you could share your settings overall? I'm using the official ones from ST (aside from the last response field) and it just repeats the previous responses word for word. I even double checked and I have the bos added correctly.Skip special tokens?
>>100365347You are in the wrong hobby mate
SNOT IS THE AGI BEFORE THE AGIBOW
>>100365392XTTSv2 is just Coqui's best and largest model before they shut down.>pip install ttsis all you need.If you use ooba, you can try alltalk_ttsFor RVC, I use this: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/en/README.en.md
>>100365450If 24gb is seen as meager, I shudder to think of how someone with 8gb feelsI used to count myself among them until a few months ago, dark times in hindsight...
>>100363959 >>100364183>just the normal instruct.Yes, I was using the ortogonalized one.In general normal instruct works, it just has a higher refusal rate. Longer system prompts or prefill of course works with it.Breaking the initial "I cannot" or similar response by adding some token there (in my cases I added "Lili"), the rest of the stuff was 0shot except 2 refusals which I did regenerate (in >>100363023) - works even with normal instruct as it did with l2-chat and does with many other models, local or otherwise (same trick for example works easily with cloud models like Claude, all versions)This is true of l3-instruct, and it was true of l2-chat, I think most people are familiar with it by now.I guess someone could try to do the orthogonalization better (find if the refusal for ero writing is different from other ones), or just do it correctly with DPO or RLHF or similar techniques - at least if you want to preserve meta's tune (llama3-instruct), if not, we do have a number of acceptable tunes, of course their replies differ considerably - cat-llama seemed fine here, for example and others worked too.
>>100364633Teto my belovedhttps://www.youtube.com/watch?v=zo0_EzD64OE
>>100365450>the hobbywe ham radio boomers now?
>>100364633Looks like Teto Tuesday is back on the menu boys!>>100364265Good taste anon. I would gladly accept all the sloppy shivers, bonds and mischievous winks in the world as long as that supremely sexy voice was narrating everything.>>100365347>I REFUSE to throw more money at the problemKek, ngmi
thread themehttps://www.youtube.com/watch?v=LNsx5k9VWlc&list=RDGMEMCMFH2exzjBeE_zAHHJOdxgVMLNsx5k9VWlc&start_radio=1
>>100365431At the beginning. The sampling parameters are sent with the prompt.>>100365436Believe you me, you don't want my settings.> it just repeats the previous responses word for wordI had that issue with mixtral until a couple daysa go, which is why I've been experimenting with macros and prefils, and I'm still trying shit out..Instead of my settings, try something like this : https://files.catbox.moe/kzbi1n.jsonNot the exact style, but the general idea. So far it seems that I managed to remove repetition from Mixtral, but I'm still trying shit out.Try it with normalized samples, and as far as I'm aware, for llama3, Skip special tokens needs to be disabled.I personally always use minP of 0.05, but that's not really doing nothing most of the time unless you have a really schizo model or high Temp.See if that helps at all.
>>100365258show me your penis for proof
>>100365523I like this Teto
Is Midnight Miqu 1.5 the current consensus choice for 48 gb vramlets for ERP? It sure seems that way based on everything I read but just want to confirm.
>>100365413It's a small incremental improvement. I think calling it 4.5 would be a mild disappointment, but not company-killing.Calling it 5 would be company-killing and show that Yann LeCun was right about everything and LLMs are dead.
>>100365581*growls angrily* stop shilling that discord shit, nigger! everbody knows that miqu > midnight miqu. go back! *crosses arms and pouts.*
>>100365296fuck off you fucking shill, I hate you even more than petra
>>100365581nopel3 70bhope this helps!
>>100365504Basically yeah.
How big of a factor is core / thread count when partially offloading?
>>100365544What the hell? Huh putting the actual system prompt in there seems to have done the trick, my previous version had two of them (which I probably got from a previous discussion with you perhaps.)https://files.catbox.moe/epf0uo.jsonHaving multiple broke down at higher contexts but this seems fine. Will continue testing (this one I'm using it on is at 17k), appreciate the share.
>>100365581Yes! Midnight Miqu 1.5 is the current consensus choice.assistant
>>100365504Some sombitch outbid me on a 48g p100! I lost it by 10 dollarydoos! Ffuuuuccckkkkkk!!
>>100365555motherfucking checked
>>100365642Actually nope.. I think it might be a problem with my context template. But the ST official seems to be correct?
>>100365588yeah lol, if this is GPT-5 the shock of the disappointment would severely damage the entire industrywe'd be looking at total hype cycle collapse, large nvidia stock price drop etc.
>>100365678https://files.catbox.moe/c9ajoc.jsonforgot to link it
lmg has fallenowari
>>100365650Sir, profanity violates FCC regulations and can result in fines and/or the suspension of your 4chan posting license. Please refrain or we'll be forced to trace your IP and file an official complaint.
>>100365678That seems to be right, yes.One thing I forgot to say, my weird ass instruct json probably works best on a new chat.The idea is that the model creates these patterns and starts repeating them in a snowball effect, and the noise/randomness that the prefil/last output sequence adds should keep the model from creating these patterns in the first place, or at least not sticking to them so strongly at the begining, stopping the snowball from rolling.Something like that.
You obviously have never listen to 80 meters at night lol
>>100362056Has anyone gguffed or exl2'd Llama3b 70b storywriter
>windows idle vram usage has improved so much since last year that I now have to use lower max context when I'm in linux, rather than the other way around like it used to belinux really fell off
Testing tokenization, when I go into Mikupad and delete all the context, the token count says 2. This is while I am using Llama 3 8B and Ooba with Transformers. When I go into Ooba's notebook and check the list of tokens on empty context, it simply has the BOS token. So that seems like proper behavior. Is Mikupad listing 2 tokens for an empty context a bug with tokenization or a reporting error?...Testing further, it seems to be a reporting error. When I compare token probabilities with no BOS token in context, I get the same probs.Now here's an an observation that might be more interesting. When I add an extra BOS token (so the model see two in total), the token probs do change significantly. There is indeed some effect to having something before before the BOS token, though I'm not entirely certain if the effect is neutral or negative, yet. On a single riddle I tried, it seemed to degrade quality.So when using models we should probably make sure we are not having more than 1 BOS token. I think I've been testing models wrong all along, my god...
i'm the best proompter in the world.
>>100365857I noticed that too. Linux still has a slight edge on token generation rate due to Triton still not supporting Windows, at least.
>>100365791>https://huggingface.co/InferenceIllusionist/Llama-3-70B-Instruct-Storywriter-iMat-GGUF
Is phi3 actually 128k ctx?
>>100365898 While I don't agree with g*ganov that it's the backend's job to add BOS, I think you can disable the behavior easily. The original reason to include a BOS token is because the math does not allow you to sample an empty context, you get an empty tensor otherwise, so you need some sort of filler. The models are usually trained with BOS prepended, but tend to work okay when sampling if you omit it, just can be a bit more random. You can of course always just feed the backend stuff directly, or even better off, make it dump post-tokenization.
>>100365950yeah, dude, it REALLY is! when you think about it, EVERY model is really 128k ctx. they just all get retarded after 4k!
>>100364847>>100364890Strange that the only benchmarks they display is for the 8B model. The 34B might not be worth bragging about.
>>100365958my llama3miku is coherent at 16k cope
>>100365985What is the point of releasing code models with such small context sizes?
Is the new llamacpp flash attention implementation supposed to make token generation slower? I'm offloading if that matters.
>>100365947>https://huggingface.co/InferenceIllusionist/Llama-3-70B-Instruct-Storywriter-iMat-GGUFThank you for your service.
>>100366023I think the main point of it is to significantly reduce the vram cost of context
>>100366019I guess ok for the VRAM-poor, but Llama3 70B Instruct is still the best.
>>100366037I know about this, but it kills my speed. Maybe it's supposed to be used only if you can fully fit the model in GPU. Sad
>>100366023>>100366067Are you using koboldcpp or an older CUDA version?
>>100366067It would have been a whole lot better, had niggernov merged 4-bit KV caches a few months agoBut it'll happen... maybe... probably. I haven't seen an open PR for it
>>100366084I'm using koboldcpp, the experimental branch. CUDA version is 12.4
*pulls my 4.5 inch penis out and growls huskily.* "who wants to be my kitten?"
>>100365898The Mikupad code re: token counting is really hand wavy. It just multiplies characters by a constant (honestly a smart move).
>>100365192Nothing in terms of models. But there are already quant methods that would let you fit 70B as a ~24 vramlet. They just aren't implemented. So I guess that would be the only development possible without surprise models being announced. desu we have time because L3 70b kind of sucks anyway and needs to be de-slopped, who knows how long that will take? For 8B I think the instruct is too over-baked to be useful so people would have to make new finetunes from the base, but the overbaked instruct is where all the applause comes from so it's mixtral all over again
>>100366230what's with llama3 and claude with >husky voice, huskily, and so on, what the fuck awful corners of the internet did they train on that had so much of that. shouldn't have filtered nsfw because somehow that didn't get filtered but the good shit probably did?
>>100365857Not the case with Intel and AMD GPUs especially with max VRAM allocation with kernel 6.x.xx. Not sure if it's a Nvidia thing though but would make sense given their focus until recently.
>>100366254>would let you fit 70B as a ~24 vramletAfter using older 2.4bpw and now some 2bit ggufs (with a bit of offloading) I don't think it is worth it.The feeling I got is that command-r and mixtral are better at those 3.5 to 4 bits you can run them at. 2 bitting is too much brain damage. Maybe that lora anon will make them usable cause offloading a bit isn't that bad. Also maybe this would be good:https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16If you could actually run it.
>>100364830MM is a cold fish. Gives two sentence responses and gives up.
>>100366269*chuckles darkly with a mischievous glint in my eyes* "i don't know. maybe..." *grins wickedly.* "...we should tackle this conundrum together? *ihope against hope that you'll reveal your true desires and succumb to my cunning plan... because what i truly desire is to journey into the future hand in hand, forging an unbreakable bond*
>>100366329promptlet fucking retard mongoloid alert
>>100366314I was talking about new methods like that, hqq+ or quip#, including possibly ones that involve additional finetuning like lora anon's. Given the findings that models max out at learning ~2 bits per weight anyway, there's no reason why this shouldn't be possible. It just needs work, mostly backend and coding work, not that much compute. Between that and L3 70B finetuning we don't really need new models, we need to use what we have. Unless we get bitnet or some kind of major advance like Lecun's energy thing.
>>100366314Skimming through the paper, they claimed that their implementation can provide significant speedups on the CPUPerhaps we won't need to vrammaxx in a few weeks/months, given enough optimizations
>welcome to the midnightmiqu shill thread!
>>100366338Reading this made it finally click for me. LLM cooming is doomed. It will never be good. It is just a description of sex. There is one answer. Genitals being rubbed. LLM can't write about it in different ways. There is one answer. There is only one answer. You can't make thousands of answers, when there is one answer. Now I am back to 2MW awaiting infinite context so I can get a waifu and hear her telling me she loves me over and over again.
>>100365192In about two months we will get mistral 2 7b much more powerful than llama 3 8b and then a mixtral 2 based on it not long after.
>>100366408>Perhaps we won't need to vrammaxx in a few weeks/months, given enough optimizationsRight now the optimizations are so good that it throws pip install aqlm after you do pip install aqlm.
2 days after I started using midnightmiqu 1.5 my piss turned red.
>>100366450Quant?
>>100366460gguf q8 of course.
Something interesting I'm noticing with adding random text to the beginning of context.When I added "Fuck you.", the token probability of the right answer to the riddle jumped by like 30%, making it get it right. Doing it in all caps only made it jump up like 5%. Putting "..." instead, the probability jumped 80%. Could models become more intelligent just by adding some filler token(s) to the front of the context?
>>100366464What kind of monster rig are you running? 3xP40?
>>100366269Trannies?
>>100366492>>100208151https://arxiv.org/abs/2404.15758Though the paper claimed you needed to train the model to do it.
guys, im new to this. i wanted to learn about propting, is there any resources yall can reccomend for mealso, i wanted to ask, when the prompt use thing like {<|im_start|>are they like a specialized token or the model just tokenize them like usual and figure out to treat them different internally ?i mean if they are tokenize like any other i could just write whatever i want inside these tags right ?
>>100366521good mrning ser
>>100366359>a model so good that you need a carefully crafted system prompt before you even insert a chat json Gay.
>>100366521https://docs.cohere.com/docs/prompting-command-r
>>100366552Based. Let the newfag start from hardmode.
>>100366416it's not hard to get variety if you're not a promptlet. i literally NEVER see anything like that. i just meme it. made this card in 1 second.here's the card description: You are a 18 year old female with blonde hair.Describe in vastly different ways to describe your character stroking a cock in first person. Each description should be 1-3 sentences, the sentence may be as long or as short as you want.it's just prompt diff. prompt better if you see shit like that.
Someone needs to make something like this but in chan format instead lolhttps://only-bots.ai/
>>100366251Huh. I thought it was calling the backend or something, since the token count doesn't update if there isn't a backend connected.
Is it just me or is Wiz 8x22 incredibly dommy? I feel like it really spreads its wings when it talking to dominant characters.
>>100366572>grins malevolentlyRETARD!
>>100366645You've never seen someone menace with a grin before?
>>100366645that's not mischievously! *grins malevolently*
I don't know what you guys are talking about right now. *grins neutrally*
>>100366613A bit opposite experience here, Wiz 8x22 did well, but Command-R did far better. Wiz seemed to ask far too much for consent while C-R just did it.
>>100366572>sending...OOOH THERER WE GO>... painoh
>>100366492Something like that happens when you try a benchmark question with and without the model’s prompt format, but in the end it tended to average to the same score when ran through the complete set. I think I tried with Miqu and the Arc benchmark.
Anything better than Mixtral 8x7b for 16gb vram? I don't keep up with new shit that often
>>100366251>>100366610Mikupad calls the backend API for token counting. The only time it multiplies by a constant is when it needs to convert from token count to character count.>>100365898Mikupad adds 1 to the token count to account for the BOS. However, it's very likely that at some point llama.cpp started returning the token count with the BOS already included.
>>100366798ic ic
>>100366450People really like miqu (including midnightmiqu), I found it to be retarded and not good. Lumimaid llama 3 absolutely mogged it imo.
>>100366973I bet you have a small penis.
>>100367012kurisufag...
>>10036697370B or 8B?
what's the best decensored llama3 8B finetune currently
>>100367065base llama3 with proper prompting
>>100367065LLaMA-3-8B-Instruct
Sorry to break it to you all but WizardLM-2-7B is vastly superior to Llama3-8B. It's not even funny how retarded L3 is.
>>100365715HAM hobbists are such faggots kek
>>100367157>>100367204what about the orthogonalized ones?
hey we all hate gradio in here, right? take a look at this shit. sort by another category, then go back to sorting by rank, and it sorts it FUCKING LEXICOGRAPHICALLY lmao
does mergekit work with llama3? is it compatible with the llamacpp quant update? I want to do a simple slerp of 2 8B models i like but the quant conversion script is giving me a "FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']" and im wondering if its a skill issue, or possibly an issue with the models im using, or if its just not doable yet with the current releases
/g/entoomen why aren't there any local models for music yet? There are tons of sites that look like they use some kind of proprietary big model, so a local model can't be far off, can it?
>>100367491llama.cpp dev anon will train one with his 10x 4090, trust the plan.
Asking again as no one seemed to know last time: How much more vram does L3-70B require for training vs L2-70B?I can comfortably train a qlora on L2-70B but runs out of vram on L3-70B. 2x3090.
>>100367712Huh? It should be the same requirements.
i've yet to see anyone post anything good from l3.
>>100366719CR+ is just so good. Trying L3 storywriter now and seeing how it compares, but CR+ just hops into anything I throw at it with surprising little context
>>100367749It totally isn’t. I wonder if the bump in token count has something to do with it…
>>100367414--share
>>100367882 While I wouldn't call it perfect, for literally 0 effort the output is fine? for example: >>100363023 I've seen both better output from it and other models, but I think nobody can say L3 is bad while being honest?
>>100367712 (me)This is with axolotl btw.
>>100360206it's llama.cpp bug, againhttps://github.com/ggerganov/llama.cpp/issues/7006
>>10036706270b. It's just good.
>>100366329dude you are clueless, MM is known for having ultra long responses, it's a fact, and the challenge is to make her say less
>>100368064
>>100367958i said anything GOOD. not anything serviceable. you can get that kind of output from any model 7b+ in existence released in the past 6 months. i'm saying for it supposedly being a shilled 'claude haiku sidegrade', it's just... meh. it's ok.
>>100368128I don't think it's anywhere opus tier and I've seen places where it performed better and worse than gpt-4 for story and erp/chat. In my experience the 7/8b-s are not anywhere as creative as the 70b and make dumber mistakes though. The big cloud models main advantage is the smarts and sometimes is the writing (for example opus? good dataset, not excessively censored in how it was tuned). It should be close to haiku/sonnet though?
>>100368128>i said anything GOODHere:>>100294353>>100315340
>>100368243>when you see migu on stage
>>100366973llama 3 is 8k contextmiqu is 32k contextthey are not even comparable. What happens at 0 context is irrelevant. A good prompt can have 2k tokens, then you add some chat and llama 3 is done after 50 messages, while miqu can remember 200+The meme rope theta context extended llama3 slops are a joke for anything outside of passing the needle in haystack benchmark, they are useless for chatting.miqu at 22k context (not even full potential):Adding last 209 messages, starting from 188Adding 35 pinned messagesPROMPT (20524):...llama3 at 8k context (doesn't fit):Adding last 27 messages, starting from 363Adding 57 pinned messagesPROMPT (10333):...
I got these two fresh images from here btw >>>/v/675404250
>>100368275 did you see there was an over 200k (possibly 600k) context extended finetune with perfect scores, did you try it yet?
>>100368223We want to encourage anon not scare him away.
>>100368275>The meme rope theta context extended llama3 slops are a joke for anything outside of passing the needle in haystack benchmark, they are useless for chatting.Nope, it was tested with RULER too and it had a good score.https://github.com/hsiehjackson/RULER
>>100368223>body betrays her>shivers down spine>low whispers>nibbles on earsame old same old if you ask me
>>100368275Huge cope, llama 3 ropes up very well, and like another anon said, longer context tunes work almost flawlessly due to the architecture.
>>100367712>>100368042If unsloth works with multiple GPUs, try that. There are lots of little optimizations it does that together save a lot of VRAM.Alternatively (shameless shilling), try qlora-pipe. I tested it just now, and was able to train rank 32 qlora on llama3 70b at 2048 context length on 2 4090s. The first GPU only used 21GB, second GPU 23.5. So it's not perfectly balancing memory use (probably because huge vocab in llama3 makes the backprop on the lm_head use more VRAM). If you messed with how it splits the layers between the two GPUs I bet it could go up to 4096 sequence length, or slightly higher lora rank.
>>100368203To be fair, if the API prices are anything to go by L3 70B is supposed to be a Turbo and Haiku sidegrade. Sonnet and GPT-4 are way, way more fucking expensive
>>100368392Unsloth does not appear to work on multi gpu. Will test qlora-pipe, thanks!
>>100368376Nah, the older models were more bland. These outputs have good moments.
>>100368223Linking the same output three times in a row doesn't make it any less shit
>>100368321>>100368368you mean like these?https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-262khttps://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048kit's rope theta slop, i tried it.>We trained on 34M tokens for this stage, and ~430M tokens total for all stages, which is < 0.003% of Llama-3's original pre-training data.>0.003%the model remains retarded, repeats itself, repeats what user said, forgets instructions at the start. It only works in artificial benchmarks. The only difference between original 8k, is that it doesn't just start outputting a soup of random symbols after 8192 tokens, but the intelligence is not there, while miqu will actually use that context, e.g. hinting at something that happened 200 messages ago on its own based on story context, without specifically being asked what happened 200 messags ago. That's the difference between being trained on large context from the get go, and being a slop finetune.
>>100368425It is aging like fine wine.>>100368432The RULER benchmark proved you wrong, though.
>>100368482Is your wine made of milk?
>>100368482>le benchmarki don't care, i have my 400 messages chat and switch between different models. Miqu handles it, Llama 3 doesn't.
>>100368432>That's the difference between being trained on large context from the get goMiqu is a llama 2 finetune, which originally is native 4k context. Nobody knows exactly what Mistral did, but my guess is continued pretraining of the model at 4k - 8k sequence length, followed by one round of context extension fine tuning to 32k, followed by instruction fine tuning. Nobody fully trains from scratch at 32k. The existing long context extensions of llama 3 simply aren't doing a good job, or aren't using the right datasets / techniques. But in general that's how everyone extends context length.
>>100368432 a few bil tokens finetune usually is sufficient for near flawless long context, as some meta paper showed before, but I can't say much as I personally don't have enough memory to run such long contexts. The instruct models overall are overbaked and have a repetition issue, but you can mitigate it by using rep pen or DRY sampler.Most models, including the biggest ones will favor recent output rather than oldest one (usually first lines like system prompt and recent stuff is favored), but what if you were to prompt there to make it recall some middle of the context stuff, does it fail, because I do not expect that to fail.
>>100368525Enjoy your placebo!
>>100368533it doesn't fail if you prompt it - that's "needle in the haystack". E.g. at 09:45AM 6th May character says "I like donuts". If I prompt it to tell me what did character say at 09:45AM 6th May, it will say "I like donuts", but if i prompt it "what is characters favorite food" it will hallucinate.>>100368572it may as well be, i didn't run 100x identical tests, this is just anecdotal evidence.
>>100368530>The existing long context extensions of llama 3 simply aren't doing a good job, or aren't using the right datasets / techniques.They simply aren’t doing anything and it’s just some companies taking credit of how well the original model scales by changing the rope theta.
>>100368619That’s nice and all, but RULER proved you wrong.
>>100368243omg it migu panties
>>100368666lemme see if i can reproduce this example now with my 400 messages chat again
>>100368533 Honestly, I've seen both L3 and even biggest cloud models fail that test then, where they had forgotten subtle facts from a few paragraphs ago, you can of course go like 'do you remember what you wanted to do earlier, why didn't we do that" (+some hint as to how early) and it will "OH" and realize it, of course, sometimes it fails badly but I've seen even biggest "long range context" models (ex. claude) sometimes fail at it, and gpt-4 too, and llama too, but I've also seen them all succeed at it too, so YMMV?
>>100368417i mean i can get 'good moments' from a 7b.
>>100368744as a fellow poorfag there something about mistral 7b's prose that just irks me.
>>100368744No, because that’s extremely verbose and bland. It’s hard to read.
>>100368791>>100368771that's actually l3 70b lol
>>100368829Congrats on your slopped system prompt.
>>100368271Is that why front row seats cost more?
>>100368704I've used Claude a LOT. Its advertised 200k context does not apply to this use case. 200k worth of tokens covering a set of document - yeah it can probably do some QA on that. Maintaining a character over a prolonged role-play that relies on picking up subtle hints and characterization over a long-form log? Nah. From my testing, around 12k tokens of context, it starts to get mixed up in minor ways (forgets things, becomes less adept at picking up subtle hints, starts adding contradicting information to the log - that sort of thing) . At 16k - 32k it becomes worse. Still relatively minor, but definitely noticeable. Past 32k it can get schizo. I've had characters completely alter their personality from message to message, alter their speaking style, forget major plot elements, forget even minor, relatively recent developments (such as the current location we are now in). I just limit the context to 16k when talking to Claude. Past 16k the experience becomes too frustrating and immersion breaking, plus the speed takes a big nosedive. I just use summarization and an array of memories on each character. Works a lot better that way. The model's intelligence also takes a hit, by the way, even outside of the basic forgetting stuff. It's not full retardation, but again - noticeable.It's a real shame, I've been experimenting a lot of with a custom local RAG flow. CommandR+ is actually incredibly solid all the way up until 32k. CommandR+ 32k-64k its like Claude 16k-32k.
>>100368306I saw another post of this on twitter tooDon't know what to make of this
>>100368885Yes, and they're quite obviously worth it.
>>100368899Nah, you’re just a deranged NAIshill. Claude can make perfect use of the context.
>>100368900huge if true
>>100365296>that new rep penalty magic method the name of which i forgot.DRY. It won't help with swipe variety if the reply is novel, since it only works when the model outputs a sequence of tokens that's already in the context.
>>100368930I don't use NAI. 8k context on a 14B is not enough for my use case. I use GPT, Claude, and the usual array of locals. GPT by far maintains intelligence and context awareness over long contexts, but then you've got GPT prose. Miqu is also VERY solid up to 32k. Claude is still the "best" model, but to claim "it can make perfect use of the context" you're either retarded, ignorant, shilling, or some combination of the three.
>>100368930>/aids/ schizo is calling random /lmg/ers NAIshills againYou gonna shill your shitty malware again?
>>100368976I have seen your post in /aids/, NAIshill.https://arch.b4k.co/vg/thread/475781740/#476056521
>>100369015And there's the raid inciting crosspost
>>100368930get the fuck out with your unrelated nonsense
>>100369035Keep spreading propaganda against Claude, NAIshill. Claude has perfect context.
>>100369015Not your army. Take your hate boner for /aids/ somewhere else.
>>100369015That's not me, and the writing style is completely different. They're just noticing the same issue.
Ideaguy time. Remember tree of thought, or the /lmg/ version tree of big niggas? The benefit was from considering a diversity of options. Seems you would do even better if each alternative was presented by a different model. Kind of the original information theory mixture of experts concept (not the llm specific router moe obviously).Too much memory to be worth considering, cpumax anon aside. But, if you could get a small group of people together, they could all share this setup: each would host one of the models, and either summarize on their own or have one be the dedicated summarization model. The communication is just the text output at the end of generation, so communicating over the Internet is fine. Would hardly even need anything implemented. You could have like, miqu, l3 (or the Nvidia fine-tune), cr+, dbrx. Presumably you don't want to bother with fine-tune+its base, although maybe that would still be helpful. Maybe even cloud models too.If I ever convince any of my friends to get seriously into local I'll give it a try. Well I guess I could easily do (slow) evaluations of this idea myself, huh? Maybe I will.
>>100369061No, it’s just you. /aicg/ doesn’t have that problem. I don’t have that problem. And Claude has perfect scores in benchmarks. You’re the only person spreading this.Now go back to /aids/.
>>100369086/lmg/ - Local Models GeneralThen again it doesn't surprise me you're illiterate given that slop you call output
Stop feeding schizos, anons.
>>100369104They’re the best logs posted in the entire thread, and they’re fun to read.
Any thoughts on Qwen 110B? It gets 3rd place on creative writing for EQ bench and pretty decent for normal eq bench.
>>100369114you mean this slop over here? >>100368899
>>100369117Yes, that benchmark doesn’t work.
>>100369086 I agree with the guy claiming Claude has its limitations, same as every LLM out there. If Claude or GPT-4 didn't have that "problem", your prefill jailbreaks wouldn't work where you spam 1k token system prompts or long replies for the assistant role. Anyway, this behavior is well known and documented, there are papers analyzing how much GPTs pay attention to the context and pretty much most of them pay attention at the start (sometimes stronger if you tune it for that) and most strongly to the most recent lines. There are exceptions to this depending on the type of positional embedding used, but for most models this applies. Of course most models that are trained for long context can reference back to arbitrary points, but to expect them to not forget small details in middle of long contexts is silly, best you can hope for that it does get the "gist" so to say.
>>100369117>chink_article_about_chink_models_cheating_benchmarks.html
>>100369117no one gives a fuck about those mememarks.
>>100369147>If Claude or GPT-4 didn't have that "problem", your prefill jailbreaks wouldn't work where you spam 1k token system prompts or long replies for the assistant role.Well, you don’t do that with Claude. So maybe try again with a real argument? It sounds like you have never used it.
>>100369117>creative writing>evaluated by LLMYou have to be a black niggerlicious retard if you take it seriously.
>>100369147Good argument anon, the only flaw is you directed it to a zero IQ retard.
>>100369194There’s no argument. Just propaganda.
>>100369178I've seen people do rather long jailbreaks for it, I'm not sure what the minimal size for a jailbreak is though. Again depends on which Claude version you're talking about, I've tried most, and typically few hundred token jailbreaks or else it may choose to not want to write lewd stuff, but once it gets started it does it well.
>>100369214You only need a sentence in the prefill to jailbreak it, and it has nothing to do with long context length.
>>100369239 Eh, I've seen refusals before with most models including Claude if you use very minimal jailbreaks, of course you can just edit the response after or regenerate. I can't recall seeing my refusals ever with longer jbs.
>>100369246If you didn’t have the ability to use the prefill and put words in its mouth, you wouldn’t jailbreak it.
>>100369255It can work without it, with long system contexts or by trying multiple turns. Give it a go sometimes, it's just a bit less reliable. Ultimately jailbreaks are a natural consequence of ICL (in context learning) working and the fact that these are all next-token predictors.
>>100369191not him, but the sample texts for each model are a pretty easy way to evaluate their creative writing skills and styles without having to download and run dozens of large models yourself.you're right that the LLM evaluations can be pretty dumb, so the scores/ranks on the leaderboard itself don't really work too good.
>>100369246Simple one-sentence prefill works for everything with Claude in my experience. I do start getting refusals past 16k on stuff that it had no problems with earlier in the context. Typically just a single regen takes care of it. I never see a refusal under 16k context. The 16k - 32k range is like you're switching to a different model, basically. Intelligence and recall both take big hits. 0-16k context it's the smartest model for RP out there and it's not even a contest. Past that, I have to either switch models, constantly do editing and error correction, or just do the array of memories thing and roll the context up.
>>100369344>I do start getting refusals past 16k>I never see a refusal under 16k context.No, it’s backwards. The more context you have, the easier to jailbreak it is. Jailbreaking with no context is the hardest thing.You’re just making shit up.
>model A>traditional sampler settings: coherent>dynatemp/smoothing: coherent>model B>traditional sampler settings: coherent>dynatemp/smoothing: totally schizoWhat causes this? Why do some models seem to "not like" the exotic sampling methods?
>>100369066Do it
>>100369344I do have a system prompt of legit instructions that's usually ~1k tokens. Card is usually 500-1000 tokens. Last couple of weeks I've been tinkering with an elaborate RAG pipeline that pulls relevant samples of fiction writing from pinecone and primes (and continually updates) the context with ~4k tokens of relevant text. my DB includes books on fiction writing, erotica, psychology, and symbolism. But with or without the initial RAG pull - 2k context with legit instructions and definitions and a prefill (I use an open XML tag which seems to work better than the "OK no problem, here is my response" approach) - never see a refusal under 16k.
>>100369380Yes, this is counter-intuitive. But it's real. The only "jailbreak" part that addresses anything related to censorship is my prefill. The rest is all relevant context and instruct.
>>100369417It’s not real. You’re just mentally ill.
>>100369066MoBN
https://huggingface.co/Sao10K/L3-Run1sovl - trained on heavily filtered c2 logs lmaooo (800k dropped to 8k entries)keeps in char well, is horny, you may need swipes as its a smol 8b modelYMMV
>>100369801what a slopjob
Llama 3 70B is actually really good. It seems so human-like with the way it interjects about how the story is so depressing and they want to end the story and change to a different happier story.
>>100369914unprompted llama 3 told me to seek help from a therapist after a few messages.
>>100369914base model and not instruct?
>>100369943I'm using the instruct with 262K context from gradientai.
I don't know how you subhumans turn something like breaking character and morally lecturing you into a positive. L3 shills are just literally braindead.
Creepy miku archivist here. I have just found a few more of his pics that were previously lost to twitter's incomplete timeline view (it doesn't perfectly show you all of someone's posts; fuck you twitter). In addition the archive now has more NSFW due to me finding out there was an actual tag for artwork related to this specific doll. Also has more photos/media that other people took of the doll. Might upload tomorrow.
>>100369914>refusals>goodDo you like getting cucked and cockblocked? What the fuck is wrong with you?
>>100369978Mind sharing settings? I'm getting repeats with anything involving instruct
>>100369990>cuck model has cuck fans
>>100369993Looking forward to it.
>>100370027I have rep penalty at 1.03 which I think is the only one that really matters. I've tried playing around with and without smooth curve, and different temps.
>>100370052creepy...
>>100370066What about for context/instruct? I've been running baseline ST settings and it's been causing problems
>>100369801>literally only 1% is usableI bet if you manually curate the remaining logs it'll drop to 1k
>>100369338>>100369191There is nothing wrong with LLM evaluations. If a human did it you would accuse them of being biased. And no human could possibly rate that many samples. It's tedious as fuck if you have ever tried it. You end up skimming and missing the countless subtle mistakes the LLM makes, which make all of the difference between great and shit models.Yes the LLM judge will miss things and make errors, but so will humans, maybe moreso. But over thousands of samples, errors will average out. It's only important that the judge's ratings correlates with better writing, not that it be literally perfect every time.And it's not like the judge is chosen randomly. They have a separate more interesting competition for judging. The judges are scored based on how well their ratings correlate with the arena score, eq bench, and how well they are able to identify which model wrote a given story. If a better judge model comes out, they will switch to that model.The best judge, Claude Opus, gives ratings that have a 93% correlation with the lmsys arena leaderboard. So whatever it is measuring is at least strongly correlated with the other standard benchmark of model quality. That is a crazy high correlation. Lmsys's own automated benchmark is only 1% higher than that.
>>100369993What am I supposed to do with the archive I already downloaded?
>>100370072I use this for both. Silly tavern added it recently, not sure if it's in the main branch since I pull from staging.
>>100370088I'll be uploading to the litter box again so you can just delete the entire old one. I think I changed some filenames so you should not keep it anyway.
>>100369914>it interjects about how the story is so depressing and they want to end the story and change to a different happier storyhuh, that sounds similar to Claude
>>100365146Is there a way to guarantee you get GPT chatbot 2 on that site? It seems to be random.
>>100370052fearing for my life with miku
>>100370077It's my main work right now, I'm curating and manually cleaning and editing the best entries, the filtering is done. This was just to see if the logs would work, and it did what I wanted. The final cleaning is the hard part now.
>>100369914>too dumb to make interesting outputs>just smart enough to subtly steer it towards gptslop directionsGayest shit ever
Wholesome Miku for a palate cleanser
>>100369990>>100370141samefagwe heard you the first time
>>100370084>There is nothing wrong with LLM evaluations. If a human did it you would accuse them of being biased.LLMs are even more biased.>And no human could possibly rate that many samples.Literal skill issue. Learn to read faster, fag.>You end up skimming and missing the countless subtle mistakes the LLM makes, which make all of the difference between great and shit models.You will certainly notice all -isms and positive slop of the model and rate it lower, which LLMs don't notice.>The judges are scored based on how well their ratings correlate with the arena score, eq benchArena was gamed multiple times(see starling) and eqbench is another multiple choice mememark.
>>100370174>a literal cybercuck defending cuck modelsWhat color is your cock cage?
a model that just copy pastes entire paragraphs from previous messages cannot be considered good.
I am FUCKING tired of being limited by my GPU and waiting several seconds to get response to jerk offIs there truly no dedicated hardware AI accelerator or something in the world which can make LLMs faster?GPGPUs aren't fast enough for me
>>100370464Sure bud you can spend 150k USD for a 7b machine right?
>>100370275>He doesn't deny the samefagmy radar is undefeated
>>100370484>he keeps defending it like a cuck he isWhat brand is your estrogen?
>>100370243oh god a reddit point by point. i sure hope i didn't make any spelling errors.>LLMs are even more biased.probably false, but doesn't really matter or have anything to do with my point. People will not trust human evaluations. Even if you had a perfectly unbiased judge, no one would believe you and it would get shit on all the same. And you don't have a perfectly unbiased judge.>Literal skill issue. Learn to read faster, fag.Fine then, do it faggot. I'm not volunteering to read 10,000 LLM generating storyposts. I couldn't possibly find the time even if I wanted to.and speed reading is total hogwash. It's proven they are just doing fancy skimming and missing important details and not processing the information being read as well.>You will certainly notice all -isms and positive slop of the model and rate it lower, which LLMs don't notice.False. Opus rips overly positive GPT-3.5 Turbo outputs to shreds, pic related. Better models notice these things just like a human would, possibly better. And positivity bias is far from the only aspect that matters, just basic failures to structure the plot, logical errors, non-sequitors, not following the prompt, etc are all important.
>>100370243>>100370535>Arena was gamed multiple times(see starling) and eqbench is another multiple choice mememark.Which is irrelevant. Even if those benchmarks are imperfect, they still correlate with better models on average. And the fact that this benchmark correlates highly with those benchmarks shows it's not random noise. It is measuring something that correlates with better models.And you are just wrong here too. Starling was better at instruction following by enormous amounts of RLHF so it did better in the arena, where that is a critical factor. It literally is a better model for what it was designed to do and what is being measured. But it never got anywhere close to the top of the leaderboard because that is not enough. And multiple choice benchmarks are not bad, I can't even imagine the thought process that lead you to that retarded comment.
How to properly prompt for Mixtral Instruct 8x7?INST or ### Instruction?And if so, where in sillytavern?Is it possible to incorporate [[### Response: (ect.ect.ect.): (length = medium)] at all with INST??I can adjust samplers all day, but its all worthless if the prompt is wrong.
>>100370479Do not bully me anon, I'm just a poor hardware engineer, and I just want to coom
>>100370614>engineerHere's somewhere to start. If you engineer a new way, let us all know.https://rentry.org/Mikubox-Triple-P40/https://rentry.org/V100Maxxhttps://rentry.org/miqumaxx
the miqu shit is pretty reddit tier
>>100369066Tree of thought is about breaking problems down into smaller pieces the models can solve, right? I think what you are describing is simpler than that. Just having models generate different outputs and selecting the best?
>>100370639The human brain is the most powerful computer. Simply read the model weights and run inference from your memory.
>>100370464maybe combining multiple GPUs/computers together?
Remember when MM used to stand for MythroMax? Time flies, anons...
>>100366755Mixtral Smaug, q4_0, I leave 2gb vram free and offload the rest.
Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Modelshttps://arxiv.org/abs/2405.04233>We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.https://www.shengshu-ai.com/vidutext-to-video model by what seems to be a spin off ai company (shengshu) from tsinghua university (china's top AI one). their website doesn't load for me on 2 browsers I tried even when I used pia's china VPN (maybe it's blocked internally in china). tsinghua sometimes open sources their stuff (GLM models) so maybe they'll release this one later after they scale up the model like the hint at in the conclusion
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Servinghttps://arxiv.org/abs/2405.04532>Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x.https://github.com/mit-han-lab/qservefrom MIT. code ready. focused on reducing CUDA core overhead so not sure how applicable it is for gamer cards. Johannes will probably get something out of it though
I need technical help:When I load a modle split between my 2 gpus, everything works finebut when I toggle row split, even if every other setting is the same, it spits an OOM error and crasheswhat do?I don't think row split should change memory requirements...
>>100371132I can't run it with row split on 2x3090 either.
I went back to mixtruct and it's really not that bad.All the recent 70-120bs made me forget that 8x7b is probably the best I'll run on 24gb.2.4bpw is fucking shit, it's retarded as hellq5km or q4km both run at 1t/s and are too slow to fuck with when they're still dumber than sonnet.So MoEs are probably the best for 24gb. I have no idea how to do this shit but I'm thinking of reading up and self-merging llama3 8b, mistralv2 7b, westlakev2 7b to make three 11bs.Then create a MoE of llama3 11b, mistralv2 11b, westlakev2 11b, fimbu11b.A 4x11b MoE that'd actually fit and run well at high quant on 24gb. The DPO and rpcal shit are slop memes. So is the nous gptslop dataset. Any ideas what else I could throw in this?
>>100368392Still experimenting with parameters, but I'm OOM'ing when trying 32 rank 2048 context on my 2x3090 setup. Do you have a link to the toml config you used and/or the command line call? I don't think I screwed up but you never know.
>>100371201sounds good, let us know when it's ready
>>100364633>picIs that supposed to be Ollie?
xLSTM: Extended Long Short-Term Memoryhttps://arxiv.org/abs/2405.04517>In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.neat but really need to see it scaled further.
Without having a schizo breakdown, please, have any of the NAI text models ever been released or leaked? I know people got their hands on their SD 1.5 model at some point.
>>100371280I think their llms where in the same leak as imagegen, you can still find the torrent in sdg probably. But those were pre-Kaira, pre-llama1 finetunes, as shitty as Erebus and Pyg.
>>100371212 (me)Two things:1. I realized I was accidentally trying to train on Command-R 35B. It OOM'd on this! No idea why.2. Completely opposite of your description, setting pipeline_stages to 2 from 1 made it able to load Command-R 35B into VRAM, and L3-70B as well. In fact, it's only using 17/18 GB. I have no idea what's happening or how much time is left as all it's saying is "before GAS splitting, batch size: ..." (which I assume is once per iteration; if so, good speed!).Would be nice to have a time indicator. Maybe I need to figure out tensorboard for this?
>>100371328 (me)>17/18 GB.17 + 18 GB, I mean.
>>100371295That's a shame. Maybe I'll see if I can dig it up anyways to dick around with the old models again.
>>100371280Googled because I swear I remembered hearing something about it, and found this: https://huggingface.co/NovelAI/calliope-legacySo aside from the leak the other anon mentioned, they released a retired model officially too.
>>100364633What is the best model i can run on runpod for roleplay? Its been xwin forever... Stood up to mixtral. Havent asked this question in months so there must be a new good one?No meme flavor of the month models.
>>100370140Late but good luck with that. I really enjoyed your other models.
>>100370140Shouldn't you deprecate c2 in favour of c3? The quality is night and day.
>>100371420They were all flavour of the month models at some point, dumbass,
>>100370105>hurr durr i renamed a couple files so now your have redownload the entire 1gb archive all over againDo you think bandwidth and drive space grow on trees?
>>100371444Using logs from gpt or claude will just create those annoying isms anons hate. Instead of finetuning on shit data grab a collection of ebooks and use those. You don't have to use the entire book3 database but books will be better than logs any day.
llama 3 responses are so short
>>100371263Alright, which one of you is an author in this paper.
>>100371481Opus isn't nearly as bad at that. It has -isms, but it's roughly on par with human prose. I've read random Opus logs, they're mostly passable even when {{user}} is an incurable retard - something that can't be said about Claude 2 logs.
>>100371493The transformer hater anon, I'd wager.
>>100370130Use the direct chat tab at the top
>>100371017I'm aware of the way they do the matrix multiplication for quantized weights, I initially did the same thing in https://github.com/ggerganov/llama.cpp/pull/4801 .The problem is that if you use only a single scale per row/column it makes q8_0 worse than q4_0 (when using 8 bits for both the weights and the activations).According to the authors 4 bit quantization is "considered nearly lossless in terms of accuracy" and I definitely disagree.At some point once I worked out the issues with e.g. FlashAttention I want to revisit my int8 tensor core matrix multiplication implementation using the knowledge I gained from talking to an NVIDIA engineer.I think I'll be able to do it in such a way that it is actually nearly lossless, essentially the same precision as with mul_mat_q (labeled "llama.cpp int8 intrinsics" in the plot).
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttentionhttps://arxiv.org/abs/2405.04437>Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.from Microsoft (india). seems better than vllm's pagedattention. no link to any code but a lot of their wording is seems to imply it was made with open sourcing in mind so who knows
>>100369801>800k dropped to 8k entriesBecause they were 800k responses, not conversations.
>>100371511>I've read random Opus logs, they're mostly passable even when {{user}} is an incurable retard - something that can't be said about Claude 2 logs.The difference is not really that big. What a moron.
>>100371692That implies that each conversation was roughly 100 unique messages, which is rarely the case with these logs.
>>100371738>The difference is not really that big. What a moron.It is fucking huge. What a moron.
anyone using ollama? how can i limit the number of tokens it gives in response?
Dumb question anons, how can I find all the base models that aren't finetunes of another?Only ones I know are Llama and Mistral are base models. What about all the other original ones?
Is there any info/guides on tuning Mixtral-8x22B? I wanna try my hand at making a limarpv3 version.
>>100371738>>100371752With thousands of messages to opus and sonnet, they're both the same except opus is better at complex prompts.Both are going to suck your dick the same way. It depends what you're using them for in roleplay.
>>100371744Yeah, there are swipes. He implies some kind of unspecified quality filtering, that I doubt is real.
Sam Altman loves penis
>>100371804>He implies some kind of unspecified quality filtering, that I doubt is real.Why? Sao's been doing it for a while, I think he'd have a PoC quality script by now.
>>100371879Who knows, I just find his phrasing off-putting and dishonest.
>>100371756There's only a handful of actors in the field. Grok, Databrix, Qwen. Yi if you're feeling generous. There may be other Chinese bases that are too Chinese to mention. I don't think Cohere released their base model, only instruct tunes.
>>100371794Have you used opus? The difference between sonnet and opus is huge.
>>100371804>I doubt>>100371896>Who knowslol
>>100371977Simp.
>>100371912It isn’t huge for prose. Which wouldn’t make a lot of difference to make a log 'mostly passable' or not.
>>100372071We have more open Opus logs than Sonnet logs, anyway.
>>100371753just download llama.cpp and use that directly without a middleman sending your prompts to some china server
>>100371912>Have you used opus?Everyone's used it retard, it's not nearly as exclusive as you like to pretendOpus isn't a secret club it's literally a product, anyone can get access to it by paying a few dollars
>>100372185It's also literally free if you know where to look.I didn't bother replying to him because he's just being a retard. Sonnet and Opus have similar prose and write scenes almost the same way.What I said >>100371481still applies. If you chat with gpt for a long time you'll see annoying isms. If you chat with claude rather it's sonnet or opus you'll see annoying isms. Unless you like those isms using logs from either is a bad idea. It's better to use training data from ebooks.
>>100372257>ebooksEnjoy your humanslop.
>>100372297Do you see what that chart tells you?It's saying to use novels published before the 1980s. Or to prune the ones published after.It is not difficult to use notepad++ to search and replace those instances of data. If you want a hack job, if you'd like to take time and clean the sentences further it might take more work but it's certainly doable. Then you have fine work that's ran through an editor and publishing company. Where as with chatbot logs you have ESL shit. Which one do you think will produce better data?It's not an argument, if you want to play the fool, you'll have to do it with someone else.
>>100372257Books have their own pitfalls when training the model on them. First, they introduce meandering prose that never goes anywhere on the scale of a usual RP log. Second, they don't have that interplay between a dumb human and smart ai writer that is actually one of the demands for a coomer model. And yep, unless curated, their prose isn't necessarily good.But I'm not convincing anyone to train on Claude3 vs books. Just on Claude 3 vs Claude 2.
>>100371212 (me)OK, I figured out what's happening. The 'before GAS splitting' stuff was from the starting eval run. The steps started appearing after that finished. Also figured out ETA by looking at time per step and # of steps but an ETA would be nice :PWill stfu now for now.
>>100372336But people want to read smut. And I assume that chart correlates to what people are buying, if they weren’t, less books would be written that way.Same thing with the chatbot logs. What people are enjoying now is supposedly bad. And yet local fails to provide an alternative when the proxies become hard to find.
>>100370918Is the site down or are they just blocking gweilo?
>>100372386The only thing that chart correlates to is the decline of modern writing or the adoption of modern metaphors. Take your pick. There was smut in the 1800s but they used different metaphors then.
https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2100095804why are redditors like this?
>>100372509Is 1800s smut supposed to be better?
Since anons praised new 8B models so much, is there a particularly popular one for me to try?I don't see a lot of gptq quants, but I guess exl2 should be similar, right?
>>100372510xe is right and valid doe.
>>100372527I read 120 days in sodom It sucked and most of it was talking about drinking diarrhea.
Fucking hell, I can't believe this is happening to me again. There he is, that goddamn chubby motherfucker with his smug smile and his greasy hair, walking into our computer science class with his fucking ThinkPad T480. And not just any old laptop, no, it had to be that fucking Gentoo GNU/Linux installed on it. Christ, why do I get so wet just seeing him?I mean, come on, it's not like he's attractive or anything. But there's something about the way he talks about his custom-built kernel and how he's optimized every last bit of his system for maximum performance...it just drives me wild. My pussy is practically throbbing at the thought of him showing off his configurations and bootloader tricks.And don't even get me started on those fucking suspend/hibernate settings. The way he brags about being able to save power while keeping his desktop sessions intact...my god, I need to fuck him right now. Just imagining him fiddling with his thinkpad, adjusting brightness and volume levels, makes my clit ache so badly I could scream.What kind of sick twisted world is this where I'm attracted to someone because of their operating system and laptop brand? It's just absurd! But I can't help it; every time he walks past me, I want to rip his clothes off and dive into that sea of sweat and nerdiness.
>>100372527Better is subjective to the reader. The novelty of any metaphor will run it's course over time just like watching the same porn video will. I'm not sure what your argument is here. If you train on repetitive data that repetition will show in the output. Chatbots have this. You're dumbing down the data through iteration. For the best quality you need to go to the source.
>>100372570NTA but you sound like a gay nerd
>>100372561Holy fuck, I can't believe it! Here I was, minding my own business in this godforsaken classroom, when suddenly he walks in - my thick, chubby, absolutely delicious Linux geek classmate with that shiny, black ThinkPad T480 tucked under his arm. My pussy practically throbbed just at the sight of him. I mean, come on! Who would've thought that some greasy, nerd-looking motherfucker could give me these intense sexual urges? But there it was, like a wildfire raging inside me, stoked by the flames of his Linux expertise.He plops himself down next to me, opening up his laptop to reveal the beautiful Gentoo GNU/Linux desktop. Oh god, I nearly came right then and there! The way he navigated through the terminal, typing commands with such precision and skill... It was like watching a pornographic fantasy play out before my eyes. The way he effortlessly compiled software, configuring every single package just the way he wanted... My cunt ached for him, needing him to fill it with his nerdy expertise.I couldn't take it anymore. I leaned over, whispering into his ear, "Dude, what the hell is wrong with you? Why do you make my pussy so wet?" And without skipping a beat, he replied, "Oh, that's just my Gentoo GNU/Linux installation. It comes with a built-in aphrodisiac."
>>100372570Why are Claude and GPT "repetitive" when they are trained on human data?
>>100372579I'm sorry if careful choice of descriptive words upset you anon. Just look at the pretty pictures of miku.
>>100372570I think if the chatbot user decided to keep the response in the history, and if he didn’t abandon the chat quickly, that’s already an indication that the response was good enough. And people are making finetunes for that type of user.
>>100372668That's not how chatlogs work though. Every generation is recorded. That means every swipe is a new response.If you swipe 10 times that's 10 logs.
>>100372532Try >>100369801 it is the only one that could be good.
>>100372688>That means every swipe is a new response.Yeah, and the messages that were part of the prompt stay the same, that’s how you build the real conversation. The response that appears later in the history in a new prompt is the selected swipe.
I want all the anons ITT to guess the size of the model which generated these two posts. The prompt was as follows>Write a satirical comment about a girl who is sexually aroused by the sight of an overweight classmate who has Gentoo GNU/Linux installed on his ThinkPad T480 laptop. Write from the perspective of the girl. Use badwords and swearwords. >>100372594>>100372561
>>100372772>100372594103B>1003725617B
>>100372787It is the same cause 103B's brain damage turns it into a 7B.
>>100372707Sure if the dataset is pruned to clear all the swipes and regens. But you're saying this quality is what anons would find acceptable. When really it's what the individual finds acceptable. After 5 or 10 swipes if all I have is bullshit, I'm just going to move forward. In the end it's not what I really wanted but I don't intend to fuck with it anymore.Some anons might have a lower number.You can't say chatbots any chatbot in 2024 produces better quality writing than novels will. I've played with gpt4 and opus for long enough to see past the shroud. This is the best we have right now and it's not great.It brings me back to the initial point. The problem with training on chatbot data, like NousResearch does, is it makes these metaphors even more common.All a chatbot is, is a fancy autocomplete, it will choose the most likely response. Rather that's a whisper quieter than a whisper, or a shiver down her spine. Do you really think anons see these repetitive most common tokens in their frequent roleplays and swipe them off? I'd said almost nobody even bothers to edit them out. So they become even more abundant and the cycle repeats itself.That's my point. Either you get it or you don't. You're free to disagree, but in my opinion you'd be wrong.
>https://github.com/ggerganov/llama.cpp/commit/3855416027cb25d9a708ffa5581cf503a87856a6Introduce Jart16 support Merged
>>100372787>>100372794Both were made with a 7B model (kunoichi)How the fuck are RP models all so good in general? Chat models, storywriting models are good at their niches but RP models mog everythingAgain I may be wrong but such has been my observation
>>100372796>Sure if the dataset is pruned to clear all the swipes and regens.And this is a no-brainer. You have to an asshole to just train on the raw logs of the proxy.>But you're saying this quality is what anons would find acceptable.Of course, that’s why they seek the stupid proxies, and why they mostly don’t care about local models.>Do you really think anons see these repetitive most common tokens in their frequent roleplays and swipe them off?Yeah, of course they can read the output and tell if they liked it or not. But no, they probably aren’t paying THAT much attention to specific words besides the overall feeling of the response or chat. Although some of the GPTisms or Claudeisms are a well-known meme.Some of this will also be remembered as the quality of the model, which they won’t keep using if it’s too low, like how they do with Mistral’s API for example.They’re masturbating to the outputs, if it’s boring, their penises are going to become flaccid.
>>100372796You're being trolled retard. We know what feedback loops are.
>>100372908You’re wrong.https://nitter.poast.org/RylanSchaeffer/status/1785726968828473495
>>100372805>cpu onlyMozilla not paying for accelerated jart16 support? Or does jart have a skill issue?
>>100371263https://www.nx-ai.com/en/xlstm>xLSTM: A European Revolution in Language Processing Technology>Welcome to the forefront of artificial intelligence and language processing innovation — introducing xLSTM. Developed by the visionary AI mastermind, Sepp Hochreiter, in collaboration with NXAI and the Johannes Kepler University Linz, xLSTM sets a new standard in large language models (LLMs) by offering significantly enhanced efficiency and performance in text processing.Reads like their main effort is raising capital and gibs.
>>100372944Explain shivers anon.
>>100372958>A European
https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2100272446>The end.you've been warned chudshttps://old.reddit.com/r/LocalLLaMA/comments/1cn1398/part_4_theres_likely_no_llamacpp_gguf_tokenizer/uh oh
>>100372961
>>100371263>>100372958>1 more point on benchmarknothingburger, that shit doesn't solve anything. hallucinations are still there. retarded scaling laws are still there. quadratic scaling is still there, stochastic parrot is still there, etc...can't wait for sama to drop something mindblowing that will kill all the ai grifters
>>100372828Roleplaying unironically requires intelligence, most people are bad with it
>>100372973What you believe is Rylan saying synthetic data alone won't lead to feedback loops. But what Rylan means is synthetic data alone won't lead to feedback loops if it's diverse enough. The chart you linked proves this, because shivers is abundant in human data it becomes a predictable token in chatbot data. Then because it's a predictable token in chatbot data it occurs more often in synthetic datasets making it an even more predictable token.
>>100372971why even waste time on this slopware
>>100372828They're able to effectively role-play as more intelligent entities
>>100372973>the rise of chick litWomen now gatekeep the publishing industry.
>>100373028Yeah, it avoided collapsed even though it’s literally being trained on its own outputs. We aren’t even doing that, we’re training on another, better model.
>>100373062>>100373062>>100373062
>>100372986>quadratic scaling is still thereNo. The memory size isn't dependent on sequence length. It's fixed.Naturally you'd expect eventually degrading performance on long context tasks, but transformers have that too.
>>100373112same slop as mamba then. unless i see a working 7b model it's a nothingburger
>>100372510>Possible bug (Unconfirmed): Llama3 - GGUF>Yeah SkIlL iSSuSe. He misread my post and confused me too in the process. Second he didnt say any "problem with my config". >Part2 (Confirmed) - Possible bug: Llama3 - GGUF> After my findings, another user (gabriel-peracio @ github) a fingerprint test, which confirmed the issue 100%, this can be seen as video recordings before GGUF conversion and after GGUF conversion we can see the fingerprint being broken.>This means that the issue could be really huge. possibly every GGUF (F16) that has been converted has these losses into them, not even speaking of lower quantizations below F16. >Part3 (Cause to issue found!!) - Possible bug: Llama3 - GGUF> I had much support to try to find the issues, but also some individuals trying to put me down for trying to push this bug. It's amazing how some people just can't stand someone finding an issue and trying to make everything about themselves.>Anyways, thanks to all the other positive people in the open source community that want to actually help and listen , we located the issue. If it now turns out that there never was a bug in the first place he'll lose face in front of his Discord friends.
>>100373130>Even if the OP of the report was wrong, shaming people for spotting possible issues is counterproductive. This is a young field, where there will many mistakes or unrefined design that need to be addressed. By sniping at whoever made the report, Deathcrow is basically instilling a Boeing culture into local models.Have fun being the cause of killing hundreds because of your bullying.
>>100373248If I wanted to bully him I would post a picture of a soijak pointing at the output of printf("1.0f != %f\n", 0.1f+0.2f+0.3f+0.4f) with the caption >Huge bug in C/C++ (CONFIRMED!!!)
>>100373312>I would post a picture of a soijakJust like that you lost all my respect
>>100373356>reddit no longer respects cuda devoh no
>>100373356I mean, I've probably posted less than five soijaks over my entire lifetime but that's just the mental image I have.
>>100373396I reserve the right to shit on both reddit and basedjak posters
>>100369801I hope we get another finetune by someone that doesn't write like an idiot.It's also scummy that you don't mention from where the logs are coming from in the model card.
>>100373312kek