/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>101584411 & >>101578323►News>(07/24) Mistral Large 2 123B released: https://hf.co/mistralai/Mistral-Large-Instruct-2407>(07/23) Llama 3.1 officially released: https://ai.meta.com/blog/meta-llama-3-1/>(07/22) llamanon leaks 405B base model: https://files.catbox.moe/d88djr.torrent >>101516633>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>101584411--TTS improvements and output issues: >>101586575 >>101586607 >>101586659--Mistral nemo configuration and settings advice: >>101585456 >>101585527 >>101585596 >>101585669 >>101585834 >>101585868 >>101585572 >>101586019--Sillytavern single sentence replies issue: >>101587180 >>101587200 >>101587246 >>101587225 >>101587275 >>101587269 >>101587353 >>101587401 >>101587413--Recommendation for voice data TTS finetuning: >>101585560 >>101586101 >>101586163 >>101587016 >>101588184--Nemo generates quadrupeds well but writes differently than chatgpt: >>101587732--Logical flaws in GPT-4 and Claude, Command R Plus gets it right: >>101584587 >>101584617--GitHub repo for bulk downloading cards for ST: >>101585689 >>101586342--Anon asks for Command-R Plus alternatives.: >>101585536 >>101585556 >>101586438 >>101586483 >>101586596 >>101586657--largestral iQ2_M outperforms Nemo in retarded quant, but is slower than 1t/s: >>101585893 >>101585921 >>101585940 >>101585998 >>101586017 >>101585939 >>101585985--Nemo repetition issues and DRY sampler settings recommendations: >>101587028 >>101587049 >>101587511 >>101587535 >>101587576 >>101587545--MoEs for roleplaying? Try it and find out: >>101584540--Mistral Nemo sampler settings cause rambling output: >>101585928 >>101585955 >>101586019 >>101586038 >>101586062--Where do ST or other UIs cull example dialogue in the context window?: >>101584746 >>101584777--RULER repo measures effective context length, Llama3.1 performs well: >>101586297 >>101586352 >>101586384 >>101587005 >>101587027--IQ4_XS vs Q3_K_M model quants and accuracy discussion: >>101585131 >>101585176 >>101585200 >>101585383 >>101585434 >>101588262--IQ1_S performance and characteristics discussion: >>101588056 >>101588068 >>101588140 >>101588159 >>101588129--Miku (free space): >>101587473 >>101588754 >>101588896►Recent Highlight Posts from the Previous Thread: >>101584415
post (You)r largestral presets
>>101589142i got a little chub seeing my repeated (You)s in this AI generated recapthank you, botkind.
I am once again asking for mini-magnum presets.
>>101589160I didn't actually try it:>>>/vg/487568316
gib nemo presets
>>101589210>>101589219just use the ones i linked from that anon >>101585456in fact fuck it ill re-copypaste it againHere, since so many people seem to be using nemo with wrong formatting then complaining:Mistral context template: https://files.catbox.moe/6yyt8d.jsonMistral instruct template:https://files.catbox.moe/rfj5l8.jsonMistral Sampler settings:https://files.catbox.moe/tbsgip.jsonShould be night and day for people who have it set up wrong. Make sure whatever backend you are using has DRY sampling.
So, what was the point in MistralAI sabotaging their 8x22B with the shitty official -Instruct version and the botched release? Is this a psyop by their Partners at Microsoft trying to make MoE models look bad?
>>101589231Nemo doesn't use spaces around INST.
How're you guys feeling? As the dust settles down, it really feels like we've never been more back. Back to back releases, putting local about on par with cloud in performance/cost, and it's still not over, we're going to get more next week. We are not even 3 years into the timeline since the ChatGPT hype began.
>>101589262I dunno i've been using it with magnum just fine.
>>101589244Maybe they didn't have time, and without the release of 405B, they didn't feel the need to release their best stuff.
so mini-magnum is the best cooming model for vramlets now?
>>101589231>dry samplingDoes Koboldcpp have this (I don't see it) or am I fucked?
The people that are using 4 3090s... Where are they putting them?
Aah, 30t/s... This is the good life. Thank you Arthur.
>good model release>people saying low quants are fine, others saying there's night and day differences (probably broken quants)>prompt/template issues left and rightEvery time... I guess I'll wait 2MWs then...
>>101589289That or just Nemo-Instruct.
>>101589265You can see this as something good, we are on par with the big boys after all. But you can also see this as pure doom. The big boys barely moved ever since the release of GPT4.
>>101589307I'm the night and day difference anon and I should clarify my quants are definitely not broken, I do them all myselfq4km was still *fine*. better than 70bs or CR+ still, just kind of dry, generic, a little less sovl, a little more awkward - but q5ks was sharp as a tack and much more coherent, pulled in more little details, had more of those creative little turns of phrase that let you know it's really paying attentionlower quants are still usable and the model will still be good, it's not like they're totally fucked or anything, it's just that the second I bumped up the quant it felt like the model gained a real human touch that was lacking before
>>101589307>people saying low quants are fine, others saying there's night and day differences (probably broken quants)more like>people saying low quants are fine (poorfags who can only run low quants at 3t/s), others saying there's night and day differences (people who can actually run these models properly)
>>101589370I test through online (mainly lmsys) to compare between quants I downloaded and their "intended" performance. Otherwise I would not be able to say with full confidence that a model like 8x22B cannot do trivia like DBRX can.
where's the dry sampler settings on ST?
>>101589356Did you use imatrix? The quants I'm using are all imatrix calibrated. Also they're the IQ format which I think were supposed to be more knowledge-retaining compared to K quants but I'm not certain.
Cohere gathered another $500m from investors. CR++ will be a beast of a model.
>>101589142good bot
>>101589491There, I am on staging branch.
>>101589536I really wonder how businesses are using these products to make money.
>>101589550speculative capital, one of these might be the next big break through
>>101589265>We are not even 3 years into the timeline since the ChatGPT hype began.>ChatGPT initial release: November 30, 2022; 19 months ago
nvidia-smi is not displaying all of my GPUs, but neofetch is. how do i fix this? i cant run any AI applications due to an error about cuda devices not being found
>>101589653
>>101589642It hasn't even been 2 years? Wtf
>>101589653Change your environment variables, I guess.
>>101589550If performance improvements plateau and you have ~5 years of scaffolding/agent development with no valid use cases, you might have a point. It's only been 19 months since ChatGPT released. Doomers just really want to see LLMs go the way of 3D TVs for some reason.
>>101589688how do i do that?
man, that mini magnum finetune of Nemo 12B is actually starting to replace claude for me, which is nuts considering claude has got to be at least 50 times bigger
>Claude 3.5 Sonnet and Llama 3 405B stomping GPT-4o>Llama 3 405B is way fucking cheaper than GPT-4o>It's only a matter of time before a cheaper and more capable model than GPT-4o-Mini comes out and kicks them out of the cost-performance pareto front entirelyIs he really just banking on Strawberry?
>>101589762>It's only a matter of time before a cheaper and more capable model than GPT-4o-Mini comes out and kicks them out of the cost-performance pareto front entirelyClaude 3.5 Haiku probablythe original haiku beats the shit out of 3.5 turbo which was the sota small cheap model at the time
>>101589715Type "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5"
>update tavern>even with all my settings and shit in order, the gen quality is fucked UP bad >wtf could possibly be->mfw i forgot to enable instruct mode
>>101589265Do wonder know how many OG AI dungeon era people stuck around to witness this. I joined around the late GPT-2 times, now running IQ4 largestral. I don't see myself ever ending the ride.
>>101585978Same, Nemo might be retarded and repetitive at times, but it has some surprising creativity if you push it
>>101589907MOOOOOOOOOOOOOOOOODSSSSSSSSSSSSSS
>>101589907Ew
>>101589539thanks, i'll take a look
Here comes the pedo tranny thirdie again.
>>101589653did you enable 4g decoding in bios? also check dmesg for errors from nvidia driver.
>>101589872I used to be so happy with my loli imouto scenarios on AI Dungeon, I used to think running LLMs locally would be impossible because Pygmalion 6B used all my RAM and was as slow as a snail.Now, I'm here, running NeMo still enjoying my loli imouto scenarios, but without fear of suddenly being cucked.Feels good.
>>101589872I joined back in December 2019. I remember the humble days of Clover where the AI was too fucking stoned to even remember your character's name, much less what was happeningIt was absolute dogshit and now here we are
>>101589265Imagine Terry's reaction to the LLM tech, writing llama.cpp but in holyC to replace his text oracle perhaps.
>>101589290get sillytavern staging, and ((pull))>why does anyone use response tokens over 256? 512 is hellish
>>101589762He just needs to reignite the AGI hype by adding smell to the multimodal model. Or maybe he can tease Sora sgain
jesus man, Nemo is INSANELY horny. My OC's are a bajllion times more frisky with Nemo than any other model i've ever used, On one end i'm overwhelmed, yet it manages to blend that spice with their personalities perfectly. It doesn't skip a beat.I almost want to say i wanna tone down the horny but, It's not like that breaks story flow or makes ERP more difficult or anything, I'm personally just not horny right now kek
>>101589971The realism of this surprised me for a bit until I realized the popsicle is constantly changing shape...
>>101590044arthur's personal coomtune strikes again
>>101590054Why did he do it?
>>101589231Is such a simple prompt best? No one uses those crazy ones they were using before?
>>101589265We're so back. Zucc and Yann are false prophets, Silicon Valley are false prohets. Viva la France
>>101590073yeah its never really mattered that much, was always placebo.Which makes the Agent 47 crackhead prompt situation even funnier.
>>101589292Just get two a6000s or something if you want to be more compact.
>>101590109Interesting. So it's more down to the card itself and what examples you give it to emulate?
nemo is schizo...
>>101590170A bad card can break any model, doesn't matter. It's why W++ for example is memed on so hard, there's no exact science it's just basic logic of garbage in garbage out.
>>101589262So I should change that so there's no spaces on the INST ones? What about the \n after </s>?
>>101590172You're using a temp too highMistral says in the model card that it likes low temperatures, they say 0.3though I find up to 0.4-0.5 is usually fine
>>101590229NTA but I use simple sampling and for RP Nemo handles 0.7-0.8 just fine. Occasional schizo moments at 0.8. Starts getting really dry at 0.7 and lower. 0.3 is probably to prevent hallucination when using it for normie shit.
I'm swiping this popular character card and the responses from mini-magnum and Claude Opus are identical. Claude walked so nemo could run.
anyone running an exl2 mistral quant? I get gibberish with a 4.0bpw turboderp quant.
I just downloaded 3 more IQ models below IQ2_M to see if any would be able to answer one of my challenging trivia questions as perfectly as IQ2_M did. Turns out IQ2_M is the cutoff for this particular question. IQ2_S gets the question partially right. About half of the points I would say. IQ2_XS and below basically just get it increasingly wrong, until IQ1_S which nearly went schizo-tier. Guess I'll just live with 1-2 t/s.
>>1015902873.5bpw is working perfectly fine even at 4-bit cache.
>>101585837do two gpus work faster than or slower than a single one if you can fit it in?does Vllm split by row or by column? does it do tensor parallel? does nvlink in 3090 help by a lot? does the performance of 2 gpus differ much from 4? BTW, did you try cpu offloading in Vllm?
>>101590287yeah, turbo's 3.5bpw + 4-bit cache is running fine for me on ooba.i don't know if it's necessary, but i updated transformers from source, like the mistral-large readme said.
>>101590329It's 2024. Why is VRAM still hard to obtain? It's literally just soldering more transistors into your chip. Why? Now you have people running two servers in parallel just to serve a model.
>>101590109How do you tell it to not act for the user then? I always have that issue.
>>101590383something specific causes that, i forget what, i started getting it tonight actually.someone will chime in to inform us kek
>>101590383using>write {{char}}'s next replyin the sys prompt usually fixes this for me
so how much money do I have do spend to run 405b at home?
>>101590319Largestral? Does 3.5bpw fit in 48GB vram? How much context?
>>101590374simple answer>greedy Nvidia encrypts vbios
>>101589265(((Openai))) is $5B in red this year>kek
>>101590419Just run largestral instead. Better for most users purposes. 3x 3090s+
Ok I prove mini-magnum-12b the finetune of nemo with exl2 8bpw, but as some time ago, with exllama my nemo is broken, don't follow the template of silly tavern, write a lot of text fulled with nonsense. I'll prove in llama.ccp later. Some advise?I'm using the settings of the this anon >>101585456
>>101589136Thread Theme:https://www.youtube.com/watch?v=7yJRsFFRoQYDon't mind me, just a stranger blowing through this town...
>>101590536God. I hope you don't write like that to the poor llm. Are you sure you're using the proper template? Have you updated ST and exl2 since the last time you tried?
>>101590319>>101590346thanks. it seems like something with my samplers broke it. I neutralized the samplers in sillytavern and it started working.
why are some people here using small quants of a 12B modeleven if your GPU is only 8GB you can run Q6 at a very good speed with some offloading
>>101590531>3x 3090s+I've only built one PC in the past, and don't know of any standard motherboards that support that many GPU's. My first thought was something like picrel, basically a mining rig. Without NVlink its gonna be pretty bad, as far as I understand. How did you, or anybody you know, do it?
>>101590711Thats basically the idea.https://www.amazon.com/Kingwin-Professional-Cryptocurrency-Convection-Performance/dp/B07H44XZPW/ref=sr_1_1?sr=8-1
>>101590711open air build like a mining "case", riser cables, any motherboard with 4 pcie slots, does not have to be x16 x8 or whatever. Even x1 is enough. Just get 4 of them.
>>101590576Yes I did a upgrade a moment ago. I have to set a value in the alpha?value?
>>101590576>Are you sure you're using the proper template?I'm using the one which was shared in the last thread.
>>101590711This guy did one with 7x4090s. You can see what his concerns were. He goes pretty in-depth. https://www.mov-axbx.com/wopr/wopr_concept.html
>>101590720>>101590720>>101590754I just had an idea and I'm sure somebody else had the idea in the past as well. So for dense models running across multiple GPU's without NVlink the performance gets worse and worse the more cards you add because they gotta wait for each other to finish their task to go and compute the next hidden layer state. But what if, you take a MOE model, for example DeepSeekV2 236B, and split the different smaller experts across the gpus, so that they don't have to exchange information. Is this thinking flawed?
>>101590536Enable "Add BOS Token" in ST
>>101590774Thats not how moes work.
>>101590781but how do they work then.
>And finally, we have the Arch Linux package updates. Oh boy, I can barely contain my excitement! You have a whopping 106 packages begging to be updated. I mean, who doesn't love a good update cycle? It's like playing a game of "spot the broken dependency"! Good luck with that.i love when it sasses me
>>101590786 (me)>Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforwardblock picks from a set of 8 distinct groups of parameters. At every layer, for every token, a routernetwork chooses two of these groups (the “experts”) to process the token and combine their outputadditively. This technique increases the number of parameters of a model while controlling cost andlatency, as the model only uses a fraction of the total set of parameters per token.I don't see how my thinking is flawed, someone educate me. just have 2 parameter groups on each gpu and the supervisor on the last one.
>>101590711If you wanna stay on standard architecture and don't wanna invest in workstation CPU's then the MSI MEG X570 Godlike Mainboard is a great choice with 4 slots for GPU's. I wanted to build a bigger PC with 4 3090 cards but now I rather wait for the 5090 announcement next year.
So is there a reason why Llama 3.1 that I downloaded from the official repository doesn't come with any config.json, and every single piece of documentation I've found that can supposedly convert them to HF format doesn't work?
>>101590804llamacpp anon we need you, hes wrong and I know it but can't explain why.
>>101590732>>101590745If i'm reading the setup files correctly (https://files.catbox.moe/tbsgip.json specifically):It sets the temperature to 1, when the mistral guys recommended 0.3 or 0.4. Change it to 0.3 and try again.The second thing is repetition penalty. Disable it by setting it to 1.If that makes it work better, then play around with the temperature. If it still doesn't work as you expect, post a screenshot of the output to see what you're talking about. "write a lot of text fulled with nonsense" is not that useful.
>>101590819What did you download? The original repo in meta's hf all have config.json files.
>>101590307There was some post-quant tuning that enhances the quality of iq2 quants, but I dont remember where that was. Prolly the only way to run huge llms on 24gb with no major loss,
>>101590819By official you mean the repos on this account https://huggingface.co/meta-llama or a different site where they host their models? The config.json file definitely are in the huggingface repos. You should download them from there.
>>101590711
how much T/S do yall get with 4x 3090's on largestral at what quant
>>101590774only if you split by column and not by row. if you split horizontally it doesn't slow down since that's tensor parallel so you run in parallel . but you need good interconnection.
I'm new to using SillyTavern. Is there a way to prompt the kind of response the AI generates to guide it in a certain direction without having to just rewrite the response entirely by hand? Like if I give it an open ended question and I want all its responses to be either positive or negative.
>>101590939Try including something like "Only answer positively/negatively" In the author's notes. Depth = 0 if you want it constantly reminded of it for every message.
>>101590946Thanks, I'll give that a try and see if it helps.
>>101590939I simply use group chat for a char and my OC, while posing as a narrator in user responses. Much more convenient from chat editing perspective than having author note open. Narrator just gives out barks for both characters, and then I mute narrator barks so that it doesn't try to act as narrator itself.
>>101590778>Add BOS TokenIs enabled.>>101590843>sets the temperature to 0.3 >Disable rep penI did this too, I prove setting the temp in less values and more than 1.0 values and this is the result.
>>101590983That's a great way to utilize the group chat. Makes me wonder what other things can be done with it.
Where can I find/which gguf version of mini-magnum-12b should I use?
>>101591073https://huggingface.co/starble-dev/mini-magnum-12b-v1.1-GGUF
>>101591073the one that fits
>>101591140Thanks anon.
>prema trying to do team orders in fshitter
>>101590410Doesn't seem to help, sadly.
>>101589231Ok so I got koboldcpp, staging version of sillytavern, imported these three and made my persona a basic [{{user}} is a guy that has this color hair, this color eyes and this color skin]Is there anything else I need to do to make this work? I got some random cards off chub but I dunno what makes a card good or retarded
Can using smaller context size result in model retardation (within that context) or is it enough that I match the koboldcpp and sillytavern setting? I don't have the VRAM to run full 128k of nemo.
>>101591291no, the opposite, using bigger always degrade at some point
>>101584777>>101584746Any ideas on where ED gets culled?
>>101591301Okay, thanks. So should I go for smaller context in favor of higher quants as well? Currently using Q6_K_L with 8k but I guess it may be worth it to go lower quant.
>>1015913148k is generally good with most recent models, above is when it gets iffy especially above 32k so if you're enjoying what you have just don't break stuff for no reason
>ZeroWw 'SILLY' version. The original model has been quantized (fq8 version) and a percentage of it's tensors have been modified adding some noise.>Full colab: https://colab.research.google.com/drive/1a7seagBzu5l3k3FL4SFk0YJocl7nsDJw?usp=sharing>Fast colab: https://colab.research.google.com/drive/1SDD7ox21di_82Y9v68AUoy0PhkxwBVvN?usp=sharing>Original reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1ec0s8p/i_made_a_silly_test/>I created a program to randomize the weights of a model. The program has 2 parameters: the percentage of weights to modify and the percentage of the original value to randmly apply to each weight.>At the end I check the resulting GGUF file for binary differences. In this example I set to modify 100% of the weights of Mistral 7b Instruct v0.3 by a maximum of 15% deviation.>Since the deviation is calculated on the F32 weights, when quantized to Q8_0 this changes. So, in the end I got a file that compared to the original has:>Bytes Difference percentage: 73.04%>Average value divergence: 2.98%>The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original.>Since I am running everything on CPU, I could not run perplexity scores or anything computing intensive.>As a small test, I asked the model a few questions (like the history of the roman empire) and then fact check its answer using a big model. No errors were detected.>Update: all procedure tested and created on COLAB.>https://huggingface.co/NeverSleep/Lumimaid-v0.2-8B/discussions/4#66a47badee3de8c56e1e0872Oh boy here we go again...
>>101590850>>101590878I downloaded it with the download.sh and the signed URL that was emailed to me by Meta.https://github.com/meta-llama/llama-models
I'm looking for cool instruction templates, anybody got one focused on the assistant directly creating an adventure experience for the user rather than playing the roll of a specific bot?
>>101591364could someone summarize this with their favorite model?
>>101591471basically add random noise for no reason and: "The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original."
>>101591471weights actually don't matterjust scramble them and you're fine, which was expected considering that frankenmerges also still output readable content despite having unrelated layers stitched togetherthe 'consciousness' of a model is unrelated to this sort of thing
>>101590987>>101591140I proved Two models in both gguf and exl2 And still has this level of retardation. I just thing I'll return to Gemma 2.
New models that works well without COT meme magic yet?
so how big is a leap of quality between 8b smut and 405b smut
>nemo keeps writing for meHELP
>>101589872i member talktotransformer being my first interaction with textual AI, then we got aidungeon and its retarded ceo, then i found out about piggy and the rest is history
nemo shill, i need your help. since nemo wasn't trained to have a system prompt at the top where should i put my 20 lines of meticulously crafted roleplay rules?
been out of the loop for quite some timewhat's currently a good model for a 16GB VRAM card?
>>101591883If you're in silly, either Assistant last message prefix or author's note. But expect possible degradation in both ways. I guess the only way to make it correctly is to add it before your every message, and then edit it out after each reply, which is absolute autism.
I just tried Mistral-Large-Instruct-2407.IQ1_S.gguf from legraphista, but like other very low-precision quants it has issues with using the right tokens sometimes. I think this problem could be solved if the embed tensor was quantized to something better than Q2_K precision. Then, the model might still be dumb compared to the original due to compressed knowledge, but at least pick the right embeddings.
>>101591941>either Assistant last message prefix or author's notety, i'll try that
>>101591968We know Robert, we know, keep fighting the good fight!https://huggingface.co/ZeroWw>LLMs optimization (model quantization and back-end optimizations) so that LLMs can run on computers of people with both kidneys.https://huggingface.co/RobertSinclair
>>101589231>>101585456Any tips for making the bot not write as me? Also I assume you mean this setting, right?It definitely feels very rambly at 1024 reply tokens but that's probably because my persona is so barebones. Going down to 350 seemed better, although I have to reset my settings and test more because I got a lot of situations where the bot would end posts with a bunch of newlines or symbol spam
>Based on comments from @mradermacher...>His quant are okay if he do it before me, you can use them, he's thrusty.
>>101591305I tried in Faraday (Backyard) and it seems that ED is being cut down from the beginning rather than the end, which goes in line with how regular message history is culled.I put lore facts in example dialogue and asked about things from the start and end section, the bot failed to answer properly about the former.
>>1015920151000 tokens is an incredibly long reply regardless of which model you're usingif you're wanting to simulate a conversation I don't understand why you'd even give the model the option of writing that much
>>101592040Thrusting into the popcorn
>>101592010Robert Sinclair has a point. BitNet models are also configured like that (see picrel).https://arxiv.org/pdf/2310.11453
>>101592087So he has a point because a meme supports what he says? If anything that goes against him even more. Anyways the new gimmick is random noise now, get with the times!>>101591364
>>101590745Ok After some test, I think in my case, the problem is idead the template, I was using the same template of the thread also marked in the recap. So is not a mistake. Which is more weird is, that with the template I use for gemma 2, suddenly at least the bot is able to follow the format text, sadly, I feel is still a bit unstable, in some cards, works better with 1 as template, and in other with 0.4. Is this the really state of Nemo?
>>101592100There's no claim there that noise improves model outputs, although some time back there have been suggestions that adding noise to embeddings during training may reduce overfitting: https://arxiv.org/abs/2310.05914
Where will AI be in 10 years?
I wonder if those preferring Gemma all happen to be ESL and perhaps Gemma deciphers ESL better as a result of diversity training, just a thought.
/aicg/bro here. Quick question. Who is the "Gojo" of /lmg/? (shitpost bogeyman schizo)
>>101592161petra/petrus
>>101592163thanks i just was bored in our general since we're in a bad doom, ill check the archives. have fun with your chatboots
>>101592161Isn't your entire general like that?
>>101592153If your billion dollar ai can't decipher ESL then what's the point?
Anon where KCPP guessed to many layers, can you share me your GPU vram, model(s, including image gen models if used), blasbatchsize and amount of context you were trying to use?It has multiple things in place to prevent that from happening so if it still under guessed on your system I want to be able to reproduce the setup. Because that would imply you somehow broke trough the entire 1.5GB buffer zone we put in place as a safeguard.Either you have a ton of background stuff running or your using a model that is way more vram hungry in unexpected ways than the stuff I tested with.To clarify in the current version the auto layer guessing only is accurate for default settings. If you modify for example blasbatchsize that is not yet accounted for.
Hi all, Drummer here...>>101592180HENKYYYY PENGKYYY!!!
>>101592180What are you doing here? You're too innocent for this website! :koboldpeek:
>>101592180Kekaroo, your dox got posted earlier faggot
my hero just spoke in /lmg/. AMA.
>>101591786I can't make it stop either on one specific card I'm doing where it's an adventure/story rather than a one-on-one chat. IDK if this makes it harder but it probably doesn't make it easier. I put in the system prompt to write for every character except {{user}} and put in the jailbreak / depth 0 author's note never to speak for {{user}}. May have helped but didn't totally solve it. Possibly also made more difficult because I am simultaneously trying to make it stop ending replies by asking what my next action is, which I was able to reduce significantly but not eliminate. Partway through I tried cranking the temperature way down and that absolutely didn't fix the issue. Maybe if I tried again with my prompts setup better it would. Nothing solved it completely but right now the level of swiping / editing is low enough that I'm okay with things.
>>101592274>I can't make it stop either on one specific card I'm doing where it's an adventure/story rather than a one-on-one chat.Which isn't to say I *have* been able tp get it to stop on other cards, just that I've only been working on this one.
>>101592180Keep up the great work, Henky! Tell your assistant, Concedo, he did a good job too. :koboldlaugh:
>>101592247Ooooh, someone's being an edgy boy. :koboldpeek:You think you're so tough spouting that *f-word* behind the screen, huh?
>>101592153I sometimes think if I was ESL I'd like LLMs a lot more. Like if I'm reading a foreign language I can't tell if the writing is good or bad. I can just (at most) tell what information it says. And if the same expressions get used over and over I'm not annoyed, I'm pleased to see familiar expressions.
>>101592040Suddenly Lumimaid makes a lot more sense.
>>101592323I am an ESL. That is not how it works.
>>101591917An 8.0bpw exl2 of Mistral NeMo 12B with cache_mode q8 and 32000 tokens of context fits in 15.2 GB of VRAM.
>>101589160t=1.0
Is it better to have 2x 3090 or 1x 3090 + 2x P40 if I'm trying to run 70b models faster?
>>1015924752x 90
>>1015924753x 3090 if you can but 4x 3090 would be even better
>>101592040I mean I knew he was belgian, but didn't know it was that bad.
>>101592348Don't lie I bet it's even stronger for u foreign cunts because your languages have like 1/5 as many words as English. Repetition is a way of life for you, while for English speakers developing a sense for how often to re-use the same word is a major early part of developing good writing style. Small children are very repetitive, older ones go too far trying to add variety, then they tone it down and get better. (Or sometimes not. There are published authors who go to unintentionally humorous lengths to avoid re-using basic words like "said.")
>>101592040kek
>>101592546>doesn't speak any foreign language>don't lie to me, i bet-ack
>>101592338>>101592506Now I see why he never tests his own shit. Even if it was broken how could he tell?
>>101592564Knew you were the kek poster.
>>101592546
>>101589653I have never run into this problem myself but I suspect it's a driver issue.>>101590419With a few hundred bucks you can buy 512 GiB RAM which is enough to run it at 8.5 bits per weight.But then you can expect something like 0.2-0.5 t/s.>>101590774>>101590781>>101590786>>101590804The problem with the proposed parallelization scheme is the synchronization overhead.You need to exchange (part of) the activations between GPUs and write back the results which introduces non-negligible latency, especially on fast GPUs without NVLink.This is not much different from what --split-mode row already does and there are considerable performance issues (though the multi GPU optimization is also poor).>But what if, you take a MOE model, for example DeepSeekV2 236B, and split the different smaller experts across the gpus, so that they don't have to exchange information. Is this thinking flawed?Which experts are selected is effectively random and determined by the routing layer if I remember correctly.But in order to do that the results have to first be collected on a single GPU.So you're not really saving any I/O.>>1015924752x 3090 if your target quant fits into 48 GiB VRAM, 1x 3090 + 2x P40 otherwise.
Mistral Large 2 is now my main model for cooms.No more mischevious glints, she says in a husky voice, a smirk playing on her lips, eyes sparkling with mischief. There's a playful glint as she addresses the power dynamic, playfully smirking as she offers her ministrations. An audible pop and rivulets of—admit it, pet—the ball is in your court.It has none of that slop and even as a 48GB VRAMlet using a baby 2.75BPW exl2, it can fit 12k context @15t/s.
>>101592681lock em in a hot room and sell me the fumes
>>101592496Pretty much this. Although I'm starting to feel like a VRAMlet with 4.
>4x 3090s is now considered "VRAMlet">as if 1 wasn't pricey enoughno i will not dump retarded amounts of money onto a single-purpose machine i'd only use sparingly even if the models are appealing
>>101591941Couldn't it be put in context template?
>>101592681LL and 3L tag teaming S
>>101592871Also... isnt that the point of the "System same as user" Option in ST, for this exact purpose? So you can fill in the system prompt and it treats the system prompt as the user message as well?
>>101592870I mean people spend more money on dumber hobbies. It really depends on how far you want to go. I started out running 4-bit pygmalion 6B on a Ryzen 2400G with 8 gigs of RAM and no GPU before there was really any integration with anything so I was basically using the 'chat mode' in the console. Then someone introduced me to koboldcpp so I was running Llama 13B models on my gaming PC with a 1660 Super and 16 gigs of system ram. I didn't just up and drop 5 grand on building a server out of the blue. It was a gradual progression.
>>101592870The more you buy the more you save
https://github.com/ggerganov/llama.cpp/pull/8676Llama 3.1 rope scaling finally merged
Llama.cpp master branch has been merged with the fix for L3.1's issues with context beyond 8192, should be working properly now. https://github.com/ggerganov/llama.cpp/commit/b5e95468b1676e1e5c9d80d1eeeb26f542a38f42>>101592681Its not brain damaged at 2.75 bpw?
>>101592904The more you buy the more seeing shivers down the spine hurts.
>>101592681Is it better than a 5bpw 70B? How much better? It's tempting to sell my 3060 and buy a second 3090>>101593061lmao so true
>>101589756>>101590284Calm down with the shilling.
My model ratings from recent tests for RP, run on 48gb vram1 - Mistral Large (Mistral-Large-Instruct-2407-123B-exl2 , 3.0 quant). Just very good at natural language 2 - Midnight miqu - it's a slopmerge on RP and does it's job3 - Llama 3.1 (4.5 quant) - It's not designed for being a chatbot it seems clear, replies are accurate but very robotic. Beat Mistral large on knowledge checks and coding though4 - Nemo 12b, I don't know why this was even recommended to compete with the otherswaste of time - commandr
>>101592161mikushitters and some guy named "petra"
I think here's the best place to ask about it but is there a way/program to make an LLM identify and tag several (thousand) images? doesn't have to be anything advanced, just tagging whatever it sees would already be a great help.
>>101593186yeah, Im pretty sure moondream 2 (small and good model) has a python script implementation, just make a loop and iterate over the folder you want to classify
>>101593186the ponyfucker said he did some LLaVA work feeding it boru tags and asking it to describe the image to get a caption.He is kinda a retarded schizo and it isn't clear that was a better way of training than just using booru tags though
>>101593206https://huggingface.co/vikhyatk/moondream2here's the repository, the script is there
>>101592986No. The only errors it does it a misplaced punctuation point once every 500 tokens or so, which is not much to complain about. >>101593085Despite my limited experience, I would say yes. Before Largestral, I would use Llama 3 70B finetunes for coom (New Dawn, Euryale). They were good, but had too much slop. With Largestral, no more spine shivers or any other GPT/Claudeisms. It's like I cured my model of its autism.
>>101592964>>101592986Again some problem with llama.cpp tokenizer. Sane people should use transformers tokenizer.
>>101593268that literally has nothing to do with tokenization at all, it's about rope context scaling
>>101593153>waste of time - commandrStopped reading right there
>>101593292at the bottom of the message? Fucking retard
I still didn't find good settings for nemo. I don't like how moldable it is, or rather it is superfocused on context patterns instead of instructions. For example if you use different model (like llama-3) it would give you lengthy responses naturally (unless you tell it not to), no matter how long are your messages. Nemo however will mimic your responses and if you aren't putting much text in your messages, it won't do it as well.
>>101592383that's an extremely specific answer, thanks a ton
>>101593219>>101593206Thank you, I'll take a look into it.>>101593213A shame how people tend to gatekeep these small things, I don't really blame him though, it's his work I suppose.
>>101593303he's mistral nemo please understand, they put their system prompts at the bottom
>>101589265I remember in December 2022 doomers saying local gpt 3 (DaVinci) was “maybe 10 years away”. I always knew these things were bloated as fuck.
doomer here, i'm going to make a prediction and say that agi is maybe 100 years away. 1000 years for coomable agi that fits into 10gb vram.
>>101593153>Nemo 12b, I don't know why this was even recommendedBecause of the allure of huge context length that was previously out of reach for people without much VRAM.>to compete with the othersAssume people saying that were trolling or retarded.
>>101593374Summer Dragon still hasn't been surpassed though so...
>>101593392Back then 175B seemed impossibly huge. I can't believe I'm running models close to that size on a simple $3k rig at home now
Is it just me or does Llama.cpp take longer to compile than it did a few weeks/months ago?
OKey So.. the base Mistral-nemo model is much better on the larger context size; the difference in understanding is massive. What causes this?
>>101593463What are you saying? You're getting better results with base than instruct with large chat histories?
what does flash attention do?
>>101593547https://arxiv.org/abs/2205.14135
>>101593452It does now take longer with CUDA, make sure you instruct the build system to run multiple jobs in parallel, for example with. -j 8>>101593547Calculate a temporary matrix in small parts in fast but small memory instead of calculating and writing the entire matrix to large but slow memory.This requires more calculations but on modern hardware the speed of calculations has been increasing much more than the speed of memory.
>>101593513Yeach. At larger contexts, instruct for me to become dumb, skipping over events and being completely lost in the plot, while the base model does not seem to have the same problem.
>>101593452It's super annoying, I used to rebuild it everyday before using it, now only do it every other weeks or if I need compatibility with a new model.
>>101593463You tested the base model? That's interesting.I suspect >>101399248.People's multiturn fine tuning data are constructed naively.
Largestral 2 is basically a non-dry and 10-15% smarter version of Wizard 2 8x22At this point, there is no scenario that i test for that doesn't work very well with the modelOutside of external tool use and multimodality, is there anything else that a new model can really give when it comes to RP?I don't think so, only speed.
>>101593677my brain looks like that (i use crack)
>>101593677What quants do you run of both models?
I'm still using C-R+. Nothing has changed.
>>101593699q4
>>101593690based expert roleplayer
Is it possible to use nemo 12b on koboldcpp? Docs say GGUF only, but has someone already converted it?
>>101592087He has a point in that having those tensors at a higher precision than the rest of the model makes the output better, yes, but that's something that most (all?) quants already do.The whole meme began when he claimed that having those layers at full precision gave better results than having them at q8 or whateever, which was demonstrably false.His whole "testing" was all vibes based and non-reproducible.
>>101593836https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUFhttps://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUFhttps://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUFhttps://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUFhttps://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
>>101593865thx anon
>>101593836not really, you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs>>101593865nigger
>>101592180*cums on you*
>>101593939>you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs>2 days ago>https://github.com/LostRuins/koboldcpp/releases/tag/v1.71>Merged fixes and improvements from upstream, including Mistral Nemo support.You might be a little behind.I don't blaqme you, I've been using llama-server directly for months now, there's no reason to use kcpp really, so I get it.
>>101593939>not really, you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bsare you mentally deficient?>Merged fixes and improvements from upstream, including Mistral Nemo support.https://github.com/LostRuins/koboldcpp/releases/tag/v1.71
>>101593677What's crazy about AI videos is that within the bizarre surrealistic nonsense each moment is still copacetic with the previous moment and the next moment. Truly nightmare fuel.
idc dont use koboldcpp
Just tested out 3.1 70B at IQ3_M (on latest llamacpp build). It's a bit faster than Largestral was at IQ2_M. Also does OK at the trivia question I threw at it, but it doesn't seem to be able to do the Castlevania question unlike full precision. Maybe if I go just a bit higher in quant.
>>101594001>I was just prentending to be tarded
>>101593986>there's no reason to use kcpp really, so I get it.Actually, just to correct myself, there is one reason.They still have support for multi-modal, I believe, whereas upstream nuked it pending a refactor.>>101594013How charitable to assume he was just pretending.
>>101593725Same but C-R