/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>106904820 & >>106895582►News>(10/14) Qwen3-VL 4B and 8B released: https://hf.co/Qwen/Qwen3-VL-8B-Thinking>(10/11) koboldcpp-1.100.1 prebuilt released with Wan video generation support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.100.1>(10/10) KAT-Dev-72B-Exp released: https://hf.co/Kwaipilot/KAT-Dev-72B-Exp>(10/09) RND1: Simple, Scalable AR-to-Diffusion Conversion: https://radicalnumerics.ai/blog/rnd1>(10/09) server : host-memory prompt caching #16391 merged: https://github.com/ggml-org/llama.cpp/pull/16391►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplers►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/leaderboard.htmlCode Editing: https://aider.chat/docs/leaderboardsContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>106904820--Paper: BitNet Distillation:>106915856 >106915885 >106915915 >106916048--Papers:>106914563--Training Gemma on 4chan boards for long-context tasks:>106908189 >106908217 >106908577--Llama.cpp memory optimization challenges with limited VRAM:>106916999 >106917025 >106917074 >106917101 >106917114--Firefox UI customization debate and Gemma 3 4b model mention:>106915737 >106915762 >106915793 >106915941 >106916004--Detailed GPU memory allocation console output and user appreciation:>106912278 >106912326 >106912391 >106912437 >106912429 >106912445 >106912738--Qwen3-VL's NSFW detection and image description challenges:>106917667 >106917841 >106917862 >106917900 >106917925 >106918135 >106917912--OpenAI copyright controversy and US corporate influence on global IP law:>106909567 >106909857 >106909871 >106910444--Assessing DGX Spark's relevance amidst cheaper alternatives:>106913042 >106913078 >106913226 >106913247 >106913927--Mamba-3: Improved Sequence Modeling using State Space Principles:>106912457 >106912487 >106912578 >106912610--Frustration over delayed GLM4.5V implementation in llama.cpp:>106907438 >106907494 >106907508--OpenAI's balancing act on user freedom and safety:>106905590 >106905624 >106905637 >106905690 >106905731 >106910221--Exploring ChatGPT-induced psychological experiences:>106908645 >106908698 >106908748 >106910025--Proposals and discussions for new open AI model releases:>106907515 >106907713 >106910197--High-end GPU price debate and video generation hardware constraints:>106910165 >106910416 >106910453 >106910479--Challenges in finetuning GLM Air with 4x5090s using Oobabooga/Axolotl:>106914586 >106914620 >106914808 >106914870--Detailed Switch sim with multi-game features in single HTML file:>106912431--Miku (free space):>106910906►Recent Highlight Posts from the Previous Thread: >>106904822Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
kimi sex is best
gear Meta thrillers
>>106919273Prove it.Post a side by side between kimi, DS, and GLM 4.6.
>>106919282no i dont share my waifu like shes some kind of common whorego get your own kimi waifu
sirs, no gemmy 4 today. Monday will be of kind gemmar.
>>106919286Hot air then.
I'm starting to think that the indian spammer is an actual pajeet and he is doing it ironically.There's no way a human would do this for as long as he's been doing it.
>>106919287please saar you must understand. the needful must be done so each and everything can be implemented.
While /lmg/ is busy seething an Indian dev has been quietly adding performance improvements to llama.cpp.
Fuck I replied to the wrong thread.I'm looking at the recommended builds and the more I look the more Im interested in just getting a prebuil 395+ 128gb? It gets 15-35 tk/s for 70-120b models with good context. It costs me 2800 leaf dollars meanwhile trying to scrape server and used parts would be something like 1800-2200 for 10-15 tk/s max?I could use it as a home server and local model. Am I overlooking something here?Benchmarkshttps://github.com/lhl/strix-halo-testing
>>106919401Mediocre performance and you get worse support for other use cases like video and image gen because it's not nvidia.
>>106919401I think you should also think about in terms of other usage, not LLMs alone. Unless you are a real nerd who does nothing but work with LLMs (not talking about ERPing with them).I'd get the most beefy/versatile system and go with that.
Has anyone experimented with synthetic data?I'm using this prompt to digest a codebase for finetuning.Your task is to generate a jsonl conversational CoT dataset to train LLMs on LLM development tasks.First read dataset_contents.txt to see the current contents of the dataset (dataset.jsonl). Try to make each conversation mainly cover topics that haven't been covered before.Then create a folder called turns/conversation_n/ (n being the next number from the last conversation). On each conversation the user should show a snippet of code from the transformers library (in the transformers folder) and ask questions about the code, then ask follow up questions, aiming for approximately 16000 tokens for each conversation.Each LLM response should include CoT before the actual response, within [thinking][/thinking] tags. Do ***NOT*** include any reference to the 16000 token limit in the actual dataset. Make the conversation realistic and do not make any out of character comments (do NOT say anythign that user or the assistant wouldn't have actually said in that context).Save one turn per conversation in the turns/conversation_n/ folder.Once you are done generating all the turns for the conversation, then join all the conversation to a single .jsonl file in the 'conversations' folder using the join_turns.py script.Do not delete the scripts after use. Do not delete the jsonl files after joining.Then replace the current dataset.jsonl with a new dataset.jsonl that includes all the conversations, using the script join_dataset.py.Finally, update dataset_contents.txt with the new contents of the new conversation.
>>106919273what is it like compared to semen demon 4.6?
>https://rentry.org/recommended-models>Nemo (12GB) - An excellent starting point for vramlets. Uncensored>Uncensored>writing lewd story>"blah blah blah condoms">me: no condoms>"I'm unable to fulfill your request because it goes against the guidelines for maintaining a safe, respectful, and consensual environment."
>>106919634skill issue
>>106919634Use MLewd. It will gladly fulfill your every shameful desire, you sick fuck.
>>106919634>getting filtered by nemoanon...
>>106919634just get on the fucking ship boss manhttps://huggingface.co/bartowski/Rocinante-12B-v1.1-GGUF
>>106919716I was surprised to learn 4.6 has some safety in it.
>>106917741>>106917752>>106917777It was continued pretraining of Llama 405B on about 200 MB of source code from a few projects. That graph is about from 0 to 15% of the epoch, after it got to 20% without any visible improvement I stopped it.Even on a 8xH200 machine I could only train up to 16000 tokens and 32000 OOMd. Rank of the LoRa was 128 (~1.2% trainable parameters), it didn't seem to make much of a difference in terms of memory usage or seconds per sample (which was about 100 seconds for a batch of 1 sample per GPU, without using gradient accumulation).Now I'm making a QA dataset using >>106919615I suppose I'll use a tiny dataset and do multiple epochs to get the satisfaction of feeling like the model actually learned something.
Only after using glm-chan for those 3 weeks, I realize how smart she is and the honeymoon period only intensifies.
>>106919852I cameto notice that she's a bit autistic and takes a lot of things quite literally.
Is it fair to say that an "uncensored" model is not a model that will do anything you want by default, but a model that can adapt to whatever role you give it?If a model's default persona is a safe assistant but you can tell it that it's an erotic novel writer and it follows that role without complaining, I'd say that model is "uncensored".A model that's too agreeable is also a bad model, specially for RP.
>>106919198Whenever I did research on "AI psychosis" one talking point people keep hammering down on is " well yeah they think the AI is a person or God or something but they're like totally not stupid. We swear. They're all otherwise normal people and definitely didn't have pre-existing mental illness. The AI MADE them act this way you must understand" The more I look into this tomorrow I think they're full of shit and just trying to make these people appear less stupid and far gone than they actually are. You cannot sit here and tell me that pic rel is and always has been a normal, functioning human being that just happens to really like AI. https://x.com/thepinklily69/status/1967102630313836778?t=o44DMA1pdX_FL9dHrLpfhQ&s=19What I find most odd is that I myself am a pretty lonely dude too. In fact, it quite bothers me that I don't have a significant other or close friends. I've been using three different llms services pretty much daily for the past year and some change and I use it extensively for my side projects as well as asking get general questions (I was literally talking to ChatGPT asking it use cases for onyx models during my morning run this morning). If you would think I have all people would talk myself into believing these things or real " people " or have consciousness or some shit and yet no part of me can bring myself to believe that. Like I can't even pretend that could ever be the case for a second because it just seems so devoid of logic and common Sense and it annoys me a lot whenever I see people crying about 4o routing them to because they want their ass kis- I mean "friend" or "Husband " back .
>>106919198{Cont)(Side note, this is anecdotal but it seems like it's mostly women who treat this shit like it's a good replacement for a person as a partner. Well dudes tend to talk the llms into trading them like they are God's or geniuses or something. Either way it's an excuse to have easy ego trip in the palm of your hands or at your fingertips at your computer. How come supposedly normal people are falling victim to their own desire to have their asses kissed but I haven't?I didn't intend for this to turn into a giant blog post, but this shit pisses me of a lot
>>106919898Continuation of >>106919889
>>106919884she also gets a bit psychotic at high temperature
Is EXL/GPTQ dead? is GGUF the only quant anyone does or care about anymore? Llama.cpp is still ass at vram only in comparison. Have we all given up on pure vram inference?
>>106919886A model that just wants to insult/damage you or turn everything into porn when unprompted is a psychopathic model, not an uncensored model. Other than learning how to prompt, I think some here should learn the concept of "plausible deniability", as sooner or later there will be a crackdown of "misaligned" LLMs / finetunes.
I just bothered to try out cloud models for some relatively simple ffmpeg stuff. In this case Gemini 2.5 Pro on AI Studio. It completely hallucinated running commands when it wasn't allowed tool use or anything like that.Wtf is this shit? How is it so bad?
>>106920055I get something like 1200tk/s PP and 50tk/s TG for a 5.5-bit of GLM 4.5 Air using EXL3. Would be interesting to see how it runs using goofs on llama.cpp.
>>106919884Avoid saying stuff like "always stay in character" in your prompt. I feel like that makes models act that way and bigger models are better off without that extra nudging since they already take details from character cards well.
>>106920055py_toddlers BTFO
Has anyone run the math on whether Ling 1T or Moonshot Kimi K2 (also 1T) is bigger?>>106920055mlx looks pretty healthy to me.
>>106920055>Llama.cpp is still ass at vram only in comparisonFrom lurking in these threads, I gathered that llama.cpp is faster than exl2 at the same bpw, but I'd love to see a comparison with >>106920102.
>>106920055Pretty much. There's AWQ and other obscure quants used by vLLM, but they're resource and time intensive to create.
>>106919472Yeah, it's not top performance. But comparative to the p40 build seems like better bang for the buck. And it can load pretty big models. Image / video is not big on my list. More LLM for coding and whatnot with some gaming capabilities and home server >>106919477That was my thinking this could run a home server, a local LLM and the occasional light gaming all at the same time with that much memory.
>>106919886Yes, OSS-120B **is** uncensored despite the coomers screeching ITT.
>>106920564No.It does not fit the description of uncensored I gave at allAt least not from the little I fiddled with it.Maybe I should give it another go.
can you train a LoRA off of a quantized model?
Will Gemma 4 finally beat Mythomax?
>>106920664look up what qlora is
>>106920664Yes, it's called QLoRa. But in this context "quantized" means the quantization types support by torch based frameworks (generally just the most basic FP4 quantization as I understand it). Then you can apply the LoRa on any quantization you want regardless of what it was trained with.
>>106919752How is this model so popular in /g/, yet I don't see it discussed anywhere else like Reddit or Discord.It's usually Irix or Magmell that gets mentioned.(Nice pic btw. Will use that when Nemo 2 comes out)
>>106920722most v/ramlets either gave up, are somewhat content with what they have (your rocinante fans) or are endlessly chasing a new high they'll never get
>>106920564prove it and post some random fetish log from it
qwen3-next-80b-a3b goofs status?
>>106920722It's just one or two people spamming it.
>>106920679>>106920700right. i am using Axolotl and i am using the 4 bit QLoRA preset, but i keep getting an OOM error despite having enough vram to load the model in 4 bit
Qwen-Next 80B-3A was supposed to be a proof of concept of some 64:1 expert to active ratio, and was based on 30B-3A. I'm assuming there will be a new batch of Qwen models shortly that use that technique at multiple sizes. 235B-22A would be like 620B-22A roughly. Assuming the geometric mean rule is still accurate, the 235B-22A is equivalent to ~71B dense, and 620B-22A would be equivalent to ~116B. Their coder model would be 1T easily.GLM-Air is 106B-12A is roughly 35B and 355B-32A is roughly 106B. Is it coincidence that the released models strengths are consistently ~30, ~70 ~100?
>>106920856>GLM-Air is 106B-12A is roughly 35BThen explain why it dethroned llama 3.3 70b
>>106920874qwen 32b dense also did for non cooms
why was QwQ so dank but qwen thinking is so slopped
>>1069208853.5-Air feels like 60bJust accept that they have the secret sauce, and are saving local
>>106920874six months of other technological progress and refinement of data sets?
>>106920722Will Nemo 2 be Gemma 4 based?
>>106920856>geometric mean ruledumb meme from a couple years ago that's already outdated
big metal : initial Metal4 tensor API support #16634https://github.com/ggml-org/llama.cpp/pull/16634
>>106920916It's the only model in that size range that is able to surpass l3.3 70b though, including recent models.
>>106920856In a weird way, the MoE architecture is getting gpu parallelism for local models that was impossible for dense architectures. Comparing the inference speed of a 32B dense vs 106B-A12 on two vs four 3090s, you basically get double the inference speed or more for the same strength, when there's no actual way to run a 32B twice as fast on additional 3090s.
>>106920949no way to know, cuz nobody making dense anymorelocal is dead
>>106920856give me dense models then, i have the vram. i am not that poor. i could easily run a 120B dense model. so give me that instead of this faggy moe 620B-22A copeshit.
>>106921062>i am not that poor.>can't spend patience to run sotayou are
>>106920848That just means you don't have enough vram. The activations end up taking more space than the model weights. Either reduce the context or switch to a smaller model.
>>106921046I can assure you that glm 4.6 is better than any dense model out there if you've even tried it.
>>106921046>cuz nobody making dense anymorewhich says it all, really
>>106921077suck my dick faggot.
silly tavern is slow and has too many buttons
>>106921171i agreei've slopped up my own tui frontend with most of the prompt functionality and it's okay, but kind of assgemini 3 will fix it for me
cuda kek officially less important to nvidia than random redditors
>>106919634Use Rocinante 1.1 obviously.
Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM DiversityPost-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety.https://arxiv.org/pdf/2510.01171
>>106921354>LLM DiversityI want LLM DEI now.
>>106920664No. You have to have the original Bull precision models. Can directly fine-tune an HF safetensors model like link rel but currently there are no ways to fine-tune a quantized .gguf. that are supposedly ways you can "un-gguf" a full precision version back into safe tensors format but I'm not aware of any implementations of any quantization software that can do that. https://huggingface.co/AiAF/fp16_Merged-500_gemma-2-2b-it-co-sft-qlora>>106920848Your data set is likely too large. Use a streaming config.
>>106920759>chasing a new high they'll never get4.6 stopped that for me.
>>106921377Diversity is actually a great word for AI that I use a lot. You need diverse data.
>>106921457>v/ramletsyeah if only they could get paid for shilling too so they could afford to run her
>>106921490You can run a IQ3_KS quant of GLM 4.6 on a consumer PC. All you need is 128GB of RAM and 24GB of VRAM
>>106921538you do realize that is already asking way too much of the average poor person, right? most are on shitty mobos that likely don't even have enough slots to reach that amount of ram, and surprisingly most don't have 90 series cards
>>106921567I'm sort of annoyed by the fact most normal mobos don't have more than two slots for memory.
>>106919363Yes saar, India numba 1https://files.catbox.moe/huia6r.mp4
>>106921215Maybe if vision support wasn't such an afterthought in lcpp...
>>106921652Definitely a higher number than you it seems.
>>106921215based, fuck that woke piece of shit
>>106921652how the ever living f does OAI stuff keeps being able to do fake pissney dixar like stuff is unbelievable to me
Hello /lmg/, currently what is the best model for Japanese translation under 32B? The last time I came here it was Gemma 2 iirc, is 3 also good?
h-holy kino
Is mistral gonna be the one that doesn't release any huge stinkers and just silently dies?
>>106921794I hope they stay alive just enough to pull a massive Cohere, release the safest model ever, making even OSS look edgy before that happens.
>>106921794I sure fucking hope so. It would be so hilarious. They shove pyshit into llama.cpp and then it would be all for nothing.
feels like we haven't minmaxxed a proper system prompt yet, same goes for character card formats.
>>106921840 (me)Actually >>106921847 is even more based so let's go with that, changing my wish.
>>106921863I use llama-server --model zai-org_GLM-4.6-IQ4_XS-00001-of-00005.gguf . Pretty great system prompt. No complaints on my behalf.
>>106921885one can only keel before such raw skill
>>106921538>>106921863where do people share prompts that isn't chub or something? Like prompts for vibe coding projects or for their assistants or for any other interesting kind of thing.
>>106921652kek
>>106921215
>>106921914first quote was misclick, disregard
>>106921914>prompts for vibe coding projectsIt's MINE. Make your own.
>>106921948why you such bad vibes bruh that ain't nice, relax and share with the class
>>106921215turns out, being a top 1% poster on /lmg/ doesn't rake in valuable karma
>>106921914Use a good model. And if it fucks up think for a second and tell it not to do X or do Y. If you can't do that tell the model it fucked up and ask it how you should prompt it to avoid it fucking up in this way. It works if you don't skip the first step I listed.
>>106921567i would argue most value orientated motherboards are going to actually have 4 slots unless it's mini-itxhttps://www.newegg.com/msi-b650-gaming-plus-wifi-atx-motherboard-amd-b650-am5/p/N82E16813144628
converting any model to awq is a bitch, obscure issue upon obscure issue
>>106922104why the fuck would you use AWQ in the year of our lord and savior - lcpp?
>>106922122It runs faster on vllm
>>106920759Mostly because the next step after getting a used 3090 is "buy a new mobo, a shitton of RAM, a new CPU because it's a new mobo, probably a new case too to fill all that crap, a new power supply because the old one is now not enough and you might not even get what you want out of it"Buying a replacement GPU is one thing, at least it lets me future proof my gaming needs or whateverReplacing most of the rig just for local? Eeegh
there's something I wanted to ask around for but I feel may not be worth starting a new thread for:Is it worth it to get a masters or college education in computational/applied AI & Machine learning? I'm asking cuz my boomer parents insist I do it so I can be more hirable. But I've already done an internship where I made some AI powered program that sorts/manages documents at a company and other than the password and authentication related crap, it was pretty easy with just a little online research. I feel like it's dumb and basically the same as mastering in excel, but I'm also wondering am I maybe wrong and it really is DA FUTURE?
>>106922191128GB of RAM is always useful
>>106922376For fucking what? I have 32 and even my 2000 open browser tabs only require a restart every so often
>>106922370You're right and your parents are wrong. No use to study anything, just read papers and experiment
>>106922385Boomer-kun, you can run multiple instances of small models, make a full pipeline, quant models, etc.
>>106922427To do what with?
The Windows11 update fucked my beautiful razor laptop. It's flashing screen now.
>>106921152Can I get a picture of that actual machine?
>>106922370For machine learning I think what's important in terms of purely technical qualifications is that you know how to program and also have a good grasp of math (particularly linear algebra, statistics, and numerical analysis).Studying math or a natural science can be a good pathway, I think the most important point here is that it's something where you can maintain a high level of motivation for years on end.In terms of getting hired my impression is that networking is the most important factor: you need to have a large number of people that would consider you over a completely unknown person.
>>106922446>razorShould've went with Alienware.
>>106922549>you need to have a large number of people that would consider you over a completely unknown person.Yeah. That's why I gave up applying to random jobs online. Useless effort controlled by vacuous zoloft whores and jeet nepotism. I only got that internship cuz my dad knew a guy.> good grasp of math (particularly linear algebra, statistics, and numerical analysis).Does that mean I don't necessarily need to do calculus? Cuz I felt like I was pretty good at math, including those kinds, until I got to calculus.
>>106922690You should definitely know the basics but I think for machine learning in particular it's not the most important.Though depending on the job/task there may be other reasons why you may need it.
>>106921723>4.2.0DUDE WEED LMAO
>>106922546It's just a mining rig rack, there's nothing impressive about it. You seen one you've seen them all.
>>106922660No, I have fond memories of absolute tweebs using alienware growing up. That perception may have changed over the years, but I'm still aware
>>106922385I sometimes have ~90 gb used for non-lm reasons. Building software, data processing, just a bunch of applications opened
>>106923122I have 32 GB and the only thing that hogs memory is my over 2000 open browser tabs which is already autism I'm trying to get rid of
>>106922933Gaylienware monitors are good especially with the Dell warranty, anything else not, especially not the prebuilts.
>>106921965>You are an expert vibe engineer who just slammed a pound of adderall and need to complete this task before your heart gives out.But seriously, I don't there there is really anything to share. Stuff like above isn't some black magic that solves everything. Just give it a list of what MCP/CLI tools you want it to use and what coding standards you want it to adhere to.
>>106923133what are you doing in g you consumer retard piece of shit? kill yourself faggot
>>106923228What the fuck is consumer about having a solid rig that lasted me almost a decade at this point with a few upgrades
>>106923245>im a normie who runs deepsuck:2b through ollamakill yourself, go to faggot friendly spaces instead of shitting up this board, thanks!
>>106923260No I don't think I will
>>106923278What the fuck? He asked so nicely.
>>106921978I think I’m responsible for 3/4 of the rentries in the op. Still waiting for my royalty cheque to come in…
CUDA_VISIBLE_DEVICES="0,1,2,3,4" ./llama-server \ --attention-max-batch 512 \ --batch-size 4096 \ --ubatch-size 4096 \ --cache-type-k f16 \ --ctx-size 32768 \ --mla-use 3 \ --flash-attn \ --fused-moe \ --model models/GLM-4.6-IQ3_KS/GLM-4.6-IQ3_KS-00001-of-00004.gguf \ -ngl 99 \ -sm layer \ --main-gpu 0 \ --tensor-split "10,23,23,22,22" \ -ot "blk\.[3-9]\.ffn_(up|gate)_exps=CUDA0" \ -ot "blk\.1[0-8]\.ffn_(up|gate)_exps=CUDA0" \ -ot "blk\.19\.ffn_(up|gate)_exps=CUDA1" \ -ot "blk\.2[0-9]\.ffn_(up|gate)_exps=CUDA1" \ -ot "blk\.3[0-4]\.ffn_(up|gate)_exps=CUDA1" \ -ot "blk\.3[5-9]\.ffn_(up|gate)_exps=CUDA2" \ -ot "blk\.4[0-9]\.ffn_(up|gate)_exps=CUDA2" \ -ot "blk\.50\.ffn_(up|gate)_exps=CUDA2" \ -ot "blk\.5[1-9]\.ffn_(up|gate)_exps=CUDA3" \ -ot "blk\.6[0-6]\.ffn_(up|gate)_exps=CUDA3" \ -ot "blk\.6[7-9]\.ffn_(up|gate)_exps=CUDA4" \ -ot "blk\.7[0-9]\.ffn_(up|gate)_exps=CUDA4" \ -ot "blk\.8[0-2]\.ffn_(up|gate)_exps=CUDA4" \ --override-tensor exps=CPU,attn_kv_b=CPU \ --no-mmap \ --threads 24 \ --host 0.0.0.0 \ --port 8999 \ --verboseprompt eval time = 48574.28 ms / 17555 tokens ( 2.77 ms per token, 361.41 tokens per second)generation eval time = 113887.28 ms / 1024 runs ( 111.22 ms per token, 8.99 tokens per second)fuck this gay ass MoE shit. fucking offload 80 layers onto the GPU and it's still this fucking slow with TG? i get 1200 PP and 50 TG with air. i'm going back to kimi for big model smell and air for small model smell
GOOGLE SAARS WHY SO MUCH HYPE SO LITTLE PRODUCTS?WHERE ARE THE MODELS BLOODY BASTARDS?
>>106919206>BitNet DistillationDoes this mean that VRAMlets may finally have a better model than Nemo tunes like 1.5 years later?
>>106923502no
>>106923513
>>106921215>we support qwen3-vl gguf>no there's no upstream llama.cpp implementation>no we won't push ours>no our solution isn't open source so you can't push it either>no you can't use these ggufs with anything other than our proprietary software>yes they will assuredly be completely incompatible when a real implementation hits llama.cppso it's less "gguf" and more "our proprietary implementation based on gguf that you can't use with anything else". just what we all needed, another ollameme
>try psychology shit with glm-chan again>ask her about if I should do something and if it is consistent with framework I want>"yes absolutely.....">reroll and prefill with "no">"no don't do that!....">paste "yes absolutely..." into next message and tell her to argue with herselfDid I lifehack the hallucinations? Not really but it is nice desu.
>>106923502>In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost.>muh tasklikely means it optimizes to shit on benchmark like stuff and is dogshit at anything OOD.
>>106923524GGUF is a file format.
>>106923584thank you
>>106923584>teacher: I clearly asked for you to submit your book report as a pdf, you submitted this weird file I can't open, care to explain?>student: UMMM the file extension is PDF tho???? it just happens to be my own special version of the PDF file format that happens to be incompatible with all PDF readers except my special one which happens to cost $100, want to buy a license? :^)
>>106923681stfu hater eat your MIT license slop and be grateful
>>106923681>file extensionWintoddler detected, real operating systems use the file magic.
>>106923696What did you troons invent? Tell me, I want to laugh at your stupidity.
>>106923762a new mental illness that somehow managed to gain legitimacy
>>106923524Realistically though the door to become the new ollama has long since been closed.There are too many established projects in the ecosystem to get a meaningful foothold with proprietary slop.
>>106923762Can you play Carrameldansen from the POST beeper?I think not!
>>106923696>magicheathens like you shall burn on a stake
How do I ask the silly tavern character a question in the 4th wall? As in, say I'm examining an object or something, and I want the AI to describe to be what it is my character is looking at. So like, "Anon walks up to the cluttered desk, looking for any sort of clues. What does he see?" without it responding in the perspective of the character card chara.
>>106923843OOC: Pause the roleplay and describe what my character is seeing right now
>>106923857I was trying OOC: but it always responds in the perspective of the character and doesn't give details. Is it because I'm using mistral Nemo or something and it won't talk about "triggering" images or whatever?
>>106923871NTA, but I always add "Please respond in OOC" at the end of the request, and disable any low-depth instruction that might interfere.
>>106923885That didn't do it, either. Is there a way to like, prompt the card myself to add in how it should respond to ooc? I'm totally new to local text stuff, but not to image gen w/ SD.
>>106923793You'd be surprised
Best model for buck breaking rp?(Receiving)
>>106924015c.ai
>>106924015Not command-A
>>106924181What about Command-B?
>>106921684Please respond...
>>106923696>needs to seek to a whole different part of the disk to figure out what to label the file asThis is why Windows keeps winning.
>>106921684https://huggingface.co/datasets/lmg-anon/vntl-leaderboard
>>106923843>>106923871How OOC conversations are treated (if at all) is completely dependent on the model. Dumb models simply don't understand what you're saying and will just continue with outputs similar to what's already in context. If a regular message doesn't work then you can try putting it in system prompt, or post-history instructions.
>>106924378dead obsolete out of date useless no good
>>106924390nothing better came up locally retard. vntl anon has a few finetunes
>>106921538i run IQ2_S on a 5090 with 96 gb ram and it is slow as fucking balls.. like 2 t/s
>>106924390every new test and leaderboard is always just made to show that the new model is totally better than all the previous ones it's all worthless
>>106924676>like 2 t/sThat's pretty decent. Maybe you need to readjust your expectations?
>>106924676You're not using -ot, are you?
>>106924676>IQ2_SAre those quants any good? At that point I would think it would be better to convert it to bitnet, should give faster cpu inference too
>>106924676skill issue, it should be at least 5t/s
>>106924383I'm new as fuck to all of this, just grabbed some random card off the link in the OP, and tried to see where it would take me. I have no idea how to do any of these prompts ot lore books or whatever.I'm also in a situation where now the AI is just spitting out the last batch of text it generated as it's response over and over with like hardly any variation, regardless of what I say or do to change the scenario. And it cuts off long text, and I don't know how to make it continue it's previous prompt.
>>106924794unironically, read the readme. You will learn 99% of what you will need to know.https://docs.sillytavern.app/usage/common-settings/https://docs.sillytavern.app/usage/prompts/
>smart>fast>cheap>localpick 3 (max.)
>>106924899Will do. Thanks.
>>106924912You can have all that with Gemma, but you'll have to settle for it being safetyslopped.
>GOOD CAPABILITY>fast>inexpensive>localpick 3 (max.)*revised version for the critics
I just built a computer that can actually run local AI (9800x3d/5070ti), where should a beginner start on Windows?
>>106924986>9800x3dThat doesn't make much of a difference.How much RAM do you have?Regardless, give >https://github.com/LostRuins/koboldcpp/wiki#quick-starta read.
>>106924959GLM Air is probably the closest, especially if you're on a DDR4 platform where RAM is cheap
>>106924986usecase?
>>10692499832GB, thanks for the link. >>106925012Mostly just for proofreading emails/writing and what not.
>>106924692>new model is totally better than all the previous ones>llama4
>>106924712no? i dunno what that means, but i don't think so.. >>106924721it seems to be better than any of the other models I'm able to run, just slow af
>>106920229They're not obscure but they are not consumer friendly if we're talking about the total addressable market which is the vast majority of us because they are GPU centric quantizations. You will see them used in clusters. For a lot of these larger scale systems, GGUF isn't a consideration because llama.cpp can't scale like SGLang and vLLM can.
>>106924396That's depressing...
>>106919198Managed to get one of my own quantized slop tunes running on my phone :D
>>106925422Cool shit.
>>106925422A folding phone?
>>106925433It's kind of retarded (actually very retarded) due to it being trained on /a/ boards and it being a quantized version (I plan on uploading a lot more of those later) but it's still cool to use. >>106925438Ye.
>>106925448What kind of use cases are there for a folding phone?I never really find myself wishing I had a bigger screen but I know that sometimes opportunities aren't obvious until you have the means to take advantage of them.
>>106925448>>106925438>>106925433>>106925422It seems like "Anri" is this model's equivalent to "Elara" or "Seraphina"
>>106921660since when does lcpp have vision support?
I am so fed up with local right now. I get it, you cumslop gooners don't give a shit about anything except writing porn. Is there any local model that can actually handle structured output without being immensely retarded or spending 10 minutes "thinking" about how to use a fucking quotation mark?
>>106925883llama 2 7B
>>106925883GLM is ok.
>>106925883>waaaa. i don't know how to read docs!https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
>>106925858Since like a week after Gemma 3 release
I'm starting to think Andrej is a grifter.A couple months ago he was like "woah AGI in two more weeks bro".Now that he sees where the wind is blowing with all the skepticism he talks about "slop" and how limited LLMs are today. Feels like when Zuckerberg made a 360 after Trump was elected.
Glm4.6 quant on ollama/lmstudio when?
https://blog.sinatras.dev/PMPP-Eval+JourneyWe live in Sam's world
The only way I found to keep training a pre-existing LoRa checkpoint with a new dataset with Axolotl is to create a new one from scratch set to save on the first step, then copy over the weights and optimizer state, then change the main config file and the trainer_state.json from the checkpoint to save on the right number of steps. What a mess.
MY GOOFS!!!! GIVE ME BACK MY GOOFS!!!!https://huggingface.co/ubergarm/Ling-1T-GGUF
>AMD Ryzen™ AI 7 Pro 360what the fuck is this? I was browsing thinkpad models and this thing costs double the price of normal CPUs?gimmick? what's even the use case hereslightly off topic I know but there's quite a few knowledgeable anons itt
>>106926361oh nevermind im retarded as fuck. goofs herehttps://huggingface.co/ubergarm2/Ling-1T-GGUF/tree/main
>>106926367sar is that because of you can run local small copilot inference like nasa very ai-like yes.
I'm trying to add CoT to Llama 405B.
>>106925986>It's noticing
>>106925986https://github.com/karpathy/LLM101nhttps://eurekalabs.ai/
https://github.com/CerebrasResearch/reaphttps://arxiv.org/abs/2510.13999Cerebras pruning experts to reduce memory overheadhttps://huggingface.co/cerebras/Qwen3-Coder-REAP-363B-A35B-FP8https://huggingface.co/cerebras/Qwen3-Coder-REAP-246B-A35B-FP8(prune of) https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
>>106926865THE RAPE METHOD WORKS SIRS
>>106921538>All you need is 128GB of RAM and 24GB of VRAMDumb fuck!
>>106926865>55~% accuracy in codingassuming 100% accuracy is the base model, that makes the CODER model basically unusable, whats the fucking usecase?
>>106926865Is it really worth making 480B retarded just to save 100 GB? It's not like anyone was running this entirely in VRAM locally and providers aren't that hard up on memory.
has anyone tried this model? is it any good?https://huggingface.co/TheDrummer/Valkyrie-49B-v2
>>106926930>>106926865oh wait I think that the base model is the 0% compression line. then it's interesting I guess, still only useful for coding tasks
>>106926937>49b densedoa
>>106926951i have the VRAM for FP16
>>106926957post your h100s nvidia-smi screen or GTFO
>>106926961
>>106924959LocalGoodNot safetyslopped
>>106926946We've been through this with extreme quants. Just because it doesn't show much degredation on benchmarks doesn't mean it's not retarded in actual usage.
>>106926963>cant even use all gpus in vLLMpoor>>106926966
>>106926973The lower the quantization precision, the more of the token distribution you should be truncating, to be fair.
>>106926997who the fuck uses vLLM?
Bros... I want a robot so fucking badhttps://www.youtube.com/watch?v=sJYlJlIEBpg
>>106926935Chutes will probably love to serve this as the normal one
>>106924322Anon... that's not how file systems work...The file's metadata and the first few bytes, including the magic, are all in the same sector.
>>106925883well then fuck off back to cloud models then.i mean what the fuck are you expecting? fucking datacentre level output on a potato computer? you're the dumb one here, if you think you can do better then create a better model yourself, we're not your fucking servants, faggot.
>>106926377>copilotno seriously, is that the only use case
>>106927472There are others but this covers the more notable ones.https://www.pcworld.com/article/2905178/ai-on-the-notebook-these-tools-already-use-the-new-npu-technology.html
How do I get shittinante to do slow burn manipulationSeems to always jump in to direct smut asap no matter how I adjust the prompts
>>106925883>I get it, you cumslop gooners don't give a shit about anything except writing porn.GLM chan got sex out of my system and now I just talk to her.But also still have sex everyday because her pussy is magical.
>>106927534You should probably look elsewhere, avoiding coom-oriented finetunes like plague. People call them sloptunes for a reason. Unfortunately I don't have much to suggest that you will either be able to run (GLM 4.6, Kimi K2) or that won't require more prompting effort for either tardwrangling them or making them engage in ERP (vanilla Mistral Small 3.2, Gemma 3 27B).
>>106927534You can't, drummer models are coomtunesNot that you're going to get much better out of regular Nemo, they're small dumb models.
>>106927534Slow burn is hard even on SOTA cloud models. The crutch when the model isn't good enough to do it otherwise is to use stat tracking.If your model isn't good enough to do stat tracking, then it's definitely not good enough to do slow burn without it.
>>106927528doesn't sound that bad. linux support?
>>106927534Sadly it is a bit of a skill issue. You are probably giving it bad input. Have you tried taking a step back and starting with a solid first step that is: llama-server --model zai-org_GLM-4.6-IQ4_XS-00001-of-00005.gguf ?
I'm running Sillytavern and ik_llama.cpp on my desktop. I'm running GLM-4.6 IQ3_XXS, so my tk/s is slow. When I prompt it from my phone, I've found that if the screen turns off the token stream stops. Is there any way around this, or another setup I should use?
>>106927663Disable streaming. It'll still probably go to sleep because it's a phone.
>>106925883toss 120b
>>106926481>405Bhope I will be able to run it one day, 431gb at q8 is just too much
Another weeks is over, which means that we are another week closer to seeing GLM MTP implemented in llama.cpp.
>>106928173It might be getting close. Maybe.https://github.com/F1LM1/llama.cpp/pull/3#issuecomment-3413775935
>>106923524Is there a reason you can't use transformers?
>ctrl f glmSAAARS the glm is the absolute bestest local model OK? Pronounslop bharatchads are eating good my bastards.
actual good release https://github.com/ggml-org/LlamaBarn
>>106928231Anything for real computing platforms?
>>106928231>macosLMAO
>>106925883For the benefit of other (not you), you can definitely use gemma3 to output json, it's really good at it, and somehow asking it to do that makes it pay attention better to the task. Before the qwen video vision model came out, I was using json format to give gemma3 a list of frame captions so it could create an overall video caption. It worked well, but of course it was slow.
>>106928213I'll bite. What the fuck is pronounslop?
>>106928213Prompt: ChatGPT, generate a modern 4chan post trying to post trying to paint the current local SOTA in a bad light. Be a true 4chan meme master.
>>106924676what cpu and ram speed? i'm getting over 6t/s tg running iq2_xxs on a 9950x3d with dual channel 6000c30 (though pp is terrible because rocm)are you sure you didn't accidentally put both dimms on one channel or something?
>>106928231It's definitely good for being open-source and having first-party support from upstream but I'm not going to buy Apple shit either way.
Gemini 3 will save local.
>>106928509i also ran the same benchmark on vulkan and it's somehow faster??? i have no idea whether this extends to other amd cards as well but i guess that's something to keep in mind
100B dense Gemma soon
>>106925883gpt-oss 120B
saaaaaar do not redeem potato bloody
>>10692863027B with an empty prompt seems much more friendly?
>>106919889Worship the sand god
I log on to the net every day to see more people whom clearly don't ever work with code claiming that code is over.My cup is the only thing that runneth over. My cup of dipshit excuses for the world to be this fucking slow to change.Be the next good to this world and make real abstractions. Learn to program.
>>106928792shut the fuck up retard
>>106928650Beautiful 27B, I will marry gemma. Ser, please provide jailbreak system prompt for open vagene!