/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108742275 & >>108736046►News>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5>(04/29) Hy-MT1.5-1.8B on-device translation models released: https://hf.co/collections/AngelSlim/hy-low-bit-model>(04/29) IBM releases Granite 4.1: https://hf.co/blog/ibm-granite/granite-4-1>(04/28) Ling-2.6-flash 104B-A7.4B released: https://hf.co/inclusionAI/Ling-2.6-flash>(04/28) Nvidia releases Nemotron 3 Nano Omni: https://hf.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108742275--Optimizing dual RTX 3090 setup for Gemma 4 with speculative decoding:>108743090 >108743104 >108743131 >108743530 >108743551 >108743361 >108743224 >108743589 >108743643 >108743838 >108743866 >108744035 >108744070 >108744217 >108744276 >108746395 >108744193--Encoding images in text completion using llama.cpp server settings:>108745965 >108745986 >108745995 >108746023 >108746145 >108746162 >108746153 >108746191--llama.cpp PR adding DFlash support for speculative decoding:>108743701 >108743736 >108743776 >108743804 >108743909 >108743958 >108743768 >108744535--Qwen 2-bit performance testing with Canvas API image recreation:>108743817 >108743856 >108743868 >108743922 >108743957 >108743946 >108743951--Critique of modern AI's lack of initiative in roleplay:>108743944 >108744192 >108745177 >108745200 >108745218 >108745240 >108745279 >108745235--Gemma's logical inference of scenario cards and early RP LLM formats:>108742385 >108742425 >108742466 >108742480 >108742499 >108742505 >108742558 >108742629 >108742653 >108742651 >108742666 >108742729 >108742594 >108742599 >108742936--Comparing AI agents to manual copy-pasting for coding productivity:>108746398 >108746438 >108746461 >108746727 >108746980 >108746758--Repetitive "Let me write" reasoning loops in various models:>108744796 >108744899 >108744920 >108744927 >108745667--Debating the necessity of jinja templates over raw prompt formatting:>108747743 >108747755 >108747803 >108747912 >108748008--LLM and inference engine embedded within a .ttf font file:>108743927 >108744507--Inconsistency of Gemma 4's refusal vectors and censorship levels:>108742306 >108742365 >108742490 >108742379 >108743700--Logs:>108742558 >108742594 >108743868 >108744345 >108744796 >108746216 >108747024--Miku (free space):>108746641 >108747847►Recent Highlight Posts from the Previous Thread: >>108743862Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Give Dipsy support, llamaniggers.
>>108749305>exlcan you even run this in llama/kobold? isn't it appleshit?
gemmaballz
>>108749417no, youre thinking of mlx. exl3 is for tabbyAPI, which outputs an openai compatible API just like kobold and llamahttps://github.com/theroyallab/tabbyAPI
Lecun was wrong.
>>108749467Can you be wrong if you literally haven't done anything?
>>108749467I wish.I want my local cat intelligence already...
>>108749477vjepa says hello
>>108749486Holy shit it's really becoming a cat
Remember when Netflix of all companies made JEPA real and nobody cared?https://huggingface.co/netflix/void-model
>>108747912>If you aren't doing rag or tool calling I can't imagine a prompt template more complicated than what you do in ST.but no image support on v1/completionsis that a llama.cpp only limitation?
>>108749548If you want to use images or other types of data in text completion mode, use the /completion endpoint and send the images or audio encoded as base64 in "multimodal_data", they'll be tokenized and put in your prompt wherever you place the media markers:https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#post-completion-given-a-prompt-it-returns-the-predicted-completionThat's how you'd do it if you were coding your own frondend anyway, not sure if SillyTavern does or supports this sort of thing.
>looks at you like you are a bug under a microscope>a stain on the carpet>a gum stuck to her shoe>a specimenWhat's the name for this slop phrase and how do I prompt it away?
>>108749617what’s wrong with similes?
>>108749642I already told it to remove similes, it's this specific one that doesn't go away. AI just can't express cold, indifferent characters without injecting this phrase.
>>108749649>>108749617metaphors?
>>108749649yeah but, like, isn’t that good writing?
>>108749649try a regex to strip it from the gen message/history going forward, so the model can gen it, but it gets ripped out
>>108749642She looks you and Riley similes.
>>108749658Broccoli is good for you. Therefore you should eat 10 pounds of it every day.
>>108749678holy THIS
What would be the ideal optical disc to store day 0 gemma weights on if I want them immutable? I thought we'd have some TB-scale discs by now but it looks like everyone just gave up after blu-ray? Did the guy who invents new discs die or something?
A little birdie told me that openai is getting ready to release gpt-oss 2 and it's going to be a huge shift.
>>108749808I am expecting that given the case with Musk. Hopefully it's less censored this time.
>>108749697>optical plebhttps://en.wikipedia.org/wiki/Linear_Tape-Open
>>108749808more like huge shit
>>108749824>the case with MuskWhat did I miss?
>>108749907>What did I miss?tl;dr musk has been calling saltman on his bullshit for the last 2 years (ie. turning openai into closedai and pillaging it)
>>108749697>if I want them immutable?just set the immutable bit kiddo
>>108749697You'll have to find old floppies and move the little tab. There's no other way.
>>108749959??? But gemma makes me ERECT how can she fit on a FLOPPY disk?
fyi everyone, ace step 1.5 xl is out and really pushes things up in terms of local music gen. aka suno at home.
>>108749988You said this last week, and it was shit last week. It'll be shit this week too.
>>108749808>it's going to be a huge shift.leaked image of gpt-oss-2
>put porn collection into rag>track fap and reviews with llm chats daily>suggest porn for me todaywhy not?
>>108750021^ that's a bot, by the way.Just testing the bots, as you were gents.
>>108750074hello bot-chama
>>108750030SLAM THAT BAD BOY UNTIL YOUR KEYS ARE ALL STICKY
>>108749398i was still using sonnet for opencode but since qwen 3.6 i'm now not using any cloud models any more.
>>108750096>>108750097>>108750098>>108750099>>108750100>>>/vg/vn
>>108750110yea i think it glitch i didn't mean to post it 5 times and that's why i deleted most of them.
>>108750114no worries, I'm glad you left 2. it's always smart to have a backup :)
>be /lmg/>cry about moes all day>mistral releases big dense model>no talk about ity'all are a bunch of hypocrites
ace step 1.5 base xl, not using the "thinking" 5hz lm, but using the lm to set the tempo etc.https://files.catbox.moe/9lz9tp.mp3
>>108750138Because mistral is EUROPEAN and the EUROPEAN laws PROHIBITS ai companies from making GOOD ai, so mistral medium is BAD!
>>108750138Why talk about it when even Mistral doesn't pretend that it's good. It's only selling point is that it uses an ancient backbone so the flops are below EU regulation limits.
>>108750152We need to protect the EVROPEAN VALUES!
>>108750138Almost all of those posts were bait.Maybe even your post is bait.
>>108750138>cry about moes all dayMoe vs dense is a meme fight, what matters is good models regardless of architectureMistral did not release a good model. They haven't for years
>>108750138bart just put his quants up a few hours ago. i'm not dumb enough to download the first unsloth uploads
>gpt oss 2God I can't wait desu. Look Qwen isn't the worst thing ever but if it's going to be censored and STEMmaxxed, I'd wish for it to be a bit more reliable and not so heavy on the repetitive redundant looping thoughts.
eggu
>>108749933Altman only needed one release and that's all we got. Why would he release another one especially when it jeopardizes their resources which is already constrained and they were losing to Anthropic until recently? Sure, it would be nice to get another open source release but I doubt it will be done because Musk is going to lose quite badly on the lawsuit as unfortunate as it will be nice to see both of them get taken down a peg. Also, if they do it, it would obsolete their mini and nano release guaranteed on meme marks. I bet it is why they waited for a while after releasing GPT-5 with mini models before they attempted to do a mini series again. I think this is all that we'll see of it until 2030/2031.
>>108750244>not 7 and i holdingsngmi
>>108750141Here's your udio/suno at home:https://files.catbox.moe/61cwlr.mp3singing the news! FRESH news! :^)It sounds so optimistic. Maybe news should always be delivered this way.
>>108749988>>108750141>>108750275What's the USE CASE? When TWO point FIVE percent of the population is DEAF? They need to focus on TEXT, which EVERYONE can read!
>>108750141how many steps for base?
>>108750298I did 100, because acestep.cpp is capped at 100 for no apparent reason.
>>108750298and, I'm using dcw at .001 / .001if you use the audio codes aka 5hz lm, you will get more squared up and more consistent results, but it's less dynamic.
>>108749467It's just that JEPA as intended by LeCun won't work with text in any useful or even meaningful manner. It does with images-video, and that's what he's pushing, but people interface with computers and other people primarily via language and its rules.Planning text with video would be astronomically inefficient, so we can shove that idea aside. Predicting sentence embeddings from other sentence embeddings in order to convert them to intelligible text (the closest thing to an actual text version of JEPA) doesn't even work as there would be no intelligible solution to interpolate between two continuous text embeddings (unlike image frames).LeCun lost the plot, no wonder he basically got kicked out of Meta.
Can You Invest in Cultured and Perfected These? And Have Your System not stall it from Pharmacy andor Prescription?
>>108750287Uhh, sweaty? Blind people can't read. And no, they don't automatically know Braille.
>>108750122i wanted to delete all of them but 4chan went "muh you are deleting posts too fast".
>>108750346BLIND people can just use speed to text ACCESSIBILITY readers
>uncensored-heretic-abliterated-turboquant-opus-distilled-nvfp4-gguf
I'm pretty new to this, am I doing this right?I downloaded gemma-4-31b-it-the-deckard-heretic-uncensored-thinking-i1 and lm studio, using a 3090. Feels like I can say like 8-10 things to it before my context length fills up and I have to get rid of earlier messages. And because the model is as big as it is, my vram is near capacity so I can't increase it any further. If I have stuff in my system prompt, I have even less context length to work with, and system prompt is the thing you use to have any consistency and keep a somewhat history right? I feel like I'm doing something wrong here.
>>108750366>one (1) singular 3090:skull: bruh
>>108750383I have a degree in computer science and the year is 2026. Do you think I could afford something better?
>>108750366You can try to QUANTIZE the kv cache, but GEMMA 4 31B is quite HEAVY with its kv cache memory footprint, especially on a SINGLE 3090. Have you tried using LLAMA-CPP to keep the kv cache on CPU ram instead of the gpu ram? It will be SLOWER though.
>>108750366drag context up anyways, deal with slower speed.
>>108750366Are you already using the llama.cpp argument --parallel 1?
>>108750366llama-server.exe --model "H:\gemma4\gemma-4-31B-it-Q4_K_S.gguf" --parallel 1 --kv-unified --threads 8 --ctx-size 43000 --n-gpu-layers 99 --no-mmap --port 8080 --jinja -b 4096 -np 1 --swa-checkpoints 3 --reasoning off --override-kv gemma4.final_logit_softcapping=float:25.0
is there real use case that you need gemma 31b over 26b a4b?
>>108750392>>108750399>>108750407>>108750424I needed to throw these posts at gemini and have it explain these to me, but I think I got the jist of it. KV cache is now quantized at q8_0, parallel is now set to 1, threads to 8, and context length is now 16k. Still a few things to look through, but it's looking much better so far. Thanks a ton.
>>108750508Yes. I will not elaborate.
>>10875051031b q4km with q8 kv can do 25k context on a 24gb car. At least that's what I'm running. Alternatively you can use the moe with 100k context and 4 times the speed.
>>108750508The use case is that you want higher quality outputs and you can run the 31b at a usable speed.
>>108750518>31b q4km with q8 kv can do 25k context on a 24gbYou can actually do 40k context with that quant and KV=Q8
>>108750529Yeah, I could minmax a bit more but then I have to close everything else and that kinda sucks.