/lmg/ - a general dedicated to the discussion and development of local language models.Teto's Birthday Second EditionPrevious threads: >>108497919 & >>108493794►News>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b>(03/31) Claude Code's source leaked via npm registry map file: https://github.com/instructkr/claude-code►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108497919--Paper: Attention Residuals:>108498860 >108499155 >108499500--Analyzing Qwen3.5-27B's abstract image generation capabilities:>108498069 >108498516 >108498541 >108498076 >108498117 >108498299 >108498331 >108498371 >108498385 >108498399 >108498414 >108498424 >108498456 >108498487 >108498548 >108498570 >108498664 >108498705 >108498813 >108498832 >108498599 >108498614 >108498837 >108498841 >108498850 >108498893 >108498903 >108498914 >108498948 >108498981 >108499046--Managing Hugging Face model downloads for offline gguf conversion:>108497944 >108497947 >108497970 >108501194 >108497969 >108497971 >108497993 >108498053 >108498085 >108500846 >108501041 >108501065--llama.cpp activation rotation quantization PR and TurboQuant implementation status:>108500970 >108501008 >108501019 >108501059 >108501153 >108501190--Bonsai and turboquant sparking debate on proprietary vs open quantization research:>108499354 >108499372 >108499371 >108499419 >108499553 >108499566 >108499571 >108499766 >108499380--Holo3 dominates cost-performance in computer use benchmarks:>108499904 >108499975 >108499999 >108500155--Bonsai 1-bit 8B whitepaper and performance claims:>108498645 >108498676 >108498696 >108498890 >108501179--Unconfirmed Gemma 4 Kaggle listing spotted in code:>108499562 >108499584--Intel Qwen3.5 AutoRound quants compatibility and performance debate:>108498231 >108498252 >108498258 >108498540 >108498833--Gemma-3-27b quant outperforms Qwen3.5 in land/water classification:>108498214--AI model geographic classification tool with confidence metrics:>108499083--Cleanup of vibeparser issues in llama.cpp:>108499191--Miku, Teto, and Dipsy (free space):>108499744 >108500101 >108498601 >108498782 >108498900 >108498987 >108500375►Recent Highlight Posts from the Previous Thread: >>108497922Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Deepseek?
https://huggingface.co/arcee-ai/Trinity-Large-Thinking
>>108502232>This thinking process is critical to the model's performance — thinking tokens must be kept in context for multi-turn conversations and agentic loops to function correctly.what a waste of context space
>>108502232Was trinity the one that mixed up the genders in cockbench?
>>108502232>Trinity-Large-ThinkingOh my willy is definitely not thinking twice before getting large when seeing Trinity
is it generally safe to say the bigger the model the better the quality and the faster the speed if it fits in the vram?so in picrel the fastest and best is Q8_O - 34.9GBthe slowest and worst is Q2_K - 12.4GB
>>108502300Smaller should generally be faster since it has to access less VRAM for each token
>>108502300speed and quality are inverse
hello anons. what models would yall suggest for general tasks/agent stuff? varying across VRAM size? I've been trying to do my own research for ~3wks and have found that Qwen3.5 27b&9b seem to be the current ceiling for local models on consumer hardware.
>>108502300>is it generally safe to say the bigger the model the better the qualityYes>and the faster the speed if it fits in the vram?The bigger the faster? No.Model fitting in VRAM faster than model not fitting in VRAM? Yes.
>>108502411your research conclusion was correct
>>108502305>>108502308>>108502418thank you
>>108502300>>108502305>>108502418>>108502423>>108502437This is only kind of true for token generation. It's the opposite for prompt processing.
So this was it?
>>108502232It's really cool of them to not only release the "base" model but also the actual base model.
>>108502247newfag, what is cockbench?
>>108502554what font is that? it's cool
I've been using zai-org/GLM-4.5-Air for a bit now, but I'm curious: what SillyTavern presets should I be using? How do those work?Like, from what I can understand, models are "based off of" others, like Llama, ChatML, Deekseek, etc. This would mean that instructions/presets for SillyTavern would work best with them.But in that case, how would I figure out what preset would be best used there? Like, for zai-org/GLM-4.5-Air (or at least, ubergarm/GLM-4.5-Air-GGUF), it's tagged with il_llama.cpp; should I use Llama presets with it?
>>108502705In your place, I'd just use the Chat Completion API so that you don't have to worry about all that.Let the Jinja template take care of all that.
>>108502689http://viznut.fi/unscii/
>>108502729Really? I've been using Text Completion and KoboldCPP for pretty much the whole time I've been using ST. Do I need to find any kind of Jinja formatting, or is it just switching everything over?
>>108502705Most models have a chat_template.jinja file. Based on the tags it uses, you can figure out the chat format/template, if that's what you're asking. <|im_start|> for chatml, <start_of_turn> for gemma, <|start_header_id|> for llama3 and so on.https://huggingface.co/zai-org/GLM-4.5-Air/blob/main/chat_template.jinjaIn older models you'll find it at the end of tokenizer_config.json.
>>108502705>>108502748 (cont)It's also embedded into the gguf file. Your inference engine should pick it up from there for chat completion.
>>108502743>Do I need to find any kind of Jinja formatting, or is it just switching everything over?The idea is that the GGUF file already comes with the jinja template built in, so assume nothing is fucky there, you just switch over and the backend will format the prompt according to the jinja template.
Is it dropping today?>:eyes:
>>108502761Emacs or Vim?
>>108502748>>108502760>>108502761When I switch my tab over to Chat Completion, it looks like it doesn't have any options that would make sense for the KoboldCPP that's running on my PC. Everything seems to be related to connecting to external APIs.
>>108502766Emacs>>108502768An API is an API, external or not, and you always connect to Koboldcpp via an API.Just put the address (localhost) and port for koboldcpp in there and it should work.
>>108502768Ah sorry, it looks like I need to use OpenAI and then add `/v1/completions` to the end of the URL with the port that I'd normally use?
is the Claude leak going to change anything or are the opensource models we have better?
>>108502780>>108502781Yes.Koboldcpp and lamma.cpp both expose an OpenAi-Compatible API AFAIK.
it's been two months and nothing beats k2.5 for MMLU. it's over.
>>108502786there nothing particularly novel or different in claude-code relative to other agent harnesses. the biggest revelation in the source code leak was the insane amount of telemetry that it sends back to anthropic.
>>108502815how is it insane when that is just literally standard practice for 90% of the software on the market? even if you pay for the software you are still the product.
>>108502815What sort of telemetry was surprising? It's hard to imagine any sort of telemetry would be so, given how data-hungry these companies are.
>>108502815There's flags to turn that off, it would only be surprising if they weren't respected.
>GLM-5VNo weights?