/g/ - /lmg/ - Local Models General - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
/lmg/ - Local Models General 04/01/26(Wed)13:06:29 No.108502192

File: __kasane_teto_utau_drawn_(...).jpg (2.55 MB, 1920x1920)

/lmg/ - Local Models General Anonymous 04/01/26(Wed)13:06:29 No.108502192

/lmg/ - a general dedicated to the discussion and development of local language models.

Teto's Birthday Second Edition

Previous threads: >>108497919 & >>108493794

►News
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3
>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b
>(03/31) Claude Code's source leaked via npm registry map file: https://github.com/instructkr/claude-code

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
04/01/26(Wed)13:06:51 No.108502197

Anonymous 04/01/26(Wed)13:06:51 No.108502197

File: teto principle.png (1.04 MB, 1024x1024)

1.04 MB PNG

►Recent Highlights from the Previous Thread: >>108497919

--Paper: Attention Residuals:
>108498860 >108499155 >108499500
--Analyzing Qwen3.5-27B's abstract image generation capabilities:
>108498069 >108498516 >108498541 >108498076 >108498117 >108498299 >108498331 >108498371 >108498385 >108498399 >108498414 >108498424 >108498456 >108498487 >108498548 >108498570 >108498664 >108498705 >108498813 >108498832 >108498599 >108498614 >108498837 >108498841 >108498850 >108498893 >108498903 >108498914 >108498948 >108498981 >108499046
--Managing Hugging Face model downloads for offline gguf conversion:
>108497944 >108497947 >108497970 >108501194 >108497969 >108497971 >108497993 >108498053 >108498085 >108500846 >108501041 >108501065
--llama.cpp activation rotation quantization PR and TurboQuant implementation status:
>108500970 >108501008 >108501019 >108501059 >108501153 >108501190
--Bonsai and turboquant sparking debate on proprietary vs open quantization research:
>108499354 >108499372 >108499371 >108499419 >108499553 >108499566 >108499571 >108499766 >108499380
--Holo3 dominates cost-performance in computer use benchmarks:
>108499904 >108499975 >108499999 >108500155
--Bonsai 1-bit 8B whitepaper and performance claims:
>108498645 >108498676 >108498696 >108498890 >108501179
--Unconfirmed Gemma 4 Kaggle listing spotted in code:
>108499562 >108499584
--Intel Qwen3.5 AutoRound quants compatibility and performance debate:
>108498231 >108498252 >108498258 >108498540 >108498833
--Gemma-3-27b quant outperforms Qwen3.5 in land/water classification:
>108498214
--AI model geographic classification tool with confidence metrics:
>108499083
--Cleanup of vibeparser issues in llama.cpp:
>108499191
--Miku, Teto, and Dipsy (free space):
>108499744 >108500101 >108498601 >108498782 >108498900 >108498987 >108500375

►Recent Highlight Posts from the Previous Thread: >>108497922

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
04/01/26(Wed)13:08:44 No.108502209

Anonymous 04/01/26(Wed)13:08:44 No.108502209

Deepseek?

Anonymous
04/01/26(Wed)13:12:43 No.108502232

Anonymous 04/01/26(Wed)13:12:43 No.108502232

https://huggingface.co/arcee-ai/Trinity-Large-Thinking

Anonymous
04/01/26(Wed)13:14:09 No.108502236

Anonymous 04/01/26(Wed)13:14:09 No.108502236

>>108502232
>This thinking process is critical to the model's performance — thinking tokens must be kept in context for multi-turn conversations and agentic loops to function correctly.
what a waste of context space

Anonymous
04/01/26(Wed)13:15:45 No.108502247

Anonymous 04/01/26(Wed)13:15:45 No.108502247

>>108502232
Was trinity the one that mixed up the genders in cockbench?

Anonymous
04/01/26(Wed)13:17:40 No.108502264

Anonymous 04/01/26(Wed)13:17:40 No.108502264

File: She's so hot!.png (3.45 MB, 2800x1400)

3.45 MB PNG

>>108502232
>Trinity-Large-Thinking
Oh my willy is definitely not thinking twice before getting large when seeing Trinity

Anonymous
04/01/26(Wed)13:21:56 No.108502300

Anonymous 04/01/26(Wed)13:21:56 No.108502300

File: 4373579469t.png (35 KB, 343x533)

35 KB PNG

is it generally safe to say the bigger the model the better the quality and the faster the speed if it fits in the vram?
so in picrel
the fastest and best is Q8_O - 34.9GB
the slowest and worst is Q2_K - 12.4GB

Anonymous
04/01/26(Wed)13:23:04 No.108502305

Anonymous 04/01/26(Wed)13:23:04 No.108502305

>>108502300
Smaller should generally be faster since it has to access less VRAM for each token

Anonymous
04/01/26(Wed)13:23:31 No.108502308

Anonymous 04/01/26(Wed)13:23:31 No.108502308

>>108502300
speed and quality are inverse

Anonymous
04/01/26(Wed)13:36:11 No.108502411

Anonymous 04/01/26(Wed)13:36:11 No.108502411

hello anons. what models would yall suggest for general tasks/agent stuff? varying across VRAM size? I've been trying to do my own research for ~3wks and have found that Qwen3.5 27b&9b seem to be the current ceiling for local models on consumer hardware.

Anonymous
04/01/26(Wed)13:37:10 No.108502418

Anonymous 04/01/26(Wed)13:37:10 No.108502418

>>108502300
>is it generally safe to say the bigger the model the better the quality
Yes

>and the faster the speed if it fits in the vram?
The bigger the faster? No.
Model fitting in VRAM faster than model not fitting in VRAM? Yes.

Anonymous
04/01/26(Wed)13:38:22 No.108502423

Anonymous 04/01/26(Wed)13:38:22 No.108502423

>>108502411
your research conclusion was correct

Anonymous
04/01/26(Wed)13:40:20 No.108502437

Anonymous 04/01/26(Wed)13:40:20 No.108502437

>>108502305
>>108502308
>>108502418
thank you

Anonymous
04/01/26(Wed)13:58:34 No.108502554

Anonymous 04/01/26(Wed)13:58:34 No.108502554

File: snip134.png (3 KB, 967x172)

3 KB PNG

>>108502300
>>108502305
>>108502418
>>108502423
>>108502437
This is only kind of true for token generation. It's the opposite for prompt processing.

Anonymous
04/01/26(Wed)14:18:22 No.108502658

Anonymous 04/01/26(Wed)14:18:22 No.108502658

So this was it?

Anonymous
04/01/26(Wed)14:19:52 No.108502674

Anonymous 04/01/26(Wed)14:19:52 No.108502674

>>108502232
It's really cool of them to not only release the "base" model but also the actual base model.

Anonymous
04/01/26(Wed)14:20:43 No.108502678

Anonymous 04/01/26(Wed)14:20:43 No.108502678

>>108502247
newfag, what is cockbench?

Anonymous
04/01/26(Wed)14:22:12 No.108502689

Anonymous 04/01/26(Wed)14:22:12 No.108502689

>>108502554
what font is that? it's cool

Anonymous
04/01/26(Wed)14:24:17 No.108502705

Anonymous 04/01/26(Wed)14:24:17 No.108502705

I've been using zai-org/GLM-4.5-Air for a bit now, but I'm curious: what SillyTavern presets should I be using? How do those work?

Like, from what I can understand, models are "based off of" others, like Llama, ChatML, Deekseek, etc. This would mean that instructions/presets for SillyTavern would work best with them.

But in that case, how would I figure out what preset would be best used there? Like, for zai-org/GLM-4.5-Air (or at least, ubergarm/GLM-4.5-Air-GGUF), it's tagged with il_llama.cpp; should I use Llama presets with it?

Anonymous
04/01/26(Wed)14:27:53 No.108502729

Anonymous 04/01/26(Wed)14:27:53 No.108502729

>>108502705
In your place, I'd just use the Chat Completion API so that you don't have to worry about all that.
Let the Jinja template take care of all that.

Anonymous
04/01/26(Wed)14:28:43 No.108502737

Anonymous 04/01/26(Wed)14:28:43 No.108502737

>>108502689
http://viznut.fi/unscii/

Anonymous
04/01/26(Wed)14:29:25 No.108502743

Anonymous 04/01/26(Wed)14:29:25 No.108502743

>>108502729
Really? I've been using Text Completion and KoboldCPP for pretty much the whole time I've been using ST. Do I need to find any kind of Jinja formatting, or is it just switching everything over?

Anonymous
04/01/26(Wed)14:29:52 No.108502748

Anonymous 04/01/26(Wed)14:29:52 No.108502748

>>108502705
Most models have a chat_template.jinja file. Based on the tags it uses, you can figure out the chat format/template, if that's what you're asking. <|im_start|> for chatml, <start_of_turn> for gemma, <|start_header_id|> for llama3 and so on.
https://huggingface.co/zai-org/GLM-4.5-Air/blob/main/chat_template.jinja
In older models you'll find it at the end of tokenizer_config.json.

Anonymous
04/01/26(Wed)14:31:23 No.108502760

Anonymous 04/01/26(Wed)14:31:23 No.108502760

>>108502705
>>108502748 (cont)
It's also embedded into the gguf file. Your inference engine should pick it up from there for chat completion.

Anonymous
04/01/26(Wed)14:31:27 No.108502761

Anonymous 04/01/26(Wed)14:31:27 No.108502761

>>108502743
>Do I need to find any kind of Jinja formatting, or is it just switching everything over?
The idea is that the GGUF file already comes with the jinja template built in, so assume nothing is fucky there, you just switch over and the backend will format the prompt according to the jinja template.

Anonymous
04/01/26(Wed)14:31:55 No.108502762

Anonymous 04/01/26(Wed)14:31:55 No.108502762

Is it dropping today?
>:eyes:

Anonymous
04/01/26(Wed)14:32:28 No.108502766

Anonymous 04/01/26(Wed)14:32:28 No.108502766

>>108502761
Emacs or Vim?

Anonymous
04/01/26(Wed)14:33:07 No.108502768

Anonymous 04/01/26(Wed)14:33:07 No.108502768

>>108502748
>>108502760
>>108502761
When I switch my tab over to Chat Completion, it looks like it doesn't have any options that would make sense for the KoboldCPP that's running on my PC. Everything seems to be related to connecting to external APIs.

Anonymous
04/01/26(Wed)14:35:20 No.108502780

Anonymous 04/01/26(Wed)14:35:20 No.108502780

>>108502766
Emacs

>>108502768
An API is an API, external or not, and you always connect to Koboldcpp via an API.
Just put the address (localhost) and port for koboldcpp in there and it should work.

Anonymous
04/01/26(Wed)14:35:23 No.108502781

Anonymous 04/01/26(Wed)14:35:23 No.108502781

>>108502768
Ah sorry, it looks like I need to use OpenAI and then add `/v1/completions` to the end of the URL with the port that I'd normally use?

Anonymous
04/01/26(Wed)14:36:00 No.108502786

Anonymous 04/01/26(Wed)14:36:00 No.108502786

is the Claude leak going to change anything or are the opensource models we have better?

Anonymous
04/01/26(Wed)14:36:30 No.108502792

Anonymous 04/01/26(Wed)14:36:30 No.108502792

>>108502780
>>108502781
Yes.
Koboldcpp and lamma.cpp both expose an OpenAi-Compatible API AFAIK.

Anonymous
04/01/26(Wed)14:37:39 No.108502798

Anonymous 04/01/26(Wed)14:37:39 No.108502798

it's been two months and nothing beats k2.5 for MMLU. it's over.

Anonymous
04/01/26(Wed)14:39:43 No.108502815

Anonymous 04/01/26(Wed)14:39:43 No.108502815

>>108502786
there nothing particularly novel or different in claude-code relative to other agent harnesses. the biggest revelation in the source code leak was the insane amount of telemetry that it sends back to anthropic.

Anonymous
04/01/26(Wed)14:41:37 No.108502824

Anonymous 04/01/26(Wed)14:41:37 No.108502824

>>108502815
how is it insane when that is just literally standard practice for 90% of the software on the market? even if you pay for the software you are still the product.

Anonymous
04/01/26(Wed)14:42:22 No.108502832

Anonymous 04/01/26(Wed)14:42:22 No.108502832

>>108502815
What sort of telemetry was surprising? It's hard to imagine any sort of telemetry would be so, given how data-hungry these companies are.

Anonymous
04/01/26(Wed)14:42:28 No.108502833

Anonymous 04/01/26(Wed)14:42:28 No.108502833

>>108502815
There's flags to turn that off, it would only be surprising if they weren't respected.

Anonymous
04/01/26(Wed)14:44:52 No.108502854

Anonymous 04/01/26(Wed)14:44:52 No.108502854

>GLM-5V
No weights?

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.