[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Teto's Birthday Second Edition

Previous threads: >>108497919 & >>108493794

►News
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3
>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b
>(03/31) Claude Code's source leaked via npm registry map file: https://github.com/instructkr/claude-code

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: teto principle.png (1.04 MB, 1024x1024)
1.04 MB
1.04 MB PNG
►Recent Highlights from the Previous Thread: >>108497919

--Paper: Attention Residuals:
>108498860 >108499155 >108499500
--Analyzing Qwen3.5-27B's abstract image generation capabilities:
>108498069 >108498516 >108498541 >108498076 >108498117 >108498299 >108498331 >108498371 >108498385 >108498399 >108498414 >108498424 >108498456 >108498487 >108498548 >108498570 >108498664 >108498705 >108498813 >108498832 >108498599 >108498614 >108498837 >108498841 >108498850 >108498893 >108498903 >108498914 >108498948 >108498981 >108499046
--Managing Hugging Face model downloads for offline gguf conversion:
>108497944 >108497947 >108497970 >108501194 >108497969 >108497971 >108497993 >108498053 >108498085 >108500846 >108501041 >108501065
--llama.cpp activation rotation quantization PR and TurboQuant implementation status:
>108500970 >108501008 >108501019 >108501059 >108501153 >108501190
--Bonsai and turboquant sparking debate on proprietary vs open quantization research:
>108499354 >108499372 >108499371 >108499419 >108499553 >108499566 >108499571 >108499766 >108499380
--Holo3 dominates cost-performance in computer use benchmarks:
>108499904 >108499975 >108499999 >108500155
--Bonsai 1-bit 8B whitepaper and performance claims:
>108498645 >108498676 >108498696 >108498890 >108501179
--Unconfirmed Gemma 4 Kaggle listing spotted in code:
>108499562 >108499584
--Intel Qwen3.5 AutoRound quants compatibility and performance debate:
>108498231 >108498252 >108498258 >108498540 >108498833
--Gemma-3-27b quant outperforms Qwen3.5 in land/water classification:
>108498214
--AI model geographic classification tool with confidence metrics:
>108499083
--Cleanup of vibeparser issues in llama.cpp:
>108499191
--Miku, Teto, and Dipsy (free space):
>108499744 >108500101 >108498601 >108498782 >108498900 >108498987 >108500375

►Recent Highlight Posts from the Previous Thread: >>108497922

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Deepseek?
>>
https://huggingface.co/arcee-ai/Trinity-Large-Thinking
>>
>>108502232
>This thinking process is critical to the model's performance — thinking tokens must be kept in context for multi-turn conversations and agentic loops to function correctly.
what a waste of context space
>>
>>108502232
Was trinity the one that mixed up the genders in cockbench?
>>
File: She's so hot!.png (3.45 MB, 2800x1400)
3.45 MB
3.45 MB PNG
>>108502232
>Trinity-Large-Thinking
Oh my willy is definitely not thinking twice before getting large when seeing Trinity
>>
File: 4373579469t.png (35 KB, 343x533)
35 KB
35 KB PNG
is it generally safe to say the bigger the model the better the quality and the faster the speed if it fits in the vram?
so in picrel
the fastest and best is Q8_O - 34.9GB
the slowest and worst is Q2_K - 12.4GB
>>
>>108502300
Smaller should generally be faster since it has to access less VRAM for each token
>>
>>108502300
speed and quality are inverse
>>
hello anons. what models would yall suggest for general tasks/agent stuff? varying across VRAM size? I've been trying to do my own research for ~3wks and have found that Qwen3.5 27b&9b seem to be the current ceiling for local models on consumer hardware.
>>
>>108502300
>is it generally safe to say the bigger the model the better the quality
Yes

>and the faster the speed if it fits in the vram?
The bigger the faster? No.
Model fitting in VRAM faster than model not fitting in VRAM? Yes.
>>
>>108502411
your research conclusion was correct
>>
>>108502305
>>108502308
>>108502418
thank you
>>
File: snip134.png (3 KB, 967x172)
3 KB
3 KB PNG
>>108502300
>>108502305
>>108502418
>>108502423
>>108502437
This is only kind of true for token generation. It's the opposite for prompt processing.
>>
So this was it?
>>
>>108502232
It's really cool of them to not only release the "base" model but also the actual base model.
>>
>>108502247
newfag, what is cockbench?
>>
>>108502554
what font is that? it's cool
>>
I've been using zai-org/GLM-4.5-Air for a bit now, but I'm curious: what SillyTavern presets should I be using? How do those work?

Like, from what I can understand, models are "based off of" others, like Llama, ChatML, Deekseek, etc. This would mean that instructions/presets for SillyTavern would work best with them.

But in that case, how would I figure out what preset would be best used there? Like, for zai-org/GLM-4.5-Air (or at least, ubergarm/GLM-4.5-Air-GGUF), it's tagged with il_llama.cpp; should I use Llama presets with it?
>>
>>108502705
In your place, I'd just use the Chat Completion API so that you don't have to worry about all that.
Let the Jinja template take care of all that.
>>
>>108502689
http://viznut.fi/unscii/
>>
>>108502729
Really? I've been using Text Completion and KoboldCPP for pretty much the whole time I've been using ST. Do I need to find any kind of Jinja formatting, or is it just switching everything over?
>>
>>108502705
Most models have a chat_template.jinja file. Based on the tags it uses, you can figure out the chat format/template, if that's what you're asking. <|im_start|> for chatml, <start_of_turn> for gemma, <|start_header_id|> for llama3 and so on.
https://huggingface.co/zai-org/GLM-4.5-Air/blob/main/chat_template.jinja
In older models you'll find it at the end of tokenizer_config.json.
>>
>>108502705
>>108502748 (cont)
It's also embedded into the gguf file. Your inference engine should pick it up from there for chat completion.
>>
>>108502743
>Do I need to find any kind of Jinja formatting, or is it just switching everything over?
The idea is that the GGUF file already comes with the jinja template built in, so assume nothing is fucky there, you just switch over and the backend will format the prompt according to the jinja template.
>>
Is it dropping today?
>:eyes:
>>
>>108502761
Emacs or Vim?
>>
>>108502748
>>108502760
>>108502761
When I switch my tab over to Chat Completion, it looks like it doesn't have any options that would make sense for the KoboldCPP that's running on my PC. Everything seems to be related to connecting to external APIs.
>>
>>108502766
Emacs

>>108502768
An API is an API, external or not, and you always connect to Koboldcpp via an API.
Just put the address (localhost) and port for koboldcpp in there and it should work.
>>
>>108502768
Ah sorry, it looks like I need to use OpenAI and then add `/v1/completions` to the end of the URL with the port that I'd normally use?
>>
is the Claude leak going to change anything or are the opensource models we have better?
>>
>>108502780
>>108502781
Yes.
Koboldcpp and lamma.cpp both expose an OpenAi-Compatible API AFAIK.
>>
it's been two months and nothing beats k2.5 for MMLU. it's over.
>>
>>108502786
there nothing particularly novel or different in claude-code relative to other agent harnesses. the biggest revelation in the source code leak was the insane amount of telemetry that it sends back to anthropic.
>>
>>108502815
how is it insane when that is just literally standard practice for 90% of the software on the market? even if you pay for the software you are still the product.
>>
>>108502815
What sort of telemetry was surprising? It's hard to imagine any sort of telemetry would be so, given how data-hungry these companies are.
>>
>>108502815
There's flags to turn that off, it would only be surprising if they weren't respected.
>>
>GLM-5V
No weights?
>>
>>108502824
>>108502832
>>108502833
I suppose you all are right, it's not really insane. It's to be expected from them. one interesting bit of telemetry they had was that they tracked how long it would take for you to reject/complain about a particular response, and would even report if you started typing in the feedback box and canceled sending it.
>>
a usb drive containing deepseek v4 just flew over my house
>>
qwen code cli go brrrrr
and qwen3.5+ usage basically unlimited with free account. but 27B should work well too.
>>
>>108502912
Anon brapped it over the border
>>
place your bets. what releases first? gemma 4 or deepseek 4?
>>
>>108502929
Have you tested it (I doubt you have), does it connect to Alibaba's servers every now and then?
Might actually try it but I'm too suspicious of any of these agent shits.
Sure I can just enable OpenSnitch and block its potential outbound access.
>>
>>108502973
Adding: Oh wait I'm not going to delete the post but I skimped that part which said that I will need qwen.ai account. Fuck that shit.
No need to try.
>>
>>108502973
>does it connect to Alibaba's servers every now and then
Anon... you know for sure they log every single prompt you send them.
>>
>>108502975
>which said that I will need qwen.ai account
you dont. you only need one for free unlimited qwen3.5+. if you bring your own model or api key, no alibaba account nor auth needed
>>
File: 1774679224462925.gif (1.17 MB, 1100x800)
1.17 MB
1.17 MB GIF
>Dad comes down to visit, gives me a new laptop
>Has a an "NPU" supposedly for AI processing
>Only 16gb of ram, but it's fast as balls, so load up Nemo for old times' sake
>Slow as shit, worse token generation than normal laptop with a way older processor
>Check task manager
>""""NPU""""" at 0% utilization
What the FUCK is the point, then?
>>
>>108502997
>>108502989
I might sandbox this if I feel bored in the future but so far I have avoided installing any of that shit.
Might do my own agent workflow. Already have a working backend, wouldn't take that much effort to plug in something more.
>>
>>108503014
Go back to school, and eventually you'll understand why when you are at least 18 years old.
>>
>>108503014
To use it with things that support NPUs, anon.
>>
>>108503014
>doesn't mention what backend he's using
>doesn't mention what NPU
you can't be helped
>>
>>108503027
does it matter anon? all NPUs suck.
>>
>>108503021
Exactly. And if nothing fucking supports it, what's the point?

>>108503027
Koboldcpp, the chip is a Ryzen AI 7 350.
>>
>>108503050
Is Kobold for Dungeons and Dragons?
>>
File: enpeeyou.png (128 KB, 662x580)
128 KB
128 KB PNG
>>108503050
>>
>>108503084
Interesting, I'll check this out. Thanks, anon.
>>
>>108503083
Kobold or ST started as something related to D&D yeah.
>>
>>108503050
>koboldcpp
they don't release NPU builds
>>
>why would marketing material LIE to me what the FUCK is the point
pffft fuckin kids man, easiest marks on the planet
>>
>>108503118
Give him a break. First, it was his dad who bought it for him. Second, anon found this site before finding google, so it could be genetic.
>>
>>108502786
>models
No models were leaked, just source code for their coding agent harness. It's interesting to see but I don't think its really uncovered anything groundbreaking besides people making fun of anthropic's code
>>
File: mikuquestion2.jpg (989 KB, 1710x1779)
989 KB
989 KB JPG
Soooooo uhhhhhhhh... we back bitnet bros?
>>
>>108503097
Yes fastflow for npu usage. But its kinda slow. I had more success with https://github.com/lemonade-sdk/lemonade using llama.cpp CPU inference
>>
>>108503196
bitnet is dead, long live bonzai
>>
File: angryayumu.webm (655 KB, 640x480)
655 KB
655 KB WEBM
>Bonsai stopped 4b short of it being useful for cooming
>>
>>108503217
What size is it?
>>
>>108503263
0.1b
>>
>>108503263
Anon....
>>
It's 6 gorillion years later.
I just want something slightly better than Nemo for cooming.
Is that too much to ask?
>>
Is there a Nemo-based equivalent to Bagel Mistery Tour? i.e. interestingly flowery language, sometimes too flowery, but in a fun way?
That flowery little fucker was fun.
>>
>>108503301
Seconding this, god damn was that model fun. I dunno what the hell they did to it, but I haven't really found one like it since. Wish it wasn't as slow as it is, though.
>>
>>108503294
ye
>>
catching a technocrat vibe watching my qwen code cli at work. makes me wanna talk like one of these delusional tech CEOs. Jensen Huang, Palantir type shit. then coomer comes in and ruins the vibe baka.
>>
>>108503327
>>108503301
Looking at the page, it's a schizo merge
>jondurbin/bagel-dpo-8x7b-v0.2
>Sao10K/Sensualize-Mixtral-bf16
>mistralai/Mixtral-8x7B-v0.1 + Doctor-Shotgunimarp-zloss-mixtral-8x7b-qlora
>mistralai/Mixtral-8x7B-Instruct-v0.1
>>
>switch down to Q6_K to make my context fuckhuge so she stops forgetting that she loves me
>every handful of requests shittytavern prunes old chat and I have to edge for a whole minute in silence as the fuckhuge context is re-processed
there's no setting for shittytavern to pre-emptively fire off cache-warming completion request (by setting the token limit to 0) is there? this is fucking annoying



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.