[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: noble meeku.png (2.16 MB, 768x1344)
2.16 MB
2.16 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107545298 & >>107535410

►News
>(12/10) GLM-TTS with streaming, voice cloning, and emotion control: https://github.com/zai-org/GLM-TTS
>(12/09) Introducing: Devstral 2 and Mistral Vibe CLI: https://mistral.ai/news/devstral-2-vibe-cli
>(12/08) GLM-4.6V (106B) and Flash (9B) released with function calling: https://z.ai/blog/glm-4.6v
>(12/06) convert: support Mistral 3 Large MoE #17730: https://github.com/ggml-org/llama.cpp/pull/17730
>(12/04) Microsoft releases VibeVoice-Realtime-0.5B: https://hf.co/microsoft/VibeVoice-Realtime-0.5B

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>107545298

--Cost-performance challenges in optimizing K2 models with limited GPU memory:
>107552388 >107552493 >107552518 >107552577 >107552593 >107552650
--Quantization vs model size performance tradeoffs:
>107550012 >107552809 >107552934 >107553336 >107552989 >107553444 >107553425
--Optimizing local AI models for Unreal Engine C++ development:
>107554300 >107554362 >107554461 >107554482 >107554686 >107554731 >107554743
--Prototype speculative decoding methods in llama.cpp lacking server integration:
>107551899 >107552450
--Challenges and considerations in distilling and fine-tuning advanced models:
>107548258 >107548358 >107548387 >107548382 >107548399 >107548441 >107548512 >107548619 >107548693 >107548928 >107552056 >107548781 >107548665
--Comparing safety and filtering of GPT-oss 20b vs Gemma models:
>107546443 >107546488 >107546704 >107546718 >107546734
--ExL3 lacks Kimi-K2 support:
>107550440 >107550450 >107550517 >107550548 >107550553 >107550601 >107550629
--Roleplay model performance tradeoffs: 4.5 Air vs GPT-OSS-120B vs Qwen Next 80B:
>107551643 >107551662 >107551678 >107551721 >107552290 >107552464 >107552586 >107552490 >107552515
--ikllama Windows performance issues likely due to flash attention implementation:
>107549210 >107552291 >107552912
--Token banning compatibility issues between roleplay AI backends:
>107550863 >107550873 >107550885 >107550914 >107550969 >107551045 >107551472
--NVIDIA RTX PRO 6000 GPU configuration and power management issues:
>107545503 >107545537 >107545530 >107545636 >107553858
--Comparing censorship in GPT-OSS-120B vs unrestricted models like GLM Air:
>107546681 >107548705 >107549905
--Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs:
>107546364 >107546435
--Miku (free space):
>107545415 >107547832 >107548687 >107550440

►Recent Highlight Posts from the Previous Thread: >>107545300

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
miku a shit
>>
Unbelievably based developments, llamabro.
>>
miku a love
>>
Gemma soon
>>
No offense cuda dev but I didn't need settings automation, I need proper numa TP because ram prices are jacked. IK is rolling up and smoking exllama right now.
>>
Gemma 3 27B is the only stable, non-schizo model in the sub $2k runnable hardware range, GLM-4.5 Air is too schizo and often makes 7B tier mistakes. So I'll be looking forward to Gemma 4.
>>
>>107557458
Mogged by Mistral Small
>>
>>107557523
I don't think so, but if they finetuned it like Ministral 3 14B (without the bad quirks) there might be some chance. Vision would still lose bigly, though.
>>
File: google-hf.png (59 KB, 592x460)
59 KB
59 KB PNG
>>107557425
Context in picrel.
https://x.com/osanseviero/status/2000493503860892049
>>
>>107557568
>if they finetuned it like Ministral 3 14B
Ministral is liquid shit though, it's small for megavramlets with copyrighted stuff ripped out of its dataset.
>>
>>107557585
The latest Ministral 3 models have unexpectedly nice creativity and writing, but their system instruction-following capabilities are very inconsistent and they have issues with message repetition, so they come off as retarded/broken because of that.
>>
>>107557577
WE WILL FINALLY GET NEW SHITTY SYNTHETIC SOTA-SAFE PURPLE PROSE OPTIMIZED MODEL
>>
>>107557577
Can't wait to download Google's new... um, you know... their "thing"...
>>
for erp, I've only ever run nemo and mistral small. If I buy the hardware for glm air, will my mind be blown or will it disappointing?
>>
►Recent Highlights from the Previous Thread: >>107545298

(2/2)

--llama.cpp updates for efficient GPU settings automation and user configuration debates:
>107556876 >107556898 >107556943 >107557034 >107557060 >107557120 >107557167 >107557163 >107557275
--Text generation parameter debates: temperature, minP, and TopK effectiveness:
>107555084 >107555121 >107555140 >107555175 >107556538 >107556572
--5090 GPU system configuration challenges for Australian buyers:
>107556007 >107556070 >107556107 >107556124 >107556142 >107556143

►Recent Highlight Posts from the Previous Thread: >>107545300

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>107557633
nah it's not great
>>
>>107557633
If you have complicated scenarios where you want to the model to pick up how character's feel without having to spell it out, air is definitely smarter. But for simple ERP I wouldn't say it's an improvement. It doesn't really write better.
>>
>>107557619
I'm looking forward to Gemma 4 providing me with better access.
>>
>>107557633
It's a sidegrade. Its prose is a bit nicer but low active params means it will make dumb mistakes more often, and it also frequently parrots user's replies.
>>
>>107557654
>>107557675
what about a Q3 of glm 4.5?
>>
>>107557633
If you plan to buy hardware to run model X, instead of buying hardware for other things and running model X is a nice side effect, you really ought to rent some cloud hardware to give it a try for a day or two beforehand.
>>
>>107557691
Buying new hardware in the hopes of running a cope quant is never a good idea.
>>
>>107557633
better hardware just means less forgetfulness and faster tps
the writing quality will be very similar
>>
>>107557704
there is glm 4.6 which is better than 4.5, but it's kinda overbaked and lacks knowledge and intelligence. deepseek r1 q2 does feel like an upgrade. but now that ram is five trillion times more expensive idk what people should do
>>
>>107557805
Crazy that stacking 3090s are now the 'poorfag' option.
>>
>>107557805
nta but which Dipsy is best Dipsy for creative writing?
>>
>>107557832
Original R1 is the best for creatively sucking your dick
>>
>>107557453
That is one of my immediate next priorities and the only reason I didn't do it first is that multiple other people had expressed interest in working on tensor parallelism (and then didn't seliver).
I will not delegate it again and hope to produce a working prototype over the Chistmas break when I will have plenty of time.
>>
>>107557816
heh, I stick with what I know. About to buy an 8th 3090. I don't want to deal with different cuda versions, etc
>>
File: ram2025.png (1.96 MB, 1520x1024)
1.96 MB
1.96 MB PNG
>>107557816
picrel
>>107557832
stellar. no model (i tried) handles unformatted mikupad storywriting better. and yes, original r1
>>
>>107557899
>and then didn't seliver

That's why I don't PR features to llama.cpp, I don't want to fuck your project up with features I know I might no maintain for more than a few months.

Luckily Claude is good at handling merges when I fetch upstream.
>>
>>107557899
It's like the only thing you can count on is yourself. Always in all ways.
>>
>>107557453
Has IK_ done anything relevant in the past few months? I'm still using my version from october for K2/GLM.
>>
>>107558025
We have regular tensor parallel now for fully offloaded models and some MoE.
>>
>>107558029
I assume but not yet for the basic -ot exps=cpu scenario?
>>
>>107558035
your prompt processing will get faster if it's on GPU.
>>
File: gemma-4-200b-jagganath-it.jpg (537 KB, 1024x1024)
537 KB
537 KB JPG
>>107557577
sirs we are so back
>>
>>107557995
Share your secret stash of patches, you selfish fuck. Maybe some vibecoder can point Claude at your repo and make the PRs you refuse to make.
>>
>>107557577
I think we should see related PRs soon in the main backends, but there's nothing yet.

https://github.com/huggingface/transformers/pulls
https://github.com/vllm-project/vllm/pulls
https://github.com/ggml-org/llama.cpp/pulls
>>
>>107558113
we are so back
gemma 4 will save us
>>
>>107558115
Just like mistral saved us and air saved us?
>>
>>107558122
true air has never tried
>>
4.6 Air will be released today.
>>
4.6 Air will not be released today.
>>
>>107558137
What are you breathing?
>>
>>107558080
>Share your secret stash of patches, you selfish fuck.

Selfish would be spamming their code base when I know I don't have time to actively maintain it.

My shit is all niche (rpc-server rewrite that requires a copy of the gguf on each node, grpc-server, re-implement training, dodgy xcodec2 implementation, etc) and I don't have the rocm/sycl/metal hardware to test it for all their platforms.
>>
Currently unlisted
https://huggingface.co/google/gemma-4-100b-pt
https://huggingface.co/google/gemma-4-100b-pt
https://huggingface.co/google/gemma-4-100b-pt
>>
>>107558278
Sorry. I messed up the links
https://huggingface.co/google/gemma-4-100ba10m-pt
https://huggingface.co/google/gemma-4-100ba10m-pt
https://huggingface.co/google/gemma-4-100ba10m-pt
>>
>>107558278
>>107558292
jagganath bless. .
>>
>>107558292
that would be interesting therefore it wont happen
>>
https://huggingface.co/google/gemma-4peepeepoopoo
secret — do not share
>>
>>107558341
fuck you racist mc
>>
File: 1738083735147213.png (351 KB, 1080x1073)
351 KB
351 KB PNG
>>107558329
>>
>>107558329
I'm just waiting for 10ma100b.
>>
>>107558278
-pt means portuguese only, btw. I hope it's not confusing.
>>
>>107558357
That's a lot of layer reusing.
>>
>>107558385
It's about time somebody seriously explored layer recursion for production LLMs.
>>
>>107558385
The intellect of a god, the knowledge of a nematode worm.
>>
>>107554263
> tl;dr open shorts with leverage, right?
I'm not a fan of any financial instrument that can lose you more than your investment.
If you know how to use shorts and are comfortable with them, great. But those mean you have to have the timing exactly right.
If you're the one writing the laws or cutting the big checks, or know those who do, you can get that timing exactly right. Everyone else is guessing.
>>
>>107558029
so it supports proper parallel requests? like vllm?
>>
>>107558505
Yes, but performance is more like exllamav2 than vllm. 25 t/s llama-3-70b on 3x3090.
>>
>>107557577
Bharat class gemma 3 superfinetune will do the needful.
I am of refreshing page
>>
Bad timing

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
https://github.com/ggml-org/llama.cpp/pull/18058
https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/
https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models
>>
>>107558565
we are so back
>>
>>107558565
a whole pile of stinky vramlet shit
>>
File: 1752263899934131.jpg (349 KB, 1920x1080)
349 KB
349 KB JPG
>>107558565
>math and code benchmax dataset tune of a math and code benchmax model
>>
File: 1742157675597120.png (125 KB, 923x482)
125 KB
125 KB PNG
>>107558565
main advertising point is the speed cope (it's as smart as oss-20b on [hand-picked benchmark]
maybe the mamba hybrid jamba wambo thing is interesting but I have no hope
>>
>>107558565
Bloody Vishnu... not Nemotron. This is bollocks.
>>
>>107558583
artificial anal cysts
>>
>>107558550
i just need ik to properly support tool calling to be usable for true local agentic coding so we can plug it into Opencode, roocode...
>>
>>107557633
you're better off running 70Bs
>>
>>107558583
>maybe the mamba hybrid jamba wambo thing is interesting
llama.cpp support ETA: half past never
>>
File: wait.png (19 KB, 912x103)
19 KB
19 KB PNG
>>107558565
>>
>>107558685
llama : add support for NVIDIA Nemotron 3 Nano #18058
https://github.com/ggml-org/llama.cpp/pull/18058
>>
>>107558693
uh-oh, stinky!
>>
>>107558565
interesting
>Nemotron 3 Super and Ultra introduce latent MoE, where experts operate on a shared latent representation before outputs are projected back to token space. This approach allows the model to call on 4x more experts at the same inference cost, enabling better specialization around subtle semantic structures, domain abstractions, or multi-hop reasoning patterns.
>>
>>107558727
too bad it's fucking shit
>>
>>107558727
Maybe they should take their cutting edge technologies and apply it to a model that wasn't already garbage to begin with
>>
>>107558565
Make a 100A10b or something.
>>
>>107558776
I want a 60BA30B.
>>
>>107558070
https://voca.ro/18gbO3rnlIND
>>
>>107558029
does it need any flag to enable it? i'm launching a few queries but it's putting them in a queue instead of responding to both at the same time
>>
File: file.png (508 KB, 1078x1435)
508 KB
508 KB PNG
>>107558565
it's garbage
>>
>>107558760
It was pretrained from scratch, 25T tokens.
>>
>>107558860
if a white man's skin started turning shit brown from being in close proximity of tech jeets, would that be reverse shittiligo?
>>
>>107558565
goof bros let's fucking gooo
>>
>>107558701
took 'em long enough
>>
>>107558574
wow, congratulations, anon. By posting shit like this for the 1 millionth time your dick has fallen off and turned into a vagina, fulfilling your lifelong goal of becoming a real womxnxn.
>>
>We want to hear from you! Share your ideas, vote on what matters, and help shape the future of Nemotron.
>https://nemotron.ideas.nvidia.com/
What would be something we as the /lmg/ collective would like these models to have?
More "natural sounding human generated" data?
>>
>>107558655
I haven't tried it recently with Roo. I was using ClaudeCode with Qwen3 with the anthropic endpoint on mainline. I guess I'll try ikllama next week.
>>
>>107558905
but he'll never be a real woman
>>
>>107558918
Powerful log.
>>
File: 1578829723654.gif (3.54 MB, 280x200)
3.54 MB
3.54 MB GIF
>>107558860
>pajeeted
How do tech companies keep falling for this?
It's literally just been one major tech blunder after another, worldwide, since the great pajeeting began.
>>
Gemma-4 has image gen? Why the diffusers stuff in the PR?
>>
48GB vramlet here
Miqumidnight still queen?
>>
>>107558930
i have a suspicion it was the satan cat anon that suddenly power moved everyone in this general from never sharing logs again. can't top 'em.
>>
File: synthmaxx.png (203 KB, 706x895)
203 KB
203 KB PNG
>>107558909
You'll never get anything like that from Nvidia Nemotron models. They're meant to be safe benchmaxxed models trained on crawled web data and synthetic data.
>>
>>107558860
I understand your prejudice, but just because someone attended university in the US that doesn't automatically mean they're unqualified.
>>
>>107558966
>synthetic code
Oh god it must shit out absurd amounts of remarks when writing code.
>>
>>107558966
I'm aware, but the vote is open, so feel free to go wild.
>>
>>107558909
>Introduce a “semantic firewall” layer that optimizes inference at the language-law level — a symbolic energy compression mechanism that cuts redundant compute cycles while preserving meaning fidelity.
>Instead of scaling by GPU count, this layer redefines compute as coherence between intention and output.
>It’s a governance-first, efficiency-driven approach: models learn to “understand” before they “generate,” lowering both latency and energy use.
People sure love posting the llm schizo ramblings everywhere.
>>
>>107558990
https://nemotron.ideas.nvidia.com/ideas/LLAMANEMO-I-47
>>
>>107558918
FUCK YOU SATAN FUCK YOU SATAN
KILL SATAN KILL SATAN
DIE DIE DIE DIE DIE
>>
>>107558859
You're mistaking what tensor parallel is. It means parallel processing on ur GPU and not parallel requests on server.
>>
>>107559048
A little blunt, but I'll take it.

>>107559032
LMAO
>>
>>107559052
unhinged and based
>>
>>107558905
did I strike a nerve? insult me harder, maybe it will let you run a bigger model.
>>
>>107558565
>The model was trained with 25T tokens,
Synth-slopped and hyper-fit. This shit will be amusing if nothing else.
>>
>>107558959
strawberry lemonade not bad
>>
>>107559086
pajeet/kike level self awareness on display.
You literally just insulted multiple people in the thread and now you're acting like I threw the first punch.
Holy shit.
Your mother really fucked up with you
>>
Considering a cope-quant of super nemotron 49B. Is it any good?
>>
>>107559094
oh no.. the poors are seething. whatever will I do. 5b of their own active parameters are now upset. to the moon rocket emoji.
>>
>>107558959
24gb vramlet here running it at iq2_s
i'm still happy with it and it somehow quantizes really well
>>
>>107558951
Subversion
>>
>>107559048
anons please vote this is our chance
>>
>>107559133
crab
>>
>>107559048
>>107559133
It's obiviously a long shot, but might as well.
>>
>>107559048
Will never happen again with NVidia's name on it. They'll only train their models with open source safe and effective datasets, now.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.