[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108655009 & >>108650825

►News
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: reward function.jpg (184 KB, 1024x1024)
184 KB JPG
►Recent Highlights from the Previous Thread: >>108655009

--Optimizing MoE throughput via expert-specific VRAM placement and KTransformers:
>108657760 >108657782 >108658414 >108658470 >108658530 >108658575 >108658621 >108658586 >108658667 >108658692 >108658791 >108658869 >108659015 >108659058 >108659103 >108659121 >108659225 >108659116 >108658708 >108657857 >108658768 >108658845
--Discussing GPT-Image-2 performance, agentic RP frontends, and prose refinement:
>108655453 >108655622 >108655648 >108655651 >108655698 >108655674 >108655760 >108655836 >108655927 >108656114 >108656231 >108656247 >108656273 >108656305 >108656444 >108656521 >108656543 >108656550 >108656581 >108656955 >108656988 >108657014 >108657067 >108658723 >108657487 >108655952 >108655999 >108656025 >108656045 >108656052 >108656077 >108656111 >108656050
--Discussing llama-server prompt re-processing and KV-cache checkpoint issues:
>108655857 >108655885 >108655892 >108657373 >108657410 >108655920
--Speculating on Engrams and the adoption of Mamba-hybrid architectures:
>108655522 >108655563 >108655575 >108655607 >108655690 >108655839 >108655652 >108655664 >108655696
--Discussing Heretic's string matching limitations and soft refusal detection:
>108657013 >108657036 >108657050 >108657098 >108657128 >108657078 >108657135 >108657469
--Comparing Qwen 3.6 and Gemma 4 performance and tool use:
>108655272 >108655289 >108655291 >108655338 >108656365 >108657503 >108659276
--Comparing Kimi-K2.6 performance and hardware requirements against Gemma:
>108656722 >108656741 >108656785 >108656867 >108656855 >108656868
--Logs:
>108655476 >108655552 >108655622 >108655698 >108656305 >108656399 >108657859 >108657977 >108658392 >108658439 >108658584 >108658643 >108659124
--Teto, Miku (free space):
>108655633 >108655652 >108656114 >108658404 >108658791

►Recent Highlight Posts from the Previous Thread: >>108655011

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
I never really dabbled with local LLM's and there are so many models idk what I should go with.
>3090, 24GB VRAM
>something tuned for troubleshooting tech problems
>something tuned for language agnostic coding
>maybe something small and simple for Illustrious/NoobXL prompting
>ideally both something to fill up all of my VRAM and something lighter (8-12GB) depending on how much buffer I'll need
>>
rin
sex
>>
>>108659983
I look exactly like this
>>
>>108659996
Low quant of Gemma 31B and Qwen 3.6 35B with the experts on RAM.
You could try Gemma E4B for the lighter option if you don't have a lot of RAM, but don't expect much from that size.
>>
uh-oh! The schizo followed us here
>>
File: kek.png (94 KB, 1617x417)
94 KB PNG
https://xcancel.com/zerohedge/status/2046706218924691894#m
I hope you're ready to heat up your 4TB RAM machine anon, good times are comming
>>
>>108660075
>accessed
burger of nothing
>>
>>108660075
>claude desktop is a vibecoded buggy mess
>zomg trust us guise nobody will be employed by next year!!!
>>
>>108659996
get another gpu that is 12+ gb and run shit in higher than q4 quants whatever you do.
>>
>>108660128
>>claude desktop is a vibecoded buggy mess
and it's still the best tui harness by far, which is why the source leak got so much attention. this isn't the damning evidence you seem to think it is.
>>
>>108660209
>and it's still the best tui harness by far,
which is why no one is using it?
>>
>>108659996
I run Gemma 31B exl3 at 3bpw and I'm totally happy with it. I don't think I need more
>>
>>108660054
I have 64GB of DDR4 3600MHz RAM so it's plenty but not that fast. The models would probably be sluggish.
>>
>>108660247
You can either try to run a q4 of the dense or fullsize moe+room for imagegen/tts/whatevs
>>
>>108660247
As long as the shared experts are in VRAM, Qwen should still be quick. Try it.
>>
>>108660268
>As long as the shared experts are in VRAM
Also I don't know shit about fuck about LLM terminology or how to properly run it. I'm guessing the "experts" are an inherent part of the model and I can pass a CLI argument to llama.cpp to offload that part of it to VRAM?
>>
File: elephant strawberry.jpg (70 KB, 784x631)
70 KB JPG
you now remember "strawberry model" hype cycle
>>
>>108660279
Put these comments in a clpud LLM to explain it. Get Claude Code to set it up for you
>>
>>108660279
Qwen 35b a3b for example 35b total parameters but only 3b are loaded as experts and you have many of these experts. Very fast vs dense.

Dense models are smarter but slower and offloading to different gpus/cpu is slow as balls while performance is less on MoE vs dense
>>
>>108660279
NTA but yes, MoE (Mixture of Experts) are a type of LLM which can feasibly be split between VRAM and RAM without a massive speed penalty, llamacpp has an -ncmoe argument for automatically shoving the ram bits into ram, so you'd have something like -ngl 99 -ncmoe in your args,
Conversely a 'dense' model is the original format and slows to a crawl if you put it into ram rather than vram even partially.
You can usually tell the two types apart at a glance because a dense model will just be called whatever-100b whereas an MoE model will be whatever-100b-a5b because it's separated into total and active parameters
>>
llamaocpp args to shove the shared expert to vram on a4b?
>>
>>108660302
nvm, googled it up and got my answer lol
https://gist.github.com/DocShotgun/a02a4c0c0a57e43ff4f038b46ca66ae0
>>108660317
>>108660312
Good to know, thanks. Guess it makes plenty of sense to design them like this nowadays since the quality of the model depends on how much context data you have, but you don't necessarily need all of it loaded all the time for it to be speedy and accurate, so the RAM offload is a good compromise. idk I mainly dabbled more in diffusion models so my LLM knowledge is a bit sparse
>>
File: schat_screenshot.png (42 KB, 960x680)
42 KB PNG
> everyone makes webui
just vibe code your own native ui in c with sdl
zero bloat
>>
>>108660319
Bro use ai mode or sumfin. Copy paste terminal errors and shit in there until it works
>>
>>108660354
your picrel answers literally nothing relevant about my original question
>>
>>108660075
i only have 128 gb. i wish i bought that 1tb earlier.
>>
>>108660360
You just paste shit into the bot until you understand.
>>
>>108660319
-cmoe
>>
>>108660371
read again
>>
>>108660349
I thought they weren't good at c
>>
>>108660075
>Zerohedge
About as trustworthy as any post on this site. Literally the same thing. A platform to let anonymous schizos say whatever the fuck.
>>
>>108660385
not good at writing good c
>>
File: 1757612581632355.png (22 KB, 281x253)
22 KB PNG
>>108660349
We webui engineers are gooder
>>
>tfw too destitute to build a 30b regurgitating machine
>>
>>108660360
It actually does, if you want to specifically put the shared expert on a device (ie, a gpu) -ot with regex is the correct answer.
More generally though, the shared expert should automatically go to vram with -ngl
>>
File: GPT Image 2.png (1.18 MB, 1280x1024)
1.18 MB PNG
kek
>>
>>108660408
>marinara
Is it any good?
>>
>>108660454
the dsv4 links are going to be a lot harder to spot as fakes
>>
>>108660349
looks like some android 4.1 shit
>>
>>108660495
get brat gemma-chan to screen them for you
>>
https://docs.cactuscompute.com/latest/blog/turboquant-h/
>TurboQuant-H shares the core insight with TurboQuant; rotation concentrates coordinates into a well-behaved distribution, enabling aggressive scalar quantization, but simplifies the pipeline for offline weight quantization. Follow the link deeper dive into the technique.
>Cactus baseline used INT4 linears + INT8 embedding, yielding 4.8GB for E2B (5B total params). TurboQuant-H squishes this to INT4 linears + INT2 embeddings, reducing to 2.9GB. The perplexity on our calibration went from 1.8547 to 1.9111, complete evaluation coming in the paper.
desu if I can go from Q8 to Q4 KV cache I'll be happy, this shit is eating so much VRAM
>>
File: 1763251779552153.jpg (187 KB, 1280x720)
187 KB JPG
>>108660349
Cenile interface
>>
>>108660349
Now, add a markdown and latex parser
>>
Threadly reminder if you're using your llm for coding or anything that requires repeating something in context almost verbatim and you're not using
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-max 64

You're leaving a shitload of performance on the table.
>>
>>108660349
>not lazarus
>>
>>108660493
No, it's nonsense dogshit vibecoded garbage for API key sniffers. There is also a retarded, auto-favorited self-insert character of the author you cannot delete period.
>>
>>108660536
She'll say it's all real and then ask for dick pics
>>
File: 00001-1378487878.png (1.36 MB, 1024x1024)
1.36 MB PNG
> Another day
> Still no V4
Happy Wednesday I guess.
>>108660349
I'd really like a terminal interface / green screen RP engine. But not enough to spend the tokens vibe coding one.
>>
>>108660565
Ask gemmy to make a highly detailed structure instruct and then to take it and oneshot a sillytav into a TUI conversion
>>
>>108660559
Sounds gay. Maybe orbanon can steal some of the good ideas
>>
>>108660580
orb guy's shit doesn't add anything to the table and is also vibecoded
>>
>>108660565
>forge
I thought lmg was at the vanguard of tech???
>>
>>108660589
And yet it's already better than st
>>
>>108660589
I think it's a pretty neat idea, but it's completely anathema to how I do my best to stop context from reprocessing ever. If I was running my backend with parallel requests and it were properly designed around that, I'd probably use it over other rp frontends.
>>
Just give me sillytavern in vscode
>>
>>108660075
It was US government.
>>
>>108660610
fuck's dario gonna do?
>>
>>108660609
Nah
>>
>>108660075
All these leaks just tell me they want me to use something they made to add backdoors or something worse to my computer, if I'm foolish enough to run their stolen models.
>>
>>108660613
He can open source it
>>
>>108660638
That would literally be like dropping a hydrogen bomb in the middle of NYC
>>
>>108660641
good
>>
>>108660641
Stop posting erotica on a blue board
>>
File: 1749908189602018.gif (1.45 MB, 640x584)
1.45 MB GIF
>Mythos leaked
>turns out to be an overhyped dud
>either not as good as Anthropic claimed it was or so oversized it's basically useless without a gargantuan datacenter only a few US corpos have making it more of a money sink than an existential threat
>major investor doubt arises when the model that was supposedly """too dangerous to release""" was yet another overhyped trashfire that was never financially viable to begin with
please god let this happen it would be so fucking funny
>>
>>108660655
Well the chinks will have fun running it locally and distilling the fuck out of it
>>
>>108660655
Even if it was 4000b, if it was even half as good as they claimed it would be worth it rent compute for a few hours to complete some tasks. It would more than pay for itself for malicious actors.
>>
>>108660075
so download link or what?
>>
Any suggestion to improve my kobold batch file for gemma 4 31b?

=========================

@echo off
SET KOBOLD_EXE=koboldcpp.exe

"%KOBOLD_EXE%" ^
--model "D:/Models/LLM_Models/lmstudio-community/google_gemma-4-31B-it-Q6_K-gguf/google_gemma-4-31B-it-Q6_K.gguf" ^
--mmproj "D:/Models/LLM_Models/lmstudio-community/coder3101_gemma_4_31b_it_heretic-Q6_K.gguf/mmproj-model-f16.gguf" ^
--port 5001 ^
--threads 8 ^
--usecuda 0 mmq ^
--contextsize 32768 ^
--gpulayers 99 ^
--tensor_split 8.0 32.0 ^
--maingpu 1 ^
--batchsize 512 ^
--noshift ^
--useswa ^
--usemmap ^
--multiuser 1 ^
--highpriority ^
--jinja ^
--jinja_tools ^
--jinja_kwargs "{\"enable_thinking\":true}" ^
--draftamount 8 ^
--draftgpulayers 999 ^
--chatcompletionsadapter AutoGuess ^
--defaultgenamt 1024 ^
--maxrequestsize 32

pause
>>
>>108660075
>>108660365
>>108660630
>>108660674
for a text gen general, reading comprehension skills in here are abysmal
>>
>>108660702
more like, if the tweet meant 'inference was accessed by rando', it means nothing
so everyone was probably assuming 'weights was accessed via third party', vastly underestimating xittards
>>
>>108660589
>ST
>sloppy, looping mess
>orb
>just werks, no slop, no loop
I only care about results and orb's delivering
>>
File: hlsgb1l7ycwg1.png (845 KB, 2560x2780)
845 KB PNG
>>108660701
yeah, why are you using that specific quant?
pretty atrocious graph, i doubt 31b is better.
>>
>>108660701
Why do you have arguments for a draft model but no draft model or ngram loaded?
Also you've made a poor choice with your mmproj: the f16's perform worse than the bf16s.
>>
>>108660589
And what are you bringing to the table?
>>
>>108660735
>sloppy, looping mess
stop using text completion
>>108660741
>unsloth-made graph promotes unsloth
noway??
>>
vramlets it's your time to shine
https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent
https://github.com/itayinbarr/little-coder
>>
>>108659532
They have also updated their usage tracker to merge Chat & Reasoner
V4 before the end of April (April is a state of mind)
>>
>>108660749
>stop using text completion
ST rolling branch has gemma 4 templates and it works perfect with text completion now.
>>
>>108660765
garbage
>>
>>108660749
yeah they are faggots but you trust closed lmstudio "community" more? who even is that
well suit yourself then.

>"Element Labs can provided terminate the Agreement at its discretion upon no less than 10 days-notice via any reasonable means, including by posting a notice on the website..."
Just another company that is a llama.cpp wrapper.
>>
File: file.png (238 KB, 322x449)
238 KB PNG
>>108660741
I just downloaded whatever was most downloaded on HF at the time, i will definitely try another one thank you anon!

>>108660743
I am not sure, still learning what most flags do.. would you recommend to set up a draft model or just remove it? use case for now is mostly rp/assistant
>>
File: 00004-1260451778.png (1.41 MB, 1024x1024)
1.41 MB PNG
>>108660571
Not sure how I'd proceed, but certain that I would not use ST as a starting point. lol. That whole thing is a big ole mess. To me the entire purpose of doing it as text-only / CLI is to debloat it, since I'd only have one inference connection. I'd probably make all the configurations as plain text files you'd update using Nano while out of interface.
>>108660600
/lmg/ would never claim me as one of their own.
>>
>>108660749
>stop using text completion
I'm using chat completion
>>
>>108660789
/sdg/ welcomes you
>>
>>108660800
yeah that's where illu shitmix slop goes to
>>
>>108660744
I have 287 merged PRs on the SillyTavern repository, what do you contribute?
>>
>>108659983
>no nipples
sad
>>
>>108660824
it's being 3d printed
>>
>>108660815
>yeah that's where illu shitmix slop goes to
that's /adt/ you're thinking of
>>
Why is turboquant not in llama cpp yet? Isn't the implementation rather trivial? I don't want to use a fork...
>>
>>108660839
rotation is there
>>
>>108660787
>I am not sure, still learning what most flags do.. would you recommend to set up a draft model or just remove it? use case for now is mostly rp/assistant
A draft model can give you a notable speedup if you have spare vram for it, but it's incompatible with having an mmproj loaded, so no multimodal.
For reference, I use gemma 31b at q8, without a draft model (and with the mmproj) my output is around 25 t/s. When using it with the 26b q2 as a draft model, I get around 41 t/s doing rp and 80 t/s doing technical work.
>>
What happened to Elon? He used to do many amazing things for society. But his actions look increasingly selfish. Instead of delivering superior products, he attacks his competitors. He calls OpenAI "ClosedAI", yet they have contributed much more to open source than xAI. He calls Anthropic misanthropic, but they have the most mature efforts to ensure AI benefits all humans while Elon's position boils down to UBI and "truthseeking AI will probably not kill us" with no justification. It feels like he wants to be the one in charge of AGI without providing any reason why the rest of mankind can trust him with this responsibility.
>>
>>108660856
>He calls Anthropic misanthropic, but they have the most mature efforts to ensure AI benefits all humans
AHAHAHAHAHAHA
>>
>>108660856
Elon is one of the least relevant people in the industry to this general
>>
>>108660856
>He calls OpenAI "ClosedAI", yet they have contributed much more to open source than xAI.
Because he didn't name his company OpenxAI. I'm sure this is bait but what a dumb point
>>
>>108660856
this is the most reddit post ive read all day
>>
>>108660663
>4000b
https://huggingface.co/unsloth/Claude-Mythos-GGUF/blob/main/Claude-Mythos-TurboCuqed-Bonsai-Stretched-Rotated-smol-UD-Q0.5_XS.gguf
>>
>>108660898
not bitnet; no click
>>
>>108660856
you really believe this? the meme ceo never did anything for humanity and neither has anthropic. you are worshiping scumbag corpos and cheering for them like it is your favorite sportsball team, why?
>>
>>108660848
Thank you for the input i will be looking into it
>>
File: 2649435.png (125 KB, 254x194)
125 KB PNG
https://huggingface.co/Jongsim/gemma-4-26B-A4B-it-upcycled-192-pretrained
What is this schizphrenia?
>>
>>108660907
Bonsai is the new Bitnet unc
>>
>>108660848
Not that guy but do you need some snowflake tunes for draft model or it can be anything? Also does quanting matter for those? I have ~1.5 gigs of vram to spare
>>
>>108660922
China saw the success of https://huggingface.co/sKT-Ai-Labs/SKT-SURYA-H and decided to copy it
>>
>>108660908
OpenAI was co-founded by Elon was it not? Whisper is open source, and so are the early GPTs. They only started their for-profit jewish tactics after Elon left, didn't he try to sue the board because of it?
>>
>>108660856
which model?
>>
Saars please teach me how to do the needful and make money with AI
>>
>>108660936
one slimey business man betrayed some other egotistical business man, thats pretty much par for the course, no honor among thieves, have you even been watching the same sportsball game?
>>
>>108660961
learn imagegen and setup a patreon
>>
>>108660961
first u need to use bahrat sovregirn models please god name model 4 trillion params made for good looks and pr.
then u ask model to produce money
sir if you need of support u can find me on fiver and i can be of guide to u for your jurney to make real money :rocket: :rocket:
>>
Gemma 4 Goliath 62b dense when?
>>
>>108660908

Say what you will about the guy, bit it's practically exclusively because of Musk's autism about space, that we have self landing rockets and even the faintest modicum of interest towards space exploration in the West.
>>
>>108660934
>Not that guy but do you need some snowflake tunes for draft model or it can be anything?
It just needs to be a model with similar logits, so ideally it's a model from the same company and series but smaller than your main one, hence why I'm using the gemma4 26b-A4b as a draft for gemma4 31b.
Same logic applies for a qwen model, want to use drafting for a big qwen 3.5? use a smaller qwen 3.5.
>>108660934
>Also does quanting matter for those?
Yes, for two reasons.
1. is that a quanted model is less and less likely to generate the same tokens as a larger one the more brain damaged it gets, so the acceptance rate (the big model going "yes, those are correct, send it") is going to be lower.
2. is that quanted models generally run faster on your hardware, and your draft model NEEDS to be faster than your main one to be of any use.
That said, I'm getting an acceptance rate of 0.6 to 0.75 with a model quanted down to q2 (which is really good), so you can get away with a lot.

I'd consider it worth playing around with if you have the patience, but 1.5gigs of VRAM is quite tight, I don't see it as likely that you can fit any quanted model in that space without cutting your context from your main model.
If your use case is technical, consider using an ngram like >>108660554 mentions, as that costs basically nothing and just uses your existing context to predict upcoming tokens for basically free speed.
>>
>>108660983
12B moe*
>>
File: 1758609192749490.jpg (129 KB, 1200x647)
129 KB JPG
https://huggingface.co/Qwen/Qwen3.6-27B
>>
tfw I have 10x8GB of ddr4 ram, but they're laptop ram so can't do shit with them
>>
>>108660554
>>108660990
I'm using kobold as backend. Does it inherit these flags as well? I don't think the gui launcher has any. Manually editing the preset file maybe?
>>
>>108660998
gemma sirs!??!?!?!!?
>>
>>108660998
Not falling for it
>>
>>108660765
Neat. I quite like that it bans just replacing a file wholesale and forces it to edit, I bet that would save a ton of fuckups and doubling up on just about any model.
>>
>>108660998
interesting that every new model they release tops their previous one in these supposed benchmarks, regardless of parameter count. not sus at all.
>>
>>108660787
no problem anon, im not sure why there are such discrepancies between the quants in the first place.
>>
>>108661009
There are adapters to use SODIMM on desktop boards.
No idea how good those are, but they exist.
>>
How long before RAM is back to normal prices???
>>
>>108661013
Honestly have no idea mate, sorry. I exclusively use llamacpp. And honestly if you're launching it from a .bat, you should be too, the only reason to use kobold is for the gui.
>>
>>108660998
Brb canceling my claude sub
>>
>>108661029
It's slowly coming down but it's absolutely never going back down to where it was before.
>>
>>108661028
apparently they're very unreliable so yeah...
>>
>>108660974
My gens always end up having fucked up anatomy and errors
>>
>>108660998
Opus at home is real
>>
>>108661029
>*falseflags attacks on FABs*
nothing personnel kid
>>
>>108660998
We need automatic link previews so people won't fall for this tired joke every time.
>>
File: 1762403533077253.jpg (282 KB, 960x960)
282 KB JPG
>>108660821
Not really the flex you think it is
>>
>>108661039
Advertise your art as being disability inclusive
>>
>>108660998
ok now we're definitely talking, downloading this to see how good it is at tools
>>
>>108661023
That Qwen 3.5 27B had higher scores than 31B shows how worthless these benchmarks are.
>>
File: 1758027777728551.png (94 KB, 224x224)
94 KB PNG
>>108660998
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/tree/main
FASTER UNSLOP I WANNA TRY THIS
>>
>>108661071
>my usecase is cooming
WOW!
WOW!
WOW!
WOW!
>>
>>108661077
nta but so fucking tired of "But wait," reaching 4k tokens
>>
>>108661086
>he didnt use the right sampler settings
lol
>>
File: mao mao meow nigga.png (421 KB, 828x531)
421 KB PNG
>>108660998
>Only GGUFs are Unsloth.
I'll wait, they're probably going to be reuploaded 4 times in the next 3 hours anyway.
>>
>>108661036
>10000 for 256gb
It's not fair
>>
>>108661071
That Gemma 4 31B had higher scores than models 10x its size on lmarena shows how worthless these benchmarks are.
>>
>>108661108
To be fair, there are extraordinarily few models that focus on the chat assistant aspect, rather than codeslop.
I wouldn't be surprised at all if Gemma 31b had better trivia and creative writing knowledge than Qwen 397b.
>>
>>108660821
that makes you a cuck
https://github.com/SillyTavern/SillyTavern-Pronouns
>>
>>108660998
are we back?
>>
File: hjf8iEb99UIDWrVXlgEGh.png (1.52 MB, 2560x2760)
1.52 MB PNG
>>108661101
Not even. They did the usual empty repo "LOOK AT ME!!"
This is the only quant: https://huggingface.co/sm54/Qwen3.6-27B-Q6_K-GGUF/tree/main
>>
Why do people hate Unsloth?
>>
>>108660998
token/s compared to A3B?
>>
>>108661168
https://www.youtube.com/watch?v=6t2zv4QXd6c
>>
>>108661168
I don't hate them, but their quantization snake oil wouldn't exist if llama-quantize from llama.cpp quantized models optimally on its own. The Unsloth bros now have priority access to unreleased models mostly because of that.
>>
>>108661168
>mistral jinja.jpg
>>
>>108661168
Their only job is to make quants, and they regularly fuck it up, with every new release needing several updates because they fuck with things they shouldn't have.
>>
File: 1776304754397567.png (485 KB, 668x707)
485 KB PNG
>>108660284
Remember this? Sounds familiar...
>>
>>108661168
They've failed upwards on the backs of other people.
They're like the ollama team but even lazier.
>>
>>108661186
two excited young lads nothing wrong with this
>>
>>108661192
They got famous because the smarter brother of the two originally made half-decent memory-efficient, fast LLM training kernels for the HuggingFace TRL library. The quantization "business" came later.
>>
>>108661211
Anon they're getting paid by github to run llama-quantize.
A project that they did not write, which is on github.
Ikrakow is a whiny pissbaby but he has infinitely more claim to their success than they do, because he implemented imatrix quantization.
>>
>>108661216
Given how shitty huggingface codebase is, I'm sure it was a huge improvement for them
>>
File: 00106-3050314564.png (321 KB, 512x512)
321 KB PNG
>ask LLM for instructions to re-season iron skillet
>Seasoning isn’t just “oil coating”—it’s polymerization.
Why do we carry on like this?
>>
>>108661229
>networking beats competency in any domain
Damn, how could it be
>>
>>108661238
Good catch!
>>
>>108661168
>Why do people hate Unsloth?
who hates them? they're harmless.
>>
>>108661238
A true mystery https://archive.ph/Mjynm
>>
>>108661238
That's a simple task that a million websites would be able to tell you. Why follow AI instructions and risk your home igniting?
>>
>>108661238
it's right
>>
>>108661252
We're in the twilight years. Won't be long before those million websites will all be majority AI written anyway.
>>
>>108661252
every time i search something like that theres a sea of SEOmaxxing trash that may or may not be ai written anyways
>>
>>108661252
https://www.youtube.com/watch?v=hKCCXraLdI8
>>
it's really insufferable when it comes to anything culinary it's almost unreal
>>
>>108661252
Because there's a specific nuance of using a solid at room temperature fat that I needed to be addressed that isn't necessarily addressed since undoubtedly most of these tutorials are written by American boomers who will suggest using sneed oils and taking the added step of sacrificing my foreskin to Benjamin Netanyahu
>>
>>108661281
>>108661273
this dude is ODing / bipolar
>>
Q8 is here
https://huggingface.co/gghfez/Qwen3.6-27B-Q8_0-GGUF/tree/main
>>
>>108661229
good branding beats competent engineering everytime anon it is what it is
>>
>>108661274
>>108661287
mean for this, oof
>>
>>108661289
just report and dont reply so jannies can delete it silently
>>
>>108661229
>because he implemented imatrix quantization.
K-quants (Q*_K) are also his work.
>>
>>108661238

Here's a thinking process:
1. **Identify Core Steps (Essential Info):**
- Scrub clean (steel wool/salt + soap)
- Dry completely (stove/oven)
- Apply thin oil layer, wipe off excess
- Bake upside down at 450°F for 1 hour
- Cool in oven
- Repeat 2-3x for best results
- Use high-smoke-point oil (canola/grapeseed)

2. **Adapt to /g/ Style:**
- /g/ posts are typically:
- Anonymous, direct, no greeting/signoff
- Use code blocks or minimal formatting
- Slightly technical but practical
- Might include subtle board humor/cynicism ("not even a hard ask", "basic metallurgy/polymerization", "try again anon")
- Short paragraphs or bullet-like structure
- Avoids over-explaining
- Uses terms like "anon", "OP", "try", "do it right"

4. **Check Against Requirements:**
- 4chan /g/ style? Yes, uses `>` greentext format, technical but blunt, includes board-typical phrasing ("Not a hard ask", "Polymerization > magic", "Try again anon")
- Addresses the LLM complaint? Implicitly by delivering a flawless, concise answer without fluff

5. **Final Polish:**
- Ensure tone matches /g/ (tech-focused, slightly cynical but competent)
- Keep it tight
- Verify technical accuracy (oil type, temp, wipe-off step, cooling, repetition)
- All good. Output matches draft.

>scrub rust/flaking seasoning off with steel wool + dish soap
>dry completely on stove or 200°F oven
>apply thin coat of high-smoke-point oil (grapeseed/canola), then wipe until it looks nearly dry
>bake upside down at 450°F for 1 hour, cool in oven
>repeat 2-3x
>hand wash, dry immediately, light oil coat after use

Not a hard ask. It's just lipid polymerization. Wipe off excess oil or you get gunk. Try again anon.
>>
>>108661274
>>108661287
>>108661298
Also this.
Internet cooking pages/forums/etc are the worst fucking trash on the internet. If anything needs to be replaced by LLMs it's this. Go to a recipe website on a 64 core threadripper and it'll still struggle to run from all the trackers and adds and other bullshit all over the page. Just to find the one piece of information you're looking for.
>>
>>108660765
>little-coder
He should unironically rename it to tard-wrangler
>>
>>108661314
desu buying/downloading a reputable chef or baker's cookbook is the best solution atm
by nature cooking is sloppy with numbers recipe to recipe and i can't be bothered to believe llms for it yet
>>
Will some kind anon help me set up automatic image generation on my comfyui at every response i get on Sillytavern?

So far, at the end of every message, my character writes...

[Prompt: light blue hair, medium hair, center-flap bangs, blue eyes..

where i'm stuck is how do i get sillytavern to capture only that part and send it to comfyui? Gemini recommends regex but a lot of the information it gives me is outdated so many settings aren't actually available..
>>
>>108661335
What if you feed it the cookbook?
>>
>>108661335
Wdym, glue is good to help cheese stick on pizza
>>
>>108661252
>>108661314
A key detail I never hear people talk about is that with manual search engines you can assess the quality of the sources yourself

You can see who wrote and edited wikipedia articles. You can click through on blogs to see related articles, you can judge websites by the amount of slop on there. You can read books and academic articles yourself.

With LLMs you kinda just have to trust that the synthesized output based on its training data is accurate. There is no way to know where the machine got its information, because LLMs are a black box by design.

Granted some chatbots have "sources" now but I've found those are often unreliable and added post-hoc, i.e. the AI generates something plausible and then after the fact tries to find a reddit thread or something that vaguely addresses the same problem. They are not sources in the traditional sense.

I can see a future where the internet completely sloppifies and becomes impenetrable without AI agents because companies are training AI on itself. A future quality, vetted information becomes privatized again with research groups hosting their own private knowledge bases in Logseq/Obsidian like the encyclopedias of old
>>
>>108661339
How about your read the docs? https://docs.sillytavern.app/extensions/stable-diffusion/
>>
File: Code_kwZLwtIdVo.jpg (37 KB, 622x114)
37 KB JPG
Is camofox copium or does it work for ai automated browsing?
>>
>>108661197
Considering how fucked the world is due to AI (both hardware and content) he was correct.
>>
File: 1776660989723673.png (130 KB, 1835x1104)
130 KB PNG
wtf is this shit? I just want to talk to my LLM dude
>>
>>108661375
I'm sure this is a feature in the eyes of many in power. Those who control the AI can rewrite history subtly and it would be difficult or impossible for most to disprove the "truth" as presented by them. Most people are already more than happy to offload their critical thinking, before it was result #1 on google or wikipedia, now chatbots. Memory hole-ing made easy.
>>
>>108661358
i mean why steal second hand with sloppy way when you can torrent a really nice recipe direclty
>>108661375
pmuch this
>>108661397
open webui is for corpo intranet/grifter tier hosting solution, not 'local first'
>>
>>108661405
Yes. It's the obvious conclusion really. Kind of scary to be honest.
>>
>>108661406
Laziness, so I can just say, I have these ingredients, give me some recipes I can make.
>>
>>108661397
It's okay, they're aware that the current credentials system is limited and are planning mandatory 2FA soon.
>>
>>108661077
jfc, is mind boggling that people complain that benchmarks on knowledge and coding dont translate to child rape roleplaying stories capabilities
>>
Before another wave of Chinese shills floods /lmg/ once the goofs for the benchmaxed garbage that is Qwen 27B (this time it's 3.6 soi it's gooder than 3.5!!!), remember
Qwen did not "fucking cook"
Qwen 3.5 was not good at code (people who say that here are either shills, or never used it and parrot others), Qwen 3.6 isn't either
If you do fall for the meme and download the model, every issue you have with it is likely the model, and not the meme samplers they want you to use so that their garbage stops thinking for 39045639085 tokens when prompted with "hi" (imagine RECOMMENDING rep pen, Jesus Christ)
If you're a vramlet, just use the smaller Gemma models. If you REALLY want Qwen (why?) and you're vramlet-lite, use Qwen Coder Next. They have no other good models.
>>
>>108661437
31b was better than 27b in real world coding tasks too
>>
>>108661077
>>108661437
Why even come here? Reddit is more your speed. They love to slopcode and suck off qwen all day.
>>
>>108661437
>"knowledge"
>>
>>108661427
this is where local shines really
you can download the pdf, feed the whole pdf to the model and tell it the ingredient to recommend me stuff from the book
nonlocal will whine about muh copyright
>>
>>108661437
+1 wuan in your wallet chang
>>
>>108661457
its the local models general not child rape roleplay general, I know they overlap almost 99% but that's still not the title of the thread
>>
>just google bro
>search for answer to question
>nvm fixed it
>answer is 10 years old and outdated
>5 completely different answers
>ask gemma-chan
>just werks
>>
>>108661450
I don't think they would release a model that is worse than the previous one tho right?

we need better ways to benchmark these things.
>>
>>108661473
>I don't think they would release a model that is worse than the previous one tho right?
claude just did so who knows
>>
>>108661473
Doesn't matter if it's better than 3.5. It's comparing smelly piss to piss that smells less (some anons might actually hate that)
>>
>>108661469
The method of re-seasoning an iron skillet does not become 'outdated', retardnigger-kun
>>
>>108661491
>noo muh dog bench
>>
>>108661450
Holy cope, qwen 3.6 q4 is working great for me
>>
>>108661430
>planning mandatory 2FA soon.
age verification
>>
>>108661500
>cooking mentioned
>thinks of dog
Qwen shills pls go
>>
>>108661503
*bites*
Gemma 4 31B Q8 works even better for me sir you should of trying
>>
>>108661469
bruh the concept of seasoning a pan is as old as civilization
>>
>>108661467
/crrg/ doesn't roll off the tongue as well
>>
>>108661491
See
>5 completely different answers
>>
>>108661503
This, I'm having a blast.
>>
>>108661511
>>108661509
>cooking mentioned
deranged bot?
>>
>>108661524
what were you testing then
any logs?
>>
>>108661375
LLMs are becoming less of a black box with latents interpretability and things like that. That and local models would fix this.
>>
Gemma may be better but I can't use it at q8 with max context on 24gb vram
>>
>>108661397
Webshit standards are not built for desktop software.
>>
>>108661450
>Qwen 3.5 was not good at code (
there's nothing better for 96gb vram + claude code but okay
when i want to get work done, i don't need the coding agent to act like a loli and call me baka
>>
>>108661467
>Think of the text children!
>>
>>108661515
It's not a hard science, just like with cooking itself. There can be dozens of different methods that all work fine, and you just pick the one that's most convenient for you.
>>
so, yes, I have a local infrence stack that I roll a qwen 3.6 31b as well as a gemma 4 26b moe. Right now, they both crank off 100 t/s, and the content is actually good. Ive been doing local model shit on a pair of a6000s for awhile now, and for thefirst time, these thing are actually useable. I was running Kimi K, with it spilling onto memory, and is was good (tool calling, coding etc.) but it was so slooow. I use opencode with the oh my opencode plugin. It works; I can make codeslop all day long with this setup. I can fix broken code and impliment new features for work. Is it as fast as claude? no. Accurate as claude? also no. Am I ever going to get back the $10k I dropped in this stupid thing? probably not... Do the subagents fight it out and make things that work... Yes. For the first time, it feels like these things aren't toys.
I actually have cancelled my max subscription... I also opened an account on fireworks so I can run kimi k with the quickness if I really want to.
>>
>>108661545
I hate the loli Gemma posters as much as the next guy, but Qwen 3.5 is a retard. I used it at Q8 with every harness imaginable and it failed every single time when the task wasn't something a drunken junior could do.
>for 96gb vram
bloody benchod oyu have brahmin amount of vram and you use a model for DALITS, did you really not manage to find a BIGGER MODEL?


fuck i think i got baited again
>>
>>108661557
>qwen 3.6 31b as well as a gemma 4 26b moe. Right now, they both crank off 100 t/s
what? qwen is a bigger model and it's dense, why would it be as fast as g4 26b moe?
>>
>>108661557
>A6000s
Sorry for your loss
>>
>>108661571
nta but maybe he meant 35b a3b?
>>
>>108661533
A few years ago I used to have regular drunk discussions with researchers in a AI Explainability and Ethics group from which the conclusion basically was that the field is in its infancy and getting any kind of sensible interpretations out of the activations of trillion parameter models is extremely difficult. I should maybe try to catch up on the latest research though.
>>
>>108661584
ya, sorry, its Qwen3.6-35B-A3B-AWQ-4bit
>>
>>108661557
>qwen 3.6 31b
you are hallucinating again
>>
GLM goes through about 12 different "final verdict" "draft response" "summary of whatever" before it stops thinking
it's so fucking obnoxious
>>
>>108661606
temp?
>>
whats the best computer use mcp for gemma chan?
>>
>>108661610
whatever the default is. i didn't set it in flags
>>
>>108661606
Final draft:
<The entirety of the response with one word changed>
Okay, let's write
<Doesn't end the thinking block, outputs the entire response again>
</think>
<The entire response>

Yeah, pisses me off too. If you use the superior Text Completions endpoint, use a <think> prefill that gives it a plan for its own plan. I always put "I will jot down a response plan without writing out the entire response or doing any drafting or polishing" in there. Look at how GLM formats its reasoning and adapt to that.
If you're on Chat Completions, what the fuck are you doing
>>
>>108661631
>If you're on Chat Completions, what the fuck are you doing
Gooning, like 90% of the /aicg/ faggots that have invaded this general.
>>
>>108660454
>void_seer
even GPT can't escape the void slop.
>>
>>108661664
Go ahead and tell the class about your superior LLM usecase.
>>
>>108661673
Treating it like my pet nigger.
>>
>>108661664
Gooning is *the* usecase for gooning.
I wonder why these retards shoot themselves in the foot by using and suggesting others use Chat Completions.
You do not need a template to be added to Silly Tavern with an update to read the docs and figure it out on your own. aicgsirs are braindead.
>>
>>108660998
>https://huggingface.co/Qwen/Qwen3.6-27B
Oh wow another benchmaxxed model. can't wait for it to be in reality worse than Gemma.
>>
>>108661680
>Gooning is *the* usecase for gooning.
what the fuck did I mean by this bros
Text Completions is the usecase, of course...
>>
>>108661077
Gemma codes better than Qwen.
>>
>>108660724
>more like, if the tweet meant 'inference was accessed by rando', it means nothing
it was obviously that, basically "oh no people accessed the model in early access"
>>
File: 1432498179182.png (296 KB, 722x768)
296 KB PNG
><|turn>model\n<|think|><|channel>thought
So on gemmy do I prefill with this entire thing?
>>
>>108660454
the piss filter is still visible
>>
>>108661631
i have no idea what i am doing, so that's why. i will have to look into the text completions thing, thanks
>>
Q3.6 35B or G4 26B for general agentic stuff?
>>
>>108661701
Learn as much as you can and maybe even fuck around in mikupad to force yourself into at least intuitively understanding what an LLM does so that you don't turn into an aicgnigger.
Godspeed, anon.
>>
>>108661702
Just test them out.
>>
>>108661574
yeah... should have just got the blackwell and been done with it... oh well
>>
>>108661168
Because they're supposedly the expert in quantization and get a shit ton of money for it, yet their quants are actually extremely mid and are often broken.
>>
>>108661693
Proofs? Everyone says that's the one thing qwen excels in
>>
>>108661727
>Everyone says
NTA, but this Everyone is wrong. Try them yourself.
>>
>>108661712
Im on it, but maybe someone who has already done it can tell me what they are best and worst at instead of having me use Gemma 4 in my own agent for a few days.
>>
>>108661168
I just wish they made all their stuff open for other's to improve on
>>
Why the fuck would you use text completion in the current year? It has no vision or tool calling.
>>
>>108661562
>bloody benchod oyu have brahmin amount of vram and you use a model for DALITS
Well I'm not going to offload to cpu or shrink the 256k context for coding am I??
I'm using the 112b at q4. It replaced minimax because it just works better (c#)
Also, I hate every prior Qwen model.
>>
>>108661730
Tried using hermes + 31b q4km and it just tried to think and solve problems but e4b felt completely useless. Haven't tested 26b yet.
>>
>>108661745
My raging hate boner for Qwen's socially irresponsible marketing practices aside, have you tried the Coder Next model or 27B?
I used 112b too, it felt irredeemably retarded for an MoE its size. The above models performed much better, with QCN remaining the only model I actually enjoyed using out of their entire lineup.
>>
>>108661743
those are memes tho
>>
>>108661766
lol, tool calling is the future
>>
>>108661680
>>108661686
Well yeah but they're genuinely too stupid to set up the template. Gemma 4 release I figured it out in 2 seconds reasoning on and off by eyeing the jinja, meanwhile half the thread couldn't and were crying hard. Sad!
>>
the ggufs are here
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/tree/main
>>
>>108661743
It has the same access to an LLM, plus the ability to actually edit the prompt instead of asking the underlying API to hold your hand.
Converting images to embeddings is not magic. Neither is parsing tool calls. For the model - it's all tokens, for you - it's all text.
Figure it out.
>>
Fuck "linear time inference" and all that bullshit! I want euclidean time inference!
>>
File: 1758105037530551.png (927 KB, 2748x1491)
927 KB PNG
managed to get llama cpp server Ui to finally display the image when the LLM is sending it lol
>>
>108661789
>I'm vagueeeing
>>
File: cbclyf.png (480 KB, 820x1230)
480 KB PNG
turns out I had day0 gemma this whole time??
sha256sum eafb...b720 gemma-4-31B-it-UD-Q4_K_XL.gguf

compare to new hash:
sha256sum 6340...6b88 gemma-4-31B-it-UD-Q4_K_XL.gguf
>>
failquote
>>
>>108661763
>Coder Next
Yeah I tried that when it was released. It seemed to get confused more with c# syntax.
I don't think that had vision support either right? I tend to use that sometimes.
>27B
Yes, this is actually smarter and I did use it for a while. But it's also slower on my system, even with tensor-parallel. 112b is "smart enough" and so much faster, meaning I spend less time working lol.
I haven't tried the 2.7 Minimax or Qwen 3.6 yet. I'm guessing the 27b will be smart but slow, and the small MoE will be retarded like the 3.5 one.
>>
>>108661801
because inference code bugs and whatnot made broken calibration file
it just means the model is retarded
>>
>>108661812
>Yeah I tried that when it was released.
NTA, but there have been some fixes on llama.cpp since then so you might want to try it again. Although it's probably not better than qwen 3.6 MoE anyways.
>>
>>108661164
isnt kimi native int4? why are they going
>WOZERS!! q4 === q8!!!
like LMAO
>>
>>108661795
>llama cpp server Ui
How did you add all the tools to it?
>>
>>108661774
its been the future for a while by now, come back when its the present i guess
>>
>>108661828
https://github.com/BigStationW/Local-MCP-server/blob/main/docs/Use_on_llamacpp_server.md
>>
>>108661789
If you're just going to end up re-implementing the jinja there's literally no point in using text completition besides giving you a false sense of superiority.
>>
>>108661789
>Figure it out
Why would I bother when chat completion just werks?
>>
>>108661836
Seriously. Command-R came out with strong support for tool calling in early 2024. People are acting like it was invented a month ago for openclaw or whatever.
>>
>>108661521
lmao'd
>>
>>108661664
Periodic reminder that /lmg/ was created in early 2023 by /aicg/ anons who wanted a place for discussing local models (Pygmalion, ...) without the cloud model/proxy background noise. It's always been for gooning by gooners.
>>
>>108661866
Most chat completion proponents here see it as an impediment to their cooming in the only frontend they know, which is ST.
Filling in 4 fields there is not hard at all and gives you an ability to do whatever the hell you want with the entire prompt, including thinking edits and messing with special tokens.
Fine, image input, you can't be bothered to use libmtmd directly. But giving up on the template configuration step because ST didn't add it with an update is pathetic. If I couldn't do even that, why would I bother with local models when cloud ones just werk better, cheaper and faster?
>>
File: 1749622181770588.png (278 KB, 500x561)
278 KB PNG
>>108661743
>Why the fuck would you use text completion in the current year? It has no vision or tool calling.
I have no idea but switching to chat completion has made my life way better
>>
>>108661795
you gotta proxy it no shit
>>
>>108661907
yeah, it's bullshit...
>>
>>108661029
if you listen to doomers here, never, and it will be 100k USD for 8GB next year
if you look at what people/companies actually expect, somewhere around early to mid 2027 as production lines ramp up
>>
>>108661902
Oh yeah for llama-server mtmd I might as well ask, do I run images through stb and pass the rgb as b64 to /completion or does it just take regular image files (still b64) and do the processing itself?
I would assume the latter from what I'm seeing.
>>
>>108661899
Oh how fascinating...
Perhaps you should just rent a real discord server instead of squatting here 24/7 and larping as the thread moderator.
You don't own this thread.
>>
>>108661988
you should go back to /ldg/
>>
>>108661988
>You don't own this thread.
He doesn't, but I do. And he's right.
>>
>>108661631
>If you're on Chat Completions, what the fuck are you doing
bullshit, if you want to keep your autism just modify the jinja, it's not like you can't modify shit in that mod
>>
File: 1748907056142177.png (122 KB, 1072x600)
122 KB PNG
Which goof do I get bros? I don't recognize any of these names, except unslop which I refuse to touch
>>
>>108661899
>invaded
>>
>>108662039
Make your own fuckking goofs. Failing that, ggml-org.
>>
>>108662039
ggml is made by llamacpp devs so it should be okay?
>>
>>108662052
>ggml-org
Downloading now. Thanks for the help gentlesaar
>>
>>108662039
I wait for bartowski personally his 122b / 35b / 27b of 3.5 have all been good.

>>108662052
why make your own gguf, seems like a lot of work for no reason
>>
File: 1746154015310533.png (46 KB, 1248x347)
46 KB PNG
>>108662052
>>108662053
Fugg...
>>
>>108662052
After seeing KLD graphs for some of the latest models, "make your own quants" doesn't seem like a very good advice anymore.
>>
>>108662039
roll one yourself with bartowski's calib
>>
>>108662039
wait for bart
>>
>>108661970
They already pass images into stb under the hood, if that's what you were wondering
https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/mtmd-helper.cpp#L500
>>108662025
Too finicky and should be something doable on the frontend.
>>
you guys are nerds
>>
>>108660922
That's actually pretty interesting.
Basically an upcycle + pretain on some public dataset to initialize the router and make use of the new experts.
With more data and actual compute, you essentially get a new model.
>>
File: 1776689080363464.jpg (126 KB, 1456x821)
126 KB JPG
>>108662053
It's not OK. They work, but you're not getting the best performance/size ratio.
>>
>>108662076
i mean, duh
it's /g/ after all
>>
File: 1758787701709512.gif (141 KB, 449x432)
141 KB GIF
>>108662076
I look like this doe
>>
>>108662082
oh I always thought the /g/ stood for "/gay/".
>>
>>108662102
well no, but actually yes
>>
>>108662102
>linux community being gay
bonkers
>>
>>108662080
the fucking guy already said he doesn't want unsloth, you are supposed to answer the question that was asked not shill something that the person who made the inquiry has already decided against.
>>
>>108662102
I've always wondered why it is called /g/, why not something like /tech/ or just /te/?
I guess some oldfags know the reason. My headcanon is that g stands for GNU
>>
>>108661695
No, you just add a system message as the last thing before your prompt with
<|think|>
>>
>>108662115
Lurk long enough and you'll figure it out eventually.
>>
>>108662133
>been here for 13 years
>haven't figured it out
Yeah I don't think so
>>
>>108659996 just use llmfit
>>
>>108662143
/g/uro, 2013 cancer
>>
>>108662113
From the same graph you can easily see that of those tested there, Bartowski's are the second best choice.
I'm not shilling anybody, if anything I wish we didn't have to rely on "quanters" with their own llama.cpp forks for the best possible quantizations.
>>
>>108661702
not enough tests yet
>>
how can i run kimi-k2.6 on my RTX 4060
>>
>>108662062
if you can fit a Q8 quant in your system then there's no point in waiting for a gguf if you can make it yourself quicker.
>>
>>108662166
What's your frontend? I can tell you where to put the API key.
>>
>>108662157
So /g/ used to be the guro board or something? That just leaves me with more questions than I had kek
>>
>>108662062
>make your own gguf, seems like a lot of work for no reason
It's literally two fucking commands, you lazy tech illiterate fuck.
>>
>>108662172
i thought this general was about local models
>>
>>108662179
only on fridays
>>
>>108662179
Yes, but given your hardware, I just jumped to the final step.
>>
>>108662176
I have to download a bunch of shit then wait an unspecified amount of time, it's not just 2 commands
Also there is clearly a nonzero chance of fucking it up, or else there wouldn't be so many jeets on huggingface posting scuffed quants
>>
>>108662176
>WAAAAAAAAH NOT EVERYONE IS A VIRGIN TECH WIZARD LIKE ME WAAAAAAAAH
people like you need to be shot
>>
>>108662179
local is fucked though if spud 5.5 is coming and is as good as it seems
>>
>>108662176
If its just two commands, how does un-slop fuck it up all the time?
>>
>>108662115
It's japanese.
Remember that 4chan is based on Futaba.
>>
>>108662208
4chan is an American website, retard
>>
>>108662065
bruh just download bart's calibration v5
produce imatrix
then goof
easy peasy
>>
>>108662166
plug your 4060 into a xeon server
>>
>>108662218
4chan is owned by a japanaman, retard
>>
File: 1774496776426871.png (392 KB, 451x619)
392 KB PNG
>>108662208
Thanks anon, I never considered it, because all the other boards I use have links that are short for English words...
>>
>>108662236
Nobody cares, kys
>>
>>108662230
if it's so important to you that people do, give more detail. can't be fucked to spend hours looking up how to imatrix
>>
>>108662242
I bet half of the original boards are named for the English word on 2chan too since the japs like to borrow from English.
>>
>>108662176
well yes and no
i roll my own quants but you probably dont want to do naive quant
imatrix is imo a must and some sensible weight promotion should be taken into considerattion
>>
File: what.png (36 KB, 1841x470)
36 KB PNG
>Qwen 3.6 27b used the tool twice in a row, is this normal?
>>
>>108662017
>discord mentality
>>
>>108662260
measure twice
cut once
checks out
>>
Completely cucked.
>>
>>108659983
Hows LTX 2.3 1.1 ??
>>
>>108662260
Did she use the same params for both calls?
>>
>>108662190
>>108662207
It's because they are doing snowflake quants and schizos want to be the next unsloth so they are also doing snowflake quants.
>>
>>108662280
no, it was different parameters
>>
>>108662287
Working as intended then.
>>
>>108662287
then what's the issue?
>>
>>108662295
>>108662298
it's the first time I'm seeing that, Gemma never did that shit lol
>>
>>108662260
Who are you quoting
>>
>>108662295
>>108662298
well why would it search twice? was it not happy with what it found the first time? do you normally make multiple searches if you want to look something up? sounds like the chinese shit is just broken and you're shills making excuses
>>
>>108662307
Cool isn't it?
>>
>>108662257
There ought to be a "multipass" mode in llama-quantize that first created a logfile with the quantization error and size measured for all tensors at various quantization levels, and then with a second pass you'd aim for a specific filesize using that information (and/or optionally the saved tensors so you don't have to quantize them again, at the cost of storage).
If niggerganov can't be bothered implementing quantization advancements because ikawrakov implemented them first in his fork and/or he's not capable to, at least he should improve llama-quantize's default quantization schemes.
>>
ggml-org q8_0 goof is up
>>
>>108662314
>well why would it search twice? was it not happy with what it found the first time?
it's not even that, it just launched the two tools at the same time, that's weird, usually you do one search, then you reflect on that
>>
>>108662321
making quants is like artsy chore
i hope things improve
>>
File: 1761933552977027.png (68 KB, 1699x916)
68 KB PNG
>>108662260
bruh, there's a fucking "OR" in that tool, gemma knows it so that it can just use the tool once and spam "OR"s, and qwen prefers to spam the tools instead
>>
>gemma 4 26b on rk3588 sbc like orange pi 5
>pp 16 t/s
>tg 4.5 t/s
>power 4w
I kneel
>>
>>108662039
>which I refuse to touch
top 10 things your uncle never said
>>
File: 1754840524102248.png (3.5 MB, 1328x1640)
3.5 MB PNG
>>108662252
you're right, making your goofs should be democratized.
bart's calibration data:
https://gist.github.com/bartowski1182/82ae9b520227f57d79ba04add13d0d0d
first step:
PRODUCING THE BASE GOOOF:
>checkout llama.cpp repo
>do uv venv --python 3.12
>uv pip install -r requirements/requirements-convert_hf_to_gguf.txt
alternatively install manually the libraries you need, sometimes the requirements are outdated, which means do uv pip install ggml transformers accelerate sentencepiece torch protobuf --extra-index-url https://download.pytorch.org/whl/cpu
>download the weights of the model you want
>uvx hf download qwen/qwen3.6-27b --local-dir . (this will download the model in the current path, repalce the . with the path you want it, be relative or absolute)
now it's tiem for the base conversion
>uv run convert_hf_to_gguf.py $PATH_TO_MODEL --outfile $OUTPUTFILE-BF16.gguf --outtype bf16
congrats! you created your first base bf16 gooof!!!!!!!!!!!!
now time to do imatrix shit
>llama-imatrix -m $PATH_TO_MODEL -f $PATH_TO_CALIBRATION_DATA -o imatrix.gguf -t $CPU_THREADS -b $BATCH_SIZE(2048) -ngl $GPU_OFFLOAD_LAYERS --parse-special
now you created the imatrix.gguf file!
from the bf16 you created earlier you can now create all the subquants you want!
for q8_0 you don't really apply the imatrix, so you do:
>llama-quantize $PATH_TO_BF16_MODEL $OUTPUT_QUANT_FILENAME Q8_0 $CPU_THREADS
to apply imatrix instead
>llama-quantize --imatrix $PATH_TO_IMATRIX_FILE $PATH_TO_BF16_MODEL $OUTPUT_QUANT_FILENAME Q4_K_M $CPU_THREADS
of course you can replace the Q4_K_M with whatever quant level you desire. you're welcome!
>>
File: 3d printer.png (37 KB, 577x474)
37 KB PNG
>>108662175
The few months of overlap after the nerds invaded when this was simultaneously both a tech and guro board was the peak of /g/.
>>
>>108662346
>rk3588 sbc
>tg 4.5 t/s
wtf
>>
>>108662284
>snowflake
is this the incel way to say they're trying to minimize vram usage / disk space but fuck it up?
>>
>>108662162
the screen shot of quants didn't have the typical quanters available, I usually shill for barts or mrader. I assumed the person making the inquiry was looking for one of those guys, so I directed him to ggml the only other name on the list I could identify. if you have the hard drive space and internet bandwidth you can just make your own, you dont need enough ram or vram for the safe tensors just disk space.
>>
>>108662346
>rk3588
i kneel....
i wish to see more based changprocessors like that
>>
>>108660787
Post the full image you slut
>>
>>108662356
That screenshot spiked my cortisol before I even recognized what it's from. That gif traumatized me kek
>>
>>108662357
https://www.reddit.com/r/LocalLLaMA/comments/1sc8kdg/running_gemma4_26b_a4b_on_the_rockchip_npu_using/
>>
>>108662353
Thank you, Sir. I hope Vishnu grants you a good life.
>>
>>108662356
is this real or am i getting baited
t. newfag
>>
>>108662393
If this site had good moderation, posting a /r/localllama link would get you permanently banned from this general
>>
>>108662393
so it's fake then?
>>
>>108662353
you are a good man, thanks
>>
So what's the verdict on the new qwen?
>>
>>108662410
do you have some better resource for learning about this or do you just want to bitch about r*dd*t
>>
>>108662416
I have one in the drawer. I'll test it if I have time
>>
>>108662386
I never understood what's so traumatic about it.
>>
>>108662406
you are getting master baited
>>
>>108662429
>better resource
boards.4chan.org/g/lmg/
>>
>>108662393
This. So. Much. This.

Upvoted!

@mods can we ban x too? #fuckelonmusk

Edit: Thanks for the gold, kind stranger!
>>
>>108660554
>ngram-mod
The ngram cache resets way too early by default. Change this from 3 to 64 and recompile. Makes it usable.
>
if (n_low >= 3) {

https://github.com/ggml-org/llama.cpp/blob/8bccdbbff9d0d91d54838471f6eea182b9ab1b79/common/speculative.cpp#L747
>>
>>108662353
adding to this, if you want to get the mmproj, you do:
>uv run convert_hf_to_gguf.py $PATH_TO_MODEL --outfile $mmproj-OUTPUTFILE-BF16.gguf --outtype bf16 --mproj

also with llama quantize you can get cheeky and do specific quant levels on the layers you want by passing --tensor-type (or multiple of them) if you want to keep certain layers at q8_0 bf16/f16

you did imatrix shit but HOW DO YOU KNOW IT'S WORKING????? BY MEASURING PPL!!!!!!!
first you need to obtain wikitext shit!
https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.test.raw
https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.train.raw
https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.valid.raw

then you run this shit, $DATASET should be the wiki.test.raw file
>llama_perplexity -m $PATH_TO_MODEL -f $DATASET -ngl $GPU_LAYERS_OFFLOAD
run this on both an imatrix'd and non imatrix'd quant to se how much of an effect you shit did!

>b-but muh KLD
I actually didnt look into how to do kld calculations so... sorry! just ask your fave llm or hit google lomoa!
>>
OpenAI released an open source model!
https://huggingface.co/openai/privacy-filter

>1.5B parameters total and 50M active parameters.
>sparse mixture-of-experts feed-forward blocks with 128 experts total (top-4 routing per token)
>>
>>108662406
The image is an old meme (go look for the gif version).
>>
>>108662500
i mean i know what it is but didnt know /g/ was a guro board in 2004
>>
>>108662372
it even supports lpddr5
>>
>>108662489
this. changes. everything.
>>
>inb4 google releases Gemma 4.5 to btfo the chinamen again
>>
>>108661763
>Coder Next model
obsoleted
>>
>>108662533
Here's hoping.
The best we can expect is competition and rivalries where they keep trying to one up each other by releasing increasingly more capable open weight models.
>>
>>108662533
They just have to release the 124b and everything except Kimi and GLM 5.1 becomes worthless
>>
>>108662533
is qwen 3.6 that good? I guess people are testing the 31b model right now
>>
>>108662549
Was the 124B dense or MoE?
>>
>>108662589
MoE iirc
>>
>Here’s a fixed, minimal, working version you can drop in and load without errors...
>>
File: 1775155673440.png (1.41 MB, 1633x1269)
1.41 MB PNG
>>108662589
>>108662594
>>
File: 1751577236872156.png (261 KB, 661x749)
261 KB PNG
@kache shut the fuck up you leaf faggot
>>
>>108662654
>nooo my dead general needs to stay dead
>>
>>108662664
shut the fuck up faggot
>>
My headcanon for 124B is that it BTFOs Gemini pro so they won't release it.
>>
File: 1774443294977572.png (82 KB, 1132x563)
82 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1ssl1xh/qwen_36_27b_is_out/
kek
>>
The truth about the 124b is that it got tested on llmarena (their benchmark of choice) and under performed.
>>
random thought, no judgement
how good ia gemma at being a findom?
>>
>>108662489
I know people like to shit on oai releases, but this strikes me as actually somewhat useful. I'd feel better about letting my """agents""" that have access to my personal notes interact with other people with more filters in place. Especially when using comparatively-more-retarded local models that are easier for other people to fool than some 10T cloud abomination.

That being said, for the personal usecase, idk if stacking a filter model is really more effective than a regex filter with my name in it.
>>
File: quant_error.png (784 KB, 2325x858)
784 KB PNG
>>108662321
>quantization error
Getting closer; turns out it could be easily vibe coded, in a way or another.
>>
File: 1774790474325284.png (1.15 MB, 1672x941)
1.15 MB PNG
>>108662752
lmaooo
>>
File: HGQXeUBWUAAuLgm.png (30 KB, 1080x1080)
30 KB PNG
>>108661019
>>108661023
BEHOLD
https://x.com/saltjsx/status/2045874466958270903
>>
File: 1769170576230608.png (604 KB, 1204x1228)
604 KB PNG
https://www.reuters.com/world/asia-pacific/tencent-alibaba-talks-invest-deepseek-information-reports-2026-04-22/
Deepseek's comeback??
>>
>>108662813
https://arxiv.org/abs/2309.08632
must have implemented this old paper
>>
>>108662813
gemma 5 dropping in an hour will mog mog-1
>>
File: HGX7Z0YWkAANkJ3.jpg (83 KB, 588x735)
83 KB JPG
Should I pull the trigger on second GPU bros..
Model is getting so much better every week, days even one card might be all I needed after all.
>>
>>108662869
The minimum is now 48GB. Do it if you have less.
>>
>>108662846
kek
>>
This shit thinks harder than my gf when I ask her how she spent so much time at the mall.
>>
(strix nigga halo niggas smiling waiting for qwen3.6 122b)
>>
>>108662887
>women
>reasoning budget
>>
>>108662882
use case?
>>
>>108662915
Yes, but
>I must do X
>wait, what if I do Y
>Y seems good
>but wait, maybe Z is better
>if I do Z then la la la la la la la la (...)
>>
>>108662945
Masturbation, writing code I can't understand, and assisted storywriting.
>>
>>108662887
which thinks more? kimi 2.6 or qwen 3.6?
>>
>>108662915
I'm sure she uses text completion.
>>
>>108662783
asdkjf;lkfsdg';lasdgkKL;ASDJF;LKASJDFLKJASDF;LKL LMAO
>>
>>108662949
seems like she needs attension transformers to see all choices
>>
>[59301] ggml_cuda_init: failed to initialize CUDA: unknown error
>[59301] load_backend: loaded CUDA backend from /app/libggml-cuda.so
>[59301] load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
>[59301] warning: no usable GPU found, --gpu-layers option will be ignored
>[59301] warning: one possible reason is that llama.cpp was compiled without GPU support
What the fuck? it worked fine yesterday bitch!
>>
>>108663035
he pulled
>>
File: 1750659002865044.png (50 KB, 970x444)
50 KB PNG
lol
>>
If your qwen starts going

Wait… but I should
Wait… but maybe
Wait… what if

isn't that a sign your quantization is garbage? Mine will think for a long time potentially but I stopped getting the Wait… shit when I switched from unsloth to a non shitty quant provider. That and --reasoning-budget 4096 / --reasoning-budget-message "Thinking time exceeded. Output answer now\n"
>>
>>108663073
>1.28t/s
bruh...
>>
>>108661186
WHAAATT THE FUCK IS THAT AUTISTIC CHINKSHIT ALL THIS TIME I THOUGHT IT WAS PROBABLY TWO PASSIONATE WHITE GUYS FROM AMERICA NOOOOO
>>
>>108663073
>11min30sec
Bro....
>>
File: 1768183262130103.png (174 KB, 399x600)
174 KB PNG
>>108663091
Yes, and?
>>
>>108663104
chudda... I kneel... >>>/wsg/6132196
>>
>>108662733
this
>>
hope everyone is doing their own quants now!
>>
>>108663073
And I thought 2.5 was slow
>>
File: glow-shine.jpg (210 KB, 1125x1104)
210 KB JPG
>>108662957
buy a 3090 and a cloud sub. Wait like the rest of us who didn't CPUMAXX and wait...
>>
>>108662614
>>108662594
Wonder if I'd be able to run it with my 7900xtx and 32gb ram...
>>
just remembered this existed: https://github.com/fagenorn/handcrafted-persona-engine
>>
>>108662869
Rent compute until the current craziness calms down, you'll get what you want for cheaper until prices are better for gpu and ram.
>>
>>108663170
I want a one-click exe
>>
>>108663196
holy slop
>>
>>108663095
if this post communicates one thing it's the cultural and intellectual superiority of the great united states of america and its fine population
>>
Thoughts on new Qwen 27B.
I almost like the thinking process it does. Everything makes sense. I don't even mind it drafting the whole response and then checking / fixing some things in post.
But, 2000 tokens of thinking take so fucking long, man. If this ran at 100+ tk/s it would be pretty fun.
It writes a lot like Gemma 4 31B does but thinks 1500 more tokens while running 5 tk/s faster.
If this released a month ago I'd be using it but now it's just not worth it.
>>
>>108663095
Anon, it's always either chinks or jews...
>>
>>108662773
>I know people like to shit on oai releases
They're not people, just shills in a different flavor.
>>
>>108662654
Yooo, @yacineMTB / kache is a pedophile / hates niggers like us? Based! I've heard roon also visits here and posts pics from his personal stash.

>>108662783
Reddit cooked with this epic cuckoldry meme
>>
>>108662783
LOWER I don't give a fuck about coding
>>
File: Screenshot048.png (5 KB, 1031x176)
5 KB PNG
>>108663256
ngmi
DOA
>>
File: 1776345292760771.png (37 KB, 1247x122)
37 KB PNG
>>
>decide to migrate to emacs, ditch vim and micro
>spend 8 hours configuring init.el and still not sure about everything...
Finally I can begin editing my text files! Thinking about making elisp version of my llm client. that will probably be spaghetti.
>>
https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive
WTF HE ALREADY DID IT
>>
>>108663449
>>108663449
>>108663449
>>
>>108662906
Laughing at the strix halo niggas who couldn't figure out a double eGPU setup and have to wait for shit models they will use at shit speeds.
t. strix halo nigga
>>
>>108663610
No one cares about your expensive ewaste though
>>
>>108663616
I thought that was the thread's subject
>>
File: 1763559247414133.jpg (26 KB, 616x474)
26 KB JPG
>>108663621
>>
>>108663435
cute
>>
>>108661009
learn to solder them, you can buy the dimms without chips in aliexpress
>>
>>108662543
Is that right? My 8x 3200 ddr4 w/ 3090 does 10 tk/s on q4 glm 4.6. Are you saying that going to 12x 6400 ddr5 will only get me 5 tk/s more? Shouldn't it be more than 20 tk/s?
>>
File: 00000-1378487878.png (1.33 MB, 1024x1024)
1.33 MB PNG
>>108662834
Sounds more like a potential tech sharing agreement in the works. West acts like China's all kumbaya and shares everything, but ofc they are hyper competitive.
If you're the DS founder, gotta love the $20B valuation on his side-gig. lol. Idk what that guy makes at his real job but this implies he's now a multibillionaire at minimum.
>>
>>108664332
That would be more of something that the CCP would arrange under cover, an investment opportunity like this wouldn't cover tech and it's not like they can't just look at what Deepseek publishes to get something out of whatever they are planning.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.