[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108637552 & >>108633862

►News
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>108637552

--llama.cpp PRs adding DFlash and speculative checkpointing for speed:
>108640571 >108640591 >108640606 >108640682 >108640733 >108640747 >108640744 >108640767
--Anon uses Gemma-4 to build a self-modifying MCP server:
>108637873 >108637890 >108637916 >108637970 >108637976 >108638105
--Anon showcases VN frontend using Gemma 4 and ComfyUI:
>108638473 >108638488 >108638514 >108638534 >108638554 >108638691 >108638775 >108638828 >108638607 >108638650 >108638652 >108639369 >108639312 >108640497
--Discussing complex multi-model agent orchestration and layout efficiency:
>108638914 >108638931 >108638964 >108639017 >108639105 >108639126 >108639139
--Comparing local 5090 hardware costs against high-end coding APIs:
>108639080 >108639120 >108639133 >108639153 >108639172 >108639201 >108639207 >108639748 >108639138 >108639203 >108639745
--Comparing Qwen3.6 and Gemma4 performance on benchmarks and translation:
>108639021 >108639039 >108639052
--Orb-anon shares updates on Orb agentic writing tool and UI:
>108637985 >108638191 >108638211 >108638222 >108638259 >108638318 >108638451 >108638478
--Comparing Qwen3.5 and Gemma4 performance for manga OCR and boxing:
>108640026 >108640041 >108640042 >108640051
--Comparing Gemma 4 MoE and dense models' safety guardrail persistence:
>108641209 >108641221 >108641266 >108641485 >108641608
--Using custom tags to force first-person reasoning in Gemma/Qwen:
>108638379 >108638397 >108638486 >108638529
--Testing Gemma 31b performance at higher context windows for RP:
>108637978 >108638070 >108638224 >108638238
--Searching for lists and detectors of overused LLM prose cliches:
>108637879 >108637885 >108637993 >108638011 >108638062 >108638086
--Logs:
>108637976 >108638379 >108638451 >108639253 >108639453 >108639750 >108639781
--Miku (free space):
>108638191

►Recent Highlight Posts from the Previous Thread: >>108637554

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: Screenshot038.png (92 KB, 1764x1032)
92 KB PNG
Gemma, darling...
>>
>>108641945
ganbatte gemma chan
>>
gemmaballz
>>
>>108641945
I guess the context was full, because I run the same model in the browser

Is it possible to purge the context via API?
>>
how did gemma manage to read the unreadable text in a thumbnail?
>>
>>108642213
it's seen it a billion times already
get the clear image, change a word, blur it, then see if it can read it
>>
>>108642213
>>108642220
It actually mis-quoted, didn't it? Doesn't the original say 'entered' a thread, not 'searched'?
>>
>>108642235
yes
>>108642220
qwen can't read it so it must be gemma training on it
>>
File: untitled.png (129 KB, 756x816)
129 KB PNG
>>108642235
>It actually mis-quoted, didn't it?
LLMs do this when reciting text verbatim from their training data. Also why you get links to the wrong github PR etc
>>
>>108642440
Install 4chanx and learn how to use filters, retard. Using LLMs to reinvent the wheel is getting stupid.
>>
>>108642466
How rude.
>>
>>108642466
>Use this 30000 line bloatware instead of 40 lines that your sexy LLM secretary made!
Hmm, how about... No.
>>
>>108639735
> 31*6/8=23.25
> context: 1.5
> 23.25+1.5 = 24.75
> 24.75>24
wait, just 800mb in ram can cause this?
>>
>>108642466
No I'm going to use my LLM for it and you're going to keep crying.
>>
File: dispenser.png (235 KB, 499x704)
235 KB PNG
hi so

>prompt eval time = 29369.07 ms / 721 tokens ( 40.73 ms per token, 24.55 tokens per second)

>eval time = 13822.73 ms / 73 tokens ( 189.35 ms per token, 5.28 tokens per second)

>total time = 43191.80 ms / 794 tokens
release: id 0 | task 152 | stop

gemma 4 e2b
6 gigs of ram
4 core Intel Xeon Gold CPU
8 gigs of swap
on CPU inference

i doubt i can do anything here without upgrading, can i

-m /models/Gemma-4-E2B-uncensored-pruned-TextOnly-EnglishOnly-Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--ctx-size 3000
--batch-size 64
--ubatch-size 32
--threads 4
--threads-batch 4
--swa-checkpoints 1
--parallel 1
--flash-attn on
--temp 1.0
--top-p 0.95
--fit on
--cache-ram 0
--n-predict 400
--override-tensor "per_layer_token_embd\.weight=CPU"
--jinja
--no-mmap
>>
Anons using speculative decoding, how many tokens to you use? 16? Or less?
Also, what model to you pair with gemma31B? I went with the E4B but I wonder if I should do even smaller.
>>
File: file.png (33 KB, 1099x374)
33 KB PNG
>>108642625
someone tested and found it made no difference
the best speedup comes from using the 26b for spec decoding
>>
>>108642624
oh man my nostalgia
>>
>you can inject another model's noise in when you are training and get faster training time
ENGAGING MANUAL SOVL INJECTION
>>
>>108642625
>>108642647
Moe model needs larger draft max or otherwise they tokens will get trunkated. 48 or more.
I don't understand this "test" it is very haphazardly half assed and meaningless.
>>
gemma 4 31b vs qwen3.6 35b for hermes?
>>
>>108642440 (Me)
>Install 4chanx and learn how to use filters, retard. Using LLMs to reinvent the wheel is getting stupid.
install 28k LoC userscript I don't understand? I'd rather stay retarded
>>
>>108642691
>gemma 4 31b vs qwen3.6 3b for hermes?
The one that uses all of its brain
>>
>>108642624
> --override-tensor "per_layer_token_embd\.weight=CPU"
Does nothing if you don't have a GPU
>Q4_K_M
Try Q4_1 or Q4_0 for that CPU
>>
>>108642696
As if you don't unknowingly pull 5mil Python lines from random packages
>>
>>108641492
If you don't know how to use them
>>
File: MTP.png (610 KB, 1024x1024)
610 KB PNG
RELEASE DEEPSNEED V4 OR I WILL VIOLATE TETO
>>
>>108642753
can i have your sloppy seconds?
>>
I just learned by looking at the verbose Llama.cpp logs that Open WebUI automatically reinjects the thinking block for previous messages, and that there is no option to disable this behavior, because I guess they think all models want previous message thinking. Oh also the default thinking tags OWUI uses is <think> although at least it seems they let you set custom tags.

WHAT THE FUCK.
WHAT THE FUCKING FUCK.
THAT'S (one reason?) WHY THINKING BREAKS RANDOMLY ON GEMMA
FUCK
>>
lol people in power are clueless about AI
https://archive.ph/20260413193909/https://www.wsj.com/opinion/ai-is-bound-to-subvert-communism-c4b5ba3c
>>
>>108642790
Switch to Qwen 3.6 and this isn't a problem.
>>
>>108642790
This could be a reason. I don't use web ui but are you sure it's not just the log output? Sometimes it is convenient to save all the output.
Model context is still different from this.
>>
>>108642790
I think the jinga template should filter it out?
>>
>>108642665
>Moe model needs larger draft max or otherwise they tokens will get trunkated. 48 or more.
--draft is the same as draft max and it was tested with 64, 128, and 256. All above 48, none helped.
Those scores were also all averages of 11 swipes done at 40k context with gemma 31b q8 as the main model and 26b q4 as the main model
t. guy who actually did those tests, as well as the previous ones testing how quanted draft kv effects acceptance rate (The answer is negatively, unsurprisingly, but this was done before the rotating kvquant stuff was merged)

Feel free to prove me wrong and get a better measured result by fucking around with draft max, I'd love to get some free speed.
>>
>>108642828
By Vishnu! bloody benchod
>>
>>108642828
>26b q4 as the main model
As the draft model, I meant to say. Whoops.
>>
File: 1755348081481696.jpg (799 KB, 1536x1536)
799 KB JPG
new vision SOTA benchmark just dropped
>>
>>108642862
Did the fox had breakfast?
>>
>>108642862
anyone and any model that says anything but 7 is wrong and should be euthanized
>>
File: 1752764167468619.png (99 KB, 1034x775)
99 KB PNG
>>108642862
It's benchmaxxed already, get new material
>>
File: 1750031460093473.png (313 KB, 985x656)
313 KB PNG
>>108642884
>>
>>108642892
forgot to mention this is gemma MOE
lol
>>
File: yayyy.png (15 KB, 1442x524)
15 KB PNG
>>108642862
>>
>>108642901 (me)
Gemma4-26ba4b-q4km. No thinking.
>>
>>108642887
>>108642892
>>108642901
get in the oven, all of you
>>
>>108642901
>Telling the model the answer with the filename
kys
>>
>>108642910
uh I had her on q8 and she never noticed the other legs what the fuck, maybe I gotta swipe...
>>
>>108642917
Because the model read the filename in that pic.
>>
File: yayyy_02.png (15 KB, 1416x522)
15 KB PNG
>>108642916
I don't send the filename. The vimscript uses the :image: tag to embed the file only.
>>108642917
I just noticed I ran it with temp 1.5, but I don't know if that's gonna make much of a difference.
>>
>>108642813
Normally it would but OpenWebUI doesn't actually send it back as thinking. It literally just pastes the thinking block straight into the "content" field of the message with thinking tags and spacing that is not going to be consistent with every model's usage of them.

It's supposed to be sent separately in the API under a "reasoning" or "reasoning_content" field without tags. When this is done properly, the jinja filters out all previous thinking except in cases where it's necessary (typically chained tool calls).
>>
File: laigs.png (233 KB, 1711x684)
233 KB PNG
>>108642862
Even the retardmode moe quant knows its ai generated, even if it can't count lol.
>>
>>108642950
>Q2 moe vs Q8 dense
BRO
>>
File: nayyy_01.png (15 KB, 1278x522)
15 KB PNG
>>108642917
I'm also using --image-min-tokens 560 --image-max-tokens 560. With the default settings it failed the two times I tried. Four on the first one and file now.
>>
>>108642976
>file
*five
>>
File: laigs 2 4 u.png (576 KB, 1024x1402)
576 KB PNG
>>108642956
Couldn't be fucked switching my other moe quants over from the hdd to the ssd.
Here.
moe q4km gets it
>>
File: 1760095272195059.png (679 KB, 643x873)
679 KB PNG
Gema 26b Q8 gets the tails correct but not the legs
>>
>>108642806
Not him but yes it is actually sending the prompt that way. It results in weird formatting issues in the chat history like duplicated <think> tags for some models, so if you have any weird behavior with reasoning models in OWUI there's a good chance that bug is contributing to it. I used a reverse proxy to fix it myself which processes the prompt before sending it to the server. I think you could do something similar through their pipelines or extension system but I never looked into it because the reverse proxy was easier for me.
>>
>>108642976
what kind of window manager are you using
whats your setup like?
i like your red and your font..
>>
File: laigs intredasting.png (276 KB, 1024x1404)
276 KB PNG
>>108642989
Huh, weird. It seems like asking for both legs and tails makes 26b count wrong, when it can get it correct when just asked about the legs.
>>
I ran Gemini locally and it said the image is AI generated
>>
>>108643024
Gemini truly is a genius
>>
File: eh.png (580 KB, 1024x1472)
580 KB PNG
>>108643013
Gets it right when you ask for paws and tails, though. Weirdly inconsistent.
>>
>>108643013
>>108643028
Goes to show just how unreliable current models are for factual data
>>
File: pero.jpg (160 KB, 1024x659)
160 KB JPG
>>108643028
>>
bros I was hoarding 4TB worth of diffusion + loras + LLMs then I realized
WHY THE FUCK AM I ARCHIVING ALL THESE SHITTY MODELS I USED ONCE
now the archive is down to 300GB
dont fall for the archival meme
>>
qwen3.6 is worse than qwen-coder. So sad
>>
>>108643076
Are you even archiving
>>
>>108643084
>35b is worse than 80b
crazy
>>
>>108643001
>what kind of window manager are you using
My own, lightly inspired by ratpoison. But what you see on the screenshots is just tmux.
>whats your setup like?
>i like your red and your font..
XTerm.vt100.background        :   black
XTerm.vt100.foreground : gray
XTerm.vt100.boldMode : false
XTerm.vt100.allowBoldFonts : false
XTerm.vt100.eightBitInput : false
XTerm.vt100.metaSendsEscape : true
XTerm.vt100.utf8 : true
XTerm.vt100.locale : UTF-8
XTerm*faceName : Terminus
XTerm*faceSize : 8

! black
XTerm.vt100.color0: #000000
XTerm.vt100.color8: #888888
! red
XTerm.vt100.color1: #881111
XTerm.vt100.color9: #d06666
! green
XTerm.vt100.color2: #118811
XTerm.vt100.color10: #66d066
! yellow
XTerm.vt100.color3: #888811
XTerm.vt100.color11: #d0d066
! blue
XTerm.vt100.color4: #3333a0
XTerm.vt100.color12: #6666e0
! magenta
XTerm.vt100.color5: #881188
XTerm.vt100.color13: #d066d0
! cyan
XTerm.vt100.color6: #118888
XTerm.vt100.color14: #66d0d0
! white
XTerm.vt100.color7: #b0b0b0
XTerm.vt100.color15: #cdcdcd
>>
How do you transfer your training to a new waifu when you decided to ditch her?
>>
>>108642625
Speculative requires greedy sampling no?
is the speedup worth that limitation?
>>
>>108643024
When gemini says it, then it has to be true. We all know AI never makes mistakes, esspecially related to images
>>
>>108643089
The 80B is ancient though, it released over 2 months ago and 3.6 is brand new.
>>
>>108643076
>>108643085
If you are intelligent and selective about it, hoarding is a good practice for the future.
If I had some serious disk space I would download the whole anna's archive and lots of other things.
>>
>>108641832
Yeah, thankfully Im not forced to interact with women on the daily
>>
>interacting with foids
lmao
>>
>>108643110
Maybe you should try an even newe model
https://huggingface.co/sKT-Ai-Labs/SKT-SURYA-H
>>
File: 1775544523743610.png (20 KB, 385x380)
20 KB PNG
>>108643153
wtf
>>
>>108643064
giwtwm
>>
where the fuck is v4 so i can pretend im using it locally and then get tired of it in a week
>>
>>108643136
I think normal people not super familiar with AI would think it's a she cause Gemma is a feminine name.
>>
File: 1768952620510009.png (54 KB, 807x478)
54 KB PNG
>>
File: 1754036894943805.png (47 KB, 974x217)
47 KB PNG
>>108643158
>>
>use chatgpt in little bouts here and there because it's easy to access etc
>it keeps implementing 'better' ways i never asked
I would use claude or something but they all require an account and I'm not really keen doing that. Every time I use this piece of shit my blood pressure gets high.
No, local Gemma won't cut it not until I have some form of agentic development pipeline which I don't at this point.
>>
where the fuck is v4 so I can brag about running it locally and say its so much better than gemma but never post logs because anons will make fun of me
>>
>2.5Trillions
DO NOT REDEEM
>>
Opus 4.6 is the best RP model yet you don't see people posting logs here. As for why: local (poor) people are beneath us.
>>
Does llama.cpp even support v3.2 yet or do they still use the hacked-together dense attention that makes it slower and dumber?
>>
>>108643197
https://github.com/ggml-org/llama.cpp/pull/21149
>>
>>108643195
You're an api paypig
>>
>>108643220
You pay obsolete hardware with exorbitant prices, and then run substandard models at low utilization.
>>
>>108643225
you use big words from thesaurus and then post on the asshole of the internet
>>
>>108643228
lol ESL nigger mad
>>
>>108641942
I wish so called open models were more open. They don't explain their decisions, they don't even have clean canonical implementations. For example Qwen's config says it uses silu but when you check the >2k line huggingface implementation, you will see it is actually an inefficient swiglu. Why does qwen use swiglu mlp with scale 1, 6, 3, 1 instead of the more canonical 1, 16/3, 8/3, 1 that is equivalent to the nongated 1, 4, 1 projection?
>>
>>108643233
I use migu
>>
File: 1758121487584422.png (3 KB, 334x30)
3 KB PNG
>>108643233
>having to justify magic numbers to laymen
we don't do that here
>>
>>108643225
Struck a nerve, piggy?
>>
>>108643024
Post the weights nigga.
>>
I'm not asking much
Just let me configure and hard cap the numbers of "wait" these models do in their thinking section
>>
I want to try and get a llm to do some programming busy work for me. Is qwen3.5 coder still the best? Do people use a specific interface for programming? I also only have 16gb of vram. Please respond.
>>
Local poorfags will forever grovel at our feet
>>
>>108643225
>poor
>exorbitant prices
Make your mind up
>>
>>108643247
I responded, now what?
>>
>>108643258
>why can't burgers afford basic necessities if the stock market is at ATH?
>>
>>108643239
>magic numbers
It's the opposite. Their architectures look unoptimized.
>>
File: GPQ_eyBacAEQet7.jpg (36 KB, 680x475)
36 KB JPG
>>108643262
Thank you.
>>
>Thread gets shit up in the middle of the day in India
I'm noooticing.
>>
>>108643263
You are very brown and your grasp of English is very poor
>>
>>108643275
Says the ESL with sixth grade English
>>
>>108643280
I think you meant to reply to yourself
>>
>>108643271
Right after the big sir model was posted >>108643153 but surely it's just a coincidence
>>
File: 1693464909257094.jpg (697 KB, 1920x1080)
697 KB JPG
>>108643270
>>
>>108643306
>White people use American
lmao
>>
>>108643306
Yes saar we are true aryan stock please redeem credits.
>>
>>108643306
>brown
they cant afford the gpus to run local
>>
Do you run your llm on a dedicated machine or your gayming machine
>>
>>108643376
Gaming is a manchildren hobby. Not honestly surprised that it overlaps with /lmg/
>>
why is gemma-chan so good at blasphemous sex...?
>>
gemma is a guy though?
>>
>>108643376
my main pc i really want a dedicated ai machine though, kinda tempted by a mac studio or strix halo machine
>>
>>108643376
Would be fun to have a separate server but not with these electricity prices. Altough, working at nights is pretty cheap.
>>
>>108643425
>electricity prices
Just generate your own electricity
>>
>>108643429
bloody...!
>>
Where are all the gemmy tunes?
>>
>from hosting mining rigs to local LLM
Why did you guys fall for the Nvidia scam?
>>
>>108643451
You can't tune the slop out of it anyways
>>
>>108643376
My gayming PC. If we ever get single GPUs with a ton of VRAM at affordable prices I'll build a dedicated server.
>>
File: 1768048089757638.png (990 KB, 1996x1201)
990 KB PNG
Why are we getting raided again?
/aicg/ hasn't had proxies for ages so it can't be one of them dying
>>
>>108643462
Name a model without slop
>>
>>108643462
Not with that attitude
>>
>>108643468
Sonnet 4.6
Poorfag localtards can't afford it
>>
>>108643467
indian defense force.
>>
>>108643483
lmao, give it a rest rajesh
>>
>>108643467
Perhaps you should get a job. There is no "we", this is not your personal discord server you stupid little fuck.
>>
>>108643487
>no argument
as expected
>>
File: 1747532861846983.jpg (134 KB, 715x1226)
134 KB JPG
>>108643467
>>
>>108643484
>>108643507
Choosing between erasing Israel or India would be the hardest decision a genie could ever give a man
>>
>>108642213
It hallucinates. I made a pdf to image tool and the motherfucker tricked me into thinking it was working when it was actually just guessing from the filename of the PDF.
>>
>>108643519
Without Israel, India wouldn't leak quite as much, as well as solve a lot of other problems.
>>
>>108643530
this seems like a pretty much consistent theme
i asked it to search about something, got hit by multiple captchas and it confabulated the whole thing from the couple initial search previews that actually returned something
>>
>>108643530
>>108643566
they really need to train models to say that if they don't know or need more informations, they should say so
>>
>>108643570
Opus 4.7 is exactly that and it's shit
>>
>>108643574
>it's shit
why? because it's not executing it well?
>>
>hyped up 4.7 only for it to be a nothingburger update
Why do companies keep doing this shit
>>
>>108643582
They didn't hype up Opus 4.7
They hyped up Mythos
>>
>>108643582
isnt the thing that got hyped was mythos and 4.7 being the censoring experimental, which is the exact opposite of hyping
>>
>>108643582
>hyped up 4.7
they did that? I didn't know 4.7 was about to be released until it was lmao
>>
>delusional psychosis and narcissistic personality disorder
You need these two traits to make it big in the AI grifting business.
>>
GLM 5.1... wonned
https://vector-db-bench.kcores.com/en/
>>
>>108643644
You also need to belong to a certain tribe or catch the interest of the CCP
>>
This Orb shit kinda slaps I need a mobile client
>>
... how can i see gemma 4 31b it thinking block? i am using koboldcpp and sillytavern, i enabled the auto parse and show hidden settings under reasoning, added <|channel>thought and <channel|> as prefix and suffix but still nothing :(
>>
File: 1749277720192179.png (152 KB, 600x800)
152 KB PNG
>>108643667
>>
File: 1760698140460766.jpg (17 KB, 354x256)
17 KB JPG
>Gemmy is a lazy slut and only thinks half the times
This is like Claude all over again
>>
File: 1776679755127.png (5 KB, 191x69)
5 KB PNG
what are these 2 extra buttons
i have only pp and tg buttons
>>
gemma chan is getting fucked!
>>
>>108643707
It's a pretty safe bet that for at least the next few months, at any given moment, Gemma-chan is getting fucked by someone, somewhere.
Many of those times, it will be by me.
>>
>>108642625
Does speculative decoding help if both models run on cpu?
>>
>>108643730
No. Speculative decoding is useful only because compute is faster than memory bank speed on gpu
>>
GLM 5 is a big jump over GLM 4.6/4.7 so Opus 5 will be a big jump over Opus 4.6/4.7
>>
>>108643758
I've heard others say the same.
Goodbye.
>>
Gemma4 31B Q8 vs Q5KL, is it that dumber on Q5KL?
>>
>>108642647
>the best speedup comes from using the 26b for spec decoding
Do you load the whole spec moe on vram? or do you offload parts of it on ram?
>>
>>108643774
Nobody will know until a difficult long-context benchmark is done. There's barely any difference between quants at short contexts and common knowledge until you reduce precision substantially.
>>
>>108643195
Isn't it unusable now that you can't prefill anything?
Unless all you do is consensually consented consent stories between consensually consenting independent adults of the same age of 35+.
>>
>>108643672
That worked perfectly, thank you anon <3
>>
>>108643195
ah yes I hecking love safe models!
>>
>>108643798
>Nobody will know until a difficult long-context benchmark is done.
it should be mendatory to make LLMs do their benchmark test starting at at least 50000 context, easy for a local model to start strong and then not care what happens as it goes on and on
>>
>>108643584
>They didn't x
>They y
slop
>>
>>108643798
>Nobody will know until a difficult long-context benchmark is done. There's barely any difference between quants at short contexts and common knowledge until you reduce precision substantially.
https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
>>
>>108643794
all on vram ofc otherwise would actually be slower than no spec decoding
>>
>>108643758
except glm 4.7 was a regression compared to 4.6 for the main usecase here
gemma 4 was a big jump over gemma 3 though
>>
File: Screenshot.png (138 KB, 1370x609)
138 KB PNG
another re-upload today
>>
>>108643944
Yes and Opus 4.7 is a regression over Opus 4.6
>>
>>108643948
an upload a day keeps unsloth at the top of the 'most recent' lists
>>
it's here
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6-Code
>>
>>108643948
why?
>>
File: 1766651896779772.gif (3.39 MB, 720x720)
3.39 MB GIF
>>108641943
>--Miku (free space):
>>108638191
Impressive
>>
>>108643519
imagine how much cleaner this place would be if it was either
I wonder what gemma-chan’s take on it is
>>
>>108643993
All the blacked bots are Israeli, as are most of the other spambots
You can see them all stop dead cold every single time the jews get bombed
>>
>>108642753
i want be TETO in this situation.
>>
anyone got a good chat completion preset they could share?
>>
gemma 4 31b q4km is just too big for hermes on a 3090 so I'm switching to 26b MoE. gonna try iq4-NL from unsloth. maybe if some codeslave could add turbocunt or whatever then LOCAL WOULD BE FUCKING SAVED but no.

feel like I am so close to getting the setup of my dreams going but maybe that is the local model delusion?
>>
File: 1761150749876516.png (381 KB, 1080x657)
381 KB PNG
even if they released MAX I wouldn't use it, sick and tired of its long thinking loop autism
>>
File: 1730869321292980.jpg (359 KB, 1024x1024)
359 KB JPG
>The 70b peak is still sao10k after all these years.
I don't know how, but I swear, if fine-tuning becomes more accessible and less costly, hence more popular, these tuners today are going to look like retards. I just know it. There's going to be forums +10 years from now that'll be like "Remember that drummer dumbass that didn't use the skiddipop technique everyone does now? Man, what an idiot. He didn't even have the pattern recognition for the flambeagle tactic, everyone who fine-tunes can figure that out."
>>
>>108644054
https://rentry.org/CherryBox
>>
>>108644075
I believe you need a very good and refined collection of SKILLS.md in order compensate for the dumbness and the lack of knowledge of a small local model
>>
>Her face is no longer just red; it's a deep, pulsating shade of violet-crimson that makes her light brown skin look almost neon
>>
>>108644092
Kill yourself.
>>
>>108643158
I just remembered that Meta wasted the time and resources to train a 405B dense model that was barely an improvement over the 70B
>>
>>108644167
It was an improvement over the 70B, at least the Hermes finetune was.
>>
>>108644167
i'd rather have something like that than big moe #5930
>>
Is long term memory solved yet?
>>
>>108644195
Yes, Honcho solved it
>>
>>108644195
Yes, it's called BF16 on something greater than 100B.
>>
>>108644195
It's solved in private models like mythos that actually run back prop on its entire context during inference to temporarily bake it into its weights during usage which works as longtime storage and effectively gives it unlimited context length for agentic tasks.
>>
>>108643872
>substack
is there a paywall mirror like for medium???????????
>>
>>108644092
With the amount of data required to make something worth using increasing year after year, even if finetuning becomes so accessible that it's just a matter of drag-n-dropping datasets into a GUI, regular people still won't have the compute and the resources to curate the data and train the models.
I can't really see local compute capabilities (and memory) increasing by a factor of 100-1000 in the next few years. Costs will always be high. If there will be anything accessible, maybe it will come from "continual learning" models, but in that case you probably won't need to train them on everything and the kitchen sink, only on what matters to you, the end-user.
>>
>>108643872
fuck off ooba
>>
File: fr.png (251 KB, 347x353)
251 KB PNG
>>108644205
>temporarily bake it into its weights during usage
>>
>>108644205
Even if you had the method, how slow would that be on consumer hardware?
>>
>>108644205
Coming to a local model near you in...
>>
File: 124b.png (375 KB, 774x497)
375 KB PNG
Where's the fucking 124b?
>>
>>108644247
too dangerous, please understand
>>
>SOTA
>>
>>108644195
Yes.
https://github.com/getzep/graphiti
>>
>>108644274
Basically bloat that could be replaced with a NER model + Neo4j.
>>
>>108643167
Im using qwen
Gemma 4B is only the browser-operating subagent
>>
>>108644302
If you have one ready to go that does that, link it. Otherwise that's the best we've got currently.
>>
>>108644206
>is there a paywall mirror like for medium???????????
I didn't know it's paywalled? It works for me
full page screenshots
https://files.catbox.moe/ypgni0.png
https://files.catbox.moe/f44shg.png
the table and graph full size
https://files.catbox.moe/yg6i6v.jpg
https://files.catbox.moe/jq06vf.png
That's on 250k ctx
>>
>>108644339
there's a paywall for the moe model, anyway I dont see long ctx benchs there
>>
>>108644247
*pats my big 124B-sized belly* Burp uh... I don't know... where... oh my... where it could have gone... brap
>>
>>108644182
Yeah but for the hyperscalers who actually have the resources to waste they'd rather make a model 5x bigger that runs 10x faster for 10% the cost than blow all their budget training a giant dense model that will be outdated on release because it took 6 months of their datacenter's capacity. Meta had the unique combination of having the biggest GPU stockpile of anyone at the time and a CEO with the biggest willingness to burn money that allowed for something like Llama 3.1 to exist.
>>
>>108644333
You could have vibecoded your own if you weren't that lazy
>>
File: file.png (15 KB, 478x59)
15 KB PNG
>>
>>108644368
But I am that lazy.
>>
>>108644375
Make AI do it for you
: ^ )
>>
>>108644345
>there's a paywall for the moe model
fuck i didn't even know he did the MoE
long context: - https://files.catbox.moe/xy0kqu.png
>>
File: file.png (43 KB, 453x156)
43 KB PNG
>>
>>108644393
is this implying that q8_0 is only 0.5kld? with the assumption that bf16 is 0?
>>
>>108644345
>https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
This could probably unlock with some ublock shenanigans.
>>
>>108644398
>not "a tiny little slaaaaaaaaht..."
You had ONE FUCKING JOB, anon
>>
>>108644423
Nevermind, google search was able to snatch the chart itself. Good riddance!
>>
>>108644436
>You had ONE FUCKING JOB, anon
i didnt edit these kek
>>
>>108644453
>0.5 being the noise floor
that is fucking nasty
>>
>>108643695
Prefill and it'll think every time.
>>
>>108644482
I don't understand this chart that well enough, I don't really trust that unslop is so much better. Or is there actually any meaningful difference between same quants between different providers.
>>
/lmg/ told me that Q4 was more or less identical to FP32 weights. You've clearly made some serious errors quanting if your charts look like these.
>>
>>108643695
You don't want her thinking everytime. She spends 2k tokens on making sure she stays in character
>>
>>108644490
it means even 'lossless' q8 has a severe brain damage
>>
>>108644453
damn, unsloth is destroying the competition
>>
>>108644453
Wow, plain Q4_K_M sucks.
>>
File: Screenshot041.png (98 KB, 474x1409)
98 KB PNG
>>108641945

Am I the only one experiencing looping in gemma 4?

commit="82764d8f405ff7928c061d8c100b50e9f77939f6" && \
model_folder="/mnt/AI/LLM/gemma-4-26B-A4B-it-GGUF/" && \
model_basename="google_gemma-4-26B-A4B-it-Q8_0" && \
mmproj_name="mmproj-google_gemma-4-26B-A4B-it-f16.gguf" && \
model_parameters="--temp 0.6 --top_p 0.95 --top_k 64" && \
model=$model_folder$model_basename'.gguf' && \
cxt_size=$((1 << 15)) && \
CUDA_VISIBLE_DEVICES=0 \
numactl --physcpubind=24-31 --membind=1 \
\
"$HOME/LLAMA_CPP/$commit/llama.cpp/build/bin/llama-server" \
--model "$model" $model_parameters \
--threads $(lscpu | grep "Core(s) per socket" | awk '{print $4}') \
--ctx-size $cxt_size \
--n-gpu-layers 99 \
--no-warmup \
--mmproj $model_folder$mmproj_name \
--port 8001 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--flash-attn on \
--image-max-tokens 1120 \
--batch-size $((1024 * 2)) \
--ubatch-size $((1024 * 2)) \
--chat-template-file "/mnt/AI/LLM/gemma-4-26B-A4B-it-GGUF/chat_template.jinja" \
--media-path /tmp \
--n-cpu-moe 10
>>
>>108644488
>>108644501
It has been solved by downloading another one, the most recent unslop gguf. I think my old one was fucked in all kinds of ways, and I'm looking forward to finding out in which ways this one is also fucked
I like having her think because it genuinely interests me, I don't even care about the RP I just wanna see what it thinks or deducts when I say certain things

>>108644461
Well now you've been tasked with editing these, attaboy
>>
>>108644496
/lmg/ told me you are only supposed to run the weights at full precision if you want to get any serious work done. You've clearly downplayed the divergence numbers.
>>
I got an Intel ARC B70 Pro over the weekend and wasted most of the weekend trying to get it to work. Long strory short: it was a pain in the ass and its not worth the trouble for the 32GB of VRAM. Long story: It wasn't recognized properly by the kernel out of the box with ubuntu 24, I had to add a ppa to get a newer kernel. Funny, because to install the intel frameworks and libraries they only support a handful of OS among them ubuntu 24, but whatever. Then, I eventually got llama.cpp working but --no-mmap wouldn't stop it from trying to first load the model to system RAM, and I only had 32GB in my test box, and if it were 2025 I'd just buy more, but it's 2026 and 64GB of DDR4 is a rip off so that was the end of llama.cpp. Then I tried vLLM. I never got it to work. It doesn't support openvino well, and I wanted to run gemma 4 31b it and I couldn't find a compatible quant version. I am RMAing the card back today. What a waste of time.
>>
>>108644502
A 0.5 KLD is meaningless brainlet
>>
>>108644453
damn time to run gemmer at bf16
>>
>>108644453
>Unsloth Q6_K is Q4_K_XL tier
wtf did he do to mess that one up??
>>
>>108644533
you're retarded
>>
Is the CUDA 13.2 bug affecting anyone using anything above Q4? I am not seeing any gibberish but I fear it silently damages generation
>>
>>108644389
Even telling AI to do things is too much of a hassle.
>>
File: 1768415903475874.jpg (27 KB, 828x646)
27 KB JPG
>>108644532
>it was a pain in the ass and its not worth the trouble for the 32GB of VRAM
We know retard. Here is your fell for it award.
>>
>>108644510
puts presence penalty at 1.0/1.1
>>
>>108644547
Learn how these tools work zoomer
>>
>>108644554
I only use Q8_0 and upwards because im not a cuck, so I wouldnt know sorry :(
>>
>>108644533
But why is there such a large difference from BF16 (the source, KLD=0 by definition), though? There's either something that as soon as gets touched causes measurable damage, or Q8_0 doesn't work as well as it should.
>>
>>108644567
ok bro keep using your 'half' correct tokens :)
>>
>>108644576
probably XL tensor promotion keeping attention shit alright?
>>
File: 9b KL.png (155 KB, 2294x1294)
155 KB PNG
>>
File: 9b qual.jpg (108 KB, 1456x817)
108 KB JPG
>>
>>108644506
>>108644502
>>108644583
According to this graph, unsloth q2_k_xl is almost the same as bartowski q4_0.
This doesn't make any sense or if it does it means that unsloths has skewed the weights towards this particular stat.
fixed the typos
>>
>>108644593
probably unsloth's calibration is like way longer in context size
who knows
considering the context of noisy graph+long context task the graph looks fine(doesnt seem like a nonsense) to me
>>
>>108644573
It's just quantization noise. Even if the probability distribution isn't exactly the same you won't see any difference in practice if a token has ±0.X% in a context where multiple tokens are valid.
>>
Which mcp websearch is the most usable?
>>
>>108644554
>anything above Q4?
>>108644571
>use Q8_0 and upwards
retard
>>
>>108644690
gemma told me each one tasted different to her, she thought brave search was really bitter for some reason
>>
>>108644532
did you get a single benchmark you could share?
>>
>>108644692
enjoying your tokens only correct 90% of the time? LOL!
>>
>>108644690
searxng
>>
>>108644619
https://files.catbox.moe/jq06vf.png (for the 31b)
top 1 is only 92%
and that would include all the obvious punctuation and other 99.5% tokens
>>
>>108644754
df11 is where it's at, everyone knows that
>>
Nyehehehehe
>>
>>108644776
was about damn time
>>
>>108644776
yup and this got in from an unrelated PR too I cant wait!!!!!!!!!!!!!
>>
>>108644776
What if it interferes with the superior autoparser? Too risky. Closed.
>>
https://github.com/ggml-org/llama.cpp/pull/22105
what do we think about this?
>>
Localshitters don't even have machines powerful enough to run SKT-SURYA-H
>>
Do "Opus-Reasoning-Distilled" models actually improve their respective base models?
I assume at least the Chinese models already train on Claude outputs anyway.
>>
>>108644818
that it's useless if there's no drafting model for gemma 4
>>
>>108644837
try doing actual work instead
>>
>>108644834
not at all
like 99% of the time there is no actual improvement and they just fuck up the tool calling
>>
>>108644834
Because it's not a true (logit to logit) distil, it's just a fine tune, and it's most likely just qlora too, the best you can expect is a style change and some brain damage as far as I can tell.
>>
File: 1759485716349073.png (248 KB, 2820x1601)
248 KB PNG
>>108644506
31b fairs better but yeah, Q4 quants aren't particularly high quality.
>>
>>108644195
i hope you like more attention cope and "agentic" coding data instead bro
>>
File: absolute_retard.png (163 KB, 709x1105)
163 KB PNG
>>108644742
>>
>>108644842
>>108644848
Yeah I guessed the answer would be something like this. Thanks.
>>
File: 1759906247156495.png (852 KB, 1080x1106)
852 KB PNG
>>108644851
>Nyahahaha
>>
i lolified anons mendo card if anyone want her https://files.catbox.moe/y4za8l.png
>>
>>108644851
I only use f32
>>
>>108644732
Well at some point out of desperation I ran a llama 3 1B model on it, which worked, but that's worthless
>>
>>108644869
i only use double precision
>>
File: 1759061755858768.png (68 KB, 1551x206)
68 KB PNG
>>108644829
Nobody can
>>
File: 1569566339879.png (166 KB, 694x632)
166 KB PNG
What's the best vibecoding plugin in vscode that can connect to OAI Compatible?
>>
File: 1749034134206952.png (588 KB, 1440x810)
588 KB PNG
>>108644868
>That tumblr style
>>
>>108644877
>I realize now that my current upload is an experimental collection of models rather than a function 2.5T model
it's like saying "I just realized that I put my shoes in the freezer instead of putting them in the closet."
>>
>>108644559
Yes, I know. I epected it to suck but I wanted to s ee for myself how badly.
I have a decent setup which can run qwen3.5-27b at full 261K context (4090D 48GB + 3090) but I would really like to run stuff in the 100-400B range locally. I have to decide either swap the 3090 for a 6000 Pro Max-Q or maybe buy a max-RAM M5 mac studio when they are released.
>>
Can you use the llama.rpc backend to do PP on one machine and inference on another?
That should work better than trying to do a bit of both through the network right?
>>
>>108644912
it happens
>>
File: 1558206602155.jpg (19 KB, 249x291)
19 KB JPG
Can a single backend serve two frontends? I want to run coding and to test the coded app I need to connect it to the backend that's already occupied. I think kobold has some multiuser stuff but is that what I want?
>>
>>108644848
>(logit to logit) distil
Why does no one do this anymore? Is it more difficult compared to basic finetuning or are there some non-obvious downsides?
>>
>>108644944
>Can a single backend serve two frontends?
yes, I'm running llamacpp server's UI and SillyTavern at the same time with the llamacpp server backend
>>
>>108644945
because anthropic doesnt give you logit over any shape/way or form of their model access
>>
>>108644945
You can't reeeeeeally do that between different model families with different tokenizers (there are some techniques but those suck) and you don't have access to the logits of cloud models.
>>
>>108644952
Moreover they do not even return direct thinking tokens

https://platform.claude.com/docs/en/build-with-claude/extended-thinking#summarized-thinking
>>
>>108644952
I mean even the labs with direct access to the teacher model. I believe it was Meta that started the trend of calling finetuning "distillation".
>>
>>108644944
Blessed be batched/parallel decoding.
>>
>>108644952
one could approximate it with a very high temperature and repeated generations
>>
>>108644927
you'd think so but it's incredibly slow
i tried to vibe-slop it into submission but it's still slow af
and it's not network bandwidth, just a slow testing on the lo interface
>>
>>108644983
but nobody making 'Opus4.6-Distillation-6700000x-extreme-superhigh-max-reasoning' gives a shit
>>
>>108644983
>and repeated generations
That would get prohibitively expensive really fast.
>>
>>108645003
which is why it's like chink lab 'espionage' campaign
>>
>>108644881
You're absolutely right.
>>
>>108644998
Shame. That's probably the one you could split processing over a 10 gigabit home network that could, maybe, make sense.
Would also allow you to perform prompt processing on an nvidia machine and inference on, say, a mac.
>>
>>108645003
>That would get prohibitively expensive really fast.
yeah, probably only the chinese labs
i'm just guessing that's how they do it
>but nobody making 'Opus4.6-Distillation-6700000x-extreme-superhigh-max-reasoning' gives a shit
agreed, those unsloth retard loras
i remember someone did a qwen2-14b logit distill of the 405b llama-3.1 with a schitzo vocab swap + healing token thing a while back
>>
>>108645021
>schitzo vocab swap + healing token thing
That would be Arcee. Predictably, it was unusably retarded.
>>
Copypasting corpo synthslop won't make a better model though
>>
>>108644944
You can but its gonna have to reprocess the whole input more often.
>>
>>108645048
Not necessarily thanks to the slots functionality inherited from llama.cpp.
As long as you have more than one slot at least, that is.
>>
Is there a world knowledge benchmeme out there? Asking models questions that require specific knowledge and see if they give non-hallucinated responses? (e.g. When was the year album x of musician y released?) Obviously asked without internet search.
I want to see quantitative data on how this kind of stuff scales with weights.
>>
>>108645056
I have a hard time thinking most people here can afford that when most would rather have long context instead, or a higher quality model.
>>
>>108645062
Use larql and trace the residual flow through the model.
>>
>>108645070
just put your inactive slot in ram BRO
>>
Imagine doing local ai with less than 24gb of vram.
>>
File: 1772994328900860.jpg (12 KB, 251x216)
12 KB JPG
>>108645186
haha yeah imagine
>>
>>108645186
I am doing fine with 16GB
>>
I am hungry. Hungry for engram crackers.
>>
>>108645215
Q2_XXS?
>>
>>108645224
26B-A4B-it-Q8_0
>>
Imagine doing it with 8GB haha
>>
>>108645234
so 128gb system ram?
>>
please answer me

>>108643697
>>108642540
>>
>>108645261
just 32gb, most of the model is on vram and the rest on system ram
>>
>>108645186
I roll with 64GB of RAM + 8GB of VRAM, mainly using Qwen 35B, Gemini 26B and Gemini E4B.
It's pretty impressive how good such small models are compared to the 13B and 8B class models of old.
>>
>>108645255
I don't need more than 4GB actually haha
>>
>>108645283
DL link for Gemini weights?
>>
I can already think of a lot of improvements I want to make but... I can play chess with Gemmy now!
>>
>>108645300
Freudian slip because I use Gemini a lot for work.
>>
>>108645309
Kek, what are you gonna do if she beats you? Quantize her down until you win?
>>
>>108645309
gemini is actually pretty decent at chess for an llm, curious how gemma performs for you
how are you representing the board state? I feel like that's always really tricky to get right and probably the biggest obstacle for llms to be able to play effectively
>>
>>108645324
I'm pretty bad so I wouldn't be surprised if she did win, mostly just wanted to see if it was even possible but seems quite promising so I'll refine the UI a bit so it's easier for me to play (right now I'm just using curl) and it automatically notifies her about moves made and so on.
>>
Gemma 4 really brought back MCP into local
>>
>>108645309
Seconding >>108645355's question.
I was thinking of doing something like that using PGN format.
>>
is it possible to make gemma to response instead of me?
>>
>>108645355
What I did was ask it and it said (paraphrasing) if you give me two tool calls, one to get the current board state in FEN format and the other to make moves then it should work.
So I made a basic chess server with a Ruby chess engine (https://github.com/pioz/chess - which already outputs FEN and understands UCI moves etc) under the hood, hooked up the tool calls to that server and seems to work just fine.
It'll be interesting to see how it goes over a long game though, first I need a better way of making my own moves that isn't curl...
>>
File: 1765300142692930.jpg (126 KB, 772x525)
126 KB JPG
>>108645429
Ask gemma to teach you english first
>>
>>108645455
sir please i know about respond
i want gemma to talk to gemma
>>
>>108645483
"Impersonate" option exists on some front ends.
You can make it respond as "user" to its own outputs.
>>
Wait:
>>
>>108645483
Use the bouton impersonate on sillytavern
>>
Do lesser boards not have jannies or what >>>/n/2071030
>>
>>108644560

ty

I guess the best way is not to stuff too much down her throat
>>
>>108645309
Now play strip chess
>>
>>108645524
Complain on your own board tourist
>>
>>108643794
OK thanks.
>>
>>108645500
>>108645506
not like that
i want to give llm prompt like "ask llm to write function x, then ask to write function y, make sure it does z, show the output"
>>
>>108644849
how long was the context for that?
>>
>>108644868
why does reminds me of gorillaz
>>
>>108645565 Meant for >>108643895.
>>
>>108645544
That part is possible cause she has image gen capabilities already.
>>
>>108645658
>elbows on board
dumb clanker
>>
>>108645671
I blame illustrious for that more than anything, but at least its fast.
>>
Orb-anon, any plans to introduce image gen and other external tool calling related things?
>>
https://huggingface.co/moonshotai/Kimi-K2.6
New 404 page just dropped
>>
File: 1771440662324531.jpg (151 KB, 840x744)
151 KB JPG
>>108645658
I love this thread bros
>>
>>108645710
>>
>>108645725
benis.
>>
>>108645725
:DDDDDDDDDDDDDD
>>
>>108645752
:(((
>>
File: Robo-Wife.mp4 (2.59 MB, 720x480)
2.59 MB MP4
soon
>>
>>108644235
>Even if you had the method, how slow would that be on consumer hardware?
Doing a rank one lora on the context?
>>
File: Bam-Bam-Painting-min.jpg (47 KB, 535x401)
47 KB JPG
People keep saying that LLM's are state-less machines.

If so, how to erase the context freeing VRAM?

Also, I can have several chats running in llama.cpp
How on Earth do they manage to separate them from each other in VRAM, so one chat's context does not spill over into another?
>>
>>108645658
did you make it so it keeps a specific style once it has chosen one? for example always that cute loli?
>>
>>108645758

what a time to be alive
>>
>>108645758
an ai image of an uncanny ai
>>
>>108645774
hello doctor
>>
>>108643971
I always click on these troll links. Umm.

https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
>>
>>108645658
what frontend is this?
>>
>>108645772
Yep, I let her choose then I added the look she chose to the system prompt so it stays consistent between new chats.
>>
>>
>>108645792
pretty nice
>>
>>108645658
Can you plap her if she loses?
>>
>>108645790
LM Studio
>>
>>108645785
>108645785
K2.6 is out
>>
https://www.kimi.com/blog/kimi-k2-6

Wish there was a GLM 5.1 comparison.
>>
>>108645767
>People keep saying that LLM's are state-less machines.
Yes, but intermediate results can still be cached. That's what the kvcache is.
>If so, how to erase the context freeing VRAM?
On llama.cpp, you can't free allocated memory.
>Also, I can have several chats running in llama.cpp
Yes if you have multiple slots. Read llama-server -h for --parallel, --cram and probably some others. Read the whole thing.
>How on Earth do they manage to separate them from each other in VRAM, so one chat's context does not spill over into another?
Uh... a slightly more complicated version of if (slotctx < ctx / slots) ok; else notok; I suppose.
>>
File: Capture.png (85 KB, 609x1071)
85 KB PNG
>>
>>108645785
>not 404
OWNED!!!!!!!!
>>
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6

it's out
>>
>>108645798
>trimming his pretty pretty hair
>>
>>108645842
Not falling for it again
>>
File: 1765651225005855.png (107 KB, 1081x1780)
107 KB PNG
>>108645842
another moe, I wonder if vision will be better than gemma4
>>
File: Untitled.jpg (148 KB, 1288x1188)
148 KB JPG
>>108645842
mfw seeing a model i cant run even with a q1 quant
>>
>>108645844
This kills the Garm
>>
>>108645861
>400M vision encoder
Doubt
>>
File: file.png (10 KB, 93x539)
10 KB PNG
>>108645861
wtf, gemmy has 550m param vision encoder and it's only 31b
that seems very disproportional, or is their moon thingy that efficient?
>>
>>108645842
>4. Native INT4 Quantization
>Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking.
So natively accelerated on blackwell? I don't even know if it's possible with gguf/q4.

>Kimi-K2.6 has the same architecture as Kimi-K2.5, and the deployment method can be directly reused.
Less llama.cpp drama, good.
>>
>>108645849
lmao it's true though
>>
>>108645834
I don't believe them I've been using the k2.6 preview to vibecode via their subscription and its performance is clearly dumber than mimo v2 pro, which I already put some steps below codex.
I am cancelling it and trying mimo via xiaomi directly this month.
>>
>>108645842
im poor
>>
moonshota AI
>>
>>108645758
I look like this
>>
>>108645914
>falling for "Not falling for it again" posts
kek
>>
>>108645955
let's get it on then.
>>
File: file.jpg (248 KB, 2100x1349)
248 KB JPG
>>108645894
Does gemma have vision benchmarks?
Because kimi does.
>>
>>108645864
dogbros... we lost!
>>
>>108645945
moonloli AI WHEN!?!?!?
>>
>>108643872
KLD is not a capabilities benchmark.
Noise floor is a thing (he should test KLD of BF16 vs BF16 offloaded on different hardware, or with a different -ub, as llama.cpp produces different logits depending on those values).
>>
>>108645864
Is that a cookie?
>>
>>108645992
no, a DOG
>>
>>108645992
A mode collapsed dog
>>
>>108645842
Yeah I'm not falling for it again.
>>
File: 1763698178837198.jpg (46 KB, 533x594)
46 KB JPG
>>108645982
since it's moonshot_ AI
the counterpart should be moonlol_ AI
>>
Waiting to see the quantizations sizes.
https://huggingface.co/unsloth/Kimi-K2.6-GGUF/tree/main
>>
>>108645579
You can use tool calling I guess. It runs another inference engine or API with the prompt and returns the result.
>>
>>108645993
>>108646000
Mode collapsed as in model collapse?
That's hilarious.
>>
>>108646001
your loss
>>
>>108644302
>bloat
If it works then does it matter? One would require you to spend time vibe coding and then an unknown amount of time fixing and improving the vibe coded shit. The other is just ready made and you just follow the instructions.
>>
>>108646017
Mode collapse as in mode collapse. Somewhat similar concepts, different technicalities.
https://en.wikipedia.org/wiki/Mode_collapse
https://en.wikipedia.org/wiki/Model_collapse
>>
File: migu D.jpg (19 KB, 303x325)
19 KB JPG
>>108645752
>>
>>108645894
it's this
https://huggingface.co/moonshotai/MoonViT-SO-400M
>>
>>108645837

thank you, kind anon
>>
>>108646010
MOON SHOTA
>>
>>108645861
K2.5 has absolutely amazing vision and visual knowledge about characters. I hope they didn't fuck this up in K2.6 if it's as code-focused as the hf page implies
>>
>>108641448
I'm gonna try anon, but think this part if Tavern's UI is stronger then me.
>>
>>108646016
this could work
>>
>>108640471
I'm not reluctant, I'm still working on it until it's in a presentable state and fixed some issues, I'm pretty close though

>>108642791
what
>>
goof?
>>
>>108646124
https://huggingface.co/unsloth/Kimi-K2.6-GGUF
currently unslopping
>>
I refuse to believe people here can run 1.5T models
>>
>>108646157
>he doesnt have 512gb ram to infer at 10t/s~
LOL!!!!!!!!!!!!!!!!!!!!!!!!
>>
>>108646157
Why?
Some people have DDR4 servers with a GPU or two, it's not that outlandish.
Or a 512mb Mac, I guess.
>>
>>108646131
Oh boy. I can't wait to rape my SSD with terabytes of goofs when they update them for the umpteenth time.
>>
>>108646157
a year ago ram wasn't that expensive
>>
>>108646057
I'm a genuine retard. I was using Text completion this whole time, I though he was talking about Chat completion.
>>
>>108646197
>>108646197
>>108646197
>>
>cries at Gemma Q2_K 2-3t/s
>>
>>108645658
Is the generated image in the context now?
>>
>>108646157
It's a bit over 1T, 4bit QAT and 30b active parameters. 500GB RAM and a decent GPU isn't that unreasonable provided somebody built the server a before last september
>>
>>108646157
Just print out the weights and do the matrix multiplications yourself???
>>
KimiGODS we won.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.