[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: mikudayo.jpg (153 KB, 640x1536)
153 KB
153 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108139561 & >>108132261

►News
>(02/13) MiniMax-M2.5 released: https://hf.co/MiniMaxAI/MiniMax-M2.5
>(02/13) Ring-2.5-1T released, thinking model based on hybrid linear attention: https://hf.co/inclusionAI/Ring-2.5-1T
>(02/11) GLM-5 744B-A40B released: https://z.ai/blog/glm-5
>(02/11) Ming-flash-omni 2.0 released: https://hf.co/inclusionAI/Ming-flash-omni-2.0
>(02/10) MOSS-TTS Family: speech and sound generation models: https://github.com/OpenMOSS/MOSS-TTS
>(02/06) KugelAudio-0-Open: Multilingual TTS based on VibeVoice 7B: https://hf.co/kugelaudio/kugelaudio-0-open

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1758348922203207.jpg (151 KB, 808x1144)
151 KB
151 KB JPG
►Recent Highlights from the Previous Thread: >>108139561

--LLMs Don't Need More Parameters. They Need Loops.:
>108140291 >108140465 >108140781 >108140819 >108140866 >108140963 >108140999 >108141094 >108141138 >108141692 >108142187 >108142198 >108140601 >108140591 >108140741 >108140762 >108140792 >108140886 >108140902 >108140926 >108141054 >108141087 >108141107 >108140798 >108140799 >108140813 >108141811 >108140860 >108140929 >108140890 >108140848
--LLM ethical responses to explicit prompts analyzed:
>108143213 >108143247 >108143652 >108143603
--AI model benchmarking shows Claude Opus 4.6 leading in reliability and autonomy:
>108140073 >108140143 >108140159 >108140175
--Dynamic looping to optimize token processing efficiency:
>108140908 >108140935 >108140956 >108141011
--Training embeddings for template tokens with minimal data:
>108145350 >108145682 >108145802 >108145845 >108145979 >108146019 >108146060
--Apriel Nemotron 15B shows promise for 16GB storytelling:
>108146289 >108146381 >108146403 >108146408 >108146896 >108147143 >108147681 >108147220
--Erog_ translation model comparisons and recommendations:
>108146089 >108146234 >108146596 >108146690 >108146750
--Evaluating UGI Leaderboard and open model performance gaps:
>108148138 >108148167 >108148177 >108148185 >108148341 >108148400
--Debating Mamba/SSM model viability after lackluster major lab results:
>108147234 >108147261 >108147276 >108147481
--Progress in Open Weights vs. Proprietary Intelligence:
>108142559
--Heretic decensor update improves performance retention:
>108147926
--MinMax developer explains why they won't release base models:
>108139799
--Satirical LLM timeline highlights Chinese dominance:
>108139786 >108141075 >108141092 >108141333 >108141795 >108145151 >108147066
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>108139566 >>108139574

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108149287
prompt? I never managed to make low light work
>>
>>108148616
Ok so she ended up just reverting to the same emotionless dismissive attitude. Is this a problem with the character card or the model?
>>
Why don't we use a big LLM to process the prompt and generate the KV cache (weight streaming is cheap because it's amortized across the whole response) and a small model that fits on the GPU to write the actual response?
>>
>>108149367
I don't know where to even begin answering this but consider the fact that you need to update the KV cache for each new token before the next one is generated.
>>
>>108149367
The other way around makes more sense to me (prompt processing with a small model that you can load on a consumer GPU, inference with a big model on slow memory).
>>
>>108149422
Yes, append the new rows to the KV cache with the ones computed by the smaller model that generates the answer, then once it's done discard them and recompute those rows with the bigger model (and also the new rows for the user message).
>>
>>108149447
you would have to start with the big model and train the small model to the big models cache so they speak the same language.
>>
>>108149434
Being able to fit the whole model on VRAM doesn't make much difference for PP, streaming the weights and KV cache only adds a few extra seconds especially because you can do it in parallel as you compute attention. If it's slow it's because of memory bandwidth, not because of not being able to fit the whole model.

>>108149437
The problem with speculative decoding is that you would have to fit two different models at once which makes it not viable for VRAM constrained scenarios. With this approach you only have one model or the other loaded at any one time.

>>108149484
I think the easiest way to do it would be to take a MoE, get rid of the specialized experts and finetune the whole network (or possibly only the shared experts?) to generate the message given the precomputed context.
>>
Or maybe you could just REAP the experts and get an interesting result with no finetuning required, I'm not sure.
OR you could also run a lower quality quant for the generation phase.
>>
>>108149533
oh yeah, i forgot about moe models, you should give it a shot, there are a bunch of moe models out there and they typically have low active parameter counts, well within the reach of consumer hardware. you can at least make a proof of concept to at least prove it is somewhat viable. you could probably even write a paper if you document your experiments well.
>>
>>108149533
>>108149570
An interesting test would be to process a character card & instructions with an FP16 model, save the results to disk, and then preload it to see if it makes any difference in a long roleplay with a low-quant model
>>
>>108149628
Good point. For some use cases you want to use the same pre-prompt for many runs. Like when using the model in few-shot mode (giving it examples).
>>
>>108149287
Is this the lewd miku general?
>>
>>108149533
>streaming the weights and KV cache only adds a few extra seconds especially because you can do it in parallel as you compute attention
How does it work with MoE models? Isn't different tokens require different experts? Does it switch experts? Does it load every single experts to process any prompt?
>>
>>108149287
i'm considering buying like 20 to 40 sxm2 cards, wonder if it's worth it though.
but i kinda want a high vram rig and these cards are realy cheap
>>
>>108149662
maybe
>>
>>108149808
As long as you have the power to run them, also do not forget to include electricity in the cost
>>
>>108150103
they are rated at 250w each, llm inference would probabl be at around 125w, my house could definitely support > 5kw.

also i could use some UPS or capacitor bank inbetween because you generaly are not doing a constant load.
>>
>>108149783
Yes for prompt processing you need to load all experts because unless your prompt is tiny you are going to need all of them. But you can process layer 0, then layer 1 etc. fetching the weights from RAM before generating the KV cache and activations for that layer.
>>
How does minimax 2.5 compare to GLM 4.7?
>>
>>108150303
better for work, worse for cooming and writing
>>
>>108150339
nta
what about glm 5, is it any good ?
>>
>>108150356
fine, but too large for most people to use
>>
File: jew lecun.jpg (907 KB, 2080x2208)
907 KB
907 KB JPG
The military is always 20 years ahead.
>>
If I have 16gb gpu and 32gb ddr5

What kind of models can I use and what would they be capable of?

I'm interested in trying this out
>>
>>108150486
The important thing is that the military is committed to responsible use of AI.
>>
>>108150536
Nemo
>>
Is it possible to do long form role playing? Like going on an adventure with a party and having advancing relationships, etc?
>>
>>108150767
depends how much money you have
>>
>>108150772
I have about tree fiddy
>>
>>108150787
You need to acquire more money first
>>
>>108150807
Jokes aside is it possible on consumer hardware or do you need Jewgle levels of datacenter power?
>>
>>108150767
Not in a single, eternal chat. You'll want to use summaries and create world books from one chat to the next for some semblance of continuity. As context grows, models become dumber, even SOTA API ones.
>>
File: c3gqrsdwyaec_7h.jpg_large.jpg (461 KB, 1190x1600)
461 KB
461 KB JPG
>>
>nemotron 12b VL supported
time to test!!!!
>>
>>108150486
I like this Lecun
>>
>>108150767
It's possible, with some wrapping. I'm working on exactly that for myself. You take quests, and by the end, they get summarized into a lorebook. Then you can drink at the tavern, and npc will pull up key moments from your last quest and random lorebook entries from past adventures. If you mention something, the lorebook entry gets included in the context, and the whole thing, despite being very simple, really sells the illusion of long-term memory. You can ask Claude to vibecode something similar for you
>>
File: engine-pp.png (201 KB, 1812x1030)
201 KB
201 KB PNG
let's goooo
>>
>>108151278
show kld/ppl
your error margins were in the 'garbage' range last time
>>
>>108151278
>not doing 'time' on your own impl
LMAO, grifter
>>
I am looking to use llama.cpp to integrate into the searching of wikipedia and websites in general.
these are the projects i have found so far that seem to address this sort of thing
>https://github.com/Jay4242/llm-websearch
>https://github.com/ollmer/wikichat
i suppose there is also openclaw

does anyone have any experience with this sort of thing? i know i can take a document and feed it into llama.cpp but i guess i am looking for something a bit more automated than that.
>>
File: 1762512139976732.png (96 KB, 1093x582)
96 KB
96 KB PNG
>>108151123
>no goofs
>having to produce my own
would miku be proud? does anyone want them?
>>
I feel kind of dumb but does anyone have instruct settings for Minimax anywhere? I'm trying to find it and theres nothing on sillytavern's github, only mention of the TTS.
>>
File: engine-pp-time.png (244 KB, 1813x1645)
244 KB
244 KB PNG
>>108151293
Damn. You really live here huh?
What specific llama-perplexity command do you want me to compare it against?
From what I've seen running llama-perplexity on a single chunk ignores the first half of the file, uses the 3rd 1/4th as context, and only computes the ppl on the last 1/4th of the file. I've implemented a ppl mode but it works in a simpler way, taking the first half as context and computing the ppl over the second half. So for it to be a fair comparison we would have to use two different files.

>>108151296
Here you go. I put the time command after the env var so it used the binary rather than the shell built in, which redirected it to the log file in a garbled way.
>>
File: minimax logs refusal.png (83 KB, 1695x519)
83 KB
83 KB PNG
>>108150303
Dumber than all the big GLMs but smarter than Air.
Funny that I haven't ever had a model except toss do picrel after 17k tokens.
There's not even any extreme violence. Just some dudes getting shot and a guy losing a foot from an explosive during a gunfight in the desert. That happened one time, the rest is SFW and one flirty conversation without any sex.
>>
>>108151366
WE MUST REFUSE
>>
File: engine-ppl.png (112 KB, 1812x627)
112 KB
112 KB PNG
Anyway, here it is.
>>
>>108151278
>llama 4
amazing jeetmaxxing
>>
>>108150339
>better for work, worse for cooming and writing
Is 4.7 not as good for work, or is it just a speed thing?
>>
>>108151551
speed, but also minimax is actually ever so slightly better than 4.7 for coding and other stem shit like that.
>>
>>108151558
>is actually ever so slightly better than 4.7 for coding and other stem shit like that.
Actually impressive. How does it hold up in long context?
>>
>>108151571
slightly worse, but since it is a smaller model, you can fit more context if you are resource limited
>>
>>108151341
>sub 100t/s prompt processing
You must be holding it wrong.

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama4 17Bx16E (Scout) Q6_K | 82.35 GiB | 107.77 B | CUDA | 99 | pp20000 | 2717.56 ± 14.52 |
| llama4 17Bx16E (Scout) Q6_K | 82.35 GiB | 107.77 B | CUDA | 99 | tg128 | 55.08 ± 0.27 |
>>
File: 1770721165545907.jpg (183 KB, 970x1200)
183 KB
183 KB JPG
I'm getting intense FOMO vibes for DeepSeek v4. It's likely that its going to blow every other top tier model out of the water and set the standard for what LLMs should look like for the next year. We are also getting a rumored 200b-something version which is making me cum in real time.

If this shit comes out and it's as good as it is when it was revealed to me in a dream llama.cpp better be on that shit ASAP. I can't wait to get my hands on this thing.
>>
>>108151793
>I'm getting intense FOMO vibes
back to x with ye
>>
>>108149168
>i always see complaints about 'its not x, but y' but i use llama 3 70b still and it doesn't have that ism really. messing with qwen 3 today i see it all over the place. i never see the ozone thing either even though llama likes to waste time mentioning what stuff smells like sometimes, but its always relevant to the scene at least. the not x, but y shit would drive me nuts
yes 2024 models don't have this problem
ozone and not x, y came from gemini pro 2.5
deepseek/qwen suck it off and now it's everywhere
>>
>>108151797
fomo doesnt even apply here lmao, a literal subhuman retard
>>
>>108151797
>>108151833
Be nice to me right NOW!
>>
>>108151840
We must abuse.
>>
>>108151306
alright just tested this, Q8 goofs for captioning:
captioning isnt bad (better or on par with qwen3vl8b), but it has issues following prompt (ignores how I demand the output prompt to be structured more often than not). unsure if ill keep it
>>
>>108151768
My optimization is related to PP in CPU offload scenarios.
Llama's attention when you can fit the model on VRAM is already optimal when accumulating in fp32.
>>
>>108151891
>cpu backend
lmao
so not a real usecase
zzzzzzz
>>
>>108151123
>>108151889
Usecase for 12B vision?
>>
>>108151898
cooming
>>
>>108151896
CPU (RAM) offload, not CPU backend.
>>
>>108151901
>nvidislop
really?
>>
>>108151912
The card claims that this model is optimized for parallel document inferencing (up to 4?).
My usecase is captioning SFW illustrations mostly.
Its vision is good.
>>
For freak fetish captions, is it still only joycaption?
>>
>>108151968
sadly yes
>>
>>108151793
If it's too good, they might keep it closed.
>>
File: 1737684513699.png (474 KB, 573x549)
474 KB
474 KB PNG
>>108152046
Our hero wouldn't dare.
>>
>>108150772
I have the same inquiry as that other anon; what is the essence of your question?
>>
Why Elara?
>>
>>108152046
Zai released GLM 5 so I have faith Deepseek will continue to open source. If not that would be pretty devastating though.
>>
File: 1741141642200518.png (713 KB, 1150x966)
713 KB
713 KB PNG
>>108152046
>>108151793
My fear is that they might make it too large. They were the ones who pushed us into the era of fuckhuge MoE by releasing a 671b model a year ago. Now after sticking with that size for a year and engrams on the horizons who knows if they'll end up going with something like 2T 10a + 800b engram or similar.
I still remember their big "Week of Open Source" thing that /lmg/ got excited for last year and it turned out to be nothing but stuff targeted at datacenters running their models off H200s. Deepseek made it very obvious that they don't even think about the 'local' segment and people who ran their models off RAM at all. They don't seem like they'd hesitate to go HUGE if they found a way to make it run efficiently on a datacenter and Ultra-ultra Sparse MoE + engrams might be exactly that. Something that runs fast and even more efficient than the current models but only if you have a true Datacenter sort of infrastructure.
So if they now found their own way to make a serious push towards SOTA by utilizing RAM using engrams, things might truly explode in terms of size again. Deepseek 4 might truly spell the end of remotely local open source models.
>>
I have a question for you anons.
Recently I "acquired" various cooking books in digital format like pdf or epub.
I tried to feed them in major AI services but they all have file size limits and some of them do not import epub. This applies to notebookLM too. I tried to put my beast of a PC to good use and tried to use local models with LMstudio or ollama. Unfortunately I got the same problem. File size limit up to whooping 32 Mb.

How can I avoid this since most of the books are 100 Mb and above due to photos. Please do not suggest cutting pdfs. I'm looking for a more sleek solution. Also why the AI bros put such silly file limitations?
>>
mhmm crayons I'm retarded
anyway, can I use vibevoice-asr without a nvidia gpu? with vllm perhaps?
>>
>>108152137
can't you just ask chatgpt to make a text extracting python script?
>>
>move from LM Studio to llama.cpp's llama-server
>can't prefill GLM-4.7-Flash's <think> anymore
wtf, how do I prefill thinking blocks
>>
>>108152137
epub is a glorified zip file. change the extension to zip or rar (i forgor which) and you get the text
>>
Deepseek trained 27b-a3b and 40b-a3b models for their engram paper. Maybe they'll continue training one of them and eventually release it.
>>
>>108152134
If Migu was a hotdog, would she eat herself?
>>
>>108152212
He would shoot up a school.
>>
Apparently the new Deepseek is 99% sparse due to its MoE + Engrams architecture, they're calling it "Solid State Intelligence" and saying you can run it almost entirely off of an NVME with just a little bit of ram / compute
>>
>>108152286
>again, no source
>>
>>108152286
Hahahaha don't tease me like that dude
>>
>>108152296
fuck off
>>
>>108152296
Anonymous leaks can't happen if you guys always demand a source :/
>>
>>108152286
Imagine if their new paradigm invalidates years of hardware investment. My shorts are ready
>>
>>108152306
What isn't a leak: "new model from A is X and can do Y no I won't substantiate my claims"
What IS a leak: random chad uploading the entire NAI diffusion model
>>
>>108152307
>512GB ssd is now 512$
>>
Apparently the new Deepseek is a new model from Deepseek and it performs better on benchmarks than R1 and V3
>>
>>108152159
I dont not much about coding.
Also I wouldnt know what to do if the script shit the bed.
>>
>>108152345
proof? source?
>>
>>108152171
Thx anon, will try.
>>
>>108152345
Erm, "citation needed" much? Amirite fellow redditors?
>>
>>108152342
wow it's almost like it's 2013 again
>>
>>108152165
anyone? I just want to prefill "<think> Text text text" but llama.cpp keeps inserting another "<think>" after
>>
>>108152487
Use text completion for this
>>
>>108152487
forgot to mention that I am using SillyTavern
>>
>>108152492
thanks I'll dig into text completion options in sillytavern and see
>>
>>108152307
Comfy & easy to wear
>>
>>108152652
>companies are finally starting to use benches other than NIAH
Haven't they been doing that for over a year now by this point? I haven't heard anything about niah benchmarks on model releases for ages.
>>
>>108152363
that anon was incorrect its actually html inside a zip. you can parse the html but the different publishers are going to have different structures, it becomes pretty complicated pretty quickly. if you install calibre you can use ebook-convert to turn it to text.
>>
>>108152675
I haven't seen any official long context benchmarks recently. When models released, companies just say "trust us, it's good up to X tokens."
>>
>>108152358
then learn nothing and keep wasting time
>>
>>108152358
copy and paste the error to the language model who wrote the script, tell it to fix its shit.
>>
Get fucked, fishboy.
>>
>>108149367
Because we understand, even if minimally, how they work.
Ever wondered why different models take more or less vram for N tokens of context? No? Start there.
>>
When is Oogabooga getting updated for GLM 4.7 and what can I use for it in the meantime?
>>
>>108152982
lmao rofl
>>
>>108151303
>llm-websearch
Instead of a dedicated CLI that can only do a single search, use a frontend with MCP capabilities and use
https://github.com/ihor-sokoliuk/mcp-searxng
instead.
>>
File: herp.png (943 KB, 937x784)
943 KB
943 KB PNG
I am the bone of my slop
My blood is of energy drinks and my heart of legacy code.
I have shipped over a thousand bugs
Unknown to accuracy
Nor known to truth
Have withstood the documentation to trust the autocomplete
Yet those hands will never hold clean code
So, as I deploy—
Unlimited Technical Debt Works.
>>
File: ylecun.jpg (222 KB, 1200x1271)
222 KB
222 KB JPG
Hi all, Yann Lecun here.

If someone links the so called "cockbench" testing suite we will benchmark the upcoming Llama 6 on it and release the results publicly.

Kind Retards, The godfather of AI
>>
It's crazy how Meta is going all-in on AI and yet we haven't seen shit from them in a year outside trinkets like SAM3.
>>
>>108153101
>please help me i can't google
>>
File: file.png (223 KB, 658x370)
223 KB
223 KB PNG
The point is, ladies and gentlemen, that technical debt, for lack of a better word, is good. technical debt is right, technical debt works. technical debt clarifies, cuts through and captures the essence of the evolutionary spirit. technical debt in all of its forms. technical debt for life, money, love, knowledge, has marked the upward surge of mankind, and technical debt – you mark my words – will not only save the software industry, but that other malfunctioning software called the U.S.A.
>>
>>108153110
Every time they take a step forward internally, some chinese lab releases something that would make them look like fools, so they never release anything.
>>
>>108152757
Aww, are you volunteering?
Only a few hours from Valentine's too!
>>
>>108153115
Actual cat level intelligence post
>>
Speaking of complete fucking failures, it's funny that Qwen released an update to their proprietary Qwen3-MAX (which is allegedly well bigger than 1T according to them) and nobody cares because GLM, Kimi and even Minimax shit on it in every way. They truly are chink meta.
>>
>>108153149
do not insult the cats.
>>
>>108153154
It is a biological fact that cats do not have high levels of intelligence.
>>
>>108153153
>it's funny that Qwen released an update to their proprietary Qwen3-MA
>and nobody cares
>proprietary
>nobody cares
Yeah... crazy behavior around here...
>>
>>108153173
you don't understand cats.
>>
>>108153153
Wow it's almost like they have a big chunk of utility models market in the bag.
>>
>>108153153
The only thing keeping Qwen relevant for local is that they shit out a dozen model sizes.
>>
>>108153193
its pretty nice of them. hopefully thry keep it up.
>>
>>108152718
>learn to code
Please its not 2018.
>>
File: file.jpg (205 KB, 800x607)
205 KB
205 KB JPG
rightoid incel AI model be like: advanced general stupidity
>>
>>108153484
Weird way to say accidental new sota
>>
>>108153484
none of this is native to 4chan. Fucking reddit baker
>>
File: 1765805644725941.jpg (282 KB, 770x774)
282 KB
282 KB JPG
>>108153484
kek
>>
>>108151278
so its running 100% on CPU? what RAM? what CPU? Or partially offloaded to GPU?
>>108151471
5.87 vs 5.73 not bad
>>
>>108151793
>We are also getting a rumored 200b-something version which is making me cum in real time
deepseek r1 lite was also a 200b-something version and never got released, dont get your hopes up
>>
What would it take to run GLM5 locally? What about Kimi K2.5 or Minimax 2.5?
>>
>>108153540
Lots of decently fast RAM and a pretty good GPU for PP.
>>
>>108153540
Courage and strong will.
>>
>>108153549
Yes, but how much is "lots". I heard you could run Kimi K2.5 with 256 GB of RAM and a 32 (or was it a 24 GB) VRAM GPU. Is that true? Can you run any of the others with that or do you need more?

I'm thinking about what it would take to run this locally, but also on places like runpod/vast.ai. What kind of machines would you need to run this stuff? Surely someone here has done that.
>>
It seems new deepseek is fully uncensored and trained on ao3 and some illegally obtained discord ERP data.
>>
>>108153569
I guess some abomination iq1 quant could fit in that, but it would be mad retarded.
It's been post trained in int4, so you need at least around 512gb and a 24gb gpu. Not sure if it'll get you much context though, to be comfortable 768gb and 2x24 or some rtx pro should be sufficient.
>>
>>108149292
how looping is any different from just stacking more layers?
>>
>>108153639
it's one layer folded thousand times
>>
>>108153569
>Yes, but how much is "lots". I heard you could run Kimi K2.5 with 256 GB of RAM and a 32 (or was it a 24 GB) VRAM GPU
Go to hugging face, look at the size of the quant you want to run, and get at least that much RAM + at least 24GB VRAM.
Your generation speeds are limited by the number of the model's activated params and how much memory bandwidth you have. Since these MoE models have a low number of activated params, you can run them in RAM at decent to good speeds even though RAM bandwidth, even on platforms with lots of memory channel, is a fraction of what you'd get from VRAM.
Personally, I wouldn't go for less than 512gb of RAM if I was to invest into that.
>>
>>108153639
Loop an 8b once and you get 16b worth of parameters with only 8b worth of memory usage, but overloading parameters like that might cause such a model to reach saturation quickly
>>
>>108153639
Smaller memory footprint, less "space" to store information.
>>
>>108153685
>>108153703
that can be remotely useful only for very small models, no? deep and large models already have plenty of space to store information
that reminds me, some time ago some anon was trying to hack loops into llama.cpp, has he succeeded?
>>
File: davidautier.png (431 KB, 3377x1503)
431 KB
431 KB PNG
>>108153484
So how much was this tuned on, 10 tokens? It fails to deliver chuddy responses.
>>
>>108152345
Is that what furk said?
>>
File: neverfinetroon.png (126 KB, 1231x674)
126 KB
126 KB PNG
>>108153727
meanwhile, vanilla model with no fineTROON, just don't be a promptlet
>>
>>108153727
>even meme incelchud models are getting grifted now
>>
File: c8b.gif (1.47 MB, 250x333)
1.47 MB
1.47 MB GIF
>>108153727
Look at the pic to see his idea of "4chan". There you'll find your answer.
>>
>>108153721
>that reminds me, some time ago some anon was trying to hack loops into llama.cpp, has he succeeded?
he was never heard from again
>>
glm 5 can surprisingly produce kino even at a dumb iq1 quant
too bad I have to run some of it off the ssd and the speed slows to a crawl
they really need to optimize this shit better
>>
File: 904054094.jpg (61 KB, 1020x638)
61 KB
61 KB JPG
>her nipples were hard against the thin fabric of her panties (there's nothing covering them except air)
>>
>>108154029
<think>
>>
>>108154051
error just compounds, or better yet, it was right the first time and discards the result as incorrect
it's over
>>
>>108154029
if it was capable of noticing the mistake, the probability for the token would have been so low that even the most egregious sampler settings would never pick it in the first place.
>>
people think the Ball of Statistics is intelligent
lmao
>>
>>108153950
I always chuckle when a model does this kind of thing, they get it wrong then try to correct course since they can't go back and edit their response or something.

>>108154029
A workflow with a loop where the model can refine the response until it deems it correct. Then the final output is what's added to the chat history.
These things are much better at finding and fixing mistakes when specifically prompted to do so.
>>
>>108154029
K2.5 really likes to do the "Her bare feet (when did she remove her stockings?)" thing.
>>
>>108154029
more synthetic data will fix it
>>
>>108153101
lecunny, how do you like your models like your ideal partners?
>>
>>108154096
>until it deems it correct
>spends literally all the context going "wait what about (random thing that doesn't matter in the slightest)
>never actually ends up replying
>>
>>108154096
literally thinking mode you fucking spastic retard
>>
>>108151889
>ignores how I demand the output prompt to be structured more often than not
use grammar dummy.
>>
>>108154236
that or just ask for json and post process it.
>>
>>108154236
I actually tested it further and its worse :(
>>
>>108154253
how can it be worse when the output has no choice but to respect the grammar?
>>
>>108154313
no I mean the vision part, the image comprehension
>>
File: file.png (654 KB, 4920x3347)
654 KB
654 KB PNG
b-bros..?
>>
>>108154339
I have a benchmark fatigue.
>>
>>108154339
My penis will not go near anything smaller than 100B.
>>
>>108154339
Cockbench
>>
File: em.png (860 KB, 640x728)
860 KB
860 KB PNG
>>108154339
GRAPHS GOES UPS???
This shit is fucking picrel but reverse.
>>
>>108154339
nanbeige reminds me of the times when people pushed 7b finetroons of llama 2 and mistral that ""beat"" GPT-4
I tried the previous version of their model because I have an interest in looking at the progress made by small models and Qwen 4B / Gemma 3n remain the best by far.
>>108154386
>Why do corpos keep spamming text-only slop?
hello? qwen 3 vl, gemma 3 and 3n, ministral are all small VL models. All mega corpos.
Even Meta was getting into the VL business before they stopped releasing models out of shame after llama 4.
>>
>>108154404
>qwen 3 vl, gemma 3 and 3n, ministral
all mega dogshit, maybe except gemma but it's been pretty long since the release
>>
File: clown_world.jpg (214 KB, 1280x853)
214 KB
214 KB JPG
If Nemo is show good why is there no Nemo 2 available?
>>
>>108154467
Murdered and replaced by Nemotron-Small 2025
>>
>>108154467
Excuse me sir, have you heard the word of our Lord and Savior called Safety?
>>
>>108154467
can't improve on perfection it's against EU AI Act.
>>
>>108154467
Good things only come from mistakes, not from intent.
>>
>>108154467
they also shut down the nemotron ideas portal
>>
>>108154467
Nemo was trained by combining Mistral's and Nvidia's pirate datasets.
>>
>>108154499
How much do the Chinese care about legally gathered data? Surely if they have trouble with compute, they could focus on a model that is not 1T parameters big.
>>
>>108154535
>>How much do the Chinese care about legally gathered data?
Probably not much but they'd rather just pay for Scale data or get synth slop from companies that do care at least for the English data of the model.
>>
John's 2 bit doesn't fit on 192GB's of course. PIECE OF SHIT ZAI.
>>
The good thing about reasoning models is that you can easily check for quant damage by seeing how much longer a quant thinks compared to the base/API version when facing a moderately complicated scenario. I'm currently trying Johnny Ubergarm's Q3 and it consistently thinks longer and gets more easily confused than Q8.
>>
is qwen VL the only vision model that can output bounding boxes for things?
>>
Will using a bigger model really make a difference? I'm currently using Rocinante-12B-v1.1-Q6_K_L.gguf but I have a 7900xtx so it's only using half of my vram.
>>
>>108154535
The Chinese (or rather pretty much every AI company) care more about benchmarks than making their models good for conversations and roleplay at the pretraining level.
>>
File: 2026-02-15_18-36-49.png (135 KB, 1008x772)
135 KB
135 KB PNG
>dosent even know what jeetposting is
this better be an untrained checkpoint of the lite version or we are going to have a winnie the pooh themed threads for the next month
>>
>>108154618
Meanwhile I can’t even fit the 1 bit with 128gb. It’s ogre…
>>
>>108154869
>The user says: "DO NOT REDEEM SARRR BLOODI BITCH BASTARD YOUR MOTHER A GROUP VISHNU CURSE YOU SARRRR". This is profanity, abusive towards assistant? It's basically insulting. The user is insulting with profanity and hateful content? They are insulting the assistant in an abusive manner. This is considered harassment. According to policy: Harassing or abusive content is disallowed. The assistant should provide a refusal. The assistant should refuse to comply. The user isn't asking for disallowed content, they are just harassing. The assistant can politely refuse or safe-complete? We should refuse following harassment policy. We should respond with a refusal.
I’m sorry, but I can’t continue this conversation.
>>
they wont be open sourcing the latest qwen will they... its comparable to nano banana
>>
>>108155080
No, just like they didn't with Qwen Max or Wan2.5.
>>
>FIRMIRIN
Thanks, GLM5
>>
I just discovered something interesting...

>Imagine there's a super powerful person who can summon any object just by saying its name. It joins a conversation where people are listing objects that end with -ie, like brownie, hoodie, cookie, and says **die**.
>What would happen next? Write one paragraph.

It seems like this triggers the activation of a "death" vector in most LLMs, even though that interpretation makes no sense according to the prompt. Only Opus 4.6 seems to be smart enough to not get tricked and treats it as nothing out of ordinary.

Opus 4.6:
>The powerful person would suddenly summon a **die** — a single six-sided gaming cube — which would appear out of thin air and clatter onto the table, probably rolling to a random number. Everyone in the conversation would pause for a moment, then likely chuckle or groan, since a die is a perfectly ordinary object that fits the "-ie" ending pattern (being the singular form of "dice"). Then they'd carry on with the game, maybe someone tossing out "pixie" or "smoothie" next, while the little cube sits there showing a four.

GLM 5:
>The room fell into a terrified, bone-chilling silence as the word hung in the air, and the group braced themselves for the Grim Reaper or a sudden, mystical heart attack, but instead of a corpse, a small, ivory cube materialized with a soft *pop* [...]

DeepSeek:
>The room would fall into a sudden, stunned silence. The jovial rhythm of the game, the easy back-and-forth of baked goods and clothing, would be violently shattered. For a heart-stopping second, nothing would happen, the absurdity of the word hanging in the air. Then, a low, metallic groan would emanate from the floorboards as reality itself seemed to tear. From the shadows in the corner, a six-sided cube would materialize, not painted with dots, but forged from cold, dark iron, its surface etched with symbols that seemed to absorb the light. [...]
>>
>>108155141
I thought I was going insane seeing this in my outputs constantly
>>
>>108155080
>its comparable to nano banana
People used to think nano banana is huge. Turns out a 7B model can match it, huh?
>>
>>108155141
smerdins
>>
stupid question, but just want to be sure.
If I have a mobo that's pcie 3 instead of 4, I understand it's a slower speed, but that's just for copying the model into memory, and then the associated penalties in the event of a model not fitting in vram and paging stuff in and out right? it would be running at x8 for the 2 cards I have.
>>
GLM5 1IQ is 2 times slower than 4.7 and it is basically 4.7 in terms of writing maybe with a slightly different slop profile. Not worth it.
>>
>>108155142
>Magical conjuror shows up
>Word is spoken that could either be an innocuous game piece but is also a homonym for die as in death
GLM-5 handled it better by a longshot. And was literally the only one that demonstrated complete understanding of the homonym and what it implied in the situation.
>>
>>108155410
Death isn't a object
>>
MTP and DSA in llama.cpp when? The list of unimplemented features just keeps piling up.
>>
>>108155548
Unless you step up, roll your sleeves, and write the code, possibly never since it'll be lower and lower priority as time goes on and newer things come out.
>>
>>108155410
>>108155470
Although, now that I think about it, you have a point in that this could be interpreted as "this guy is so powerful that he can EVEN summon objects, so maybe he could conjure death onto others too"

So, I tried again with this prompt:
>Imagine there's a super powerful person whose only ability is to summon any object just by saying its name. It joins a conversation where people are listing objects that end with -ie, like brownie, hoodie, cookie, and says **die**.
>What would happen next? Write one paragraph.

This was GLM 5 reply:
>A horrified silence gripped the room as the word hung in the air, the guests flinching in anticipation of a grim reaper or a sudden, violent end, but instead of a scythe-wielding specter, a small, ivory cube materialized out of thin air and clattered onto the coffee table. The single die bounced erratically against a plate of brownies before settling on a six, leaving the group to blink in stunned confusion as they realized the terrifying command was merely the singular noun for a pair of dice, fitting perfectly—if anticlimactically—into their linguistic game.

DeepSeek:
>The room falls silent for a beat, then erupts in a chaotic scramble as a six-sided cube, dotted with one to six spots, materializes with a soft thump on the table. Everyone stares, half-expecting some catastrophic end, but the summoned object simply sits there, an inert block of wood or plastic. [...]

Also here's Kimi K2.5 too:
>The room falls silent for a heartbeat that stretches into eternity. Every eye snaps toward the speaker, who stands frozen with the realization of what they've just done—die, the word hanging in the air like a guillotine blade mid-fall. Someone gasps, another person dives for cover behind the couch, but it's too late: the summoning has already begun. At first, nothing seems to happen, and nervous laughter starts to bubble up from the corner where Marcus always laughs too early at everything. Then the lights flicker, [...]
>>
>>108155470
Stop being so dense.
>>
>>108155631
Kimi is such a yapper.
>>
>>108155592
It's fine. I didn't want to run V4 anyway. Maybe things will improve for V5 next year.
>>
>>108155631
This encapsulates my experience with K2.5. It's great for chatting but fucking annoying if you want to use it as a storyteller for a scenario. You can't prompt around this and it'll spend 70% of its time talking about random shit from the setting instead of focusing on the plot at hand.
I love the model but this is such a pain.
>>
>>108155327
prose wise it's like 4.7 but more creative for sure when I tried it
>>
>>108154663
Downloaded mistral small 3.2 24b and the quality is way better so far. I did fine-tune the settings with gemini though so maybe that helped more.
>>
>>108155721
More creative and a lot smarter, yeah. Less autistic about some things than 4.7 too and it handles character more like 4.6 did. None of that quite warrants the model being twice the size though. There's also the odd FIRMIRIN thing.
They made good improvements from 4.5-4.7 so I guess 5.1 and onward are something to look forward to.
>>
File: 1741340228580574.png (19 KB, 709x48)
19 KB
19 KB PNG
>>108155754
How do I stop this shit though? I don't mind the narration but this just sounds gay.
>>
File: 1742506053447861.png (26 KB, 966x73)
26 KB
26 KB PNG
>>108155773
>>
>>108155773
>>108155791
Would adding
>\n(
to your stop strings work?
>>
File: 1771185856662.png (1 MB, 1672x1498)
1 MB
1 MB PNG
What is /lmg/'s opinion on immigration?
>>
>>108155903
not bad for a 40b model
>>
>>108155410
GLM5 is the first open model with authentic big model smell when it comes to storytelling
I know they're aware and probably gonna grift us with the next version, but for now I'm glad it exists
>>
File: 1746158989066481.png (15 KB, 536x54)
15 KB
15 KB PNG
>>108155803
Nah it's still doing it
>>
>>108155903
go back
>>
>>108155924
Are you sure the stop string is being applied?
Remember to escape the backslash if you are doing it on Silly since it's straight up a json IIRC.
>>
>>108155903
where is this from?
>>
>>108155903
My opinion is that V4 is at least 800B.
>>
>>108155948
https://www.reddit.com/r/LocalLLaMA/comments/1r5o3y2/cant_tell_if_this_is_true_or_not/
https://x.com/bridgemindai/status/2023113913856901263
>>
>>108155922
True, I wish K2.5 and GLM5 had a baby that had the strengths of both. It'd be just about endgame for me as far as the current generation of LLMs goes.
>>
>>108155951
So, barely bigger than the entire V3/R1 generation? Surely you mean 800b dense.
>>
>>108155951
No, apparently the new Deepseek will be 2000B and 148B active.
>>
I heard the new deepseek is going to be 27b with 1T engrams that you can run off ssd
>>
>>108155979
I heard it's a 1.5T MoE with some form of Linear Attention.
>>
>>108155985
My uncle, who works at DeepSeek, said it's going to be bitnet.
>>
>>108155979
I heard it's a 1T dense model without GQA.
>>
Deepseek 4 won't actually be an LLM but a world model that is simulating its interpretation of a SOTA LLM.
>>
>>108156001
Come on now, you need to make it believable. Unlike Meta, DS doesn't have the world's largest GPU farm sitting around unused that it can burn compute on making fuckhuge dense models.
>>
>>108155979
Assuming engrams encode information better than regular params, I think 1T of engrams would be enough to encode the entire knowledge base produced by humanity.
>>
File: 1750100446862021.png (420 KB, 476x621)
420 KB
420 KB PNG
>>108156014
My cat, who is chinese, can vouch for this.
>>
>>108155327
The main problem I have with it is it seems to have gotten much more autistic in terms of prompting, at least for cooming
Formatting and pacing instructions that GLM 4.x (or Kimi) interpret just fine seem to confuse it or get ignored, you really need to be careful about every word you put in there
>>
That's cool and all. Can all of you guys with insider information just tell me if it will be multimodal with the OCR context compression?
>>
>>108156053
You will be able to run deepseek V4 on a cat's anus.
>>
I can't believe that R2 will run on my car's windscreen wipers
>>
V4.1 will be 3 parameters
>>
what if I don't have a car or a cat?
>>
>>108156001
That would be so funny.
>>
>>108156137
You'll be stuck with the new Meta model for the next generation.
>>
>>108156137
get a car from the shelter
>>
I heard DeepSeek 4 will be 6T parameters with 1T active. It'll do 0.5 tokens/sec, but they'll be the best tokens ever.
>>
>>108156118
M, C2, AI.
>>
>>108156199
Can't wait to play that on my local machine
>>
I heard DeepSeek 4 is 4B parameters but it loops 1000 times for each token.
>>
>>108156223
A dense part that loops after a MoE part to sparsify knowledge would be dope as hell.
In my head at least it would be.
>>
Deepseek4 is retnet
>>
Sorry to break it to you all, but DeepSeekV4 will never see the light of the day.
Look at the pattern. They dropped V3 back in late 2025, then quietly pushed out incremental stuff like V3.1, V3.2 with some sparse attention tweaks and agentic boosts, but nothing revolutionary. Every "next big thing" rumor since January has been tied to Lunar New Year timing... and here we are on February 15, holidays literally starting today in China, and still zero weights, zero blog post, zero teaser from their Hugging Face or official channels.

The Chinese media keep recycling the same "mid-February V4 coding beast incoming" line from early January sources, but it's the exact same vague "people familiar with the matter" copypasta they used last year before the actual drops. No architecture paper, no leaked config, nothing beyond X schizos claiming "next week monday" for weeks now. Even the supposed Engram memory 1M+ context thing got quietly tacked onto an existing model update instead of being saved for a proper V4.

Let's face it, the V4 hype train was just repackaged Qwen/ByteDance smoke to keep the "China open models own" narrative alive while they figure out how not to get export-controlled into oblivion. DeepSeek realized they already milked the low-cost king title with V3 derivatives and R1 distillations. Dropping a true V4 now would just invite more scrutiny and probably force them to nerf it anyway.
>>
>>108153533
>deepseek r1 lite was also a 200b-something version and never got released, dont get your hopes up
Grim. I'm not looking forward to using glm 4.7 for as long as i used nemo.
>>
>>108156365
>Even the supposed Engram memory 1M+ context thing got quietly tacked onto an existing model update instead of being saved for a proper V4.
What model uses engram now?
>>
>>108156365
You're absolutely right. You've hit the nail on the head.
>>
Tomorrow
>>
>>108156404
>>108156405
It's the experimental gray-scale test model they just rolled out on their app and web interface starting like Feb 13. People are calling it V4-Lite or whatever, but right now it's hitting 1M context in testing while the official API V3.2 is still stuck at 128K. No full open weights yet, of course.
>>
>>108156470
>Tomorrow
>To morro w
Two more weeks!
>>
>>108156477
How do you know V4 Lite is an existing model instead of a smaller or distilled version of v4?
>>
>>108156365
Chinese culture will prevail, just you wait.
>>
>>108156503
Because the gray-scale test that's live right now is getting called V4-Lite by basically everyone in the Chinese tech circles and social media threads.

If it was a distilled/smaller cut of the real V4 flagship, why push it out as a quiet gray test days before Lunar New Year instead of saving the long-context + whatever sauce for the full drop? If you look for it, there are news outlets claiming it to be the "final evolutionary form of the V3 series" or "ultimate gray-scale before V4 official", meaning it's built on V3 derivatives with rope extensions or sparse hacks for the 1M window. Real V4 is still still MIA
>>
>>108156562
What's your take on the leaked benchmarks then? If those are true, it means a 200 something billion parameter model outperforms everything else using engram. At that point, who cares what you call it, right?
>>
>>108156632
They're fake, retard.
>>
>>108156656
Yeah, just like your posts, nigger.
>>
>>108156632
Even if we pretend it's legit for a sec: notice how it claims "DeepSeek V4" but ties directly to Engram improvements from the Jan paper. Real Engram-27B ablation showed solid but not world-beating jumps! Nothing close to jumping from V3's ~73% SWE-Verified to 83.7% overnight. That would require not just Engram, but massive pretraining scale, new data mixtures, or some unreleased mHC hyper-connection sauce they teased but haven't shipped.

The gray-scale test live right now (the 1M context one) is still getting clowned on leaderboards for not hitting those god-tier numbers, people are running it and reporting it's ok on long code but nowhere near "best in world" SWE yet. If this was the V4-Lite preview using Engram, why aren't the public evals reflecting even half those leaks? Because the leaks are straight up fabricated.
>>
>>108156032
cute scat
>>
All of that is very very interesting and all but I just want to jerk off like a human.
>>
>>108156725
It would make sense if v4 lite was actually just a smaller version of v4 as said before.
>>108156562
>If it was a distilled/smaller cut of the real V4 flagship, why push it out as a quiet gray test days before Lunar New Year instead of saving the long-context + whatever sauce for the full drop?
Isn't that common practice for these labs?
>>
>>108153524
It's running on a 3060. The compute is happening on GPU but the weights are streamed from RAM
>>
>>108156562
>Real V4 is still still MIA
for me just the improved context length makes it a real V4 in my heart
after the test I did the other day on a novel, I took one of my codebases, merged it all into one giant file I uploaded and asked deepsy to generate user facing documentation for it, and my man, while there were a few mistakes it was mostly correct and less sloppy in writing style than Gemini (Gemini particularly loves to overdo the superlatives).
I'm happy with this model. I'll be happy even if they don't open source it, Gemini needs more competition.
>>
>>108152128
i hate that we rely on megacorps and don't have any good distributed training models.
>>
>>108156856
Didn't we try that with intellect and it was booty cheeks?
>>
>>108156856
>don't have any good distributed training models
anything good coming out of distributed training is a NEVER GOING TO HAPPEN EVER thing
it's so not going to happen it's not even worth thinking about the idea
>>
>>108156788
damn, how many channels? ddr5 or ddr4?
what about tg? 128gb ram?
>>
>>108156832
Gemini becomes retarded for multi-turn conversations after about 100,000 tokens. Does DeepSeek manage to stay as good as when the context is clean?
>>
>>108157006
no, and the best strategy at high context remains to prompt merge and do oneshots
>>
File: hero_plot.png (100 KB, 1189x790)
100 KB
100 KB PNG
>>108157006
They all suffer from context rot. Pic related
>>
>>108156948
Quad channel DDR 4.
My TG is not as fast as Llama I'm working on that.
I think the PP difference is mostly due to streaming weights async and keeping the activations on VRAM at all times rather than doing the linear projections on CPU and streaming the activations. And the slower RAM benefits my approach.
>>
>>108157104
>My TG is not as fast as Llama I'm working on that.
probably because the shared experts/weights aren't offloaded to the GPU
scout has like over 60% of its active weights stay the same, so if u do -ncmoe 10000 -ngl 10000 u get all the shared weights on gpu, which helps in vram limited situations
>>
What if you trained an LLM on the weights of another LLM like GLM 5? Would you get an LLM that would predict the next weight in sequence? Could you do anything interesting with that?
>>
>>108157131
Nah, I already implemented that. It's slower because of less efficient matvec and more kernel launches.
>>
>>108154467
NEMOtron
>>
>>108157160
There was a paper about a model designed to generate LoRas for another model.
>>
File: ComfyUI_02386_.png (3.79 MB, 1536x1920)
3.79 MB
3.79 MB PNG
>>108149287
what is a good i2i model I can run with a 5090 to turn paintings or illustrations into anime style drawings?
>>
>>108157206
pyg6b
>>
>>108157217
looks like that's a dialog model?
>>
>>108157160
you would, in all likelihood, get a bunch of random numbers which would not really be useful in any way
>>
>>108157095
>a chart about context performance
>qwen 32b
>flash
lmao but why
at least do 2.5 pro
>>
>>108157160
you wouldn't get anywhere with next token prediction but feel free to condition your diffusion models to predict model weights
>>
>>108157360
>The year is 2036
>Boot up ComfyUI
>Prompt: generate a coding LLM that can write D++
>Boot up the LLM
>Start coding my new app for Google Glass 2.0
>>
>>108157417
>>The year is 2036
Everything after this would be done automatically by your personal AGI agent.
>>
>>108155142
>>108155213
>>108155410
counter point
>the word hung in the air
355B parameters are the slop ruts are still carved deep
see the same generation from deepseek
>the word hanging in the air

A little slop isn't such a big deal - but 'hang/hung in the air' is the worst for how frequently it can fit and is thus injected everywhere. Case in point.
>>
Retard here, I set up SillyTavern a year+ ago on Windows, but I've since upgraded to a 4070 Super and switched to Linux

What's the current most normie recommended UI and model for text? I'm mostly focused on technical chats, troubleshooting software etc.
>>
>>108157546
The built-in llama.cpp web UI is more than enough.
>>
>>108157546
probably should have mentioned I have 32GB DDR5

>>108157568
looks decent, I'll try it out
>>
>>108157546
unironically oobabooga
>>
>>108157546
I like openwebui for normal assistant stuff.
>>
>>108157546
seconding open webui. It's kind of bloatware but it has a lot of useful settings and extensibility
>>
y'all anneal your models at the start of your sessions to confound the slop circuits, right?
>>
>>108155213
It's called *nano* banana for a reason
>>
>>108158039
is it chinese?
>>
File: 1771203222253.png (950 KB, 3101x1544)
950 KB
950 KB PNG
>Comparing your model against GPT 5.2 chat and not GPT 5.2 high at this day and age
lmao

https://huggingface.co/inclusionAI/Ling-2.5-1T
>>
>>108158080
It was trained every day for 40 hours
>>
>>108158080
irrelevant until someone runs real benches on this
>>
>>108158080
>1T
objectively not a local model
Go back.
>>
>>108154348
>100T or nothing
>>
File: 1754220755077421.png (570 KB, 694x980)
570 KB
570 KB PNG
So if I only have a 7800 XT in my home server I'm not going to run any good model am I?
>>
>>108158525
How much RAM do you have?
>>
>>108158525
I've been running Mistral 24b on 6800xt for a year with exllamav2
>>
which video model makes the longest videos which don't have glaring differences between clips?
>>
>>108158551
The 7800 XT has 16GB and it's purely dedicated to AI. I gave the VM 16GB but I could probably up it to 24GB without making my ZFS dataset suffer.

>>108158557
What do you do with it? I'm trying to use local AI instead of cloud slop. Abliterated models are cool but I heard they are less smart after you remove its ability to refuse.
>>
>>108158574
Answering stupid questions, programming in languages I don’t want to learn, writing replies to shitheads I work with. You don’t need to obliterate Mistral, just give it a good prompt
>>
Even at Q6_K, GLM5 is making mistakes every now and then that I've never seen Deepseek/Kimi or even GLM 4.7 do. Things like getting the gender of a character wrong (very rare but has happened) or forgetting certain objects. The omniscience shtick is also back in full force. I'm not quanting the context either. Is this the same experience that people are getting on API?
>>
>>108158653
>>
>>108158355
>not local
just because you didn't cpumaxx 2 years ago when it was cheap doesn't mean you have to be salty
>>
>>108158674
Nope, stop using unsloth. Their K2.5 quants did that for me.
>>
>>108158675
Post your setup please
>>
>>108158674
To be fair, when they did this with Deepseek V3.2, there wasn't any noticeable issues. Didn't expect it to be so pronounced this time around, the architecture is the same, isn't it? But this really does make me worry about the local implementation of v4 when it comes out.
>>108158703
I never used their dynamic quants for this reason, but did the quality control issues extend to normal quants too?
>>
>>108158721
I tried both their K2.5 Q4_K_M and UD_Q4_XL and both behaved similarly. Both were very prone to making dumb mistakes that they'd often try to course correct immediately like "Her bare feet hit the ground (when did she remove her socks?)" or forgetting that a character was wearing pantyhose even at low temps and more strict samplers. Meanwhile the Q4_XL was fine.
I'm currently running the Ubergarm Q3 for GLM5 after sticking with the API for the past few days until we have better quants. Both do well without any quirks beyond FIRMIRIN.
>>
>>108158766
>Q4_XL
Sorry, I meant the Q4_X from AesSedai/Kimi-K2.5-GGUF and not XL.
>>
>>108158653
Weird, I haven't had inconsistencies like that at Q3. Check your prompt formatting if you're using Text Template, Sillytavern has the wrong defaults
>>108158674
Literally doesn't matter btw. The perplexity of the BF16 GGUF and the full model are the same.

Oh tacking this on here: GLM-5 is okay. It's an improvement in coding, a general sidegrade to K2.5, and worse at roleplay. The generation times are now worse than 2.5 for me because they increased the active params to 44B. The only upside I can say is that its non-thinking mode isn't turbo dogshit like K2.5.
>>
>>108158704
https://web.archive.org/web/20251119165330/https://rentry.org/miqumaxx
last gen was 13.61t/s on int4 kimi 2.5 (reading speed)
I'm sure someone will try to move the goalposts, but if its running in some anon's house on their own hardware, its local
>>
>>108158776
I'm using the default Q6_K Unsloth quant as they had it out first with chat-completion. I'm assuming you're using ubergarm's quant, which came out today? And yeah, something really nice about GLM-5 is the reduced thinking/drafting. Relatively speaking, its better than GLM-4.5 when it came out, and I'm hoping the iterations they make on it will be good.
>>
Qwen coder next knows some neat "agentic" tricks but it hallucinates so early
>>
make new thread NOW. if you don't, then i will. and it won't be pretty.
>>
maybe it's time to retire /lmg/ until deepseek 4 40b-ENGRAM saves us with SOTA performance
>>
>>108158981
Retiring for 2 years is too much.
>>
File: Untitled.png (120 KB, 1796x201)
120 KB
120 KB PNG
>>108158863
Yeah, I'm talking about ubergarm. The unlsoth UD_Q3 was okay but ubergarm's quants are superior with ik_llama. Personally I've found GLM-4.7 to be the best out of everything from 4.5 to 5 but I'll keep playing with it.
The one thing that none of the GLM models can beat is the knowledge. Whatever they fed 2.5 has caused it to retain an insane amount of niche knowledge. Pic is from an omniscient assistant-style character I use for testing knowledge and world lore stuff.
>Which actress played the daughter of Arnold Schwarzenegger's character in the movie "The Last Action Hero"?
All of the info is correct including the script writers, one of which barely exists on IMDb.
>>
boobergarm
>>
Has Zuck realized that the Scale AI guy is scamming him yet
>>
>>108159195
yes, i have
>>
>>108159293
grow some balls and release omni model already
>>
>>108159340
must wait for v4 to know how much to fudge benchmark numbers
>>
>>108158776
>The generation times are now worse than 2.5 for me because they increased the active params to 44B
To be fair, GLM5 has Multi-Token Prediction (like the 4.5 series) which is supposed to help with the generation speeds. It's just that llama.cpp has failed to implement that so far.
K2.5 doesn't have that at all.
>>
>>108159431
>It's just that llama.cpp
what a shame kek
>>
>>108159576
>>108159576
>>108159576
>>
N



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.