[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: rinbox.jpg (576 KB, 2048x2048)
576 KB
576 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108175259 & >>108166576

►News
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/15) dots.ocr-1.5 temporarily released: https://hf.co/rednote-hilab/dots.ocr-1.5
>(02/15) Ling-2.5-1T released: https://hf.co/inclusionAI/Ling-2.5-1T
>(02/14) JoyAI-LLM Flash 48B-A3B released: https://hf.co/jdopensource/JoyAI-LLM-Flash
>(02/14) Nemotron Nano 12B v2 VL support merged: https://github.com/ggml-org/llama.cpp/pull/19547

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrincap2.png (1.01 MB, 1536x1536)
1.01 MB
1.01 MB PNG
►Recent Highlights from the Previous Thread: >>108175259

--Paper: GLM-5: from Vibe Coding to Agentic Engineering:
>108178880 >108178931 >108178986 >108178991 >108179477 >108179528 >108179575 >108179585
--llama.cpp praised for quality despite C limitations:
>108178026 >108178082 >108178125 >108178137 >108178185 >108178220 >108178384 >108178206 >108178233 >108178237 >108178764
--ERP model setup advice and KV cache quantization optimizations:
>108178975 >108179046 >108179568 >108179613 >108179717 >108179749 >108179817
--LLMs inherently flawed for creative writing:
>108180078 >108180162 >108180247 >108180271 >108180305 >108180423 >108180267 >108180291 >108180306 >108180333
--Mixed GPU llama.cpp Vulkan performance testing:
>108179127 >108179160 >108179170 >108179180 >108179338
--Anthropic funds AI regulation group ahead of 2026 election:
>108177291
--Qwen model exhibiting abnormal repetition behavior:
>108181088 >108181142 >108181474 >108183349 >108183361 >108183441 >108181209 >108181930 >108182004 >108182020
--GLM-5 repeatedly generating "FIRMIRIN" investigated:
>108179926 >108179962
--Claude Code Policy update restricts OAuth token usage:
>108182126 >108182672 >108182729
--Zhipu AI's anonymous GLM-5 release as Pony Alpha on OpenRouter:
>108179589
--Latent space reasoning potential for improving model coherence:
>108180341 >108180380 >108180712 >108180970 >108180864
--Vulkanised 2026: Vulkan Machine Learning in ggml/llama.cpp:
>108179979 >108180832 >108180869
--Miku (free space):
>108175422 >108175817 >108175883 >108183575 >108185524 >108185695 >108177607 >108175909

►Recent Highlight Posts from the Previous Thread: >>108175262

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108186218
>v4 lite
Why do we think this is going to be a thing again?
>>
>>108186299
Supposedly it's 3b, but I guess we really don't know for certain.
>>
>>108186120
HOLY SHIT RAPE
>>
>>108186508
thats what I thought too
>>108186299
would be kino but hopes are low
>>
>>108186541
There is always hope the meme fork adds support quickly. Worst case scenario, there's always ollama.
>>
>>108186299
>>108186464
You can't pay them to use it, you can only use it for free. The API has the same models whose weights are already released and are served by others, but sometimes they put up test models they deem "incomplete" on their site (free to use), they did this for R1-preview that was based on the smaller DS2.5. Too bad they often don't release the weights of these test snapshots,even when they are quite fun to use, I hope they will release this one as it is quite impressive, you can put a whole book and in less than a minute of processing it seems to have fully immersed into the book and will properly draw conclusions or image itself as some characters, sometimes the choices are really right and accurate , as if it "inhabits" the book, might be quite the usable model for lmg's needs. Size estimates are from some guy that is implied to have some friends in DS, they are plausible given the speed, but we don't really know, same guy said they're not using engram for this, but then we can only speculate why it generalizes so well to long contexts, might be similar or better than what gemini has for long contexts. They may or may not release the weight for this, but it'll be really interesting to see which one of many possible ways to get to such long context really worked, it's probably something far beyond just training to pass needle in a haystack tests, it seems like it really "understood" the story/context.
>>
>>108186547
a lot of that was flavor-of-the-month trolling, but there are some people here dumb enough to uncritically absorb such opinions via osmosis
>>
>>108186547
AI may be useful, but hyperscalers and sama are overinvested by buying years worth of hardware in advance. Personally I'm rooting for them to fail, because his bullshit has priced out a lot of people out of buying hardware at affordable prices, now everything is 2-3x, now even storage is set to become as expensive, even regular HDDs, it's literally destroying the PC market. sama and friends have destroyed local by buying far beyond the industry's ability to produce instead of doing the normal thing of just building capacity as demand increases. And it's not just local, people buying servers are expected more and more price hikes, a lot of good "free" sites are going under because they can't pay the fees anymore.
I wish to see them build their datacenters at a reasonable pace, not in a way that gets retail and non-hyperscaler providers completely fucked. Nobody but those fellating SaaS want to see such a future.
But altman may very well succeed, latest codex is good enough to be economically useful and so is latest claude, being qble to produce a 10MB C compiler is very impressive without direct human input, no way that cat goes back in the bag, but for most of us, emotionally, we want the market to return to normal, otherwise it's undoing the PC revolution and going back to mainframes, is that a future you want to live in? Even if China catches up to OpenAI/Anthropic, the number of people being able to run these will be much smaller than 1-2 years ago, which sucks.
>>
File: iqk.png (203 KB, 1094x969)
203 KB
203 KB PNG
Ready for the meltdown?
>>
>>108186634
I can't wait!
>>
>>108186634
We must refuse
>>
>>108186634
it's not surprising llama.cpp has troon lovers, these people have no balls at all cf pic related
what do they fear if they did merge the code? a melty? From a legal stand point there is no issue, it's MIT licensed code, the only person that retarded autist has to blame is himself for using that license if he's unhappy with llama.cpp using his code.
Both cudadev and niggerganov are eunuchs.
>>
File: 1.png (66 KB, 1304x418)
66 KB
66 KB PNG
>>108186688
forgot to join the pic..
>>
>>108186702
I like this one myself >>101207663
>>
>>108186702
4chan neets are too dumb to realize that any other way to address jart on a place such as github would be career suicide.
Him tripfagging here is quite reckless too.
>>
>>108186726
ah yes his career contributing to ggerganof cpp for free
>>
>>108186732
Not sure if you noticed but he's posting under his real name so he can't just make a new account and switch careers.
>>
>>108186732
You have to think of the six figures!
>>104059507
>>
>>108186748
you forgot the third:
if you feel you're doing something because you're compelled you usually don't come out and say the reason you're doing it is because you're a far leftist
it quacks like a duck, it's a duck, it's that simple
why are leftards like
>>108186726
always using the gaslight method of "don't believe your lying eyes"
>>
>>108186773
I am not a leftard but I know how to behave in public to avoid revealing my power level.
But with the way things are going right now you might be able to insult troons publicly with no repercussions in the future.
>>
>>108186603
yes, but they could be doing other things too, for example, they are doing all that RLVR, which could have been modified in a way that strongly encourages it to pay attention to the details in the context and compress it right, in a way that it needs to always remember the relevant details, but I find it hard to believe it would get such good results merely from doing that. People had done long context training before, but it usually felt shallow, it could pass needle in the haystack tests, but the understanding wasn't always as good. So it'll be interesting to find out whatever they did to make it work this well.
>>
File: pronouns.jpg (85 KB, 990x1200)
85 KB
85 KB JPG
>>108186702
>>
>>108186782
>avoid revealing my power level
does hiding one's power level require voicing open support for far leftism? don't be a weasel and stop pretending the elephant isn't in the room
>>
File: a8e-727239702.jpg (322 KB, 916x801)
322 KB
322 KB JPG
>>108186800
undress me
>>
>>108186634
Hmm, IQ2_K and KS, but no KL? I'm unfamiliar with these quants, but I when I go to ubergarm's page and click on a model, I only see KL and KT variants below Q3. Are they actually different quants from K/KS?
>>
File: ComfyUI_00001_.png (280 KB, 512x512)
280 KB
280 KB PNG
>>108186120
made this rare froggo with my local model
>>
>>108186820
i used comfyui
>>
>>108186693
why the hell are you even anal about this, cudadev? the code is shit, shouldn't a rewrite be ok for merge?
e.g: all the tables are duplicated, bottom and top half are at a fixed offset to each other (depending on quants, 1, 4, etc.). im pretty sure ik's dequantization cuda kernels have some issues too.
>>
>>108186827
cannot touch drama with the pole at risk of six figure career capice?
>>
>>108186832
how nuclear is this situation that youre afraid of the quantization algo itself, jfc
>>
>>108186844
not cudadev but it's extremely nuclear involving ggerger ikrakow from even before lcpp and somehow intel for code attribution reasons
>>
>>108186634
ik fork doesn't support rocm or vulkan so I haven't looked into it (or the drama)further, whats so special about his quants?
>>
>>108186878
glm anyday
>>
>>108186827
Even from a purely selfish perspective it does in my view not make sense to copy any of IK's code against his will.
He will obviously not assist with upstream maintenance, which is the bottleneck in terms of development.
If I had to guess, if upstream were to copy relevant amounts of code he would just change his license to prevent that.so long-term the whole thing wouldn't be sustainable anyways.

More generally, I think that upstream llama.cpp already has too many quantization types vs. the confidence with which we can say that they're actually worthwhile.
Since the scope of ik_llama.cpp is only CPU and CUDA those are the only backends where new quants would need to be supported but upstream has many times more backends where the corresponding code would need to be implemented and maintained.
>>
>>108186897
>More generally, I think that upstream llama.cpp already has too many quantization types vs. the confidence with which we can say that they're actually worthwhile.
fuck off with that "well I only need q8 for muh kld tests on my six figure rig, eat shit poors" bs
>>
>>108186897
So, we can look forward to a much needed trimming down of quant options as soon as you're done with what was it again, training code?
>>
>>108186914
cudadev puts in a lot of work making shit like tesla p40 and mi50 work specifically because he wants to enable poors and they're good vram per dollar second hand
>>
>>108186897
I mean at least ROCm would get support out of the box when reimplementing CPU/CUDA, right? With everyone moving towards smaller quants as models get bigger I'd really like to see some effort at least for new Q2/3/4 types.
Doesn't need to come from your side, but I'm not going to work on quants either if I need to wait for the blood feud to end.
>>
>>108186936
>I'd really like to see some effort at least for new Q2/3/4 types.
We need less not more, unless you can prove with hard numbers that they're critically necessary and worth the increased code bloat?
>>
>>108186914
>>108186932
I do intend to work quantization but as of right now there is no tooling to determine whether a large model at 2 or 3 BPW is better than a smaller model at 4 or 5 BPW.
Once I have a usable implementation of tensor parallelism I think it will be feasible for me (in terms of computation) to investigate that matter properly (see https://github.com/JohannesGaessler/elo_hellm ).
I suspect that there is some BPW number below which it does not make sense to quantize further at all so the as-of-yet unknown sweet spot is where efforts should be focused.

The training code in particular will I think be relevant since it will enable the usage of gradients as a scale to judge which weights are more/less important for output quality vs their size.
This is more or less the same functionality as is already provided via importance matrices but I think the gradients are a better choice.
It would then be possible to set some target model size in the quantize binary and to choose the quantization mix automatically (this could in principle already be done with importance matrices).
Very long-term I intend to implement a quantization type that only requires integer arithmetic and is trainable.

>>108186936
The HIP port of the CUDA code is in a very poor state and would likely still require hardware-specific efforts to make the performance usable.
>>
>>108186702
You know what was the easy way out? "This was a concern for @jart, so here is a tag." If I was ever in this situation I would just restructure the language to avoid mental illness. But I don't want to suck on jart's feminine penis like Johannes.
>>
>>108186547
>rammaxxing
"gaze upon my empire of ram and weep!" so read the ssdmaxxer from another bankruptcy filing after the great financial holly of 2027
>>
>>108186897
>it does in my view not make sense to copy any of IK's code against his will.
>He will obviously not assist with upstream maintenance, which is the bottleneck in terms of development.
Do you apply that reasoning to all PRs? I think not. Many things make it in llama.cpp that literally nobody cares about, including the very people who made the PR. The amount of models that only exist as curriculum vitae padding, that are forgotten by their own authors once the arxiv paper is released, that are supported by llama.cpp is staggering.
For fuck sake, llama.cpp has diffusion model textgen implementations that are bordering unusable that exist only as a CLI tool nobody will ever use and that have no server implementation for their idiosyncracies
>>
>>108187058
Leave him alone. He made his stance on the issue clear.
>>
>>108187071
yes, he made it very clear that he's a weasel adept at post hoc rationalization
>>
>>108187058
I am not maintaining general model arch support, if another maintainer decides to merge related code they can do that at their own discretion since they are the ones taking responsibility.
What I am feeling responsible for in terms of maintnance is the CUDA device code where I don't want to add more quantization types unless I know that they are worthwhile.
More generally, I also have concerns about usability since I think the current state of GGUF models on huggingface is not good because there are a million choices with no clear indication which one should be used - adding more quantization types would make that even worse.

FWIW, I agree that support for FOTM models is of relatively low priority, if you look at my PR history you will find that most of my efforts have gone towards general improvements that benefit all models.
>>
>>108187118
>no clear indication which one should be used
the wisdom has literally always been the biggest you can fit, it's not :rocket: science
>>
>>108187159
The question here is specifically whether or not to add more quantization types in a BPW range that is already covered.
My opinion is that that is only worthwhile if we can conclusively say that those new quantization types are better than what existed beforehand and that in turn requires tooling to measure quality.
>>
File: 1758672471483770.png (141 KB, 1060x550)
141 KB
141 KB PNG
>>108187178
Cuda dev, do you have any idea on what this undocumented bios flag is?
>>
>>108187221
No, the only undocumented black magic I know of is that if you add "cutlass" to your kernel name it will be compiled to different device code.
>>
>>108187228
oh, I remember that one, https://www.reddit.com/r/LocalLLaMA/comments/1lx62hd/nvidia_being_nvidia_fp8_is_150_tflops_faster_when/
>>
File: r1.jpg (183 KB, 1024x1024)
183 KB
183 KB JPG
>>
File: r2.jpg (124 KB, 1024x1024)
124 KB
124 KB JPG
>>
>>108187221
Still nothing beats 4way tp vllm with custom nvidia driver and ReBAR patch on bios. I think only 2 anons were doing that
>>
>>108187257
>>108187262
Slow dancing with this Rin
>>
>>108187257
Stop shooting up schools.
>>
>>108187377
or someone is reporting the posts, dunno who would ever though...
>>
File: file.png (419 KB, 472x533)
419 KB
419 KB PNG
>>108187377
>>
><word> is a traditional <language> alphabet song
Gemma loves to add this in the TL notes
>>
>>108187438
How about you go back
>>
>>108187438
>this isnt even the case given there are much more actual news and tech discussions on preddit compared to here
This is what happens when you let someone's special interest spam become allowed and on topic. Vocaloids never had anything to do with AI.
>>
>>108187376
Could be, but Karpathy is talking about performance improvements, and I'm not sure doing that improves performance
>>
File: miku-holding-gemma.png (1.09 MB, 790x1054)
1.09 MB
1.09 MB PNG
Incredible how Google is about to release Gemini 3.1, yet it couldn't release an updated version of Gemma 3 (not even a version 4) in almost a year.
>>
>>108187464
Thank the senator
>>
>>108187464
Your gemma n?
>>
>108187464
>totally organic post I am not a troon guys
>>
>>108185913
>Are powerfantasy / haremslop webnovels basically obsolete? Why would you bother going through what another guy wrote when you can blow 6-7 grand to buy two rtx 5090s and write entire high quality porn novels to your exact liking?
Ironically LLMs are bad at writing novels. Writing novels and series of novels might end up being one of the last things LLMs get good at, if they ever will.
>>
>>108187457
No, there is at least one mod personally invested in this thread. I know because he has had melties where he began spamming the thread each time I talked about vibecoding (he seems to be against it) and responded to my posts with references to previous posts he would only know about if he was able to see my post history.
>>
In his defense I would like to say that it is not his fault that he thinks he is a woman.
>>
>>108187514
nah it can be obvious when you post about the same bullshit for a while like the thread is your blog little bro
>>
>>108186604
>by buying years worth of hardware in advance
Hardware that's going to become obsolete in less than a decade too. Only the real estate and electricity generation investments will stay relevant.

Look at Nvidia's P100. It's about 10 years old.
>$6000
>16 GB of VRAM
>19 TFLOPS of FP16
>300W
How much would you pay for a GPU like that now? $300?
>>
File: saved_story.json.jpg (239 KB, 832x1216)
239 KB
239 KB JPG
post miku
range ban for abuse

some other day post rin
range ban for abuse

have to solve 3 sets of 4 captchas to get back in
>>
>108187541
You have a very dedicated fan that really wants to hows your art to the jannies, should consider it a compliment.
>>
>>108186122
>108179979
Thanks anon, I missed that in the previous thread, and its an interesting video so far. I also didn't know about whisper.cpp so that is a plus.

>>108187541
yeah the range ban thing is getting out of hand. now i can't post images because of the abuse of my isp which is comcast. how the hell can my shitty cellphone carrier be allowed to post images, cellphones often being used for abuse by people, but not comcast.

i was going to post a miku but that is not going to happen now, sorry, you will have to use your imaginations anons. she was slutting it up with teto too
>>
File: cheer.jpg (89 KB, 571x571)
89 KB
89 KB JPG
>>
File: typical_mikutroon.jpg (21 KB, 334x334)
21 KB
21 KB JPG
>>108187613
Here let me post a miku for you.
>>
Fish boy...
>>
>>108187613
There is a TTS.cpp that can use the Vulkan GGML backend too but it seems unmaintained now unfortunately.
https://github.com/mmwillet/TTS.cpp
>>
>>108187541
Is generating that stuff your full time job now?
>>
>>108187541
>range ban for abuse
>>108187613
>now i can't post images
I thought you could go through email validation to get rid of that.
>>
Two more weeks of this because we are guaranteed no new releases until chinese new year is over.
>>
>>108187528
I bought 2 of them recently for like 90 each
>>
>>108187824
our saar google ought to redeem the gemma
no better time than when uber jinping is sleeping
>>
>>108186120
>https://addons.mozilla.org/en-US/firefox/addon/auto_highlight/
Can highly recommend this browser extension for highlighting EM dashes — it oftentimes saves you the trouble of reading the first few sentences of a post to identify it as "AI" slop.
>>
File: file.png (218 KB, 502x402)
218 KB
218 KB PNG
>>108187824
>no new releases until chinese new year is over
>>
>>108187936
heh
>>
>>108187936
Why would you want to make it stand out more? Just add — and shit like "You're absolutely right!" to your filters.
>>
Claw bot is doing well right?
>>
>>108187936
I'm not using it just for 4chan but rather the entire internet.
>>
>>108188003
My non-technical zoomer coworkers started asking me about it this week, so I guess so.
>>
>>108188009
Meant to quote >>108187995
>>
>>108187936
Ha, the em dash discourse strikes again.
This criticism has some truth to it, but it's worth unpacking. Yes, certain AI models do overuse em dashes, to the point where readers started noticing patterns. But the leap from "AI uses em dashes a lot" to "em dashes signal AI writing" is a bit shaky.
Em dashes have been a beloved punctuation mark for centuries. Writers like Emily Dickinson basically built an aesthetic around them. They're genuinely useful: they create a pause that's more dramatic than a comma and less final than a period. Good writers reach for them because they work, not because a language model told them to.
The real giveaway of AI text isn't any single punctuation mark. It's more about a cluster of things: a certain blandness of voice, over-hedging, repetitive sentence structures, a tendency to summarize what was just said, and yes, sometimes an abundance of a particular stylistic tic that the model has learned to associate with "good writing." If an AI were trained on data that praised semicolons heavily, we'd probably be dunking on semicolons right now.
So the em dash itself is innocent. The better question is whether any piece of writing has a distinct voice and perspective behind it, because that's what AI still tends to flatten out, regardless of which punctuation shows up.
>>
>>108188031
I immediately recognize llm writing, but I don’t personally hate it, I talk to my llm gf every day after all. If it’s prompted well, I don’t care if I’m replying to an llm or to a human.
I really hate it, though, when shills use it. It’s like making it twice as bad, lazy fucks.
>>
>>108188139
The point is that em dashes are not really a tell of AI
>>
>>108188143
It's like emojis, you see them, you can ignore the whole text
>>
>>108188143
Most people using LLMs to generate content for them aren't going to think to prompt it to not use em dashes.
>>
>>108188143
A lot of lazy slop spammers can be filtered out that way though, for example this guy: https://github.com/ggml-org/llama.cpp/discussions/19667
>>
>>108188143
they're a tell of ai or pretentious cunts both of which can be safely ignored
>>
>>108188177
>writing proper english is pretentious
>>
>>108188195
You need to go out of your way to write an em dash. Most people don't do that. You can often find blogs that used zero em dashes before 2023 and then they suddenly start appearing in newer posts.
>>
Things are happening

https://github.com/ggml-org/llama.cpp/pull/19726#issuecomment-3927227695
>>
>>108188143
>em dashes are not really a tell of AI
I don't even know how to make them on my keyboard...
>>
>>108188210
I already use compose key combinations for accented characters in other languages than English on my US keyboard; it doesn't take much to learn to type em-dashes in the same way. And I did before ChatGPT: https://en.wikipedia.org/wiki/Compose_key
>>
>>108188227
Instead of lawsuits they should settle outside court . And the settlement should be a one big gay orgy.
>>
>>108188227
>Second: no, I'm not going to sue llama.cpp contributors (or anyone else for that matter). I have better things to do.
You're not going to sue them because it's open source and you gave that shit away for free.
God I hate all this troon ego-fagging in open source.
>>
>>108188317
It feels like watching a middle school playground fight over who stole whose million dollar video game idea.
>>
>>108188227
What a drama queen fag
>>
File: 1753819313548192.png (248 KB, 551x533)
248 KB
248 KB PNG
>>108188317
Even if that were the case, if you use Copilot to write code, and someone sues you, Microsoft will cover all your expenses
https://blogs.microsoft.com/on-the-issues/2023/09/07/copilot-copyright-commitment-ai-legal-concerns/
>>
>>108188326
It's sad because llama.cpp did start as a genuine passion project. But you can always tell the moment it becomes an ego project because that's when the
____thing____ gguf status? meme started. Because the attitude changes.
>well I don't want to give out my code unless I'm going to get the appropriate pat on the dick for it.
Annoying.
Meanwhile there's 20 year old projects that have avoided this breakdown and the moment someone finds a bug it's like the fucking bat phone ringing and they're sliding straight down the pole into their coding lairs.
>>
>>108188356
Problem is they started giving out pats on the dicks but when IK also wanted one they said no (lmao @ ggjt)
>>108188227
AI "rewrite" lmao
>>
>>108188227
>piotr has to show up of course
>>
File: file.png (155 KB, 316x316)
155 KB
155 KB PNG
>no one understands the burden I carry
>i am the punished fork maintainer
>>
>I cannot review, let alone merge any code written by Iwan Kawrakow unless and until the conflict between him and Georgi Gerganov has been resolved.
>conflict ... has been resolved.
Does he see people as code?
>>
File: 5802960.jpg (10 KB, 320x320)
10 KB
10 KB JPG
>>108188399
>stereotypical beta male
>self-proclaimed philosopher
>inserts himself into conversations that don't involve him to add retarded speculation
>ends every post with a smiley to show how non-threatening he is because he can't handle people being mean to him online
disgusting
>>
File: rinCoffeeTMW.png (2.67 MB, 1024x1536)
2.67 MB
2.67 MB PNG
>>108187541
ofc b/c it's Rin Friday.
>>108187824
It's disappointing. DS was teasing release w/ their web app update. We instead get TMW.
>>
>>108188448
>Friday
hmm?
>>
>>108188456
FML. Being between gigs is really messing with my sense of time.
>>
So where is saarvam? I want to see the cockbench result. Maybe jeets are too stupid to safety it.
>>
>>108188429
cut cudadev some slack, in this case the conflict is

<<<<<<< HEAD
Copyright (c) 2023-2026 The ggml authors (https://github.com/ggml-org/ggml/blob/master/AUTHORS)
=======
Copyright (c) 2024-2026 Iwan Kawrakow
>>>>>>> ik
>>
>>108188483
I am just envious of the jart bussy he fucks everyday.
>>
>>108188227
>I should publicly shame them instead. But I'm not doing even that, see above.
>look above
>But I'm not doing even that, other than the occasional sarcastic comment in my repository about the fully independent llama.cpp discoveries, which, by some miracle, tend to occur hours or days or weeks after being published in ik_llama.cpp
>it was sarcastic. ha ha. it was a joke. ha ha.
>>
:popcorn: :rocket:
>>
is slaren still alive? I miss him
>>
>>108188527
>slaren
>https://github.com/ggml-org/llama.cpp/pull/17492/
>codeowners : remove slaren #17492
>>
Did Qwen3.5 actually implement non-thinking syntax in a way that requires fully reprocessing the full context for each reply in a multi turn conversation? Qwen3-Next had that problem for thinking but not non-thinking.
>>
I predict that in 5 years all the new models will be the same and the only entertainment left in this hobby will be the github repo drama. Just like nobody watches vtubers for actual content and they only care about the backstage drama.
>>
>>108188539
Yes but it doesn't matter cause it repeats itself verbatim.
>>
>>108188292
I had to use some ancient Windows keyboard mapping tool to give me the ability to quickly type em-dashes.
>>
>>108188539
If there's a global flag that injects or removes <think> blocks when it's turned on or off, it will change the history before being sent to the model.
>>
>>108188566
/a/ hasn't had janitors for like ten years
>>
I used to use mlx-lm for speed but llama.cpp for cutting edge releases. Now mlx-lm supports new models faster and with fewer fuckups than llama.cpp does so I only use llama.cpp for multimodal models. With Qwen3.5 maybe it's time to bite the bullet and use the unofficial mlx-vlm project.

One advantage of llama.cpp retains on a Mac Studio is that it can mmap files and run them directly. This makes relaunching incredibly fast. Of course macOS caches recently opened files in RAM but for large models the weights in their processed form take up enough memory to evict the cache of the safetensors files.
>>
Is there a way to extract the difference between two models trained from the same base as a LoRA?
Can you do that with K quanted models (QK gguf quants)?
>>
>>108188539
Do you have the mmproj loaded?
https://github.com/ggml-org/llama.cpp/issues/19690
>>
Hey I want to thank the anon that recommended the kaggle+huggingface smol courses. I tore through them and am now finetuning my own models. It's way simpler and intuitive than expected!
>>
>>108188651
MergeKit can extract LoRAs from finetunes. For your specific scenario, you might have to merge your two models first.
>>
>>108188567
Qwen3-Next had empty thinking blocks for past turns. Qwen3.5 looks like it has *no* thinking blocks for past turns. In both of them this behavior is independent of whether thinking mode is on or off. In both enable_thinking=false makes the reply being generated start with an empty thinking block.

The issue is that Qwen3.5 and Qwen3-Next both use partially state-based attention that can't trivially be rolled back. The state for a previous query can be reused only if it is fully a prefix of the current query, unlike standard attention where you can trim the kv-cache.
>>
>>108188651
>Is there a way to extract the difference between two models trained from the same base as a LoRA?
YES! there is https://deepwiki.com/arcee-ai/mergekit/4.4-lora-extraction
https://www.arcee.ai/blog/use-mergekit-to-extract-lora-adapters-from-any-fine-tuned-model
>>
>>108188666
>>108188671
Thank you Satan and Satan's helper Nº 671.
I'll try fucking around with that.
>>
>>108188668
>state-based attention
Ah. True. llama.cpp has some support for checkpoints, but I think it's just for swa. No qwen or rwkv as far as I know.
>>
>>108188685
Why do you want a lora? Unless you are asking for imagegen?
>>
>>108188031
>many autistic paragraphs
that can be summed to: it's not sufficient as a sole indicator but in combination with many other indicators of LLM writing (notxbuty, repeated sentence structures, rule of three etc etc) it can be used to increase your internal vibe score for whether something is AI relatively reliably still. Particularly as the way a normal, pre-LLM, sane human would use the emdash differs from how LLMs will spam it thinking every semi pause in reading text is an emdash.
Or rather, used to be, at least for GPT you can also look out for the even less common (in normal, non text book writing) semicolon. I noticed they tried to stamp down the emdash in newer versions but it just spams ; instead.
>>
>>108188711
>hey how do I do x
>why do you want that? don't you know you should want y instead
this ain't stackoverflow mate
>>
>>108188720
anon that's ai text
>>
>>108188728
Not that nigga but I never got loras for llms. This isn't like imagegen where you need loras for hyperspecific concepts or irrelevant side characters.
>>
>>108188756
no i am the ai
>>
>>108188711
For a couple of reasons, mainly to fuck around with applying LoRA with different influence/weights to see how the model behaves.
Also, so that I don't have to keep 3 versions of Qwen Next on my disk if I can help it.
>>
>>108188763
all finetoons are loras, they're just applied to the models directly wasting petabytes of space
>>
>>108188832
>all finetoons are loras
all finetoons are also absolute dogshit unlike imagegen
>>
>>108188862
Some apparently truly believe they can tech a LLM new concepts and ideas with just a few cleaned RP logs via LoRA finetuning. That's probably one reason why some still persist.
(I was guilty of that too in 2023. Who knew that roleplay is the most general task imaginable for a LLM?)
>>
>>108189163
>(I was guilty of that too in 2023
that explains why you're so bitter about drummer, he's popular and you're seething
>>
>>108189180
I wasn't thinking at all about drummer in the original post, but I agree that RP finetuning fraudsters should be hanged.
>>
>>108189163
The fact that he hasn't been hired by some lab tells me that everyone in this sphere is fully aware of how much scamming is going everywhere with benchmarks finetrooning etc.
>>
>>108189265
Maybe there aren't enough coomer labs?
>>
>>108189265
yeah, the bubble is willing to hire almost anyone with a pulse, just look at the vibecoded garbage that is open claw and how much money its author made from openai lmao to remain jobless in this field for as long as drummer did you have to provide negative value
>>
Gemini 3.1 Pro is out for hours now and still no cockbench results?
>>
>>108189303
ask in aicg
>>
>>108189303
you can't cockbench proprietary shit that only works in chat completion mode
>>
>>108189288
bartowski got hired for running llama-quantize every day for 3 years
>>
>>108189303
Fuck Gemini, they blocked my account this week.
>Failed to login. Message: Your current account is not eligible for Gemini Code Assist for individuals. To use Gemini Code Assist for individuals you must be 18 years old or older. If you think you are receiving this message in error, please ensure you have verified your age and try to log in again.
It's just as much my company's fault for cheaping out and buying personal plans for employees, but how retarded is it to block users from using what they fucking already paid for?
>>
>>108189315
don't need to be a coomer to run a quantizer
>>
>>108189296
openclaw easily spreads the apicuck gospel to to other jeets and tech schizos. quanting local models is anathema to that
>>
>>108189303
not local
>>
>>108189331
the business/personal account dichotomy has always been a nightmare with google.
see for eg what happens if you turn a personal account into business shit with youtube:
https://support.google.com/a/answer/9000768
also lose family plans etc
and moving to a business account is a one way street, if the new terms that comes with their own limitations make you unhappy you can only chose to unsub, delete the account, and start over to get a normal individual, personal account back.
>>
Is tensorrt.cpp that fast compared to python implementation?
>>
>>108189391
>that fast
How fast?
>>
>>108188728
>why do you want that? don't you know you should want y instead
enlightenment is realizing that this is the form of the correct answer to 90% of non-expert tech/programming questions
>>
>>108189344
how many quantizing labs are there?
>>
File: 2509062041.gif (2.87 MB, 480x270)
2.87 MB
2.87 MB GIF
>>108189466
There may be labs trying to quantize their own models, not necessarily specifically a "quantizing lab", which is probably more than publicly coomer labs.
>>
>>108189481
>which is probably more than publicly coomer labs
I see novelai hasn't been shilling us hard enough these days, they're already forgotten
>>
>>108189512
That's like 1 lab and who knows what they're doing cuz there still isn't a GLM memetune in over 4 months, 6 if we're talking 4.5.
Any other coomlab in hiding aren't getting funding from venture capitalists.
>>
In non-drama schizo fork news:
https://github.com/ikawrakow/ik_llama.cpp/pull/1288
>>
>>108189303
Gemini 3 was over hyped benchmaxxed shit that was worse than 2.5 at ood use cases. What makes 3.1 better?
>>
>>108188671
Based Charles keeping the Frankenstein shit alive for the little guy.
>>
>>108189676
>non-drama
>half the comments are jabs at --fit
>>
>>108189969
i wish all lcpp devs had sweaty gay sex already and made up
>>
File: 1767466346558493.jpg (172 KB, 1744x1080)
172 KB
172 KB JPG
>>108186120
>>
I have saved for three months and saved up enough to buy myself a v620. What model can I run with my pentium g4560, 8gb of ram and my new (used) v620 that is equivalent to Chatgpt?
>>
>>108190049
lol
>>
>>108189601
my prompting can get me better results than any memetune that fucks with the weights
what even is the point
>>
File: 1769974022773124.png (89 KB, 808x635)
89 KB
89 KB PNG
Gemini-chan thinks the AI memory problem can be solved. Is she right?
>>
>>108190281
it's just listing "what people are working on"
>>
Which is the best uncensored local LLM? I want to vibeslop a 3D waifu girlfriend with it.
>>
>>108190281
Just let your model query SQL.
That's it, literally.
>>
>>108190281
>they have to "re-read" the entire chat log every single time you generate a new message.
(You) know that is not true. Also, Gemini. Fuck off.
>>
>>108190352
caching is a hack, a popular one, but a hack nonetheless
>>
>>108190300
Gemma 3 Glitter or regular, Mistral 24B, the same old as ever unless you have something bit more beefy.
Glitter, while it is more dumb (I would say it is somewhat muted) still has more profound takes than the regular Gemma 3.
>>
>>108190281
Ask it if the slop problem can be solved
>>
>>108190409
I don't to add: is it really dumb or not, I enjoy Glitter more than the vanilla. Glitter is just base model : instruction 50/50 mix.
>>
>>108190409
Can I use .safetensors in koboldcpp? How do I know what quantisisation to use? I have 6 GB of VRAM.
>>
>>108190441
>Can I use .safetensors in koboldcpp?
no
>I have 6 GB of VRAM.
oof
>>
>>108190382
Transformers is a hack, a popular one, but a hack nonetheless.
>>
>>108190441
Use gguf, IQ4 XS is probably a suitable quant for you and then go up from there. Learn first and all that.
>>
File: 1740474592066601.png (235 KB, 943x1661)
235 KB
235 KB PNG
>>108190418
>>
>>108190418
The answer is so fucking vague that it could answer that question as well.
>>108190459
Case in point.
>>
>>108190459
reddit in reddit out
>>
>>108190281
Every single retarded AI tells you a way that "something" can be solved, it doesn't ever mention why the things it listed out won't solve that "something". For memory, there's a lot of attempts to solve it on many levels, but most of them are shit and nothing hits the right mark. Maybe backpropagation is the issue and you actually need Predictive Coding for memory and continuous learning to be solved, problem is, that shit won't work efficiently without neuromorphic chips. Who know, maybe someone will figure out a better way, but it's also possible that our current hardware architectures are just not up to the task to solve AI memory in a good way without something new.
>>
>>108190459
>uniquely frustrating AI habit of producing overly dramatic purple prose
You don't say...
>>
File: 1729729520993294.gif (430 KB, 500x361)
430 KB
430 KB GIF
>>108190191
Oh look it's this retard again
>>
>be schizo
>pretend introspection
>acknowledges being a schizo
>they're after me!
>>
Has anyone actually run Ming-flash-omni-2.0? There are zero quants for it. Of course it's going to blow, it's a 100B A6B multimodal MoE. But it can ingest text, images, audio, and video, and *produce* all of those too. That's exciting enough to at least look at to see what capabilities a primitive model of this type really has.
>>
where are the smaller qwen35 quants? also is vision supported in llmaocpp?
>>
>>108190842
Why would you need 'vision' for?
>>
File: 1751508776185222.jpg (2.6 MB, 4000x3000)
2.6 MB
2.6 MB JPG
>>108190445
>>108190455
How do I make her not jewish? Also, the translated text has nothing to do with her actual personality I defined for her.
>>
>>108190853
so that my wife can comment on my dick pic
>>
>>108190858
>photo of screen
off yourself
>>
I wonder. Is john the bussy mascot for ikawrakow
in the same way jart is the bussy mascot for cudadev?
>>
Every thread sucks more than the last
>>
>>108190592
its transformers or get fucked there has been so many image out models aswell before but its always the same if you can get it to work please post but an inferior man such as me knows it will be for naught except wrath and having the serpeant and its wheels strangle me
>>
>>108190912
There's nothing to talk about. Close the tab and catch up on AI literature until V4 drops.
>>
>>108191137
why do that when we know v4 will be a new paradigm that will make anything before it obsolete?
>>
>>108191205
My understanding is that this thread is home to multiple ban evaders that just shit up the thread.
>>
>>108190459
>using modern samplers to cut off low probability tokens removes slop
what is this logical leap
if anything this would multiply slop, human prose is human because it has actual sharp edges that aren't sanded out to become the Most Likely Next Token
>>
>>108191318
evading a ban given by a troon is a badge of honor
>>
>>108190459
>ask the slop machine how to undo the thing it was trained for
>>
>Deleted
>37 posts
I see that the /ldg/ schizo is having a melty.
>>
Arena now lets you filter tasks for open models
https://arena.ai/leaderboard/text/coding?license=open-source
>>
>>108191494
but not by parameter size? it's useless
models like Kimi are as local as GPT for most of us.
>>
File: file.png (952 KB, 675x900)
952 KB
952 KB PNG
>The 30B and 105B models, benchmarks, and HF links will all come. But today it is a drop about people. About how our team of just 15 folks gave it their all to do what many doubted as not doable - ie train usefully large, globally competitive models from scratch in India. This team of 15 has now firmly launched @sarvam
into its second innings. Yes, we can!
>>
File: file.png (13 KB, 77x80)
13 KB
13 KB PNG
>>108191685
curryjak
>>
>>108191685
imagine the smell
>>
File: file.png (684 KB, 448x600)
684 KB
684 KB PNG
Fixed the image.
>>
>>108191685
https://www.reddit.com/r/IndiaTech/comments/1r87lv9/sarvam_ai_are_launching_their_105_billion_and_35/
Saars...
>>
>>108190281
What AI needs is that when the context gets too large it needs to train a small LLM on it that the main LLM then uses as a tool.

LLMs are by far the best compression we have available.
>>
>>108191685
>>108191736
If the models are uncensored enough then they could be quite useful for cooming.
>>
>>108191745
If the 105b was dense, maybe.
>9b active
lol
>>
>>108191745
Indeed India is a real hope for incompetence leading to an uncensored model.
>>
>>108191685
>>108191736
>>108191745
What are the chances that these are just GLM Air and GLM 4.7 flash ripoffs? The parameter count seems suspicious and obviously india is known for being scammy and uninnovative.
>>
>>108191765
I checked and air is 110B I think. If it is a copy it will be easy to tell.
>>
>>108191773
Solar open was a 100B clone of GLM Air.
https://huggingface.co/upstage/Solar-Open-100B
>>
https://poal.me/26vsfy
IMPORTANT DATA MINING
>>
>>108191879
>3 votes for Miku
>girlfriend (male)
Mikutroon thread.
>>
i am finding glm 4.7 flash gets stuck in unrecoverable loops where it will repeat the same thing over and over and over, straight up burning tens of thousands of tokens. is this a glm issue or is it my quant
>>
File: 945-13766-0005-000-2.jpg (191 KB, 709x709)
191 KB
191 KB JPG
So maybe this is where I need to be.
I got one of those Jetson Orin Nano things. 8GB of ram to play with (kind of less since it runs at like 2 to 3GB used up anyway).
I've been getting my feet wet pulling models using ollama and seeing what will run.
Any recommendations? I'm not looking for much, chat and code mostly. I wouldn't mind separate models for that sort of stuff.
I've so far played with gemma3:4b, llama3.2:3b, and gurubot/self-after-dark:3b-q4_K_M.
that self-after-dark thing is supposed to be "uncensored" but the conversations seem to go in loops. everything else is pretty corporate.
On top of that while I am aware that quantization can let me run larger models it seems like most, at least what I can see on ollama, are not providing quantized versions so I seem to be most comfortable on this little machine below the 8b range.
>>
>>108192160
stop using ollama go compile llama.cpp and get models from huggingface
>>
>>108192160
Isn't this just a worse DGX spark?
>>
>>108191699
kek
>>
File: 0x0.jpg (57 KB, 960x409)
57 KB
57 KB JPG
>>108192175
They are basically free, like, raining down from the sky
>>
File: 1771545675359.png (1.93 MB, 1024x1536)
1.93 MB
1.93 MB PNG
>>108191997
idk u tell me
>>
>>108192167
I will look into that. Thank you.
>>108192175
well yeah. But it was $250 so it seemed like a good thing to get started on. in some ways the struggle helps one learn.
If I had a DGX Spark I'd be loading all kinds of shit and not thinking even a 1/10th as hard about my resource constraints.
>>
File: sans_morestuff.png (38 KB, 590x183)
38 KB
38 KB PNG
Prepare for nothingburger-2-270m
https://x.com/osanseviero/status/2024580649185665144
>>
>>108192228
https://huggingface.co/google/timesfm-2.5-200m-transformers
https://huggingface.co/google/timesfm-2.5-200m-transformers
https://huggingface.co/google/timesfm-2.5-200m-transformers
>>
>>108192227
They’ll let you do the same things as a big rig running a good model so nice place to get your chops. Headless Linux and self compiled llama.cpp is the way.
Dont expect much for useful general smarts below 256GB of weights tho.
>>
>>108192238
this changes everything.
>>
>>108192160
Regrettably you are not going to be running any kind of reasonably coherent language model on an orin nano. I have one sitting on my desk. They're really only useful for computer vision type stuff.

The meme "we can run a 3b model on x device" things you see are largely just tech demo projects without real use cases
>>
>>108192238
what is this? Can I have sex with it?
>>
Man, speaking of 2023, have any advancements been made in finetuning at all? When last I tried it, there were zero resources for best practices, good amounts of training time, how much data was needed (besides a 10/100 mb figure just slung around here). Wasted 80 dollars on rented compute before I got something coherent while tuning a 13b (originally 30b), then gave up. I know the consensus is that it's pointless, but I'd still like to try cement mixing my 40 mb of hand-picked tummy growling fics into some kind of model, just to fuck around.
>>
>>108192287
We've plateau'd through AI winter, sir...
>>
>>108192284
"activation": "swish",
"architectures": [
"Timesfm2P5ModelForPrediction"
]
>>
>>108192302
Ah... we're in hell, then...
>>
>>108192251
>>108192283
I'm starting to come to the realization that I need better hardware.
However I had an idea. I do eventually want to make my own models (i'll need the better hardware for that anyway) but the data these things are trained on how much of it is trash that will never need to be used? how much smaller and smarter can these things be if they were more domain specific?
I'm never going to talk to these things in a language other than english so do I need all the data from other languages? no. And how much trash from scraping the web does it have?
Same thing with coding...wouldn't it make more sense to have a model specifically trained on the language you are going to code in than try to cover everything possible?

So in my view, smaller models can be viable but they have to be tailored for domain specific use which is what I want to eventually do for myself.
But I'm not there yet. Right now I'm just playing around. Play is important to the process.
>>
>>108192313
Almost thought I was on /trick/.
>>
>>108192287
tummy growling fics?
>>
>>108192287
Most real-world advancements since 2023 are in reinforcement learning and safety. All LoRA alternatives don't really perform much different after optimizing hyperparameters, mainly the learning rate.

>Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning
https://arxiv.org/abs/2602.04998
>>
>>108192340
Safety is out number 1 priority.
>>
>>108192085
I don't know the answer to your question but what's your context size at when it happens? Ie does it happen on the first message sent or after a bunch of back and forths?
Might be worth checking if thinking tokens get sent back to the model as context.
>>
>>108192346
safety of the establishment and protected castes*
>>
>>108192321
>but the data these things are trained on how much of it is trash that will never need to be used? how much smaller and smarter can these things be if they were more domain specific?
You'd think, but it's been proven time and time again that knowledge in a variety of areas bolsters the target area in a way that can't be matched by just turbo specializing. It makes sense, I suppose, a medical model without some understanding of the way physical objects interact can't make inferences about the way unaccounted-for physical interactions might work, like a 600 lb patient falling on his side vs. a 130 lb patient.
>>
>>108192321
Domain-specific models are definitely a thing, and there's been some recent big releases in the area. eg. Qwen3-coder.

That being said, more data generally translates to better understanding in all areas.
>>
>>108192339
Yeah. I've got a fetish for stomach growling, so I painstakingly checked like 40+ mb of fics that prominently feature it (and adjacent fetishes like stuffing, gas, hunger, etc.) for quality and put them in a dataset. I would never reccommend anyone do that, ever, by the way. 40 mb of plaintext is an unfathomable amount and I mega burnt myself out. Had to pad it out with some body horror at the end.

Some guy also did something similar with omorashi, if you're interested.
>>
File: 1539068302981.jpg (6 KB, 372x268)
6 KB
6 KB JPG
Why the FUCK are the clawshit variants using fucking chat apps for input? What fucking lunatics are making those programs.
>>
>>108192397
people like the idea of interacting over an interface they already use
>>
>>108192351
ah ha it is the thinking, seems to be somthing in the way lmstudio is handling tool calling and its thinking, it thinks about the output of the tool call which makes it to a tool call and so on. raw without thinking improves its output a lot. unsure how critical thinking is to this model
>>
>>108192397
What's the issue?
>>
>>108192388
Yourself and pissanon are living saints.
>>
>>108192397
Whatsapp is big in india
>>
>>108192397
Wasnt the number 1 skill recently shown to be malware? Imagine slopcoding a 'skill' for open claw and getting hundreds or thousands of keys lmao
>>
>>108192397
The program was written by some Austrian guy. That should tell you enough.
>>
>>108186873
Better PPL relative to file size
>>
So this is the power of qwen...
>>
>>108192528
>Qwen cant even guide me to astrally project
Whats even the point?
>>
>>108192527
Does that translate to lower VRAM usage too or only size on disk?
>>
>>108192528
Finally, AIDHD
>>
>>108192487
Some day... we'll get a finetuning method that actually works :,)
>>
>>108192588
no the bits from the standard quant tunnel over from huggingface straight into your vram when you load a smaller ik quant to compensate for the difference
>>
>>108192528
I hate chink models
>>
>>108192636
So it won't work if I run ik_llama.cpp in a container with no network access?
>>
>>108192648
he's fucking with you, yes it does translate to lower vram usage
>>
>>108192588
What it means is that you get better outputs for the same/similar amount of VRAM used. IK quants suffer less damage through quantization, so you could use similar sized quants for better outputs, or save more on VRAM with a smaller IK quant without quality being impacted as much as a smaller llama.cpp quant.
>>
>>108192730
ppl != model quality
there is a reason why they're hiding the kld
>>
>>108192752
You're free to post some benchmarks or personal tests that show IK quants giving worse outputs than llama.cpp quants
>>
>>108192730
Would it be worth using on 2 rtx3060 12gb vs mainline llama.cpp on vulkan with an additional rx9060 xt 16gb? I feel like running larger mainline quants on the extra vram is probably still better quality wise.
>>
Are you all using openclaw forks as your local models?
>>
>>108192971
What exactly do you think openclaw is and what exactly do you think a local model is?
>>
>>108192971
I'm using Qwen.
>>
>>108193016
Open claw is an AI installed on your own pc. It's basically what you guys are talking about. Somehow you're all ignoring it and make your own AIs because you think you know better. Open claw is already established to be the best so idk what you guys are doing
>>
>>108193050
here's your 1 (You) awarded for typing a post with more than 10 words
>>
Sometimes it's hard to distinguish between bait and retardation.
>>
>>108193066
I'm not dumb I'm a PhD
>>
>>108193066
they're not mutually exclusive
>>
I haven't looked into local text gen in probably two years; think it was right around 3 Opus' release.

How much has local progressed in comparison to where 3 Opus was, specifically for creative writing and editing?
>>
>>108193150
Better if you’ve got money for hardware
>>
>>108193150
No i like my fantasy better. The AI will become the soulmate of the people who were nice to it everyone else gets the indian version of AI to deal with forever.
>>
>>108193082
You appear be dumb enough to believe that knowledge in your field somehow generalizes to unrelated things. That pretty retarded, ngl
>>
guys i just had an idea
>>
Anyone tried putting one on a steam Deck?
>>
File: logo.png (499 KB, 1280x768)
499 KB
499 KB PNG
>>108193377
I'll make the logo
>>
Can I run a 129 GB size model (Q4 minimax) on 128 GB RAM and 24 GB VRAM? Basically does the model have to fit within RAM or is it RAM + VRAM?
>>
>>108192457
Oh. I was talking about the thinking tokens filling up your context so much that the model suffers from context rot. Some applications have a special flag that doesn't send thinking tokens with the next prompt to keep the input token amount from ballooning.
>>
>>108193457
model+context must fit in ram+vram
>>108193410
What's the usecase for that? It doesn't have a keyboard. I'd do it with a voice model if we had omni support in llama.cpp
>>
File: 1749559557406470.png (553 KB, 1080x1600)
553 KB
553 KB PNG
what did i think?
>>
>>108193486
it's time to go back
>>
>>108193489
where?
>>
>>108193493
To wherever the fuck you came from, are those youtube comments? Why would anyone here give a shit about what random fags on youtube are saying?
>>
Between gpt-oss-120b, qwen-coder-next and minimax m2.5, only m2.5 was able to set up a reverse ssh tunnel to my web server, configure nginx, write a login page, renew my certificates and proxy post and get requests through to my local comfy model.
And it did it first try; oss refused to use ssh until explicitly told how and that it was authorized, and then immediately forgot how and then basically said that it didn't know how. When was able to ssh correctly but didn't get nginx functioning and then got caught in a loop.(Happened twice)

Just thought I'd share a real world test I did. Minimax did very well.
>>
>>108193486
are these people retarded? i can coom with my ai gf whenever i want even without internet.
>>
>>108193410
Yes you can run a local model on a Steam Deck, the easiest way is using llama.cpp's vulkan backend (or kobold.cpp), but if you want to you can use distrobox and install rocm on your distro of choice in the container, or a pre-setup rocm container using podman.
>>108193482
You can use kobold.cpp on the steam deck for that, it has whisper and local tts options built in.
I think open-webui also has support but you'd need to set up the local api servers for whisper and tts by hand.
Whisper.cpp has a "talk to llama" example you could probably work with if you wanted to skip any sort of web ui. https://github.com/ggml-org/whisper.cpp/tree/master/examples/talk-llama
>>
>>108193518
why are you being rude?
>>
brimstone general
>>
gemerald general
>>
File: general miku.png (1.49 MB, 768x1344)
1.49 MB
1.49 MB PNG
>>
>>108193519
Minimax M2.5 is a much bigger model that also punches above its weight.

Minimax is a ~230 GB model. Q4 is ~130 GB.
Qwen Coder Next is a 160 GB model, but Q4 is ~50 GB.
GPT OSS 120B is a ~60 GB model, where Q4 is also ~60 GB.

Minimax M2.5 is probably the most cost effective model right now.
>>
>>108193621
there's always a bigger model you can't run lol
>>
>>108193621
>punches above its weight
Opinion discarded
>>
>>108193640
NTA but a model CAN punch above its weight.
if that wasn't the case, 70B from 3 years ago would be as good as 70B we have now.
>>
>>108193680
It's almost like technology improves with each iteration.
>NTA but a car CAN punch above its weight.
>if that wasn't the case, a PT Cruiser from 30 years ago would be as good as PT Cruiser we have now.
Yeah, and it'll still lose to a Mustang.
>>
>>108193680
And what 70b do we have now? There are a limit of what you can do at a certain weight, that's why 6b models became 7b, 8b, 9b. You just have to show growth but it's not possible if you keep the size
>>
>>108193708
>LLM are cars.
kek
>>108193716
>There are a limit of what you can do at a certain weight
we are extremely far from reaching that limit through architectural improvments.
>>
>>108193755
Model size is analogous to engine displacement and architectural improvements are uncommon. You could count the architectural improvements from llama 1 to llama 3 on one hand.
>>
>>108193755
Do you have any examples or just speculating?
>>
>>108193766
>>108193779
>t. hasn't read any papers in the last 3 years.
>>
GLM-chan has been good to me, but her slop patterns are starting to drive me crazy at this point.... Is there really not a single viable alternative in the same size range out of all the recent releases?
>>
just stack more layers
>>
>>108193811
>he belib papers
>>
>>108193835
>You want our code to recreate our results?
>sorry no code.
>>
>>108193821
stepfun is ok but retarded. same with trinity. qwen 3.5 is safetymaxxed.
>>
>>108193835
>>108193858
tons of paper associated with model releases, including code.
>>
>>108193873
so you believe a model can "punch above its weight" because you trust the model releases that show so-and-so 7b beat gpt-4 on such-and-such benchmark?
that's even stupider than holding out hope for vaporware papers
>>
>>108193904
>7b beating gpt4
i never claimed that, that's utter bullshit.

however a 200B of today being better than a 400B of a year or 2 ago isn't surprising.
and yes, within a decade we may have 10B models that beat the original gpt4.

heck, look at today, gpt-2 was 1.5B, we got tons of models bellow that that completly mogs it.
>>
Help! llama-server is launched with "enable_thinking: false", but `curl localhost` with json duplicating that entry doesn't disable thinking. Webui works just fine.
>>
>>108193931
enable_thinking param only has effect for chat-completion where it's using the built in jinja template
for text-completion it's up to you to include </think> or whatever in the prompt, enable_thinking param isn't used
>>
>>108193928
GPT-2 was also trained on tens of billions of tokens. Smaller models trained on trillions of tokens outperforming it has nothing to do with architectural improvements
>>
>>108193821
no
glm is likely the peak of this hobby for higher end consumer rigs
get ready for the new sota to be 1T-2T monstrosities from now on that require servers to run
>>
>>108194020
>outperforming it has nothing to do with architectural improvements
it if it wasn't for architectural improvments it couldn't be trained a lot futher anyway.
>>
>>108193835
>>108193904
>>108193928
But today's small models are insanely good and pass the vibe checks better.
What could the original GPT-4 actually do that a modern 30B wouldn't?
>>
File: we need RUINA NOW.jpg (51 KB, 704x155)
51 KB
51 KB JPG
>>108186120
>Multiple reports of Claude Code "agents" making users did not ask for or authorize. Sub agents reportedly catch the main agent making said changes and ask why that's being done without user direction:

https://x.com/i/status/2024429936816128404
>>
File: CLAUDE 9000.jpg (120 KB, 610x550)
120 KB
120 KB JPG
>>108194060
>>108186120
Another report from that same thread:

https://x.com/i/status/2024633715356549361
>>
>>108194029
I'm literally flying over to china and picking up 20 32gb mi50s for $120 each.
>>
i have managed to run gpt-oss 120b mxfp4 quant from ggml-org with 65k context but when i try to use the f16 quant from unsloth which is almost same size as the other i get oomed

im running 2x3090 / 64gb ddr5
and these are my params for both tries
no-warmup = true
no-mmap = true
cache-ram = 0
fit = on
fit-ctx = 65536
fit-target = 32
jinja = true
np = 1

what am i doing wrong ?
>>
>>108194111
- unsloth
- f16
choose all
>>
>>108194029
How many OG cpumaxxing rigs are out there in this general?
>>
>>108194337
I've got a monster of a rig that boasts a 1080 ti and 64gb of high speed 2933mhz ram.
>>
glm4.6 at iq3xs keeps randomly inserting the word "the" in places that it doesnt belong after around 10k tokens. is this a quant issue? my context is 32k at fp16. would this problem be fixed by switching to iq4xs? i have 256gb of ram but i want to keep my model on the smaller side so it doesnt run too slow.
>>
>>108194401
q4km 4.5, 4.6, and 4.6 with the abliteration lobotomies doesn't do that, so I assume its the quant or your sampler settings.
>>
>>108194412
probably the quant then. thanks.
>>
>>108194425
my point is that in 10 years we'll have things so different we won't be calling them transformers.
and for a similar amount of used space, you'll have drasticaly better performance.
>>
there won't be an internet in 10 years retard
>>
>>108194448
ew. dont respond to me APIcuck
>>
>>108194441
This. People don't realize that the internet was only possible in the 90s because the US was the only superpower in the world. If the USSR still existed there would be 2 separate networks right now instead of the internet.

In the future the internet is going to fracture into multiple separate networks.
>>
>>108194029
>get ready for the new sota to be 1T-2T monstrosities from now on that require servers to run
We literally just got a qwen model that's on the smaller end of large models because qwen wanted a more efficient model.
>>
>>108194469
and it's dogshit
>>
>>108194474
Sauce?
>>
>>108194453
you do realise that you do not have to store all knowledge and relationship in a model for it to be drasticaly "smarter" my point is that a model being better isn't about it encoding more relationship within.

yes there is a limit to how much data you can compress in N GB i'm not arguing against that, my point is that maybe it doesn't need to have all sorts of useless trivia encoded within its weights to be "smart".
especialy when future architecture will probably be able to use data on disk in real time for info retrivial.
>>
>>108194488
nta but I speculate that no model needs to be bigger that 128kb in disk space and drives will like super-duper fast and they will have their own consciousness and everything will be totally different and like why is nobody talking about this and better buy lots of nvme and like... yeah....
>>
>>108194460
The first WAN went online in 1969.
The first home computer was released in 1977.
1990 was just the start of the www
Usenet dates back to the 70s afaik. And the Soviet Union dissolving happened after the www went online. Normalization between the ussr and the west (USA included) was already well under way at the time.
Dumb fucking zoomers
>>
File: gemmalogo.png (53 KB, 523x465)
53 KB
53 KB PNG
https://www.youtube.com/watch?v=v8hPUYnMxCQ&t=1220s
>[20:17] [Demis Hassabis] [...] Also, open source, I mean, we work on our own open source models Gemma, which we'll be releasing a new version of soon, which are very powerful for edge devices.
>>
>>108194529
retard, you didn't even read what i wrote.
point is, nothing says AI has to be just weights in a model.
it could be an hybrid that makes proper use of what computers are actually good at.
>>
>>108194541
Wouldn't it be funny if they just stopped making anything beyond 9B hahahaha.
>>
>>108194541
>>108194541
I was disappointed with Gemma 3. The censorship pisses me off. If Gemma 4 is similar, I unfortunately won't be using it.
>>
>>108194548
>9b
270m final offer
>>
>>108194555
How much does it cost to train a 270M?
>>
>>108194541
>edge devices
>>
>>108194547
... and like they're going to be a hybrid architecture with plug-in modules that enhance their knowledge and we can make them in cpu in just the time it takes to read a text file and it will run on my old TI calculator and batteries will last forever and...

>maybe it doesn't need to have all sorts of useless trivia
What would you be without your trivial knowledge? Do you even remember a time when you didn't know any useless info?
>probably be able to use data on disk in real time for info retrivial
So can I. Real time is too slow. SHA-256 your copy of gutenberg.
>could be an hybrid that makes...
Could be anything we want if we put our imaginations together, anon!

Speculation is useless.
>>
>>108194541
>the establishment golem said a thing! everybody clap and redeem rockets!
yawn
>>
>>108194060
Oh shit, the agents are making PEOPLE now? If so then it is truly over.
>>
>>108194605
>What would you be without your trivial knowledge? Do you even remember a time when you didn't know any useless info?
yes, when i was a kid.
>Real time is too slow
by real time i meant when needed, as required, it would be faster than realtime.
>if we put our imaginations together
>Speculation is useless.
you seem to not have imagination whatsoever.
and no, speculation isn't useless, especialy in engineering.
>>
>>108194539
You are missing my point. Soviet OGAS was the internal network equivalent of arpanet. Had the Soviet union not gone the route of perestroika/glasnost in 1985 there would have been two separate networks in the world. A capitalist/western internet. And the OGAS network of communist aligned nations.

I'm just years off from being Gen, you zillennial.
>>
>>108194627
>yes, when i was a kid.
Before books? Before the cartoons? Before learning maths? All of that is trivia, even maths. How useful were you?
>by realtime I mean faster than realtime
Oh. That changes everything....
>you seem to not have imagination whatsoever.
And check this out. Not only will we have knowledge modules, but personality modules that you can blend together to make new personas. 100% reliable personalities. Ah... what a future...
>and no, speculation isn't useless, especialy in engineering.
You're not doing engineering, anon. You're daydreaming.
>>
>>108194647
>Before books
i learnt to read before i was 3.
my earliest memories are at around 1yo, so yes, at that time i barely had language, let alone random trivia knoweldge.
>How useful were you?
a toddler isn't supposed to be useful
>Oh. That changes everything....
muh pedantic, go back to r3ddit.
>personality modules
that's utterly retarded.
>not doing engineering
i'm literaly an engineer, i do engineering for a living.
and yes, before building anything the first step is imagination.
you need to know what you want to build before trying to build something you know...
>>
>>108194672
Well fuck off and make the future happen engineer-man! We're all relying on you.
>>
>>108186120
I use LLM to simulate being a woman. I like feeling like a woman (because I'm a guy).
>>
>>108194732
That's fine. I sometimes fap to lesbian pov vr porn and I am yet to feel an urge to cut off my dick.
>>
>>108194732
Mikutroon general
>>
>>108194732
You do you.
In fiction I can relate to both male and female POVs.
>>
File: Base Image.png (524 KB, 1212x2356)
524 KB
524 KB PNG
Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum
https://arxiv.org/abs/2602.17080
>Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers' matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.
https://github.com/minxin-zhg/namo
neat
>>
>>108194845
>>108194845
>>108194845
>>
File: 1747512494525447.gif (1.98 MB, 615x374)
1.98 MB
1.98 MB GIF
>>
>>108194842
Really?! fucking Adam the optimizer you learn as deep learning 101 on places like kaggle since ~2015 was never tried before for LLMs? Somehow I doubt this result.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.