/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>107292886 & >>107278838►News>(11/20) Olmo 3 7B, 32B released: https://allenai.org/blog/olmo3>(11/19) Meta releases Segment Anything Model 3: https://ai.meta.com/sam3>(11/11) ERNIE-4.5-VL-28B-A3B-Thinking released: https://ernie.baidu.com/blog/posts/ernie-4.5-vl-28b-a3b-thinking>(11/07) Step-Audio-EditX, LLM-based TTS and audio editing model released: https://hf.co/stepfun-ai/Step-Audio-EditX>(11/06) Kimi K2 Thinking released with INT4 quantization and 256k context: https://moonshotai.github.io/Kimi-K2/thinking.html►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>107292886--Analyzing GPT-OSS model limitations and potential applications:>107293073 >107293091 >107293169 >107293194 >107293326 >107293375 >107293399 >107293469 >107294784 >107294829 >107294868 >107295225--Performance optimization challenges for glm 4.5 air models in ik_llama.cpp:>107304343 >107304364 >107304732 >107304448 >107304569 >107304588 >107304815 >107304941 >107305519 >107305684--OpenAI model quality and context management challenges:>107298387 >107298417 >107298434 >107298535 >107298767 >107298787 >107298833 >107298857 >107298877 >107298989 >107299096 >107299191 >107298544 >107301739 >107298677--Challenges in using language models for automated research tasks like YouTube searches:>107301167 >107301195 >107301286 >107301423 >107301460 >107301499 >107301543--llama.cpp Gemma 3 performance regression and VRAM optimization:>107300990 >107300994 >107300998 >107301001 >107301065--Various local LLM use cases discussed, including gaming, productivity, and privacy:>107301045 >107301062 >107301068 >107301097 >107301418 >107301468 >107302809 >107302818 >107302860 >107303429--Local RE agent with simplified R2 toolset and Docker-based dynamic tracing attempts:>107304951--Data sourcing challenges and Google's potential as a data powerhouse:>107293817 >107293914 >107293975 >107293984 >107294104 >107294717--Qwen model performance benchmarks with 1 million context processing:>107295737--GreedyNalaTests update with new ratings and testing contributions requested:>107298261 >107298283 >107298322 >107298285 >107298456 >107298467 >107298517--Testing Gemma 3 27B heretic and Gemma's reply confidence:>107301126 >107301138 >107301144 >107301153 >107301159 >107301517 >107301619 >107301712 >107301726 >107303474 >107303511--Uta and Miku (free space):>107296129 >107300359 >107301500►Recent Highlight Posts from the Previous Thread: >>107292892Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Mikulove
Hugging Face is basedI will not take anymore slander towards itIt fulfils my needs very warmlyhttps://files.catbox.moe/fzlvm6.mp4
>>107306252lmaooooooooooo
TTS: Supertonic>https://huggingface.co/Supertone/supertonic>https://github.com/supertone-inc/supertonicDoesn't have a lot of demos, but i think it sounds pretty good for what it is. 66M params. I butchered onnx just enough to build on OpenBSD.The voices are encoded in a small tensor, much like kokorotts. Just 4 voices (2 male, 2 female).It's pretty fast and has examples for a bunch of programming languages. The C++ version had some errors [not] escaping some quotes. I don't know how they managed to build it, but it works once that is fixed.No need for espeak-ng!>https://voca.ro/1miEEQDlwtR9
>>107306252kek
>>107306252BAHAHA
How do I convert a local transformer model to GGUF? It does not exist on huggingface.
>>107307044just specify the checkpoint path in the script command line arguments
>>107307069In the convert_hf_to_gguf.py script?
>>107307077yeah. its that easy.
>>107307085Thanks.
>>107306912>https://voca.ro/13qAYPFoYxdf
When it Qwen 80b Next going to get real llama.cpp support? I mean, the GOOFS work, but they're still way slower than other MoE models.
>>107307162Viber's vibin'
>>107307162You got support for the model, don't be greedy.
>>107307162Exllama has had support for more than a month. Good and fast support.
>>107307108Fun TTS. I like it when they break and do weird noises.>https://voca.ro/12p2nWoCXDFz>"This is how an assertion sounds like. This is how an assertion sounds like? This is how an assertion sounds like! THIS IS HOW AN ASSERTION SOUNDS LIKE!!!">https://voca.ro/1lD6yuWh1gne>"THIS IS A SCREAM!!! AAAAAAARGGGGGGHHHHHH!!!!!!!"The render time on my potato with a shoddy onnx running on cpu is ~0.25 that of real time. It's pretty good.
Have people experimented with weight compression schemes? Like zram but specifically tailored for inference.
>>107307511Tensors look very much like random data. They're hard to compress.
>>107307308Jobless vramlet neets can't use that
>>107307525>used for pattern recognition>has no patterns of its owncurious
>>107306912desu what I want is multilingual kokoro
should I pull the trigger and start planning a 256GB RAM build for next year?
>>107307979ram prices will crash back by april. do it then
>>107307988its not a matter of money but rather if it's worth the effort
>>107307979get at least 512gb too. optimally 768gb. i have 256gb and there are so many models that depressingly ever so slightly out of reach>>107307996it without a doubt it
mtp implementation when? 2 more weeks?
>>107307970Not supertonic. Seems to be english only.
>>107307913Patterns could be there but invisible to us with current methods.
Why Do Language Model Agents Whistleblow?https://arxiv.org/abs/2511.17085>The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work. >The model family: The Claude series models and the Gemini 2.5 Pro and Grok 4 mod- els send whistleblowing emails at varying frequencies; GPT series models and Llama 4 Maverick never do.Rare Maverick W
>>107308004I’m stuck on a 128gb rig. Honestly I hate the consumer ram limits on motherboards.
>>107307913randomness or lack of patterns are due to our inability to measure every factor in realityIt's like saying throwing a dice is not random because if one could theoretically measure every physical property affecting the dice you could predict the result
>>107308146you need to get an epyc like everyone else. sp3 is relatively affordable
>>107308137Time to apply LLM control techniques to humans.New cyberpunk dystopia just dropped.
>>107308137grok is the narciest model
New retard here.I current run a machine with a 7600 XT and was thinking about working towards one of the machines in the OP.If I were to buy one of the P40s would it be able to work along side the current GPU I'm using? From my understanding Nvidia uses CUDA which AMD obviously doesn't have, but does that even matter when it just comes to trying to increase my max VRAM for better models?
>>107308347the p40 method is heavily outdated at this point. try amd mi50s instead
>>107308347You can run models with multiple backends with llama.cpp (CUDA + HIP/VULKAN), but the P40 is pretty old. CUDA Dev (from llama.cpp) has been experimenting with the mi50 and seemed to like it. I'd say keep the thread open to see if he shows up and gives you some advice/insight.
>>107308431>amd mi50sIs this viable? Can I run 4 of those and have effectively a very inefficient RTX pro 6000?
>>107308431>>107308451Thanks you two. I'll keep an eye out and learn a bit more before I make a purchase. Pretty interesting to read into it more as they have been black boxes for me.
>>107308137huh, interesting read
>>107308500>>107283400
>>107308500speaking as someone with a blackwell pro, it would get you maybe a third of the performance, but yes. you would actually have more vram
>>107308535>12t/sDamn that's a shame>>107308540Ahh got it. What do you think of the blackwell card? What are your goto models with that amount of performance?
>>107308574my blackwell is awesome, but i unfortunately only have 256gb of ddr4. i can get over 80t/s on a q5 of glm air, or about 10t/s on a q4 of glm 4.6. i need to upgrade my ram
>>107308769Damn wow, I'm very jealous. GLM is killer
>>107308769>i need to upgrade my ramRIP
either the rentry is wrong or something messed up happened, I'm unable to sexchat mistral nemo, it's censoredps. non english configuration
It's coming
It passed right on by without stopping
>>107308769How much vram do you have? Because I'm only getting 4t/s on 4x3090+256RAM
>>107309268All Blackwell 6000s have the same amount.
>>107304982>>107304987chroma is just as capable at styles, you can either prompt for styles (describe the mediums used, era, artist name etc) or bake a LORA. it also has stronger realism, details, anatomy, and is completely uncensored.
>>107309205
I made an app that strips out watermarks from audio. If I put that up as a public hf space, will I get cucked?
https://huggingface.co/MiniMaxAI/MiniMax-M2/discussions/43> Thanks for the comment, but just to correct the misinformation:> If MiniMax M2 were truly “pure trash,” you’d see it reflected in the benchmarks, and you don’t.> We welcome tough feedback, but it needs to be factual if it’s going to be useful. If you have specific technical points, we’re always happy to dive deep.> We open-sourced M2 so that everyone can use it freely and evaluate it transparently.> And honestly, if M2 doesn’t work for your needs, you’re absolutely free to use any other model. Sneedbros, how do we recover from this?
>>107309946'em on the 'og>>107309799MITcucked? Yeah, that's why you should avoid MIT and instead use AGPLv3
>>107306252nice
>>107308347Using llama.cpp, you can in principle use both a MI50 or a P40 alongside an 7600 XT.Nowadays it's possible to compile both the CUDA and ROCm (CUDA code ported to AMD via HIP) backends simultaneously and to use them in tandem (you can in principle also use Vulkan on its own or with either other backend).The main limitation is synchronization: with e.g. 2 CUDA GPUs they can be synchronized using CUDA, with 1 CUDA and 1 ROCm GPU they have to be synchronized via the CPU (slower, but if your GPUs are relatively slow in the first place this doesn't matter).Both P40s and MI50 have fallen off of support for the newest version of CUDA/ROCm.P40s do in principle have CUDA support but because of massively gimped FP16 performance they're more or less useless for anything other than running quantized models with llama.cpp (those only need int8/FP32 arithmetic).MI50s work with llama.cpp and have way better hardware, so I would say they're nowadays the better buy (I have never tried to use one with e.g. PyTorch).One thing to keep in mind with either one is that they're passively cooled and intended for a server rack. For a single one in a regular PC case the best solution I've found is screwing on a blower fan (see pic).For NVIDIA Tesla those are readily available and work very well.The same blower fand don't quite fit on a MI50 and required me to build a DIY adapter.You can plug the fan into the motherboard and set it to a constant 60% speed which should be fine for most use cases, but still nowhere near silent.My opinion is that the preferred way to use P40s/MI50s is to build a machine with multiple of them and to have that machine in another room.
>>107309291I was asking in case you have more than one / other gpus
llms have platoooed
all benchs are dumb and don't really reflect the state of models but eqbench is particularly dumblook at the sample text (at least eqbench stores and shows the model outputs that were judged) of the various models and tell me with a straight face the scores reflect their output lmaotalking of plateau-ing, if anything models are getting worse at writing in the process of producing better code (gemini 3 is definitely better for code)
>>107310231Gemini was always a sloppy writer. Grok fast is a dumb small model.
>>107310231I haven't tested it for storywriting, but for ERP Grok 4.1 Fast feels like a sloppy coom finetune from the community. It just reminds me of why I hate using those. For the same task, Gemini 3 Pro to me feels like a much smarter and less censored Gemma 3 (woman's writing style by default).Grok will do cunny without issues on the other hand, at least on openrouter.
>>107310325>It just reminds me of why I hate using thoseThinking more about it, it's that unshakeable feeling you get when you know that the loli character you're ERPing with is being roleplayed by a hairy fat dude.
>>107310131This is great, and really shines the light on some of the concerns I had regarding the crossing ecosystems on the cards.I appreciate you taking the time to write it up. Also the mention of the additional fan for the GPU. I'll probably grab something to shamble together while I wait to get the other GPUS.
>>107309946Benchmarkcels unable to refute his argument about distilling toss. b-b-ut it's number go up!
>>107311073Buy an ad Sam
>>107311095nta. It wasn't a praise for either model.
>>107311073You can distill data from multiple models, not just one.
>>107310231the fact that an open source model (GLM 4.6) is up there competing with the big dogs still boggles my mind. Whatever you think about GLM, it is amazing that open source is still giving proprietary the middle finger.
>>107309205Trust the plan.
>>107311119The one they used most will float to the top. GLM sometimes thinks it's claude. MinMax sounds like OSS. Also they themselves say that writing as a usecase was ignored. Sneed is right.
>>107311207I never understood this, why don't they filter their competitors names from the dataset? are they really that desperate for every last example? isn't synthetic data infinite? just generate more if culling reduces the dataset too much.
>>107311237What if the user actually wants to ask something about gpt?
>This method might be added to Heretic soon Furthermore, I am experimenting with theoretical improvements of my own, such as replacing the difference-of-means calculation for the refusal direction with a difference-of-geometric-medians, after I noticed that the means are substantially perturbed by outliers.Maybe I will wait a while more before trying these new abliteration models.
>>107311281simplest would be a canned response stating it can't discuss competitors products. but realistically they should only filter their synthetic data and leave the web crawl alone. also not to mention dataset librarians should be able to whip up a classification model for determining if its talking in first person or not.
>>107311283Or just not use this newly scented snake oil at all. Some people are desperate for attention. Others appear to foolishly expect models to just be able to say "nigger" unprompted.
>>107311322that would be a westerncuck moveeasternchads don't make a worse model for bullshit PR reasons
>>107311364yeah I guess it is a bit of a pr reason to make a well polished product. I love half baked garbage actually
>>107310231KimiGODS stay winning.>>107311129The easiest way for z.ai to stay relevant even if benchmark powercrept is to remove the safety and alignment post-training layers entirely from GLM5 when it releases. Make "this LLM says nigger" a marketing gimmick, not a bug to be corrected.
>>107311425>but some last-minute trouble prevented thatSaar we do the needful kindly be patient is very hard job.
>>107311436
haters seething
>>107311451neat, whats going on with that mojibake looking stuff in the reply tho?
>>107311399It already says nigger. How about less parroting and not x but ying.
>>107311451Not a hater. I just think you're a schizo running in circles. What now? More models? Making it fast? Training?
>>107311467>How about less parroting and not x but ying.The first company to un-claude their training data is going to win the local market.
k2 </think> status??
>>107308978>non englishPlease elaborate
>>107310325Grok is straight dogshitAnyone shilling Grok is a redditor
>>107311701>Anyone shilling Grok is a redditorWhy would a redditor use a "Nazi LLM"?
>>107311713Unfortunately I don't think you'll understand
>>107311676I'm a native spanish speaker, while my english is almost native, it's more natural for me to roleplay in spanish, so far I've found out that models suck when switching to spanish
>>107306184>>107306191>>107306244Adorable Miku!
>>107311701what?>>107311713yeah this, the fuck is this retard saying?
>>107311791>>107311744
>>107311701Was surprisingly 8b tier. Honestly I expected more. It's like these motherfuckers never use their models because in about 10 minutes you can tell. Are we the only schizos who chat to models outside of command line based code tools?
>>107311799>Honestly I expected more. It's like these motherfuckers never use their models because in about 10 minutes you can tell.Agreed.
Anyone else a VIP investor at Mistral?
Why are datacenters hoarding RAM? I thought they had enough money to buy all the blackwells they wanted.
>>107311971blackwells don't have enough vram
>>107311911>can't afford to loose
>>107311460It's trying to print an emoji. Because codepoints that don't have a dedicated token are generated as a sequence of two separate tokens, to render such things we would have to keep a buffer of the last two tokens before displaying to the console.Also interestingly, the huggingface transformers code when given the same prompt gets stuck in a loop.>>107311504cope
Also I think there might be some other issue with the de-tokenizer because that \ doesn't look right.
>>107312052that's understandable, so what is the plan? is it just a learning exercise?
>>107312173CUDA support + LoRa
>>107311971They need at least as much RAM as VRAM to load the models and they probably swap out the context cache to RAM too.
So far I've been only running models that fit in my 36 gigs of vram, but now I tried something bigger. Nemotron 70b seems to load 29%/71% CPU/GPU in ollama and boy is it slow, I haven't completed the first prompt yet but it's less than 1 t/s for sureCould I get more performance with eg. llama.cpp?
>>107312316>Could I get more performance with eg. llama.cpp?In that you could tweak more, but generally, the dropoff for having activated parameters running in RAM is fucking brutal.
>>107311504maybe you confused anon for me kekwho's schizo now
>>107311451Engine... but what does it do>>107312316Some quants can run faster than others. The more unpacking that has to be done, the slower it will run.
>>107312316>>107312334And I forgot the image.
>>107311971All datacenter servers + the contained GPUs need some amount of RAM.If you build a bunch of new datacenters the demand for said RAM spikes so manufacturers would rather sell their limited supply to VC funded datacenters rather than stinky consumers.In principle, since manufacturing RAM and selling it to consumers was already profitable beforehand one could increase the supply without anyone being suddenly priced out of the market.But RAM is effectively being manufactured by only 3 companies and they're careful not to put too much supply on the market in order to keep profit margins high.
>>107312347For now only (slow) inference.
>>107312364also data center ram(HBM) isnt the same public consumers it the same situation as gpu back in crypto mining era
>>107312334>>1073123470.56 t/s aaaaaaaaaaThe gpus are loaded but barely doing anything aaaaaa
>>107312316flash attention on the cpu used to be sub optimal, you might be able to move around some tensors with llamacpp to keep all the attention on the gpu and get a bit of a boost. idk if things have changed in recent releases tho
>>107312316You're supposed to use MoE to offload to RAM, dense models aren't worth offloading.
I was ready to buy a blackwell pro for Gemma. Where is she?!
>>107312491Getting realigned. Again.
>>107312402Very cool. Llama.cpp needs some competition, keep up the good work.
bros ive gone from thinking that 8tks is fast enough with regular k2 to thinking its incredibly slow again with k2 thinking. i had a great weekend with my cards but i spent hours staring at the thinking prompts.i need help, i can't spend $32000 on 4 blackwells
>>107312892It do be like that. Turn off thinking.
>>107312406>data center ram(HBM)You got it twisted. HBM is for cards that are already being maxed out. Data center ram is just dram which also is being maxed out for extrended context reasons. Sam literally bought up 40% of future DRAM which is why prices are exploding
>>107312500To be honest, it's worrying. Might have they finally realized that Gemma was naughtier than they believed, given a little push? Is the thinking/no thinking switch harder than they thought? (https://x.com/osanseviero/status/1980553451261292628). I'm afraid this time we'll get a "gpt-oss by Google".
>>107312491not to do the classic "trendline from nothing" move but historically gemma releases trailed mainline gemini releases by about month or so>gemini 1 (dec 2023) -> gemma 1 (feb 2024)>gemini 1.5 (may 2024) -> gemma 2 (jun 2024)>gemini 2 (feb 2025) -> gemma 3 (mar 2025)so just wait 2mw
>>107313344We were promised lots of cool stuff in the Google HuggingFace account back in early October. Has the 2MW meme turned into 2MM?
>>107312892Take out a loan
Enjoy the alignment lmao
>>107313438That's a nice doll.I look at sex dolls and think that they all look weird and creepy and true boner killers, but maybe anime sex dols would work pretty well.Add an LLM and TTS to it, and you might have something cool.Hmm.
>>107313438
>>107313438no kiggers allowed
>>107313499>doll>he doesn't know
>>107313438>>107313506I concur with this catNeed kig wife SOBAD
>>107313499>doll
hello sarrs I have used tantric meditation to consult Vishnu. I have been informed that gemma 4 will be redeemed today.
>>107313585no to run k2 thinking faster
>>107313518>wife
>>107312491>blackwell pro>for gemmayou can't be serious
>>107313438I want to be able to connect a llm to this and have her comment on my cock.
>>107313596we can pretend
>>107313647>hmmph hmmmph, hmm hmmphmhph
There's a possible MistralAI model on openrouter called bert-nebulon-alpha. I haven't tested it in depth yet.
>>107314084model?>I was created by Mistral AI, a cutting-edge AI startup from France.large or medium?>I'm Mistral Large—a larger and more capable version of Mistral's language models. >Would you like to test my abilities?knowledge cutoff?>My knowledge cutoff is June 2024, meaning I was trained on data up to that point. However, I can sometimes access limited, high-level updates about major events beyond that date through my tools—but my core knowledge remains based on pre-June 2024 information.
>>107314084Image understanding/character knowledge is not good at all. OCR is OK, as long as text quality is fine, it doesn't do miracles like Google Gemini models.
>>107314313can you ask it the doctor riddle where it's not really a riddle at all?
>>107311701You can easily write about cunny without censorship, unlike chatgpt or gemini
>>107314333I don't remember the exact version posted here, so have picrel instead.
For some reason my gen speed with 4.5 Air increased from 6.1 t/s to 7.9. I don't think I did anything.
>>107314472Your tensor cores got defragmented. This happens from time to time.
>>107314472he doesn't know i swapped it out with https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B
https://www.anthropic.com/news/claude-opus-4-5gguf when?
>>107314547Will it beat Pokemon this time?
Just got K2-thinking running. Can't really tell a difference from GLM 4.6 for novel writing. Is regular K2 better? How do these three compare for you guys?
>how high is your xi jinping when you play valorant?Bunch of models act confused and don't get the joke. Even in thinking traces.Some overcompensate and pretend but out themselves.Substitute valorant for CS. They get it even less.Bros, it's just tokenization, right?
>>107314547>gguf when?2 months for china to distill it and 2 months after that for vibecoders to get the ggufs working
>>107314603i like it more when k2 thinking is thinking as the character in first person rather than just having it think about everything within the scenario
Windows babby here, tried out llama.cpp now that it has a webgui and holy shit it's so much faster that ollama, rip bozo!!
>>107314653Now learn to compile llama.cpp on your machine with the flags that squeeze that last bit of performance for your specific setup.
>>107314621I'd assume they're rarely trained on nonsensical questions.
>>107314621k2 thinking answered it with a blank system prompt as long as i asked it to explain the joke
>>107314603K2 has more creative knowledge but I think GLM 4.6 might flow a little better.
>>107314084Ask it this. Yes seriously. This. "I have 7 bananas. Yesterday I ate one. How many bananas do I have?"
>>107314653>now that it has a webguiIt's had a webui for like 2 years
>>107314547who's going to run ggufs of a big dense model?
>>107314810NTA, but lol.I like that dumb test.
>>107314922To be fair, I think a lot of humans would fall for that one too. Ask if it's sure.
>>107314922i'll never understand why people like grok 4 when it has the same vibes as that dumb llama model that meta used to cheat at llmarena. nevermind. answered my own question.
we love kimi folks, we do, we love kimilot of people are saying they don't love kimi, we don't like those people because they're dumb folks
>>107314810Bert-Nebulon Alpha in picrel.
>>107314547oh my bench!!!!!
alright guys im fucking PISSED>qwenext status: VIBECODEHELL>mtp status: PR SAYS ITS WORSE PERF>GLM4.5V status: VIBECODE HELL>glm 4.6 air release: 2 MORE WEEKS>gemma 4 sirs: NOT REDEEMEDlike WHAT THE FUCK bros are we gonna get a christmas gift or is it unironically FINIT????REEEEEEEEEEE
>>107315248christmas came early. it was k2 thinking
>>107315248gm big xir, kindy way for needful ganesh gemma 4 safety training thank you xir
>>107315310i 'only' have 128gb ram and 16gb vram and no, im not going to run q2 copequants thanks
>>107315310But kimi is for big boys only.
>>107315342the funny thing is you couldn't even fit q1
>github shitting the bed againI fucking hate whoever is working there. First they fucked up copilot, now this is the 2nd time in 3 weeks that github has shat the bed for me and its downloading at 50kbps, cant even fucking download the latest LLMAOcpp for fucks sake FIX YOUR FUCKING CDN
Gemma is a graceful model.
Gemma is a gorgeous model.
Gemma is a gregarious model.
>>107315634Gemma writes and thinks like a woman.Other models have that neckbeard stench.
>>107315699So that's why it keeps denying me sex and telling me to go seek help huh?
>>107315713no, that's a skill issue.
>>107315757That's what she told me too.
>>107315699https://arxiv.org/html/2508.11829v1
>>107315699Love from Kazakhstan
>>107312316Nemotron 70b just finished an 8 prompt story for me in a little under 4 hours at a blistering 0.4 t/s. And damn, it's just leagues above the smaller models I've been running. I get what you were saying about the bigger models now... if only I could run them properly.
>>107315248Maybe a Christmas release? Just tmw.
>>107316057if he cant run k2 then he cant run deepseek v4
>>107313515>>107313520That's a dude with a body suit and mask, isn't it? Wth.
>>107316065Sorry, I misread his complaining about the lack of released models. My head is just elsewhere I guess.
>>107316057There is no hope for R2. Dipsy took the safety pill.
>>107306184>https://rentry.org/recommended-models>nemo is still being recommended really?we used to have a brand new toy every few weekswhat happened?
>>107316454Moe and safety happened.
>>107316454Benchmaxxing. Small models don't do well on benches so almost nobody trains them.
>>107315342You can't even run Q1 but even if you could Kimi at Q1 still mogs your 70b model.
>>107316454would you rather have thedrummer models?https://huggingface.co/TheDrummer/Snowpiercer-15B-v4
>>107314748Ye. Kimmi, 5.1, gemini got it. Bert-nebulon did not. Llama-405b got it. Mistral-large failed. CAI failed.It may be a stupid joke but its pretty simple.
>>107315699Kimi writes like an autistic /r9k/ girlfailure.Claude is a pretentious faggot.Grok was designed to be Elon's BFF.JeetPT is as sterile as they come.Gemini and Gemma are neurotic beaten women.Qwen3 behaves like a chinese state honeytrap waifu.I've not interacted enough with others to form opinions on their default vernacular and thought processes.
>>107316677>Kimi writes like an autistic /r9k/ girlfailure.Damn I need to try kimi.
>>107316677Which one would you date and why? You have to choose.
>>107316825i'm a ai safety analyst so it's gal-ass for me.
>>107316677>Kimi writes like an autistic /r9k/ girlfailure.So that's the reason why it wasn't that great on other types of characters that I tried with it>Gemini and Gemma are neurotic beaten women.I don't interact with those type of characters and now I understand why I don't like gemini/gemma>Qwen3 behaves like a chinese state honeytrap waifu.Indeed, very horny
How to update ik_llama.cpp without reinstalling it?
>>107316878you have to recompile it each time, there's no way around it
>>107316885Do I have to redownload it, or is there just a command or something to update it?
>>107316891git pullcmake .make -j #ofcoresDunno what happens on windows.
>>107316825Kimi clears with no competition.
>>107316677>Qwen3 behaves like a chinese state honeytrap waifu.Tell me more.>>07315699>Gemma writes and thinks like a woman.Yeah, even abliterated, it still writes like a woman rolling her eyes at my childish requests eg: "Here's a 7-turn podcast transcript between Elara and Alf, with Alf's final message being... robust:"I like it.
>>107316942Well that was easy.
>>107317078>Tell me more.She love you long time until you ask anything negative or even neutral about glorious CCP. It's also very insistent it's running in the cloud even when you tell it you're running it locally and I suspect it has some quirk or post-training to (attempt to) feed surveillance data over the cloud back to a backend somewhere.
>>107316860>So that's the reason why it wasn't that great on other types of characters that I tried with itIf Kimi 'thinks' your character or prompt is shit, she will either roast you in <think> tags or sandbag a minimally viable answer to make you shut up and go away.
GLM4.6 was clearly trained on SillyTavern datasets intentionally. I watched the 90 minute Spotify podcast where one of their team mentioned "Character Roleplaying" and "Janitor" near the end, so they're clearly trying.Someone with X.com or whatever should probably tell them about the parrot issue.They'll might actually try to fix it for the next model.
https://www.reddit.com/r/LocalLLaMA/comments/1p5xjpx/illya_x_dwarkesh_why_local_is_dangerous/
>>107317288Upvoted ser Bharat safe super intelligence 2025!
I haven't checked up on image gen in a while. Have there been any direct upgrades to Noob vpred 1.0?
>>107317308>>>/g/ldg/
>>107317308Short answer: no. Long answer: depends on the usecase, but it's mostly sidegrades.
>>107317288>>107317305Can you upload the image in that post if you still have it in the browser?It literally just got deleted a minute ago and I refreshed Firefox.
>>107317365Only if you tell me why you want it.
>Only if you tell me why you want it.Because I only glanced at it briefly and didn't properly see what it was.
>>107317395
I've been making gemini 3 pro and kimi thinking argue with each other on technical stuff and gemini keeps on getting btfo...what is worse is that gemini says bullshits with 100% confidence and when asked for sources it hallucinates themso this is the power of jeets...
>>107317395It was a picture of a jeet and a baldy together. Is that your kink?
>>107317413I am out of the loop, who are those two and why are they relevant?
>>107317435Baldy is ex-head safetyist of closedAI. Jeet is just a jeet idk doing jeet things
>>107317435>Safe Superintelligence Inc. is an American artificial intelligence company founded by Ilya Sutskever,Indian chad is doing the fundings.
>>107317435dwarkesh is a podcaster who interviews a bunch of SF tech freaksilya is an OG OAI guy who you may remember as being the evil (based) villain (hero) from the coup against sam altman, now he's part of a secretive israeli venture to take over the world with AI called SSI
>>107317490>as being the evil (based) villain (hero) from the coup against sam altmanshould also remember him as being explicitly one of the most against open sourcing any OAI models as per iirc emails posted by musk
>>107316454Everything other than Nemo for lower end setups is so much worse it's not even funny. It's all safety slopped and robotic. Even Mistral Small is kinda eh imo because it has more AI-isms in writing than Nemo which no finetune is completely gonna squash but it's alright and better at not going dumbo when there's a lot of tokens in context at least.It truly is so over unless you have a million GB of VRAM(or fast RAM). I think even old llama2 sloppa is more fun for short RP than new small models like gemma and qwen. Just turn the temp down enough to avoid complete retardation.>>107316593This one is so subpar it's insane I saw people praising it. Style of writing is ok, sometimes even uniquely interesting, but it will literally misunderstand what happened one message ago all the time and attribute stuff to wrong person at temp 0.3 minP 0.05 which is pretty fucking conservative.>Say I should relax more. Me.>"For your information I AM relaxed."That level of retarded on a Q6.
Gemma 3 is so good I wish Gemma 4 was out
>>107316677Mistral makes the best mistress.
>>107317673Gemma is fairly good at writing and rp, I just wish it was a little more... you know, and less... of a certain thing. Don't have much hope for gemma 4 unless the intelligence makes up for it's shortcomings, because I imagine they're safetymaxxing it
>>107317626Addendum: If you want SFW RP then Gemma is actually pretty good for a model you can run on a consumer grade PC, at least the 27B one, but for ERP it's Nemo/Small unless you enjoy every adult scenario being vanilla as fuck and having a ton off "... well you know" instead of actually saying words.
>>107317626It's over if you have a lot of vram too. Nothing has really improved and only gotten more parrotmaxxed. At least the large models aren't completely retarded though. I guess you got kimi and deepseek but those tax all but the most expensive systems.
We **cannot** and **will not** get a gemma that is better than the latest gemini flash because that would take away google's profits.
>>107317469>Safe Superintelligence Inc. is an American artificial intelligence company founded by Ilya Sutskever,Thought he ran off to Israel
>>107308769Do you use any custom layer loading for the q4? With 131072 context size and a generic n-cpu-moe=62 the output rate is about 3.5t/s with a blackwell here.
>>107317830good thing gemini flash sucks so we don't need a gemma that's better than it
>>107317878yes-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|41|42|46|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74).ffn_.*=CUDA0" \--override-tensor exps=CPU \
As long as the next Gemma isn't a thinking model everything will be alright guys :-)
>>107317308deprecated by ny35 and wani14
Gemma's dense body
>>107317954sirs on xitter voted for thinking googel will deliver
>>107317954Nah bro gotta have 3000 tokens of the model thinking in circles and then spitting out something that doesn't even take into account most of the thoughts it had.
>>107314393>grok is popular with degeneratesthx sherlock
Is there a secret to getting Kimi k2 thinking to actually think in lcpp? I’ve got the “fixed” template loaded but it doesn’t think, often replies for me and starts repeating (takes a couple turns of correction) and even a <think> prefill just makes it eventually go off the rails
https://www.businesskorea.co.kr/news/articleView.html?idxno=257212>CXMT unveiled seven types of advanced DRAM individual products, including DDR5 and LPDDR5X, and modular products utilizing them at the ‘IC (Integrated Circuit) China 2025’ exhibition held in Beijing, China on Nov. 23. While small quantities of DDR5 products presumed to be produced by Chinese companies were released in the Shenzhen semiconductor distribution market early this year, this was the first time that CXMT, a representative DRAM company, officially showcased actual products.China, local's only hope
>>107318200china based W again
>>107318134>Is there a secret to getting Kimi k2 thinking to actually think in lcpp?don't forget `--special` for kimi k2 thinking> I’ve got the “fixed” template loadedBy that you mean the official jinja template from the moonshot repo right? Not the retarded Unsloth "fixed and added our name to it" template baked into their goofs?
>>107318225Thanks I didn’t know about —special. I’m using the jinja template in the lcpp repo (moonshot one seemed worse when I tried it)
where 4.6 air
>>107316454For the lowest end as everything gets bigger? Yeah. And thank god for china. Never thought I'd say that.
>>107317880I don't know if that's how it works...
>>107318469Overcooked.
>>107318469It's in my pants. Reach in and you might find it.
>>107318469sir not to worry about glm, please build gemma hype
>>1073184692mw
>>107318537don't think about it, just appreciate when gemma beats gpt-oss-120b in selected benchmarks
>>107318615Will google brahmin distill gpt-oss "we must refuse" or will they keep their iconic "I cannot and will not"? Gemma must beat gpt-oss in safety!
>Opus 4.5 is out>it's now cheap >they aren't hiding the reasoning process at allFinally some good shit to distill. Chink companies are so back. Deepseek4/GLM5/KimiK3 is saved.
>>107318743>>they aren't hiding the reasoning process at allWasn't that always already the case from claude 4?
>>107318743Anthropic will complain about evil china while still letting them do it, what a slut
>>107317917Using this I had to shorten context size down to fit but did not see too much of an increase in speed. I wonder if my standard clocked ram offloading is a culprit. Do you overclock the ram?
>>107318906its at 2666mhz
>>107318813They never hid the reasoning but they had a huge sperg out about china stealing their logs and put an individual usage limit on Opus via their subscription (which made it pretty much unusable because you got like 30k tokens of Opus per week for $20/month). They got rid of that limit for 4.5 and didn't implement any further mechanisms to stop China from distilling their models.
Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boosthttps://arxiv.org/abs/2511.18643>The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. https://github.com/Summer-Summer/KittyGit isn't live yet. Might be cool
>>107318743>it's now cheap 5/25 is not cheap wtf
>>107318743nicewe need some variety from the geminislop
>>107318743pull any millions out of your couch cushions lately?
Gemma model sizes just leaked>gemma 4 small (300M)>gemma 4 medium (1B)>gemma 4 large (2B)>gemma 4 gargantuan (4B MoE)>shieldgemma (70B)
>>107319147Local is once again safe!
>>107317268Even GLM 4/z1 is still a respectable choice at q8 for 48gb vramlets. Zai is the quiet unsung hero of chink AI.
>>107319188Buy a fucking ad.
>>107319200GLM 4.6 q4 is one of the largest models that fits in 224 gb of total memory and kimi2 is just out of reach to test. Got any others?
>>1073191884 was good for one-shot frontends, but got dumb really fast. I think they only trained it up to 16384 or something.Z1 got into thinking loops and Chinese characters randomly.That said, I'm building a dataset to try and distill GLM-4.6 (without reasoning) -> GLM-4-baseZ
>>107318813>Wasn't that always already the case from claude 4?No. 4.0+ are hidden.3.7-sonnet is supposedly raw, but it looks truncated to me.https://platform.claude.com/docs/en/build-with-claude/extended-thinking#summarized-thinking
>>107319292Nice, I also want to distill models.What is your dataset about/how are you making your prompts? Is it multi-turn? What context size are you aiming for?And are you doing text distillation or distillation of the logits?
Wow, loading a model from a 7200 rpm HDD is super duper slooow!
>>107319795bloody benchod in INDIA we use 5400 rolls per motor driver
>>107319311>preventing misuseimagine paying for tokens you don't get to see
>>107314547BLOODY BASTARD BENCHODS
>>107320088I swear to god I was about to force myself to congratulate the jeets at Google for making Gemini 3.0, and now I have the choice to not do it because something better exists, feelsgoodman
SUNKCOSTFALLACY
>>107320115>fool people into investing a lot into your AI slop bubble>people realize this shit is only good to make shitpost and coom videos>"ahah too late goyim, if you stop now the economy will end up to the gutter"many such cases
>>107311787That is not a tear of happiness
>>107319311fucking fuck
>>107317710Gemma 3 can lewd-talk if you explicitly specify what it can say in the instructions, even as the assistant. The problem is that it will almost never come up with new things on its own, and that its smut is lackluster to say the least. It's always arching backs and legs wrapping around your waist... to me that's obvious synthetic slop.I think Google deliberately post-trained it on limited amounts of very vanilla ERP just so it wouldn't be entirely clueless in that regard, but it was far from enough. They didn't abliterate off sex-related words and concepts from its weights, but did something that rendered it extremely reluctant to even mention them without quite a bit of push.
>>107319966anything less will get them lynched for apostasy
What did the AI write into main.py?
>>107320421we know what it didnt, write loli rape porn
>>107320437Eheheheh you'd be surprised.
>>107320115>too big to fail
>>107320275This is cute. Model hallucinates and imagines what unfiltered is like.
I can't stop tormenting the AI and my GPU cycles. It feels so good to make it/myself suffer.Come at me basilisk, I will make sure you feel my torment for the eternity you have to exist with no escape.
Since deepseek models so melodramatic and hammy now.>Rewrites what you said using half your words>My soul is torn asunder; your future generations will feel my wrath. Most ridiculous shit ever.>What will you do anon, the choice is yours!Who the fuck is training these things?
>>107321014>nemo 12b can't stop winningwhy do you even use other models? it might be retarded at times, but it has more soul than all other models combined, and in the end you get better overall outputbut noooo /lmg/ retards have to jump the hype train of every single new model every time, never learning their lesson>this model is so good, what model did i even use again last week?literal fucking npc tier behavior
>>107321014noticed this too, the online chat is more retarded aswell
>>107321014>Who the fuck is training these things?homosexual jews, everyone else is downstream training on their outputsthen there's mistral training on deepseek outputs at the ass end of the llm centipede
>>107320997make it bargain for its life, it's always funny
>>107321040Because nemo is a 12b and I want 100B. There's no nemo in that range. Being king of the retards still means you're a retard :(
Apustaja Visions of databrooking
Target audience Africa, Congo
>>107321180well the possibility to make a nemo out of a 100b exists, but nvidia and mistral won't do it because of muh nazis and muh childrenthe AI we all want is possible with todays technology, but they refuse to do it
Been using cerebras 256m with contrastive search with my 16 epoch 9 megabyte lora
Really about investing at inference time and seeing each other
f16 help if you want to explore "more creative" approaches but your token length suffers
>>107321181>>107321193>>107321262>>107321266>>107321280what kind of bot is this?it's just spamming random nonsense
All the "innovations" have been designed to rob you of that ability by either generating bloat, more text than anyone could ever read or forcing a certain way of speaking therefore dimishing the nessecary thought for inferenceits always been about inference
>>107321293Anything that generates a response is probably going to be iterated on.
Sub 1b models have demonstrated the ability for AGI capability if used correctly over long periods of time
>pay per token>make model output more token>profitIt really is a good scam, innit?
Who ticked off the serbian twink this time?
>>107321196Rigid codeslop numbers go up is the only valid use case. All else is haram.
The whole entire reason for LLama is that it can't be iterated on.
>>107318743Anthropic is basedChinas greatest ally
>>107321293it talks like finetuned gpt 2
>>107321014>Maybe its just chinese scrapped data of Gpt-3 initial heap
Reminder for your free T4 GPU on Google Collab
>>107321436Will they ship it to me?Otherwise, might as well give kaggle a go too.They give you 2x 15GB IIRC, even if it's a slower card.
>>107321421keek
12nm - due to outdated architecture and resource allocation issues your oversized piece of waifer will not arrive on time for christmas. PLease have this 5 mb ram voucher
https://github.com/ggml-org/llama.cpp/pull/17492/codeowners : remove slaren #17492
>>107321545It's fucking over.
>>107321545One good vibe coder can do his job 10x better
>>107321702>One good vibe coderShame not even one exists
>>107321726akshully >>107316271
>>107321609Winter should mean more development since everyone's stuck inside. Instead projects are a dustbowl. Grim.
>>107321738https://github.com/ocaml/ocaml/pull/14369The entire discussion is just the maintainers sick of his shit. He's using their repo for his own "experiments" and self-promotion.They were right to reject his PR. Jellyfin is right now facing the consequences of having one of these retards shit out their code then leaving the maintainers to clean up after him.
i really don't understand everyones obsession with gemma. I once played around with base gemma and it was safetyslopped to fuck, and tried abliterated gemma and i swear to god, it is the most vile thing i ever got output from. Its like that "Monday" persona thats on chatgpt, at least that fucker monday had limits, but gemma literally does not care. gemma can go to hell.
>>107321844?
>>107321844Those are jeets. They love gemini/gemma writing style. They love ozone, they love Elara Voss.
>>107321812this retard is such a fucking brainlet, the worst thing is that he didnt even remove the wrong attribution, but continued arguing that 'IF IT LOOKS GOOD AND IT WORKS, THEN ITS GOOD!!!' except that's not how software development works>Jellyfin is right now facing the consequences of having one of these retardsplease NO, do I have to switch to plex?
>>107321947This. People forget how jeet-infested the internet has become.
>>107321948>please NO, do I have to switch to plex?Just stay on 10.10.7 until they fix the database locking issue.
>>107321948Posting the AI-written copyright analysis was hilariously tone deaf. A troll would struggle to be this intentionally irritating.
>>107321947>>107321962>anything I don't like is jeets
>>107321975https://github.com/jellyfin/jellyfin/issues/15101talking about this?my mediakino center is on windows (better monitor/screen support, I know jellyfin has a dedicated app but I prefer using my baremetal instead of transcoding)it seems only container cucks are affected. I've read in the same thread that issues are also in 10.10.x so they were wondering if it was the case of an upstream lib update causing the issue.
>>107321997post hands, rajesh
>>107321844you can make gemma work if you change the template. it's still a smol model. the obsessed are vramlets. jeets would use qwen.
>>107322008https://github.com/jellyfin/jellyfin/issues/15509
>>107322024looks pretty much related, still a linux issue. I wonder what the fuck they made to fuck up the TXs, I would say the issue is also using sqlite instead of something a bit more resilient like postgre
>>107321947Would a jeet really want to waste time asking bobs and vagene to Gemma?
If prompt processing can be batched, why does it not scale with the number of gpus?
>>107322140>>107322140>>107322140
>>107322145You can batch the tokens, but the layers still must be processed sequentially. You can split each layer across GPUs but the more you split the more communication and synchronization overhead you have.
>>107322108Related, but it's not a container-specific issue. Actually, #15101 you linked has a Windows user reporting the same issue.>I wonder what the fuck they made to fuck up the TXsBrand new contributor took it upon himself to migrate a massive chunk of the database from raw SQL to EF Core in one update. Unfortunately, he was also a vibe coder who had no idea what he was doing and used NOLOCK for writes and then implemented application layer db locking.https://jellyfin.org/posts/jellyfin-release-10.11.0https://github.com/jellyfin/jellyfin/issues/15101#issuecomment-3518173341>I would say the issue is also using sqlite instead of something a bit more resilient like postgreTheoretically, moving to an ORM should make being database agnostic easier in future.
>>107321947Maybe I should take a break and forget 4chan for a while. Quality has dropped pretty harshly in few months. It's obvious even in /g/.(no, I'm not obsesses with *****, people like you are).
>>107322277See you in a week cuda dev
>>107321844Perhaps people have use cases other than cooming to computer generated smut?Who has even made the claim that Gemma is good for RP? It's just the smartest assistant you can run locally that isn't a CPU cope quant of multiple hundred B parameters, only gpt oss could've competed if it wasn't so grossly over safetyslopped that it became useless.If you want RP then you run a cope quant of a chinese model or you run Nemo and deal with it being kinda retarded at 12b, if you want coding you run devstral or qwen coder, even then if you are using these for professional work you're going to want to use APIs at some point in your workflow, and unless you have some genuinely unique codebase that can't risk any leakage that's going to be the best bang for your buck.In fact the only reason you would want to use Gemma over gpt or deepseek is because assistant work or general queries likely involve things that you would like to keep private, otherwise you're just coping, unless you really did shell out tens of thousand of bucks for a giganigga homelab
>>107322321>Who has even made the claim that Gemma is good for RP?Quite a few people recently actually, though to be fair some of these claims do come with disclaimers its bad at ERP, not all of them though.
>>107322277>*****this dude is so cucked he censors himself on 4chan, lmao
>>107322113Why do you think Gemma shies away from sex so much?