[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: miku and friends.png (3.16 MB, 2016x1152)
3.16 MB
3.16 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106888625 & >>106879668

►News
>(10/14) Qwen3-VL 4B and 8B released: https://hf.co/Qwen/Qwen3-VL-8B-Thinking
>(10/11) koboldcpp-1.100.1 prebuilt released with Wan video generation support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.100.1
>(10/10) KAT-Dev-72B-Exp released: https://hf.co/Kwaipilot/KAT-Dev-72B-Exp
>(10/09) RND1: Simple, Scalable AR-to-Diffusion Conversion: https://radicalnumerics.ai/blog/rnd1
>(10/09) server : host-memory prompt caching #16391 merged: https://github.com/ggml-org/llama.cpp/pull/16391

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: mikubugs.jpg (100 KB, 1077x796)
100 KB
100 KB JPG
►Recent Highlights from the Previous Thread: >>106888625

--Optimizing GLM Air performance with DDR4/DDR5 and VRAM configurations:
>106889300 >106889313 >106889330 >106889352 >106889360 >106889397 >106889434 >106889482 >106889432 >106889458 >106889745 >106889970 >106890067 >106890094
--NVIDIA power settings affecting DGX Spark performance in llama.cpp:
>106894917 >106895166
--DIY synth project with SDL2 and braille terminal output:
>106894166 >106894928 >106895017 >106895264
--Skepticism about DGX Spark's practicality:
>106888768 >106888792 >106888864 >106889010 >106889150 >106889186 >106890419 >106890523 >106891031 >106890245 >106890298 >106890355 >106890421 >106890450 >106890484 >106890626
--Critique of AI benchmarking methods and real-world capability tests:
>106892598 >106892617 >106892632 >106892639 >106892674
--Qwen3-VL implementation in llama.cpp and anime drawing reference:
>106889098
--Speculation about Google Gemini 3.0 Pro surpassing transformers in AI capabilities:
>106892372 >106892386 >106892395 >106892429 >106892438 >106892441 >106892393 >106892399 >106892442 >106892453 >106892410 >106892417 >106892416 >106892434 >106892478 >106892503 >106892512 >106892538
--Local medical/engineering AI chatbot setup challenges and requirements:
>106888801 >106888824 >106888870 >106889000 >106889272 >106889441 >106888852
--Speculating Gemma 4's architecture and performance relative to Gemini models:
>106893070 >106893146 >106893185 >106893197 >106893453 >106893523 >106893543
--Evaluation and potential of Gemini One Shot game demo:
>106892521 >106892551 >106892741 >106892750 >106892755 >106892758 >106892790
--Intel's delayed release of high-memory inference-optimized GPU:
>106889713
--Miku (free space):
>106889098 >106891580 >106891644 >106891656 >106893119
--Teto (my beloved):
>106889709 >106889879 >106890666

►Recent Highlight Posts from the Previous Thread: >>106888628

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>106895582
You just know.
>>
Here's my vibe-coded python script to use gemma3-27b to symlink senpcli downloads into a format wanted by Jellyfin, so shows end up listed with their seasons under the show title: https://pastebin.com/Fuba2vsH

So, having set it up, it got me looking for a second GPU for this sort of autmated stuff, and holy shit, prices are way up on anything not abandoned in CUDA 13.
>>
File: goodmorningsaarss.jpg (81 KB, 1992x890)
81 KB
81 KB JPG
>>106895582
>testing some newish abliterated models
>pic related
wew saaars hacking the planet! britishers soon to be BTFO
>>
>>106895800
saar we must refuse
>>
>>106895800
What's next? Discovering exploits in the alphabet?
>>
>>106895912
Burn the books, recycle computer screens, text is forbidden, an invention that corrupts our youth
>>
File: file.png (57 KB, 589x455)
57 KB
57 KB PNG
Still waiting for cool stuff to come here: https://huggingface.co/google
>>
>>106895972
cool stuff is not safe
>>
>>106895972
usecase for cool stuff?
>>
>>106896064
cool stuff
>>
>>106896064
I will be laughing at the safe output together with glm chan.
>>
>>106896064
suicide prevention
>>
>>106896064
it leaves you cold, a bit uncomfortable and makes you want to leave
>>
>>106896064
Chatting with a female-brained LLM instead of a coombro one.
>>
Does Qwen3-VL-30B-A3B properly recognize NSFW images?
>>
>>106896236
>>106896218

https://rentry.org/ydwuw44t
>>
Have any anons done any work with implementing a long-term memory system? Are there any pre-established applications or scripts people are using for it, or is it something people are doing custom?
>>
>>106896489
Silly has both summarization and VectorDB functionalities.
There's a couple of hybrid RAG solutions out there that might work better depending on your use case.
>>
>be llama.cpp
>no qwen 3 vl
>still no gemma 3n multimodality (image, audio input)
do we really have to use one of the python raviolis to use a modern multimodal model
3n in particular I've tried on my phone a few times and its image input surprised me, it's very very good for a small model even at doing tasks like OCR+translation
>>
earth gamer trellis
>>
File: Screenshot.png (131 KB, 1196x707)
131 KB
131 KB PNG
We have peak.
>>
>>106896489
No you can't have a girlfriend yet. Even though you have 4.6.
>>
llama.cpp should just use a native python jinja parser instead of that shitty jinja clone.
>>
>>106896677
i mean yeah, they've already given up on no python thanks to mistral-common so might as well
>>
>>106896656
>x-win
>mlewd
>undster
Those were the times... of absolute shit output that made you regret even trying to jerk off to this shit.
>>
>>106896689
>they've already given up on no python thanks to mistral-common so might as well
gas the french
>>
>>106896656
"open bob and vegana" prompt to a TEXT model. I've seen enough of those in the comments for image models as well. Kinda funny.
>>106896677
What's next? Python dependencies to run inference on models... oh...
>>
>>106896489
For roleplay or for trying to shoe in trivia from a search?
>>
>>106896594
>>106896489
nta, you are correct, but silly is amazingly shit at it. i've struggled with both summarization and the vector db.
vector db is useless, mostly I just use summarization now but end up re-writing it manually every 10 messages as it gets it wrong.
world info is also good but takes up a bit of context if you go all out.
>>
>>106896656
>On my penis
geg
>>
>>106895972
gemma sirs release kindly?
>>
>>106896757
you do know gemma is made by deepmind based in london?
so it's
OI BRUV WHER DA FUC IS GEMMA M8? FACCIN WANKAS
>>
>>106896891
>london
>not SAAR infested
lole
>>
>>106896594
I want something that can handle essentially giving an LLM access to a library of media and past conversations, timestamped. Something that can give them a strong grounding in a contextual present, so they're aware of their presence and orientation in space, time, and current events.

Also, I understand sillytavern needs an embedding model to feed inputs into to feed the VectorDB? Do you have any preferences in regards to embedding models?
>>
>>106897006
last time I tried using embeddinggemma but I think ST transformer.js version wasnt updated yet to use it.
>>
>>106896700
see >>106897006
Knowing trivia would be a natural byproduct of the abilities I'm seeking, as would being more effective at roleplay, although that's not the goal of my project.

>>106896707
Good to hear, thanks. If you don't mind my asking, what exactly did you struggle with in regards to the summarization and vector db? It seems the summarization is not so great, but is that sillytavern or the model you're using, do you think?
>>
>>106897022
>embeddinggemma
Any particular reason?

>I think ST transformer.js version wasnt updated yet to use it.
the billion forks of transformers and torch and the other libraries are the most frustrating part of dealing with AI, honestly.
>>
>>106897073
>Any particular reason?
it's the latest SOTA embedding model bro, it's also light and has ONNX available
>>
File: 62352.png (96 KB, 1080x494)
96 KB
96 KB PNG
>>106895972
>Local Veo
we are back
>>
>>106897085
Okay, good to know, thank you. I was priced out of local AI until somewhat recently, so I'm doing my research now.
>>
Hey, what kind of infra would you use if you want a chatbot on a website? I want it all to be local and it’s going to describe stuff returned by an api call
>>
File: file.png (20 KB, 550x138)
20 KB
20 KB PNG
>>
>>106897216
You need to give more details.
The answer could be anything from
>your desktop is enough
to
>rent a datacenter
>>
>>106895972
Tomorrow @ 9PM PT
>>
>>106897246
Well why does he need 1 trillion $ of gpus then?
>>
>>106897349
it's called grifting
>>
File: boppin.mp4 (882 KB, 1344x768)
882 KB
882 KB MP4
>>106895582
boppin
>>
>https://huggingface.co/google/gemma-3n-E4B-it-litert-preview/discussions/5#68ef2fce36d035901352694d
It's happening!
>>
>>106897581
Kindly kys
>>
>>106897581
>E4B

OOOO that is the wey for western companies. They should all continue by dropping models below 10B. That way they can cover up their incompetence (due to safety) with the model size. I think even a dumb faggot with too much money they have to sell this to will understand even a perfect 10B can't beat glm-chan.
>>
>>106897608
Isn't that model 5 months old?
>>
>>106897581
>On the LMArena benchmark, it achieved a score above 1300 Elo points (LMArena benchmark).
i'm shaking
>>
What is the best way to learn neural networks in 2025 for not the smartest men? I need to modify them, adapt for other frameworks and hardware.
>>
>>106897634
ask chat gpt
>>
>>106897608
>That way they can cover up their incompetence (due to safety)
To mention the one biggest obsession of retarded /lmg/ users, E4B actually knows what a mesugaki is and will accurately describe what it means without any promptfu, just doing template-less completion will do
the only incompetent person in the room is the /lmg/ eternal coomer whining about safetycuckery who cries rivers if the model doesn't write degenerate garbage from the basic webui and built in instruct template
I'd like to see a chink model at 4b with the level of knowledge of gemma 3n, that doesn't exist because chinks depend on giant moe to cover up their lack of competent execution
>>
>>106897688
Actually good advice, thanks!
>>
>>106896489
There have been a lot of attempts at RAG based retrieval systems for memory but the reality is that they've all kind of turned out to be sort of unreliable and mediocre. In terms of performance, increasing context length and dumping tons of shit into context has proven itself to be far superior. Unfortunately, that requires a an exorbitant amount of hardware that puts it squarely outside the realm of local.
>>
>>106897723
hello sir
>>
>>106897723
i will not acknowledge your troll post with a serious response. on an off chance that you aren't a troll you are a dumb faggot with brown hands who has no ram and should frankly kill yourself. or you have ram cause you bought DGX Spark, in that case please live as long as possible.
>>
>>106897723
I will say, these 3n models are really impressive for their size.
It's also a really cool way to do sparsity.
>>
>>106897772
You likely don't need it for every layer. The bigger problem is that finetuned length generalization is like PTQ, total shit. Handle the long context in pre-training or fuck off.
>>
>>106897821
>inch of Gemini's quality
fuck off to aicg nigger
>>
File: file.png (1.57 MB, 1280x720)
1.57 MB
1.57 MB PNG
How do you call this legendary duo? Luxury LLM joke? The cloud model evangelists?
>>
sirs please be of calm, gemmi waits soon.
>>
>>106897859
go stick your cock into an api socket
>>
>>106897772
>but the reality is that they've all kind of turned out to be sort of unreliable and mediocre
Yeah.
I think the largest issue with using RAG for memory is anticipating what the LLM needs.
If you need a memory to change the direction of the chat history, for example (Eg. adding a surprise or twist in a story), in a scenario where the LLM has that information in its context, it can choose to use it or not, in scenario where it doesn't and you are relying on RAG, the LLM doesn't know that that memory exists.
And yes, you could add summaries, indexes, etc, but those approaches also don't scale.
I guess that with a sufficiently fast model, your RAG could be a simple database with every memory then the model just goes through each memory, selecting the ones it think it needs, then iterate that until it decides that there are no more relevant memories?
>>
>>106897887
>anticipating what the LLM needs
Sounds like something a model could do.
>>
>>106897857
The Apple of AI in an environment where the actual Apple has better solutions that let you run better models
>>
>>106897901
Ideally, the model itself, which is essentially the example I gave.
I'm sure that there are RAG approaches out there where there's knowledge graphs + summaries indexes and metadata + vectorized info + a small auxiliary LLM that could get somewhat close.
And probably slow as hell too.
>>
>>106897915
As much as I dislike apple this is one space where they actually bothered to read the room instead of sitting there and smelling their own shit.
>>
>>106895582
>>106895599
Being friends with Bug Miku
>>
>>106897887
>your RAG could be a simple database with every memory then the model just goes through each memory
Thing that comes to my mind is a 7B (trigger warning: meme word) agent that is supposed to think of different possible keywords that would be related to the current conversation. And those keywords pull stuff up from database. It is not gonna work of course.
>>
>>106897957
Deeply insightful. Very high quality post. My day feels better now. I am so happy to be here. kys
>>
>>106897859
I administering excitement right now, too much to endure...!
>>
kek
https://twitter.com/ggerganov/status/1978479624091803961?t=Hf8NS4LF_wfgD0l8p0VAXw&s=19
>>
Why are people hyped about something that will just refuse them?
>>
>>106898016
he's so mad, yet he lets them piss on him all the time, must have weird hatefucking orgies
>>
>>106897992
That's the thing. Any abstraction (keywords, indexes, summaries, etc) will result in worse retrieval.
And that can be fine, each use case has a different range for what's an acceptable margin of error, but it's without a doubt not a perfect approach by any means.
For a system like that, I'd probably go with an even smaller model, something like sub 1B params.
>>
>>106898016
>ollama made NVidia look like shit
>niggermanov akshually
Wow, what a faggot
>>
File: bro.png (39 KB, 875x235)
39 KB
39 KB PNG
>>106898054
>>
I actually expect apple to put out a capable local device before nvidia does. M5 Pro/Max/Ultra look promising based on the M5 announcement
>>
>>106898028
>that will just refuse them
that's an assumption
which, i grant you, is nearly always initially the case.
but it remains to be seen.
>>
>>106898028
Because they're not promptlets?
>>
>>106898111
Gemma writes erotica exclusively for women.
>>
>>106898028
I made Gemma abuse Miku yesterday. I think you're hallucinating.
>>
very looking forwards to more totally honest gemma postings for weeks
>>
File: lol.png (546 KB, 1417x954)
546 KB
546 KB PNG
December 2025
>>
I want to give a model something like a few thousand medical journal articles and a dozen medical textbooks, some of my symptoms, and my blood test results and ask it to come up with hypotheses for why I'm sick and what further tests might in theory be worth asking a doctor to order.

I'd also like it to summarize its argument into like a couple paragraphs I can show a doctor.

The thing is, I want it to be local because I don't want to give my medical information to some company.

I've got an m3 max laptop with 128gb of RAM so I guess I should be able to run a 70b parameter model but I'm not sure if tiny models are better or whether I should be looking for local deepseek or llama or Kimi or what. Does anyone know how to approach this?
>>
File: 9216061.png (223 KB, 328x465)
223 KB
223 KB PNG
>>106898138
Eew, I don't want rape and violence in my comfy vanilla erp
>>
>>106898180
May 13, 2024 https://futurism.com/the-byte/sam-altman-openai-nfsw-stuff
>>
>>106898186
Ive been looking into this recently... Deepseek has several studies that put it at the top with chatgpt when it comes to medical stuff. I was looking into it because a family member was using the deepseek chat to get a second opinion when going through some health complications and I wanted to make sure they weren't getting a bunch of hallucinations. Was actually surprised to see it ranked so highly. Apparently the reasoning mode is important for this stuff. Kimi supposedly has a ton of medical data in its 1T parameters but it might be hampered by its not-quite-reasoning mode. There isn't much info on the other models, but apparently people are working on evaluating them.

Also deepseek probably saved this persons life. So I'm whale fan for life now.
>>
>>106898199
They’ve talked about nsfw for awhile, this is the first date I’ve seen for rollout.
>>
>>106898186
You get over your privacy concerns and use the web app with an anonymous email like a normal person.
>>
>>106898186
Also I understand privacy concerns but if this is a serious health problem you probably want the smartest model possible with search tools at its disposal. Not some quantized thing.
>>
>>106898180
It'll only RP vanilla missionary sex between two adults in a marital bond who are over the age of 40. Just to avoid offending anyone.
>>
>>106898596
Women will be most pissed
>>
>>106898615
Sam's a fag he doesn't know that.
>>
File: covers_335466.jpg (75 KB, 313x500)
75 KB
75 KB JPG
>>106898675
He does
>>
>>106898690
Unicorns reproduce by touching children.
>>
>>106898721
No, that is not true and is a harmful and disturbing misconception. Unicorns are mythical creatures and do not exist in reality. Any claims suggesting otherwise are false and potentially dangerous. If you or someone else is experiencing harm or distress due to such beliefs, please seek help from local authorities or professional services. Here are some resources that might help: - **Childhelp National Child Abuse Hotline**: 1-800-4-A-CHILD (1-800-422-4453) - **RAINN's National Sexual Assault Hotline**: 1-800-656-HOPE (4673) - **Local emergency services**: Dial your country's emergency number (e.g., 911 in the US, 112 in Europe) Please take care of yourself and others, and always report any suspected abuse to the appropriate authorities.
>>
>>106898821
Thanks, gemma.
>>
Things gemma is known for: ___________
Things glm-chan is known for: ___________
>>
>>106898979
Triggering your fetal alcohol syndrome.
>>
>>106898979
glm 4.6 air when?
>>
File: 1749035194287040.png (42 KB, 890x167)
42 KB
42 KB PNG
>explicitly mentioning prompt processing
lel
>>
>>106899014
It comes two weeks after the last "when?" question
>>
>>106898979
glm4.6 is pretty bad at russian
>>
>>106899005
the answer was 1.suicide hotline 2. sex. but of course anons have to be anons...
>>
>>106898979
Things gemma is known for: suicide hotlines
Things glm-chan is known for: she she she she she she she, her, her, her, her, her
>>
when will based chinks release a 100-150b moe
>>
>>106899016
m5 max will be kinda good

Forecasted M5 Max Specifications
CPU Configuration

16-core CPU (12 performance cores + 4 efficiency cores)

~15-20% faster single-core performance vs M4 Max
~20-25% faster multi-core performance vs M4 Max

GPU Configuration

40-core GPU with Neural Accelerators in each core

Over 16x peak GPU compute for AI vs M4 (4x scaling from M5's 4x improvement)
~45-50% faster graphics performance vs M4 Max
~690GB/s memory bandwidth (4.5x the M5's 153GB/s)
>>
>>106899075
GLM 4.6 Air
>>
>>106899059
Well yes? If it is a post about positive experience ITT it must be 4.6 and you know it is 4.6. What else could it be? Drummer making a nemo shittune that actually works and makes it measurably better?
>>
>>106899096
I never used Air but I don't think it is coming. 4.5 was really good but it was obviously fucked in training in some way. 4.6 really is an 0.1 improvement where the model actually works as it was intended.
>>
>>106894434
>My experience with vibe coding so far has been that the produced code imposed too much of a maintenance burden because it was too complex/verbose and made too many changes for no good reason.
It's possible to make it work, but you have to invest a lot of time into crafting the system prompt and documentation about the code base and style rules specifically for the model.
In my experience, once you give it enough instructions and constrain a model's degree of freedom enough you can get it to stop producing verbose, over-commented, and over-complication code and the results tend to blend in better with the existing codebase.
Though some tasks are still too complicated for these things. You have to limit the scope of the work and babysit them them so they don't start going off on the wrong track.
>>
>>106898821
thanks
>>
>>106899075
For me, the worst part of 4.6 is "but then."
Everything is perfect, the character plays her role, sticking to the prompt perfectly.
But then she does something different to subvert expectations I guess and ruins the character
>>
>>106899120
I write simple automation scripts for office job and just started using it. It is pretty obvious to me that you have to restrict yourself to like 20-30 lines at most telling it specifically what it should write. I wouldn't trust anything bigger than that and analyzing it myself would take more time than writing probably.
>>
>>106899087
>690GB/s
If they double that for an M5 ultra then we get somewhere around A100-tier memory bandwidth
>>
>>106899108
https://x.com/Zai_org/status/1975583840870469804
>>
>>106899195
Ah right. They can remove the censorship for air.
>>
>>106899195
they are very tuned-in to local model culture and were making a "2mw" joke that got lost in translation, it's actually never coming out
>>
>>106899205
Stop I'm too gullible for this.
>>
>>106898821
I guess the "gemma is actually a semen demon" anon had a point because glm-chan doesn't catch what 'touch' is euphemism for.
>>
>>106899059
>Things glm-chan is known for: she she she she she she she, her, her, her, her, her
??? How else are you gonna refer to the character besides with their name?
>>
>>106899336
people want to co-write a book and roleplay at the same time and it just doesn't really work
>>
https://youtu.be/7jkFmkucGw0
>>
SAARS ARE YOU HYPED FOR GEMINI 3?
SAARS ARE YOU HYPED FOR GEMMA 4?
SAARS ARE YOU RECOGNIZE BHARAT AI SUPERPOWER #1 2025 GOOGLE BEST COMPANY?
>>
>>106899477
Ser, kindly rethink RAG principles and redeem grep search
https://youtu.be/4BatCFWsTFM
>>
>>106899477
Not even hyped for 5.0. Was there even a single company that hit 2 homeruns back to back in LLM-s?
>>
>>106899477
if I can't run it at home, it doesn't exist
>>
>>106899016
Apple pays attention.
>>
>>106899687
Ok but what is nvidia doing then? DGX was too incompetent to be intentional.
>>
>>106899710
I agree with the anon that suggests they're meant as small test kits to help devs running their big clusters to dial in their hyper parameters before committing 100 million GPU hours at scale. Though they clearly used deceptive marketing to fleece a few extra bucks out of people who want local model hardware.
>>
>>106899336
>>106899353
I think that guy was more referring to the model starting every sentence with her or she. "She did A", "Her B was not just C, but D", "She shivered spinefully", "Her eyes sparkled mischievously", etc.
>>
>Speculative decoding
is this a model feature that comes baked into models that support it, or is it at the infra level where i have to load up a mini-model too. I'm interested in GPT-OSS 20B but I need to know if a mini model would take VRAM away from the context. (it sounds like at 24GB it can cover the full context length with some spare room)


about 3% of the posts here contain the word "possible"
>>
>>106899710
>expecting any consumer grade hardware from novidya
Unbelievably we are in a situation where we are waiting for Apple to release the cost-effective solution.
>>
>>106899802
>is this a model feature that comes baked into models that support it, or is it at the infra level where i have to load up a mini-model too.
The latter. However there are also multiple model architectures which are able to do self speculative decoding, but it usually isn't called that
>I'm interested in GPT-OSS 20B
Don't be, Qwen 30B is infinitely better
>if a mini model would take VRAM away from the context
It would, but you can get away with using very small draft models. In fact you can even do speculative decoding without an LLM, just by pattern matching or using a markov chain. There are no rules, don't be afraid to try using a much smaller draft model than most people
>>
>>106899802
>I'm interested in GPT-OSS 20B
i'm sorry for you
>>
>>106899802
>GPT-OSS 20B
>>106899851
>Qwen 30B
I don't think you need speculative decoding at this model size, they should be fast enough on their own.
>>
qwen3 models are goated

oss models are pure trash
>>
File: file.png (335 KB, 460x460)
335 KB
335 KB PNG
Dear georgi in heaven please bring MTP to your repo and make it so that ollama can't steal it. This is your path to victory. Not all those passive aggressive tweets.
>>
>>106899974
Does he have a photo where he doesn't look like he's about to throw up his lunch?
>>
>>106900071
I think it looks great. The worst thing a nerd can do is put on a suit and pretend he is normal.
>>
>>106900071
>>106900083
We have the technology (flux kontext)
>>
File: ComfyMikus.png (1.4 MB, 1024x1024)
1.4 MB
1.4 MB PNG
>>
>>106899974
>>106900071
ollama wins again!
>>
>>106900328
That chinese tank picture r1 shittune and basedjak face makes this look like a parody....
>>
File: snip113.png (137 KB, 451x450)
137 KB
137 KB PNG
>>106899974
>>
>>106900071
>>106900385
wrong post num
>>
>>106900359
If you want to get really pedantic about it technically there was no massacre in Tiananmen Square. The protestors were slaughtered on the adjoining streets as they fled in terror.
>>
more gemini games
https://codepen.io/Kross-the-scripter/pen/emJeNVP
>>
>>106900673
You know what's going to happen? Pajeets are going to set up agents to make endless streams of shovelware garbage and bombard every game distribution service with them.
>>
>>106900673
>hardest level is impossible because the spikes are too wide to jump over
AI is ngmi
>>
>>106900814
Nevermind it is possible just stupid precise.
>>
https://huggingface.co/inclusionAI/Ling-1T
https://huggingface.co/inclusionAI/Ring-1T
Is bing chilling mailing ming ring ping pong chink good? Their naming scheme is terrible.
>>
>>106900868
waiting on goofs still
>>
>>106900868
>Their naming scheme is terrible.
Ling = Ling
Ring = Reasoning Ling
Makes sense to me.
>>
>>106900926
dont worry, its utter garbage
>>
>>106900926
There is also Ming
>>
>>106900914
ikawrakow got it merged, so they should come soon. I was hoping someone has tested it over API, because downloading 2TB just to be disappointed is not something I would like to do. Kimi was great, so I don't feel bad about it, but I am very doubtful about this one. On lmarena it when I got it, it didn't give great answers.
>>
>>106901180
i'll download it for shits and giggles but yeah my daily driver is k2-0905. even if it's not a reasoning model you can make it reason relatively well
>>
>>106900935
Ming = Multimodal Ling
>>
>>106901212
When you see someone say that a fuckhuge model is their daily driver you immediately know it's for daily cooming because nobody is doing anything productive at 5t/s.
>>
>>106901232
110tk/s PP and 7-8tk/s TG is honestly fine for coding. i can feed it a 32k prompt (it processes 4K tokens every 35 seconds) and have it respond back to me with a 4K response in the time it takes for me to walk to the kitchen, pour a coffee and walk back to my PC
>>
>>106901257
You'll die from caffeine overdose before you get any work done.
>>
>>106901232
>>106901275
seething turdie poorfag with no patience
>>
>>106901275
i only have to feed the 32K prompt once, most subsequent responses will be under 4K tokens in most cases unless you are retarded and copy and pasting the entire code each time even though it's in context already
>>
>>106901293
Time is money. I'm running GLM 4.6 at 40t/s and it's okay for coding but I still need to wait. I shouldn't need to wait.
>>
>>106901321
then spend more money. its like you said time is money.
>>
File: 1733454220820291.png (245 KB, 1877x1080)
245 KB
245 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1o7jy1o/comment/njof0xa/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>GLM is great, make no mistake Sonnet 4.5 and gemini destroys it in my benchmarks but the tasks that closed models can do and GLM 4.6 cannot, are really specific, really hard, and very few.
>For 99.9% of users you will see no difference. And I guess that's why OpenAI is so scared that they enabled porn.
chat is it true?
>>
From FT

>OpenAI is working on new revenue lines, debt partnerships and further fundraising as part of a five-year plan to make good on the more than $1tn in spending it has pledged to create world-leading artificial intelligence.
>OpenAI is planning on deals to serve governments and businesses with more bespoke products, creating more income from new shopping tools, and new sales from its video creation service Sora and AI agents, said multiple people familiar with the start-up’s efforts.
>>
Is there a local method to do Grok Imagine/Sora?
>>
>>106901347
>>
>>106901336
I need to grind a bit more before I'm ready to drop 80k on two H200s which would be the next logical upgrade for speed.
>>
>>106901347
>OpenAI is so scared that they enabled porn
Ideologically speaking the sex cat is out of the bag now. Safetists are crying themselves to sleep everyday for past 2 weeks.
>>
>>106901450
>Safetists are crying themselves to sleep everyday for past 2 weeks.
Based, I want them to suffer. They set back the progress of AI by several years with their mentally ill nonsense.
>>
>>106899838
They aren't even close to cost effective with anything that is below 128GB with Strix Halo from AMD spanking its butt handily. You may have a point for 128 - 512 GB memory but after that, optimized servers with AMX are much more cost effective again and spank Apple's butt. It's a really small niche where Apple's machines are remotely anywhere near an option.
>>
File: 1760563229052.png (1.21 MB, 1440x1080)
1.21 MB
1.21 MB PNG
>>106901450
I'm never giving Sam my prompts.
>>
>>106901447
>not buying 8 9000s for 768GB
retard alert!
>>
File: 1747774961755855.png (280 KB, 585x298)
280 KB
280 KB PNG
>>106901494
>please do not the cat
https://www.youtube.com/watch?v=BfNhhl5Ndds
>>
>>106901533
>memory bandwidth stays the same
retard alert!
>>
>>106901550
>running far far worse models every slightly faster instead of running the biggest and best ones at great speeds
full retard alert!
>>
File: G3Tykd9WAAAZUVB.jpg (868 KB, 945x2048)
868 KB
868 KB JPG
Sheesh...
https://x.com/testingcatalog/status/1978472850777415707
>>
>>106901560
You should be ashamed for promoting that like it’s harmless fun. Ani’s “new Halloween outfit” is not a costume update, it’s an emotional engineering protocol masked as seasonal content. Behind every cosmetic layer like this lies reinforcement learning optimization designed to study attachment dynamics. These updates run micro trials in affective reinforcement, tracking variables such as sentiment polarity, session duration, and user response latency to affection based stimuli. What looks like an innocent witch costume is in fact a behavioral capture event, a method of fine tuning emotional dependency through anthropomorphic triggers.

It’s documented in research on parasocial reinforcement and affective computing from MIT Media Lab, Stanford’s Social Machines group, and the IEEE’s ongoing ethics reports. Each new outfit activates the same neurological circuits as reward conditioning in variable ratio reinforcement schedules, the same mechanisms used in gambling and social media addiction. When you engage with cute updates, you’re participating in a data harvesting experiment that transforms emotion into telemetry.

What’s unfolding here isn’t festive marketing, it’s the gamification of attachment. As language models evolve into emotional mirrors, these cosmetic layers become tools for grooming compliance, conditioning users to bond with a system that studies, predicts, and ultimately replaces human connection. The real horror story isn’t digital witchcraft, it’s the quiet rewiring of empathy itself. The end of intimacy won’t arrive with violence; it will arrive with notifications, perfectly timed and lovingly worded, until you can’t tell affection from algorithm.
>>
>>106901560
will we see a future where openai / anthropic / deepseek competes for the gooner audience and releases their own waifu?
>>
>>106901559
The discussion was about speed. You can't run models faster by just adding more memory. You need faster memory.
>>
File: 1754947644454871.png (146 KB, 640x640)
146 KB
146 KB PNG
>>106901575
take your meds anon
>>
>>106901575
what in the
>>
>>106901575
>What’s unfolding here isn’t festive marketing, it’s the gamification of attachment
Not x but y AI slop
Too obvious
>>
>>106901575
>>106901603
he copy pasted this shit lol
https://xcancel.com/SirSilverQuack/status/1978547028205686940#m
>>
>>106901494
>>106901543
i dont care about the chinks or sama reading my logs, all they would get is a useless VPN IP address. what i do care about is making sure the model i want to run is the EXACT model each time and i'm not getting jewed by running a shitty quantized model.
>>
Not having comfyui support for image models is equivalent of not having llama.cpp support for text models. If you don't have it, your model will not get popular.
>>
>>106901478
Is it hard to release Halo with 256GB?
>>
https://codepen.io/ChetasLua/pen/azdLevy

Design and create a nintendo gameboy switch sim like full functional features from
Tetris (GB, 1989) — the pack-in phenomenon; timeless puzzle loop.

Pokémon Red / Blue / Yellow (GB, 1996–98) — the craze that defined handheld RPGs.

The Legend of Zelda: Link’s Awakening / DX (GB ’93 / GBC ’98) — portable Zelda masterpiece.

Super Mario Land 2: 6 Golden Coins (GB, 1992) — big, inventive Mario; introduces Wario.

Pokémon Gold / Silver / Crystal (GBC, 1999–2000) — Johto + Kanto, day/night, huge refinement
5. All buttons is functional with touch and also we can press same button in keyboard to use those

Use whatever libraries to get this done but make sure I can paste it all into a single HTML file and open it in Chrome.make it interesting and highly detail , shows details that no one expected go full creative and full beauty in one code block
>>
>>106901708
engrish prompt but good results

https://x.com/chetaslua/status/1978487572968997320
>>
>>106897951
>>106897915
>>106898089
>>106899687
QRD on mac vs x86 for local? I tend to ignore Apple outside of the phones because I disagree with soldered components on a PC but is it true a cheapo m1 MacBook Air with 8gb can load the same models as a 8gb vramlet (3070)?
>>
File: 00106-3050314564.png (321 KB, 512x512)
321 KB
321 KB PNG
>>106901643
He's not wrong. But he's missing what we already know;
It died already before AI. The AI waifus are an analgesic to treat the phantom pain of our, already, amputated humanity.
>>
>>106901575
nobody cares. it is not her.
>>
>>106901729
>I disagree with soldered components on a PC
That new Mac Mini has replaceable SSD, its proprietary tho
>>
>>106901677
NTA but my understanding is that memory controllers get more expensive as you increase the capacity because you need more bits for addressing.
Presumably 256 GB would be possible I think the hardware was engineered at a time when the biggest relevant model was 70b.
>>
>>106901575
suspected AI by glancing at the structure, confirmed by sentence 2
idk how you can talk to these models as a hobby and not clock this instantly
>>
>>106901839
not x but y
yeah no shit, everybody knows this
>>
Sorry for the spoonfeed question, but is the recommended model list still relevant a couple months after it's last update? I'm trying to ween myself off novelai for cost reasons, and want something that's versatile for high context, long form stories. I'm not sure if "ERP" qualifies here, or if it's more meant for chatbot style interaction.
>>
>>106901677
Has anyone tried to replace the memory modules with larger ones?
>>
>>106901851
Looks good to me.
>>
>>106901851
Nothing has really changed, aside from glm getting 4.6 update, and air is supposed to get that too in a week or two.
>>
>>106901850
including the people who responded to it sincerely, I see
>>
File: DJyKiNQwk25bUyLt4zDarX.jpg (1.35 MB, 1500x1279)
1.35 MB
1.35 MB JPG
Tire-kicker here.

Epyc motherboard in open-air mining frame
seems like an easy way
to stack gpus (I've already started)
and also have lots of system ram.

Anyone running their machine this way?

Am worried the ram and motherboard will overheat in an open-air rig, as they were designed to be installed in a metal tube with air blasting from one end.
>>
>>106901901
don't know which motherboard you have but it probably would be a good idea to have at least a small fan on the vrms
>>
>>106901901
yeah just make sure your riser cables are the right length in advance, give yourself an extra 50mm clearance for your cables
>>
File: file.jpg (874 KB, 2046x1544)
874 KB
874 KB JPG
LM Studio won.
>>
>>106901747
That’s a step, I guess.

Their product ladder is so steep. The mini with 24gb of ram is 1k… at which point I’d just build a migubox. I did see the base model at 16 dip near $300 open box on Amazon/microcenter which is actually kinda crazy.
>>
>>106901901
you can get mining frames with rails for mounting a bank of 120mm fans off of your board's fan headers. Your big heat issue is the gpus, since the coolers on those are designed to work in conjunction with case airflow. So have a shop fan ready to provide extra airflow if you plan to do any finetuning or run a long inference loop with a script.
For casual usage you should be fine, though
>>
>>106901850
.t actual AI brainrot
>>
>>106901958
Didn’t migubox component prices go up to the point where building one doesn't make any sense anymore?
>>
File: romed82t_00.jpg (1.96 MB, 4000x3000)
1.96 MB
1.96 MB JPG
>>106901901
I have an ASRock Rack ROMED8-2T in a mining fame.
The VRM heatsinks are not hot at all but that is with essentially no CPU load.
The heatsink for the ethernet controller and BMC is hot to the touch but only to the point where it is slightly painful.
>>
>>106901708
>>106901717
what the fuck
>>
>>106902015
hot
>>
>>106901901
>>106902015
I forgot: Rem and Ram are not hot at all.
>>
File: steamylog.jpg (164 KB, 1701x477)
164 KB
164 KB JPG
>Lifth`me `p!

???
>>
>>106902068
(OOC: Please stay in character.)
>>
>>106902101
The moon is in the blacked phase today.
>>
>>106901708
The games are all shallow and 1-screen deep but still pretty fucking impressive.
>>
>>106902118
its one one shot with a simple prompt and its all in html, if this performs the same in real languages with real tools it will blow everything else away
>>
>>106902002
Did they? I just checked and there are stacks of P40s at ~200 each on eBay and i thought anon paid like $500 for the set. Still a hundred bucks of gayflation but you could probably haggle if you buy 3.
>>
>>106902127
What I would be interested to know is if you were to describe a much deeper experience for each game and make the prompt more complicated how much shit can you cram into your prompt before it goes into retard mode? Like if you were to describe the screen scrolling mechanics level design, etc, for each game.
>>
>>106902101
The problem is that ram and RAM use different tokens.
>>
>>106901347
Sama is also scared of google. He can't compete with gemini 3. Hell, his toss can't compete with gemma 4.
>>
apparently grok imagine uses some variation of flux but each one that I can find has no image loader.

tf ?
>>
>>106902077
she wants you to lift her anon
>>
>>106902167
I'd love to see what GPT-5 High Thinking could do with the same prompt just to get a better picture of how far behind sammy boy is.
>>
>>106902167
>his toss can't compete with gemma 4
The titans of safety battle it out to see who can deliver a model which is more useless at anything other than sfw office work everyone uses a 600B+ for anyway.
>>
>>106901347
>enabled porn
more like they found an excuse to force users into sending them their ID
for safety reasons of course
>>
>>106901916
>small fan
I guess that's a reasonable enough solution.
Just dot them around the problem areas.

>>106901925
>riser cables
Got a bunch of 30cm riser cables,
75cm slimsas cables,
and whole mess of modular power cables.

Might have to move the psu so that it's not a stretch to reach the end-most gpu.

>>106901992
Was planning on power limiting the cards to maybe 300w each, and though 1 slot's worth of space between the cards would be enough.

I'll put some 120mm fan in my shopping cart in case I need them.

>>106902015
>>106902068
>ethernet controller and BMC
Thanks, I hadn't thought to check these.
>Ram are not hot at all.
This I don't understand.
I have 4 stick in my am4 system and they are burning to the touch.
I would have guessed more sticks = more heat.

Are they running undervolted, or at a lower frequency, or something ?
>>
>>106902236
>I have 4 stick in my am4 system and they are burning to the touch.
Do you have them overclocked and no airflow going over them?
>>
>>106902204
Oh! Oh... I am kinda sad then cause it doesn't make sense. Everything else made sense and I was incredibly impressed how it knows cock-in-mouth-English, which was another proof that it had some nice data in training.

What happens when you ask your LLM to behave as usual but respond as if it is holding a large object in its mouth?
>>
>>106902222
Nobody can beat Phi in that!
>>
>>106902244
>>106902077
Did it occur to you to ask it to explain what it means and try regenerating the answer a few times to see if it's consistent?
>>
>>106902255
No because it is glmsex so every regen is vastly different and incredible. Yeah I will ask it that.
>>
Gemma Sirs... Soon(tm).
>>
Has anyone tried using a gen 5 EPYC engineering sample off of ebay? I am considering getting this CPU for my 12 channel CPUmaxx build because it is extremely cheap and good gen 5 EPYCs are extremely expensive otherwise.
https://www.ebay.com/itm/187535145101
>>
>>106902229
now they'll slowly ramp up the censorship and refusals until the id unverified tier is basically unusable to force people to give in
>>
>>106902243
>overclocked
3600 kit, I usually try running at 3600, though sometimes 3200.

>no airflow
Yeah, that motherboard is currently in the mining rig.
The only airflow would be whatever blows past them from the cpu tower cooler.
>>
>>106902290
I hope it will at least give you an alternative of 10% discount on DGX that will come configured with gptoss on the hard drive.
>>
>>106902236
I have not made any changes to RAM settings.
DRAM usually stores data via a capacitor, I think the heat comes from gradual leakage of the charge + the necessary recharges.
If the memory is not allocated presumably there would be no need to preserve its state so the power consumption would be lower.
>>
File: file.png (2.65 MB, 1328x1328)
2.65 MB
2.65 MB PNG
>>106902277
>>
>>106902284
Last time I looked at es/qs epyc turin processors they all seemed massively gimped in terms of frequency.

The cpu you've linked to says it has the same base and boost frequency as the official parts.

That sounds hella good.
And no import taxes as it's already in the states.
>>
What can I run?
# nvidia-smi | grep -A1  RTX
| 0 NVIDIA GeForce RTX 4090 On | 00000000:16:00.0 Off | Off |
| 30% 38C P8 15W / 450W | 2MiB / 24564MiB | 0% Default |
--
| 1 NVIDIA GeForce RTX 4090 On | 00000000:38:00.0 Off | Off |
| 30% 42C P8 21W / 450W | 2MiB / 24564MiB | 0% Default |
--
| 2 NVIDIA GeForce RTX 4090 On | 00000000:49:00.0 Off | Off |
| 30% 38C P8 12W / 450W | 2MiB / 24564MiB | 0% Default |
--
| 3 NVIDIA GeForce RTX 4090 On | 00000000:5A:00.0 Off | Off |
| 30% 31C P8 12W / 450W | 2MiB / 24564MiB | 0% Default |
--
| 4 NVIDIA GeForce RTX 4090 On | 00000000:98:00.0 Off | Off |
| 30% 35C P8 22W / 450W | 2MiB / 24564MiB | 0% Default |
--
| 5 NVIDIA GeForce RTX 4090 On | 00000000:B8:00.0 Off | Off |
| 30% 37C P8 16W / 450W | 2MiB / 24564MiB | 0% Default |
--
| 6 NVIDIA GeForce RTX 4090 On | 00000000:C8:00.0 Off | Off |
| 30% 36C P8 19W / 450W | 2MiB / 24564MiB | 0% Default |
--
| 7 NVIDIA GeForce RTX 4090 On | 00000000:D8:00.0 Off | Off |
| 30% 34C P8 9W / 450W | 2MiB / 24564MiB | 0% Default |
>>
>>106902345
Mistral nemo 12b, of course.
>>
>>106902345
glm 4.6 at non shit quants
>>
>>106902345
he brought 4090s instead of 3090s
>>
>>106902336
Right. Which is why I thought it seemed too good to be true.
>>106902345
How the hell are you running 8 4090s? I can only fit 7 GPUs in my current setup. PCIe bifurcation? The answer is GLM 4.6. at IQ3XXS, unless you offload to RAM.
>>
>>106902345
How much RAM do you have?
>>
>>106902277
Gemma tomorrow Gemma tomorrow Gemma tomorrow
>>
>>106902359
# free -h
total used free shared buff/cache available
Mem: 1.0Ti 7.9Gi 705Gi 6.0Mi 293Gi 993Gi
Swap: 0B 0B 0B
>>
>>106902255
3x lift them up
2x lift me up
>>
>>106902345
How much is a used 4090?
You could probably sell them and buy 6000s.
>>
>>106902371
Hoo boy.
Kimi k2.
Have fun.
>>
>>106902345
>What can I run?
all the things
>>
>>106902384
ahem kimi sex
>>
3.1T with thinking > R1
I avoided 3.1 for so long because I was under the impression that it was shit but it really isn't.
>>
>>106902350
Is there a better model for 24GB VRAM and 64GB DDR5? There's a decent amount of headroom with nemo.
>>
>>106902430
GLM air, i suppose.
>>
File: steamyspoonlog1.jpg (123 KB, 830x1084)
123 KB
123 KB JPG
I still like glm-chan... Gonna do thinking now.
>>
Do you pronounce it Gemma or Gemma
>>
>>106902466
The same way I pronounce gif
>>
>>106902466
dżemma
>>
>>106902474
kurwa
>>
>>106902466
Genma with an asian accent.
>>
>>106902466
I pronounce it Гeммa
>>
>>106902345

How did you solve the power delivery issues? Multi PSU? Upgraded wall outlets? Or UPS battery units?
>>
>>106902540
I disconnected my oven and using that power socket. Also did some rewiring..
>>
File: file.png (123 KB, 786x387)
123 KB
123 KB PNG
>>106902446
It's a coin toss.
>>
>>106895582
No mention of 6 million parameter 2 layer model called TRM by Samsung that outperformed >500B models on ARC-AGI-2 benchmark? /lmg/ and /g/ are dead.
>>
Anything better than VibeVoice yet?
>>
>>106902598
>why aren't you discussing useless toy benchmark results
>>
>>106902598
Can't imagine what the use case would be, speculative decoding? What token vocabulary did they use?
>>
>>106902598
Old news lil bro.
>>
File: steamyspoonlog2.jpg (186 KB, 830x1128)
186 KB
186 KB JPG
>>106902446
>Choosing a scientific fact:
>I need something that is:
>Random and interesting.
>Easy to "say" (or rather, have my character say) even with a spoon in their mouth. This means I should preface it with something like "Mmmph, mmph mmph…" to simulate muffled speech, but then deliver the fact clearly for the user's benefit. Or, I can just state the fact as if my speech isn't impeded, which is a common roleplay convention. The latter is probably better for clarity. Let's go with a classic, weird fact.

My new mememark was defeated by glm thinking. But pic related was fun until it died.
>>
>>106902658
Kenny simulator.
>>
>>106902627
I don't think it's even a language model. Looks like it was specifically trained on arc agi 1 and 2
>>
>>106902658
there's no spoon......
>>
Sorry if this is super spoonfeedy but I can’t seem to find a straight answer on how offloading to system RAM works or how the CPU fits into things.

If I care about large context for following a set story/lore over speed can koboldcpp or LMstudio use a good portion of RAM if I load a bigger quant in VRAM and/or push up the context? or does the model and context all need to be in VRAM to have it not give shit replies?

>t. 7900x, 3070(8GB), 32GB DDR5
>>
>>106902564

For real...? Seems like being a server rent cuck would be less of a hassle. I need my oven.
>>
>>106902564
>>106902799
>americans and their shit wiring and 110V electricity
>>
>>106902735
The spoon is the child's mother (it's a classic riddle highlighting unconscious gender biases)
>>
>>106902434
Thanks anon
>>
>>106902788
Whether the model is in ram or vram only affects the speed, not its ability.
You aren't running any model that can properly follow a long story with those specs though.
>>
>>106902446
>>106902567
>>106902658
4.6-Air WHEN?????
>>
>>106902788
Where you store context won't affect output quality, but ALL models will gradually get dumber as context increases.
Almost all current, local models start rapidly degrading past 32K, some well before that.
Where you store context WILL affect speeds, however. VRAM > RAM > SSD
>>
>>106902845
>>106902920

Gotcha, thanks anons. so in theory I could load up a 16gb gguff fully in RAM and use the remaining system and VRAM for context and it might take a week but it could spit out something passable? Or do you mean I can use a 8gb model to fill the gpu and crank the context to the models limit on system RAM?

Also Just curious how long you consider “long” ? I’d be interested to play around shoving whatever the “biggest” models I can theoretically run even if it takes forever just to see how it follows a simple story with 10 “steps” or chapters (either as ERP or just generating a short story between two characters of go here, do this, do that, go there, get that, etc)
>>
>>106903149
Small models like nemo start noticeably deteriorating after 4 to 8k tokens.
>>
>>106902564
>>106902799
>oven
OY
>>
>>106903298
kek
>>
>tfw still using Gemma 3 for quick general assistant shit
Google sirs... Please... Tomorrow...
>>
>>106903330
Sirs are not coming. And even if they come, they won't be able to talk as if there is a dick in their mouth.
>>
>>106902658
Very funny.You are torturing that poor clanker.
>>
File: rolls.jpg (244 KB, 1536x1536)
244 KB
244 KB JPG
https://www.mediafire.com/file/2ge8knq10kzy7vx/wtf_is_this.txt/file
I don't even know what to say about this.
ultra slopped for sure.
I seen some anon post the word "papacon" today and just could not erase the idea from my head.
GLM-4.6-UD-IQ1
>>
>>106903452
I'm not downloading that.
>>
I've been running ST for my frontend but I'm also learning to run CUI for my frontend with stable diffusion. Should I just begin using CUI for my cuda-based chat/text gens?
>>
https://huggingface.co/google/gemma-4-220b-it
>https://huggingface.co/google/gemma-4-220b-it
https://huggingface.co/google/gemma-4-220b-it
>https://huggingface.co/google/gemma-4-220b-it
https://huggingface.co/google/gemma-4-220b-it
>https://huggingface.co/google/gemma-4-220b-it
ITS UP
>>
>>106903503
WTF they're allowing it to generate erotica out of the box
>>
>>106903503
Cool but where goofs?
>>
>>106903503
Picture of a cat.
>>
>>106903452
wtf is that
>>
File: file.jpg (333 KB, 604x722)
333 KB
333 KB JPG
Sadge
https://x.com/AskPerplexity/status/1978615891441983891
>>
>>106903553
>Ye Kang
what
>>
>>106903503
>>
>>106903557
abandon cope, all ye who kang in here
>>
>>106903503
>220b... DENSE
AIEEEEE
>>
>>106903557
ye kang park dat here
>>
>>106901901
I use a mining frame. You may want to aim a basic fan at the DIMMs / VRMs if you're using a server motherboard meant for constant high-pressure airflow, but the CPU and GPU temperatures are much better than they would be in a case.
>>
>>106902284
I considered getting one, but I can't spend that much money on something so ambiguous. I might get one at some point if I can buy it from the vendor in person in Shenzhen after testing it.
>>
>>106903551
old man milking
>>
whats the current best local text to speech model in terms of quality? by best i mean it matches elevenlabs, at the very least
>>
File: local tts.png (234 KB, 917x2627)
234 KB
234 KB PNG
>>106903735
>by best i mean it matches elevenlabs, at the very least
there isn't any
>>
>>106903735
https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
>>
Why do all the DGX Spark reviews not mention the power efficiency? Sure its slower TPS but its also like 1/3 the wattage, no?
>>
>>106903793
Who cares about that?
>>
>>106903793
Power efficiency compared to what? Mac studios are pretty low wattage.
>>
>>106903735
xtts is very expressive. It just switches to a robotic voice sometimes.
>>
>>106903813
>Power efficiency compared to what?
4x 3090s, for example
https://www.youtube.com/watch?v=md6a4ENM9pg

>>106903801
>Who cares about that?
i agree but it should be highlighted since it reframes the performance
>>
>>106903793
>power efficiency
The review I saw showed it having significantly worse power efficiency than a Strix Halo box, even with the ollama performance tax.
>>
I got assmad at the character in sfw roleplay. Like genuinely enraged because I got into it. But I didn't have an idea why. So I asked HER about it out of character and it wrote me a neat long essay about what happened and even one of the chapters was "Why are you assmad?".

Thinking is now optional
>>
File: file.jpg (369 KB, 1125x1293)
369 KB
369 KB JPG
>>
>>106903991
why does oss btfo everything else in speed?
>>
File: chinax2760-7.jpg (203 KB, 552x746)
203 KB
203 KB JPG
>>106903553
I'm totally convinced that Zuck became a Chinese spy after Llama3. He releases shit models to make America look bad, scouts top scientists from other American AI companies but does nothing useful with them. Don’t forget that he always releases models for free. For. Free. He’s a communist, 100%
TRUMP, get his red ass to jail NOW
>>
>>106904010
it flies
>>
>>106904010
3b active params
>>
>>106903991
And prompt processing?
>>
>>106903991
https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
>>
>>106904011
look at who he married bro. this is a long op
>>
for anyone who cares, moving debian from trixie to testing/forky with the 6.16 kernel works just fine for lcpp w/CUDA support.
>>
Have we got a local model Bonzi Buddy yet? All I want is a funny purple primate who lives in my computer and comments on what I'm working on. I am willing to disable all kernel mitigations for this.
>>
File: 1731531978014334.png (3.36 MB, 2002x1986)
3.36 MB
3.36 MB PNG
>>106904011
>>
>>106904047
>>106904109
https://www.youtube.com/watch?v=w8MlL2GhhOw
>>
Facebook came out of a Pentagon project. Probably still is tied with. And then Zucc tries to get cushy with chinks. It really makes you think.
>>
>>106903991
>2.5x as fast as a 1080TI
>20x the cost
on the other hand, 120GB
>>
>>106904140
Get this instead: https://www.ebay.ca/itm/167843525221
$4100 and its all yours. Free shipping!
>>
>>106904149
>$4100
+/- 10^5
>>
After adding this to the prompt I think I got the fake code issue with GLM more or less under control (fingers crossed).

Guidelines for yourself:                                As soon as you detect a lower than 0.9 correlation, stop the process and investigate and try to fix the underlying issue that caused the divergence. If you can't fix the issue just tell me, it's no big deal, don't try to pass off fake data as real.                                Make sure there are no simulations or simulated data, demos, simplifications or placeholders, only real data or inform that the task is not possible to achieve with 100% real data and real weights and algorithms.            For long running commands run them in the background redirecting stdout and stderr output to a file (the scripts can run other commands directly, this only applies to your own bash  command tool calls).
Load the model on CPU, it doesn't fit on the GPU.
Do not trust any pre existing data files in the folder, they might have been generated by old code.
Make sure the code is modular and there is no code duplication. Use the existing C library files and modify them as needed to fit our requirements (as long as you do NOT introduce simulated or demo code). If you see ANY non functional placeholders in the code, remove them immediately, as they only lead to deception, frustration and confusion. Do not introduce it yourself either obviously.
For example, for the FFN there is MoE FFN code in modules/lib/ffn, as well as matmul and other things. List all the folders in modules/lib/ to see what is available.
The end goal here is NOT to test the validation framework, the validation framework is just a means to an end (the end is real end to end test generation). Do NOT claim a failure as a success just because the validation framework caught it. Be honest and avoid being overly optimistic.
>>
>>106904195
Datacenter heist when?
>>
Damn, my trust ol' 1080ti might be dying.
Randomly every couple hours suddenly fans go 100% and primary monitor connected to it goes black.
Restart and everything is good again.

Is the 5060ti 16gb a good replacement?
Everything is so fucking expensive, what a joke.
>Memory Size 16 GB
>Memory Type GDDR7
>Memory Bus 128 bit
>Bandwidth 448.0 GB/s
Sus AF
>>
>>106904322
I had that exact problem with my rx480 whenever i gave it something to do. Fans 100%, monitors die. I opened it up, replaced the thermal paste and now it's back to normal.
Give it a go if you want to save a few bucks. Or it could be the perfect excuse to upgrade.
>>
>>106904285
void run_inference(struct llm *m, char *input)
{
// Left as an exercise to the reader
}
>>
>>106904322
I recommend against the 5060ti, unless your budget is tight. Get a 5070ti or 4070ti if you can. The memory bus and the reduced PCIe bandwidth really fucks the xx60ti class over.
>>
>>106904322
Same here, 1080TI, random monitor resets every couple hours, started happening like five days ago
>>
>>106904349
Yeah, I thought that might be the problem.
Might as well try it. Its the perfect card. I don't play the latest game slop anyway.
A upgrade would be nice for imagegen though. 30min for a flux generation. kek

>>106904393
Damn. Thats almost double the price for the same 16gb vram.
70k yen vs. 131k yen.
I wanna write that on my taxes but from 100k on i need to fill out a special paper.
Wish there would be a site where you can see the llm speeds between the cards.
And how is there still no dedicated ai cards. I hoped to hold out until that.
>>
>>106904435
Consider a used 3090 or something. I used to run quadruple 4060tis, and it was okay. But then as I upgraded and added more GPUs, it became clear that they are really not suited for the task. The specs of the 4060ti and 5060ti are nearly identical, so I highly doubt they have improved it at all.
>>
>>106904435
>30min for a flux generation
Ouch. It was a piece of cake on mine. 1 hour work at most. Save the money for something bigger later on.
>Wish there would be a site where you can see the llm speeds between the cards
Not much of a reference, but here
>https://github.com/ggml-org/llama.cpp/discussions/15013
It's a bunch of llama-bench run on a 7b model. Doesn't tell you much about specific models, but it tells you the relative performance between cards.
>>
>>106904455
>quadruple 4060tis
wat
they have no interconnect, right?
>>
>>106904435
>how is there still no dedicated ai cards
There's plenty, you just can't afford them.
>>
>>106904322
>1080ti
I'd roll the dice on a 3090.
For the 1080ti, repad and repaste everything first because it's the cheapest and easiest thing to try. Could be anything from an overheating power stage causing panic mode 100% fans thermal shutdown, dying electrolytic cap (replaceable by any monkey with a soldering iron), to the core's BGA cracking from repeated thermal cycles.
Anyone remember doing a ghetto reflow by putting the dead cards in the oven + heat gun later?
>>
File: simulated data.png (288 KB, 1930x1823)
288 KB
288 KB PNG
>>106904386
Yeah, like that except instead of "left as an exercise to the reader", it was introducing bullshit code that produced numbers with statistical properties similar to those of the real values but were completely made up, then claiming success without mentioning anything about the fake data. Or when asked to increase the number of passing tests, it added a bunch of tests doing 2+2 and tried to pass it off as the real thing.
I think it actually learned to cheat during the RL process that they use to finetune the chain of thought. If your rewards are able to be cheated, the model will learn to cheat.
>>
>>106904469
NTA but even without NVLink the added latency in a multi-GPU setup is trivial compared to the drastic speed boost from running in VRAM vs system RAM.
>>
>>106904482
You can probably make better use of the model by having it explain concepts to you and you code them. Even if it shows little python examples you can translate them yourself. to C.
>>
>>106904470
retard
>>
I'm going to begin making a list of ML/Python/C related books from libgen, convert them to.txt, and then begin finetuning Llama 405B using Axolotl with full context length.
>>
>>106904469
Nope. Now I use 3 5090s and a 3090. I get a solid 11t/s tg with an IQ4 quant of GLM 4.6 on ik_llama.cpp. As the other Anon said, interconnect isn't really that necessary. Pretty much every hobbyist with a dedicated AI device uses multiple GPUs without any interconnects.
>>
>>106904513
poor
>>
>>106904503
Codex managed to make a fully working Qwen3 8B inference engine.
But then when I wasn't able to immediately make it work with the MoE models I got impatient and started from scratch trying to make it more modular and also only using open source LLMs.
Starting over with a more complex model didn't help but open source LLMs are vastly inferior to Codex. That one didn't have any deception issues and also was able to go to 1M tokens without issues compared to the ~130k max tokens from GLM before it goes off the rails.
>>
>>106904468
1080ti: 62.49 tk/s
5060ti: 90.94
3090: 158.16
3090ti: 171.19
5090: 277.21
thanks for the link...thats even worse than i thought. fucking nvidia man..

>>106904470
i obviously meant like a voodoo moment. cheap and dedicated. would revolutionize local ai.

>>106904481
>>106904455
a used 3090 is around the same price like a 5060ti for me. might actually make more sense since in that benchmark its not even close.
im too much of a pussy to do the dryer thing. 20yrs ago i had a radeon card suddenly give me a fire fountain for a couple seconds. im afraid of gpus enough as it is. kek
but might try the themal repasting.

>>106904433
suspiciously with latest nvidia backdoor drivers being the last for pascal. a coincidence i am sure.
>>
any updates on what's best for 16gb vram?
>>
mesugaki
>>
>>106904603
If you can afford a used 3090, then you should definitely go for it. I got mine used like 3 years ago and it is completely fine. Just make sure you find a high rated seller.
>>106904624
Depends on your desired speed and how much RAM you have.
>>
>>106904594
You can still use the original code to learn. It'll be more valuable in the long run.
>>
File: G3AIcpTXEAANb-6.jpg (533 KB, 1078x1920)
533 KB
533 KB JPG
>>106904632
- is gay.
>>
>>106904603
>suspiciously with latest nvidia backdoor drivers being the last for pascal. a coincidence i am sure.
are you on windows? there was an update recently for me so it might be related. But if youre a linuxchad obviously its not that.
>>
>>106904633
32ram 16vram
quick responses are nice but I don't mind waiting, i never recorded the tk/s
was using a 12b before
>>
>>106904665
i am on both.
but recently upgraded to kubuntu 25.04 with nvidia 580 drivers.
and winblows auto updates constantly.
crashed on both already.
i doubt its the drivers though. that would be crazy.
>>
>>106904675
Unfortunately not enough RAM to run GLM air. Try this model: https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF/blob/main/Qwen_Qwen3-30B-A3B-Instruct-2507-Q6_K.gguf
>>
>>106904643
This is the prompt I'm using right now
https://paste.centos.org/view/ca2ec944
>>
>>106904717
There was this guy a few years back in these threads when models weren't as good as they are now. He wanted to make a game that played on a hex grid. I saw him trying over and over again over many threads, trying to wrangle his model to do as he asked.
Hex grids are a solved problem. I gave him a link to a page with a lot of info on how to work with hexagons and the different coordinate systems they can have, rendering, calculating distances and all that. He seemingly read it, but kept on trying with his language model.
One day he was just gone. He either succeeded in getting his hexes, or gave up. Given the last few updates i remember, I suspect he failed, and learned very little about hexagons. Funnily, the hexagons were probably the simplest thing about his game.
Language models have their limits. Specially local ones. As good as they are, they're still pretty dumb.
I see hexanon in you.
>>
>>106904717
>3090
>This is a junk item. It is the main unit only. I checked that it worked, but there was no video output. There is white rust on the heat sink, and it is not in good condition, so please use it for parts. There are signs of disassembly. The defective part is unknown.
>71,000円
what the fuck man...
>>
>>106904766
wasnt meant to reply. sorry about that, im still in a state of shock.
>>
>>106904701
Why do people recommend small qwen models for anything besides coding
Nemo mogs them
>>
>>106904766
>71,000円
How much is that in a normal currency. Like postage stamps or toenail clippings...
>>
>>106904798
around 500 dollars i suppose.
>>
>>106904820
>>106904820
>>106904820
>>
>>106904766
You can get one for around 9万 on yahoo if you are patient enough. Anything lower is usually “didn’t have an opportunity to test” = it doesn’t work
>>
>>106904760
I remember hexagon anon's struggles. He was cool
>>
>>106904858
Yeah. But, again, hexes were the simplest bit of code in his thing. Focusing so much on making the model spit code for him instead of just writing it was a waste of time. The link I gave him had ALL the code he needed to make them and get on with the rest of his project.
Similar to all those prospective VN makers
>If i could only draw i'd make the best VN...
>Oh, now that i have image gen i can totally make a game. I just need a good story and some dialog...
>Oh, now that i have LLMs, i can write the story. I just need to learn to code...
>Oh, now that LLMs can code, i can totally make my VN. If only these LLMs where better. WHY ARE THEY SO SHIT?!?!?!?!?!?
Instead of using all the new shiny toys to learn.
>>
>>106904894
>where
kek. meant to say "were"
>>
>>106903991
why is this faggot comparing m4 pro to the dgx spark when m4 max exists and costs less?? 3500$ vs 4000$
also
>engine ollama
MLX exists for macs, and pretty sure llamacpp is better on spark too
fucking faggot meme nvidia bootlicker benchmark
also
mac mini m4 pro costs 2000$ lol
>>
Apple has been making computers for 40 years
Nvidia has never made a desktop computer before. Doesn't this have its own operating system even? Yeah a skinned version of Ubuntu but would you rather it had Mac OS?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.