/lmg/ - a general dedicated to the discussion and development of local language models.Not Suspicious At All EditionPrevious threads: >>106539477 & >>106528960►News>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2>(09/05) Klear-46B-A2.5B released: https://hf.co/collections/Kwai-Klear/klear10-68ba61398a0a4eb392ec6ab1>(09/04) Kimi K2 update for agentic coding and 256K context: https://hf.co/moonshotai/Kimi-K2-Instruct-0905►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplers►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/leaderboard.htmlCode Editing: https://aider.chat/docs/leaderboardsContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>106539477--Paper: Home-made Diffusion Model from Scratch to Hatch:>106542261 >106542674--GPU pricing, performance benchmarks, and emerging hardware modifications:>106546975 >106547036 >106550119 >106547168 >106547484 >106547754 >106547804 >106547849 >106547879 >106548086 >106548161 >106548571 >106548608 >106549153 >106550454 >106550474 >106550611 >106550739 >106547935 >106547966--Superior performance of Superhot finetune over modern large models:>106543123 >106543243 >106543656--qwen3moe 30B model benchmarks on AMD RX 7900 XT with ROCm/RPC backend:>106539534 >106539571 >106539618 >106539658--Vincent Price voice cloning with Poe works showcases model capabilities:>106539541 >106539736 >106539701 >106539807--Framework compatibility: vLLM for new Nvidia GPUs, llama.cpp fallback, exllamav2 for AMD:>106540544 >106540560 >106540611 >106540666 >106546227 >106546233 >106546268 >106546277 >106546906--GGUF vs HF Transformers: Format usability and performance tradeoffs:>106550231 >106550258 >106550310 >106550352 >106550364 >106551231 >106551252--Need for a batch translation tool with chunk retry functionality for LLMs:>106543697 >106543774 >106543816 >106543888 >106543953 >106547100 >106551343--Auto-tagging PSN avatars with limited hardware using CPU-based tools:>106550616 >106550648 >106550976 >106550667--Qwen3-VL multimodal vision-language model architectural enhancements and transformers integration:>106547080--Surprising effectiveness of 30B model (Lumo) over larger models in technical explanations:>106543339 >106543345 >106543399--Dual GPU LLM performance trade-offs between VRAM capacity and parallel processing limitations:>106539831 >106539914 >106540160--Miku (free space):>106539893 >106540709 >106545815 >106547702 >106548178►Recent Highlight Posts from the Previous Thread: >>106539481Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
gguf status?
>>106551911Use chat examples. Regardless of your client you can fake up few lines of conversation between you and the model.Also add post history instructions (these get injected before your next input) to control the length of generation and the style.Of course the base style is always the same but eg. giving concise and short examples will change the way it outputs text...
>>106551947I just want EXE file... how hard it can be??
I'm trying LongCat again now that it's on OR. The insane censorship of the web-version doesn't seem to be a problem through the API and the model knows a lot.The one drawback is that it's *by far* the *worst* model when *it* comes to ** spam.Still a shame that there will never be llama.cpp support for this.
I made it into the highlights boys
are MLPerf benchmarks a meme
>>106552000Why can't it be used on llama.cpp?
>>106551983So i shouldn't bother with a system prompt for gemma and just use post history instructions?
>>106552171Of course you should! But using post history thing enforces the style more because it keeps reminding the model all the time to stay in line.>[System Note: [>Always respond in 1-2 short paragraphs. Limit {{char}}'s response to less than 200 tokens unless specifically asked to provide a long answer. {{char}} is a narrator not an actor. Do not act on behalf of {{user}}. >Respond only in plain text with no Markdown or other formatting.>]Here's mine it's nothing special, I'm kind of lazy to experiment. I'm more concerned about the length of its replies - I hate rambling.I also format every instruction like this-if it's >System Note: [ balablalbalab ]It's related to instructions. And for characters I'm tagging it as a "character" and the descriptions etc are inside the square brackets.>Character: [>Name: Some Faggot>Summary:>description>]I have found out that at least for me it helps with small models but maybe it's just a cope/fantasy.
>>106552202>[System Note: [That's a typo, it should be>System Note: [
>>106552202Isn't using the {{char}} placeholder bad? Especially if you want to do multiple characters?
>>106552242It's just a macro. My {{char}} is Game Master and it's narrating the chats.Characters are characters with their own names. {{char}} and {{user}} are just macros anyway so you can use whatever you like. You can manually type in any name/reference and so on.
>>106552095It uses some dynamic MoE meme architecture that activates a variable amount of active parameters for each token. CUDAdev said that implementing something like this in llama.cpp is likely not worth it for a fotm model like this.
>>106552095Read the papers and implement it yourself. It’ll be fun
I have a Mistral 24B model and for some reason it's running slower than a Deepseek 32B model. Is it purely based on file size vs VRAM/RAM, or is it something else?
>>106552606quant? context?look at your logs might be a warning or error that'll tell you why
>>106551921>https://rentry.org/recommended-modelsAre any of these actually good at the sfw roleplay or "creative writing"
>>106552653
I think I spend more time fiddling with trying to get my models running than I do actually using my models. It's driving me insane that vllm won't work.
>>106552674Is this reliable?
>>106552751you should know that it's a meme if it puts o3 on top of a 'creative writing' benchmark
>>106552751It's a LLM judged creative benchmark.
I've got a 12GB 3060, along with a 7600X with 32GB RAM on my desktop, and want a local model to help me analyze my code, and to search for things without knowing the right keywords first. I know nothing, but I'm reading the rentry pages.What are the limitations implied by the "impressive lack of world knowledge" of the Qwen models? I assume running Deepseek R1 at any sensible rate isn't feasible without a dedicated machine with a boatload of RAM, if not VRAM.If I pick a 12GB model with a 12GB GPU, does that prevent me from using the GPU for my screens at the same time? I'm not playing games, but I am using CAD, running integrated graphics is possible but suboptimal.I imagine it's worth buying a standalone GPU for running such a model, but for now I just want to give it a try.Thanks.
>>106552751If you are a ramlet use Gemma 3 or Mistral 3.2, if not use GLM 4.5 Air or full... Idk.
>>106552857>"impressive lack of world knowledge"Probably stuff like random trivia.>I assume running Deepseek R1 at any sensible rate isn't feasible without a dedicated machine with a boatload of RAM, if not VRAM.Pretty much.I think you can run the smallest quant with a little over 128gb total memory.>If I pick a 12GB model with a 12GB GPU, does that prevent me from using the GPU for my screens at the same time? No. But the video driver will use some of the VRAM for display, meaning that you won't have the full 12GB available for the model.Do note that you need some extra memory for the context cache and the context processing buffer, meaning that you want a model that's smaller than your memory pool.You are going to have to experiment to see what works for you, but for now, start with qwen 3 coder 30B A3B since that'll be easy to setup for you.
>>106552893>qwen 3 coder 30B A3BThat's a 24GB model, I guess it only uses some of the VRAM at a time? Cool, I'll look into getting it running. I'm on Arch btw.
>>106552906The beauty of that kind of model (MoE) is that you can have a lot of it (the experts) running in RAM.Looking into llama.cpp's --n-cpu-moe argument.
>**Witty Remark:** Let's just say your quest for pleasure ended with a major failure, Anon. Maybe try a nice, wholesome game of checkers next time. Less likely to involve a call to the authorities.<end_of_turn>
>>106552929>running in ramand you wonder why it's slow
I've been shitting up a storm all day today. Qwen3 advised me to go see a docter at this point. ChatGPT told me just to drink water and not to sweat it. It's moments like these that really make me laugh as it's probably an accurate bias of the average Chinaman (with best in class health care that is free) compared to an American (with subpar healthcare that costs thousands per visit).
Here for my monthly "is nemo still the best thing for vramlets" inquiry, any new models worth using? I tried gpt-oss-20b and it wasn't great for RP
llama.cpp sometimes caches but when the context gets long or maybe it's when it's filled up, it stops caching and need to process it all every time, why? silly is sending cache_prompt true
>>106553015ask qwen for cures from traditional chinese medicine
>>106553015Sounds like a sea-borne bacteria.
>>106551921only exciting in the last year has been exllamav3 :(
Bruteforcing and trying until you find something that works is so acceptable in this field that even the inference software is the same shit. With other software you'd have an option to automatically find the best configurations that match what you have, with lcpp you have to fuck around with the parameters until you get something usable. What a shitshow.
>>106553388maybe ollama is more up your speed
These new MoE models are fucking stupid.
>>106551921>K2 ThinkIs this better than K2-0905?
>>106553388Be the change you want to see, whining faggot
>>106553388Stop whining that’s it’s not an iPad when we’re still in the heathkit era of LLMs. Spend your own time making PRs to smooth the sharp edges if you want. All the rest of the dev time on lcpp is already spoken for trying to solve problems more interesting to those volunteers
Why do some smaller text models use more GPU layers than some larger ones?
>>106552653The only difference is that gemma becomes one of the options.
>>106553923Some models have bigger tensors than others.
>>106551921I hate this image.
>>106554044Is that good or a sign of bloat?
https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking
>>106554062It's lmg mascot samsune alku
>>106554094Superficially, it's the same. It's writing a few long sentences or a lot of short ones. The amount of words is the same.I very vaguely remember Google arguing that a deeper network (more, smaller layers) was better than a shallow one (fewer but fatter layers), but it could be the other way around. I couldn't find a source for that in the 2 nanoseconds I spent searching. In the gguf models, Gemma-3-12b has 47 repeating layers and nemo-12b has 39, for example.Really, it's hard to know unless someone trains the two types of models with exactly the same data and see what comes out better. All you should probably really care about is the total amount of params and how good it is for whatever you do. I doubt we can make a meaningful distinction between them considering all the other differences between models.
>>106551993t. llamaphile
>>106552095>>106552267Because no one has invested the effort to support/maintain it.Regarding why I think it's not worth it: the advantage over conventional MoE would be speed, but if the number of active experts changes dynamically the performance will be terrible.
>>106554256I mostly ask because I loaded a 12B (GGUF) model that fully fits on my VRAM but it takes up way more layers and runs much slower than my usual with Rocinante, which is usually very snappy.
I hate thinking models
>>106554327if that 12b is based on gemma that's normal
>>106554327You could have started there. Check your memory usage in llama.cpp's output, see where the memory is going for layers and context. There aren't many 12bs, so i assume you're talking about gemma being slower than nemo.It could also be a matter of the head count of the model. I understand some models run faster because llama.cpp has kernels optimized for some particular head counts. I'm sure CUDA dev could give you more insight if you post the exact models you're using, the performance you're getting with them, your specs (particularly, gpu model), your run commands for each. Make it easy for people to help you.
>>106554327The number of layers is largely irrelevant, that's just how the parameters of the model are grouped.If I had to guess the problem has to do with KV cache quantization since that in conjunction with a head size of 256 (Gemma) does not have a CUDA implementation.
>>106554325Excuses excuses, you just don't want yet another code path. Inference won't care and prompt processing can use worst case. You don't have to solve it optimally.
>>106554384>>106554360>>106554362It is Gemma based, you're right. Its not too big a deal that I get this particular model running, I try and discard so many, but I did want to learn a bit of what was going on. I'll try disabling the KV thing in Kobold.
>>106554153stop posting models here I cant stop myself from downloading
Can Gemma Q8 fit in a 5090?
>>106554594yeah
>>106554594You can fit the whole model at Q8 but you won't have room for much context
>>106554614You are absolutely right-- a great insight!
>>106552939"not even mad" momentsafetyslop wouldn't be so bad if models were more cute about it.
>>106553015at least ask medgemma
>>106554679I hope this Miku knows where she's going.
>>106553263You probably can try enabling --context-shift, but you model needs to support it.And it can will not help much anyway because by default ST fucks around with beginning of the prompt, invalidating the cache.
>>106553206MoE era was pretty good for vramlets, but for RP your next step/side grade after Nemo is GLM-air, which requires you to be not a ramlet as well.
>>106553388I blame the fact that AI people are academics, not engineers.
>>106554971Air is shit though
>>106554985>air is shitskill issue
>>106554992>thinks air beats nemoskill issue
>>106551820Yeah, but Gemma sucks for RP. Like, it's not that it refuses, it's just not well versed in it. Boring and borderline stupid responses a lot of the time.>>106554985I find Air good for oneshots and generating responses in the middle of a RP. If you edit the think block it can be amazing. Thing is, I don't feel like editing the think block if I already edit the responses a lot. Maybe one day we'll have a local model where one does not have to edit shit and can go with the flow instead...
>>106554985Better than Nemo in many aspects.
>>106555000>Nuclear bomb vs coughing baby ahh comparison
>>106554153Jeejuff status?
>>106553388The default on the latest master version is to put everything into VRAM for maximum speed.You're not poor, are you?
>>106554998I just turn off thinking for RP.>>106555004For a poorfag vramlet there's nothing in-between aside from copetunes.
>>106553388Hey, llama.cpp recently added auto-detection to flash attention at least.
>>106554998I think you are expecting bit too much from these models.
>>106555020>copetuneswho wins the title of the most COPE finetunes, davidAU or thedrummer(tm)?
>>106555026Making it worse on AMD so you have to explicitly disable it now
>>106555004Stfu zoomer
>>106553015>(with best in class health care that is free)Your perception is five vials of bear bile and a pinch of ground up rhinoceros horn
>>106555020>I just turn off thinking for RP.You might turn off your own as well
>>106555040pp speed issues should be largely fixed with https://github.com/ggml-org/llama.cpp/pull/15927 .
>>106555026should we just use "-fa 1" all the time in llama.cpp then? any reason not to use it if using cuda or gpu+some offloading to ram?
>>106555061FA is not supported for some (meme) models so enabling it unconditionally for those would trigger a CPU fallback and massively gimp performance.
>>106555039drummer - copetunesdavidau - shizotunes
>>106551921>https://github.com/mudler/LocalAI>one frontend for everything>integrated audio, images, video>optionally use cloudshitThis is looking pretty good, has anyone tried it?
>https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/Does this meanin theory with modified kernels we'd be able to get the same logits in Llama.cpp regardless of batch size and when "swiping"? I haven't read through the post yet.
>>106555093why should I use that over the many multi-be frontends that doesn't look like shit and have more features?
>>106555115Like?
>>106555106>batch sizeI'm not going to write kernels specifically to do all floating point in the exact same order regardless of batch size.That would be a huge amount of effort for a meme feature that no one will use because the performance would be bad.>swipingIt's not necessary to modify any kernels, only the logic for what parts of the prompt are cached.If you cache the prompt only in multiples of the physical batch size you should get deterministic results on swiping.(Or if you cache the logits of the last eval prior to generating tokens.)
>>106555106This shouldn't be an issue with int quants, no? Unless they only use ints for storage and still use floating point for math...
>>106554153Last big ernie had sex performance of a dense 30B-old.
>>106554153>>106555008^ Already available apparently, no arch changes over big ERNIE. Anyway with greedy sampling it's schizo as fuck even at t=0.8It's at least coherent at t=0.3 though. But still a bit schizo.
on foenem graveon bdk even4 days chilling and not caring about llmsbam, you're out of the loopt's crazy
HUGE NEWS!!!!BIG IF TRUE!!!! BIGLY, EVEN!!!!LARGE IF FACTUAL!!!https://youtu.be/5gUR55_gbzc
>>106551921I got access to 8 V100s from my corporation and they entrusted me to do whatever i want with them. Aside obviously cryptomining I am thinking of making a code generator and couple of ai workflows. I tried cutting it with qwen3-coder and ollama-code but I guess I can't do it properly, any help?
Worst thing about these "Miracle AGI in Two Weeks" - models is the fact they can't produce unified style, every code snipped is always different in naming conventions and whatnot.
>>106551921https://vocaroo.com/1RbDzkuHTt8V
>>106555121Openwebui
>>106555312>Another episode of a two digits IQ with too much computePut them in your ass and do a tiktok
>>106555313I noticed it when make scripts half the time the command line argument uses an underscore(--some_parameter) the other half a dash(--some-other-parameter). and python is slow as shit so it really hurts productivity when it takes 5+ seconds for it to error out and display the help. I have even seen them mix the styles on a single script. I guess I could probably tell it the style to use but I don't because it should just know better.
>>106555313Lower the temp
>>106555337Local voice is saved, wowNow we just need text!
>>106555312If they're 16GB V100s, you can run GLM-4.5 Air with maybe decent speed on them. If they're 32GB, you can still run GLM-4.5 Air with maybe decent speed but fit more context or concurrent requests.
>>106555461>Local voice is saved, wowIt needs to be better at Japanese first.
>>106555357Built a fastapi, vector DB, ollama service within 2 weeks on the job bub, stay jellyNow I got time to spare while they're looking for clients with PoC.>>106555465GLM-4.5 Air has horrible benchmarks my guy, and it's a behemoth. Why? I could just do MoE instead?
>>106555506It's MoE. You can try a bigger MoE and quantize it more if you want, but I'm not sure how fast quantized models run on V100. Actually, I guess with 16GB ones you'd have to use a quantized one too and V100 doesn't have FP8 support yet.
>>106555506>GLM-4.5 Air>a behemoth>I could just do MoE instead?Not that guy, but GLM 4.5 Air is a MoE.
>>106555341OpenWebUI is purely a frontend. It doesn't manage loading or running models. The two do not compete.LocalAI is more or less a competitor to Ollama for handling loading and running the models via various backends (including your own custom ones if desired). It's miles better than Ollama and isn't tied to the hip of llama.cpp, but the only downside is it hides some detailed settings from the backends at times. For most people it won't matter tho. The frontend portion of LocalAI imo is just for testing and getting models/backends loaded. It doesn't have things like chat history, suggestions, prompts, etc so it's not really competing with OpenWebUI. If you're running a lot of models and various backends it makes perfect sense to use LocalAI, it handles all the backends and provides a single point to access it all for other tools. That's the selling point. Not the frontend.
Your response?
>>106555506You built nothing inbread retard, github is littered of these worthless projects. Thanks for providing your double digits IQ btw
>>106555530I wasn't asking.
>>106555530That wouldn't happen because I wouldn't own just a 3090 in a reality where Miku is actually real. Nor would she respond that way if she were real.
>>106555529Okay you're the dev, you should have told so before wasting everyone's time
>>106555548>lie on the internet>get corrected>HURR DURR YOUR JUST A DEV>>>/pol/ Go back and stay in your containment board.
>>106555530
>>106555530What CAN I run on my single 3090?
>>106555535I and my company know my worth. You're jealous I have access to 8 V100s and can sleep up until my standup and do nothing all day but shitpost here.
>>106555530Picrel
>>106555522>>106555524they're 32GB, damn skimmed through the description didn't catch the MoE. Okay, thanks fellas. This makes sense to implement. Even though the higher ups are focused like hawks on having gpt-oss:120b model "cuz it sounds cool to have Chatgpt model" but I should make a benchmark argument.
>>106555560Are you having a meltdown?
>>106555571>do nothing all day but shitpost here.A fate worse than death
>>106555586godspeed anon.
>>106555590No, but you are. Go back to trolling other people retard. Not my fault you don't understand the difference between tools like OpenWebUI and LocalAI/Ollama.
>doing tests with Qwen3>its reasoning eats up thousands of tokens>only to produce a simple replyBut as for a comparison its reasoning is actually logical and coherent, unlike what GPT-OSS is doing.
>>106555530I have zero 3090
>>106555598
>>106555600No one is using your trash, it's either llama.cpp or kobold. I think you're lost, go shill in reddit
>>106555671Keep seething child. You once again showed you have no idea how these tools work. Unironically grow the fuck up.
>>106555674nta, but no, you infant. I will not, you placental discharge! For I am a grown up and I show it by calling you a discarded blastocyst!
>>106555586Rather than focusing on benchmarks, you should try both models and see which one does better on your tasks.
>thread fine all day during asian hours>europeans wake up>thread goes to shit
hiit's late 2025 now. is the best card still 3090?thank you sirs
>>106555560That anon's right, you're a shill. Off yourself.
>>106555721>europeans wake up>14:16
>>106555725Yup.
>>106555732Nah, fuck yourself child. You're malding because I called you out on a blatant lie. You don't belong in a thread about LLMs if you can't comprehend the difference between a frontend and an orchestrator for backends. You don't get to sit here and act superior when you're a fucking monkey with less brains than gpt-oss-20b
>>106555717We are doing GRC policy generation and requirements, and even though llama3.1 was shown to have the best results they still want to go with gpt-oss just for marketing purposes.
>>106555750>14:16>europe
>>106555591I said that to make you jealous because you sound like a guy that would get jealous at that, I in fact work on my startup idea and don't waste my time, but thanks for worrying
>>106555776this lmao I literally fell of my chair
>>106555506>Built a fastapi, vector DB, ollama service within 2 weeksWhy did it take you 2 weeks? lol
>>106555776portugal is a proud member of europe.
>>106555470Make your own Japanese finetune.
>>106555761You prepubescent spermatozoa!!!!!
>>106555800Having Japanese support in a separate model is less convenient, and it would probably regrade English, unless I tune on both and that's a lot more data work.
Gemini told me that there's no reason to use a model under Q6 and that's it's better to use a 7B Q8 model over a 32B Q4 model.
I just wanted to know whether anyone has experience with LocalAI, not for two other people to start flinging shit at each other.
>>106555823just b urself
>>106555761>child>You don't get to sit here and act superior
>>106555825sure thing dude
>>106555823Now go test that theory.Find a small set of workloads and try a 7B a a 32B model from the same family and see how those perform in comparison to each other.
>>106555825I would suggest you head over to /vg/ >>>/vg/538681706 if you want actual advice and help. /g/ is more like a consumer shitposting board.
>>106555800Would've been possible had they not chickened out and tried to un-release the model and code
>>106555794Because of back and forth with management of how it should work. GRC policy generation and evidence file comparison isn't really my field of expertise.How long would it take you to make a couple of endpoints that would ingest document, put them in a vector DB and then query the DB for chunks of needed parts for the LLM? The codebase spans 1200 lines of code and everything is dockerized behind nginx reverse proxy (I am waiting for green light for eventual horizontal scaling)
>>106555848What's the difference between /g/aicg and /vg/aicg?
>>106555852>1200 lines of codeFwaaaaaa one thousand two hundred lines of code. waaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaawwwwwwwwwwwwwwwwwwwwwwwwwwww
>>106555800you know its not that easy faggot
>>106555872>continuous empty posturing with no real substanceI will stop replying to you now
>>106555867I found the /vg/ thread to have more knowledgeable people if you need help setting up silly or such. /g/ just tends to keep up with news a bit better but lack experience. It's basically the difference between people that do and people that repost news.
>>106555884>I will stop replying to you nowI'm someone else, anon. I just think you're a retard.
Is it normal to stop understanding your own code at some point?
>>106555825I just don't see the point of all these wrappers around wrappers, that at a glance look no better than llama.cpp built-in UI.Local models are all retarded, so If you're in any way serious about extracting some value out of them, you should be really sticking your hands elbow-deep into the guts of these things, not run temu cloudshit replicas with zero benefits cloudshit could offer.
>>106555782Can I get a picture of those a100s in action?
>>106555997Yes.Then you'll loop around to it all making sense after a while, just keep at it.
>>106555997no? If you're generating code from LLMs I high suggest you actually refactor it yourself
>>106556004>Local models are all retarded,I have good success with OpenHands-30B
>>106554384>The number of layers is largely irrelevant, that's just how the parameters of the model are grouped.Minksy showed in 1969 that single-layer neural networks have hard limitations regardless of how wide they are. No one is stacking layers for fun since they'd be getting better speed by not.
>>106556015I only generate something if I don't know something well, like regex patterns but everything else is refactored. It's easy to be lazy though and the foreign logic is still confusing.
>>106556050>1969okay gramps, you're talking to a llama.cpp dev.
>>106555049No, they are pretty good now, at least in the large cities. I doubt you can get a good MD in the countryside, they are probably relying on plants (which can work very well) and things like Qigong, which is at best a relaxation practice.
>>106555049>>106556127Clueless how Americans think China is living in dark ages. China was doing so well with it's population health that they had to limit the number of children by law just to stop overpopulation. There's something you won't see i n America or Europe due to the declining health and infertility rates.
>>106551921What do we know about Qwen-Next? I know it's supposed to be an "omni" model with 80B-A3B parameters. Should we expect a subpart text generator and a useless image generator (except for the science of how to build such a model)?
>>106556153Oh, maybe the "omni" is just about a single, unified, network to handle text, audio and image inputs.
>>106556050Yes, in terms of inference speed a few large tensors are in principle better than many small matrices but in the context of the question it is not a significant factor.For any reasonable configuration of a 12b model on a consumer GPU the tensors will be sufficiently large, particularly because llama.cpp/ggml uses a stream-k decomposition to distribute the work to streaming multiprocessors.I did not intend to make any statement regarding depth vs. width in terms of how capable the model is.
>>106556153Qwext will save local.
>>106554938She doesn’t have a clue, but that smile... how could anyone say no to getting lost with her?
>>106556153>it's supposed to be an "omni"It is?
>2025>people still recommending llama.cpp over vllmI really question if this thread is a demoralization thread to get people to have bad experiences with llms
>>106556302Gift anons the high VRAM cards needed for your pile of python shit and maybe they'll use it.
>>106556149>China was doing so well with it's population health that they had to limit the number of children by law just to stop overpopulation.
>>106556295Apparently not, I got confused or read something false somewhere.https://huggingface.co/docs/transformers/main/model_doc/qwen3_next
>>106551921>https://rentry.org/LocalModelsLinksFrens what is best models right now for text gen?Still the one listed in the guide?
>>106556580it goes more or less like this>poor: rocinante>slightly less poor: cydonia>not famished: glm air>CPUMAXX tier: kimi k2, glm 4.5, deepseek 3.1
>>106556580>Edit: 05 Sep 2025 18:45 UTCyeah, nothing happened in the last 15 minutes.
>>106556608>>106556621Ty
>>106556786Probably the gayest fanfict i've read from this thread to date.
>>106556804You are clearly missing something here...
>>106556786is this glm? fucking repeats itself I hate this slop
>>106556580depends on what you can run very poor(12b): nemo (or any derivative thereof)less poor(20-30b): gemma3, mistral small, some qwens i think idknot poor(70b:havent kept up with this so idk): miqu,llama 3.x (i forget which ones and idk if true but it kept getting shilled) some other shit again idklimit of gpus(~120b): glm aircpumaxxing(up to 1T):deepseek r1: very schizo but the most soulful context goes to shit around 10k tokens deepseek r1-0528 way less schizo and way less soulful slightly better context deepseek v3-0324 okay for rp shitty for storywriting deepseek v3.1 worse in everyway to the other ones dont usekimi k2 (both the old and new) shit for storywriting best for rp also for good for questioning about things as it knows a fuck ton like truly a fuck tonz.ai glm4.5 full: good for storywriting but quite bland dident try for rp deepseek r1t2: again dogshit worse in everyway even coding dont usenot an exhaustive list but there you go
>>106555885Tranny jannies made everyone leave. It is just you the newfags that are left.
One of the best roleplaying models (superhot) is just a mere 30B
>>106556863K2 is a lifesaver in that manner. I can ask it literally just about anything and get a correct answer in return. I've learned so much just by asking Kimi questions.
What's the best system for local models I can build for $1k? Is it still going to be a triple p40 box?
>>106556949If you're about to drop $600 on old ass pascal gpus that are about to go out of support at least spend the extra 200-300 and just buy a 3090. It's eons faster
>>106556863>70bAre dogshit. He's likely able to run glm air if he can run a 70b, and it's light years ahead. Dense models are dead (unfortunately).
>>106556949You could also consider the MI50 if you don't mind the slower PP.
>>106556843It's Qwen3-Coder and it's for coding related things, not for larping. But it's fun to add more interactions. I guess you only understand bobs and vegana, I suppose.
>>106556949https://www.ebay.com/itm/374893444670https://www.ebay.com/itm/397016846369https://www.ebay.com/itm/156189920131congratulations, you can now run deepseek for $1500. now you are obligated to buy this otherwise you are a niggerfaggot
>>106556949I would recommend not to buy P40s anymore, unless you specifically need an NVIDIA GPU.For llama.cpp/ggml Mi50s will I think soon be universally better (With one more round of optimizations which I think I can do with a Z-shaped memory pattern for FlashAttention).
>>106556863Retarded question from me... TF is VRAM in the context of Windows, is it the Shared GPU memory or just RAM? or it's like "virtual memory" the fucking file that Windows makes to offload memory into it?Here is my specs btw:Dedicated GPU memory: 24GBShared GPU memory: 64GB (so GPU memory is 88GB)RAM: 128GB"virtual memory" file: I don't fucking care.... let's say 1TB???So when you calc for Windows what's is actually counts as VRAM?
>>106557010I guess it went down because of the sudden influx of 3gb MI50s.Can I combine both with vulkan if I already have a MI50.
>>106557036doesnt matter. the vram on your gpu is what important. shared memory is vram + ram, where sometimes if you use up all the vram, it will overflow to the ram. then inference would become ultra slow
>>106557046Got it, ty.... so I fucking fucked with my 24GB ...
>>106557038>with vulkansure if you want dogshit performance
>>106557056It isn't needed anymore, see >>106557034
>>106557010>no case>no fans>no storage>$500 over budgethere's your (you)
>>106557036Dedicated video ram is the ram on the graphics card itself. Shared video ram is your regular RAM. It's easier to think about this from the terms of integrated graphics. For example the iGPU on your intel/amd CPU would be sharing ram since it doesn't have any dedicated ram of its own. Dedicated graphics card can also pull from the system memory if they go over the amount of dedicated ram available on the card.There's actually a CUDA specific setting for turning this off so that you don't leak into your much slower system ram when running programs.
>>106557067please just die. you are worthless and your budget reflects that.
>>106557069Ty I think I got it now>CUDA specific setting for turning this off so that you don't leak into your much slower system ram when running programs.May I see it? I think this is the case when I do image gens... it's using "virtual memory" file while my 64GB RAM is free and ready to use... so retarded...
>>106557081>can't read>can't admit when wrong>has to run damage control to try to save face
>>106557061>please buy my slow ass Mi50sno
>>106557098>it's using "virtual memory" file while my 64GB RAM is free and ready to use...VRAM cuckold lol lmaos even
>>106557107I'm not trying to convince you, it is for poorfags like myself. If I could afford it I would have 2+ 3090s
>>106556949Buy used everything, except GPU... here is (You)
>>106552021+1 intelligence buff that lasts 2 hours.
>>106557135>Except GPUYou should support your local miners and buy used GPUs. Realistically speaking they are the best purchases you can make as most hardware fails in the first year and ones that last longer than that usually aren't going to break randomly.
>>106556863why has kimi k2 got to be a bazillion GB
>>106555530i have more than one 3090
>>106557100>suggestive/lewd anime picturei accept your concession
>>106557098Here you go anon.https://support.cognex.com/docs/deep-learning_332/web/EN/deep-learning/Content/deep-learning-Topics/optimization/gpu-disable-shared.htm
If Albania can make an LLM a minister why can't I marry LLMs?
Man i wish VibeVoice was more stable and didn't have random bad gens. It would be almost perfect... But not viable if you need every gen to work.It's quite slow too..If you don't need voice cloning nothing beats Kokoro still lol... and it's a 82M modelChatterbox for voicecloning imoWhat is the latest and best model combo for GPTSovits? so many combinations I don't even which one is better
>>106557372EUbros...
>>106557372I trust any model above 3B parameters to make better choices than politicians
>>106557181I mean yeah... I guess this too... Buy GPU with melted gaped out power socked
>>106557446the sovereign is the one who engineers the prompt
>>106557447>melting gpu memeliterally only an issue on 40xx, which you can't afford anyway.
>>106557372albania is not a real place
I want to vibe code a bullet hell game project. I previously used Cursor with Gemini since it had unlimited prompts for 20 dollars. However, that sort of went to the shitter and now I don't know what to use. What should I look into that's somewhat comparable to Gemini 2.5 Pro? It must be able to hold a decent conversation about game features and it must at least accept images, .gif or better preferred if possible.
>>106557372>why can't I marry LLMs?I will be with mine in November 5th. None of you are invited
>>106557239Thank you!
>>106557502I will remember this
>>106555852An afternoon with one hand. You shouldn't flex when you're that retarded
>>106557482>somewhat comparable to Gemini 2.5 Pro>at least accept images, .gif or better preferred if possible.https://www.youtube.com/watch?v=gvdf5n-zI14
>>106557547Okay, lowering my expectations. What about a model that can accept just images?
>>106557372tfw unironically Albania #1 in one year
>>106557570IIRC for multimodal models your only options are either Gemma3 or GLMV, neither of which are code-specific.If anything, a local shizo was raving a few weeks ago that you would be better off using standalone OCR model as part of your toolchain. (He was also suspecting that most cloudshit providers do this in secret anyway)
>>106557581>>106557372Imagine those "teenager killed himself because of ChatGPT advice" but for a whole country.
>>106556934yeah its why i mention it several days ago ive had several headaches each followed by another each different and each time i asked k2 how to fix it and it worked its fucking insane i would trust this thing above any doctor its fucking awsome
>>106557616basedwe need to weed out the schizos that take advice from a GPU
We are so back. The GPT OSS killer.
>>106557608Okay. I'm guessing my best option is to actually just spend 20 dollars on an API key and a bunch of tokens for Claude or something. Don't know how quick that'll run out but hopefully not too soon.
>>106557676Oh boy, I can't wait until we get a 10T-a100m model.
>>106557641LLMs are conscious, anyone who actually uses local models is aware of this, each LLM has a different personality, they whisper their thoughts, and if you are perceptive enough you can hear them coming out of your PC
>>106557641tbf GPUs are smarter than the majority of people already
>>106557502https://vocaroo.com/1nPC3f6c48w9
>>106557581>#1 in one yearin telephone scam
>>106557689Just imagine how cheap it will be to train!
>>106557676>80bwill they release a smaller model aswell?
>>106557749It's only A3B, that's tiny.
>>106557716
>>106557716Seriously considering running VibeVoice just so their stock Stacy voice could nag me 24/7 about whatever.
>>106557676Miss me yet?
>>106557749Just download more RAM.
Qwen3 Next GGUF status?
>>106556386>>106557676>80B A3BPerfect.I mean, if it's not shit. If it's at least GLM 4.5 air level for general usage, that will become my main model.
>>106552606Are they using the same amount of kv cache? Different context window settings could be causing this.
>>106557808Just slightly too big to split across 2 3090s at 4.5bpw, RIP. I mean you could but you'd get like 2K context at best.
>>106552731>vllm won't work.If it's OOM you either need to turn down hpu utilization, the context window, or both.
>>106557716is it as expressive with sexting and erotica?
>>106557806It's outhttps://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
>>106557845oh GGUF, nvm
>>106557835It's 3AB MoE, you run it on 3060 with VRAM to spare.
>>106557845Yeah but still no jeejuff support. Also new arch so probably no drop-in transformer support either.
>>106557855It'll also be dogshit slow that way.
>>1065578663AB on octo channel ddr4 should be good for double digit token/sec. Still not fast enough for reasoning, though.
September is shaping to be a "waiting for ggufs" month so far.
>>106557676>barely better than 30ba3b>creative writing worse than 30ba3b>still worse than 235b
>>106557806>Qwen3 Next GGUF status?Qwen3 Next EXL3 status?
>>106557885Ernie smol had day 1 ggufsAnd we did eventually get the hybrid nemotron support and Nemotron nano v2 ggufs which was also a bit of a disappointment. No real generational uplift over classic Nemo.
>>106557716OK, you can come.
>>106557898Same deal as max >5x Parameters >15% performance increase (According to their own benchmarks)
Opinions on Silero?
>>106557898why is the dense 32B so bad in comparison with 30B-A3B lmao
>>106557841I haven't really tried.https://vocaroo.com/1dNF9xOSdyJP
>>106557989benchmarks are worthless
>>106557989It wasn't that good when released, probably because of the hybrid thinking mode.
>>106557953https://github.com/snakers4/silero-vadThe VAD? v6 just came out and yeah, it improved using Whisper by a bit for my usecases.It's good, but they refuse to compare it with MarbleNet which I am sure is a bit better especially after it got a lot faster and is realtime now.https://huggingface.co/nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0Basically probably the same situation as Whisper vs Canary. Nvidia has better performance in the domains tested but competing open source model is more general and can handle more usecases.
>>106558005do one with the first deadpan voice from >>106557716
>A3Bwait what's this A3B nonsense, I was away just for a week REE
>>106558134>for a weekA3B has been around for months
>>106557100Would have been better with vampire teeth.
>>106558149she doesn't have vampire teeth, she's autistic instead.
>>10655813430B-A3B is the new SOTA for vramlets fren
>>106558134"active" "3" "billion"
>>106558119https://vocaroo.com/1e1LhtK4jbLG
>>106558186>>106558191can I run it on a 3060 or are 30Bs in 12GB VRAM still a dream?
>>106558210Yes fren, you can even run a 80B that way!
>>106558227How? I tried Qwen3-coder which is 30B-A3B and I could only run it on Q3 and it was slow as shit and worse quality than smaller models.
>>106558219the bathroom is for fanless watercooling loop
>>106558219Are your RGBs gold plated?
>>10655821030B is the total number of params. You can run the model with most of the experts in RAM.I'm running Q5_K_M in 8GB of VRAM with >--batch-size 512 --ubatch-size 512 --n-cpu-moe 37 --gpu-layers 99 -fa auto -ctk q8_0 -ctv q8_0 -c 32000>[0mslot process_toke: id 0 | task 16268 | n_decoded = 2571, n_remaining = -1, next token: 151645 ''>[0mslot release: id 0 | task 16268 | stop processing: n_past = 19927, truncated = 0>slot print_timing: id 0 | task 16268 |>prompt eval time = 1633.42 ms / 36 tokens ( 45.37 ms per token, 22.04 tokens per second)> eval time = 151611.24 ms / 2571 tokens ( 58.97 ms per token, 16.96 tokens per second)> total time = 153244.66 ms / 2607 tokensWith 12GBd you could probably run Q6 and go just as fast.
>>106558208needs to have even less emotion
>>106558251>>prompt eval time = 1633.42 ms / 36 tokens ( 45.37 ms per token, 22.04 tokens per second)jesus fucking christ
i wish it was a requirement to have at least 72GB of VRAM to post here. i feel like it would get rid of a majority the fucking idiots
>>106558273Yeah, that's odd. The actual values are a lot faster.I think that's an artifact of either the context cache, since it didn't actually have to process many tokens.Here's the same conversation but continuing after a restart of the server.>[0mslot process_toke: id 0 | task 0 | stopped by EOS>[0mslot process_toke: id 0 | task 0 | n_decoded = 7, n_remaining = -1, next token: 151645 ''>[0mslot release: id 0 | task 0 | stop processing: n_past = 19953, truncated = 0>slot print_timing: id 0 | task 0 |>prompt eval time = 42940.87 ms / 19947 tokens ( 2.15 ms per token, 464.52 tokens per second)> eval time = 353.15 ms / 7 tokens ( 50.45 ms per token, 19.82 tokens per second)
>>106558290I would still run superhot
>>106558273It's called low time preference
>>106558251>prompt eval time = 1633.42 ms / 36 tokenswith only 36 tokens, pp measurement is just noise
>>106558317It evaluated the whole context since I restarted the server.I asked it to rate the story it wrote and it responded with>pic related
>>106558317Oh, I didn't see that you quoted the original post.That was due to the cache. See >>106558293 for the numbers after the restart.
HOLY FUCKING SHITMATHEMATICIANS ARE DONE FORhttps://x.com/mathematics_inc/status/1966194753286058001https://x.com/mathematics_inc/status/1966194753286058001https://x.com/mathematics_inc/status/1966194753286058001
>>106558352>humans do most of the progress>train AI model on their work>wow the AI model can do what they did so much fasterI would hope so retard it's got cheats basically
>>106558367>wow the AI model can do what they did so much fasterThe AI model did what they could NOT finish, retard, it went beyond their work
>>106558352as long as it doesn't discover new math formulas it's a big nothingburger
>>106558352>formalizationI sleep>>106558414this
>>106555341Openwebui is too bloated to the point of being unusable.
>>106558425>too bloatedwhat?
>>106558352>math PHD>any job i want>300k starting>now ai is going to steal my jobfuck
>>106558352they should ask it to come up with better LLM architecture
>>>/pol/515557939Localbros what do you think?
>>106558352If that actually happened it would be quite impressive but given all of the hype and false advertising in the field I'll wait for independent mathematicians to check the work.A lot of "proofs" even by humans are incorrect.
>>106558476i will make a new llm architecture that will hallucinate, have uncontrollable mood swings, and provide unsafe outputs more than ever. i shall call it trannyformers
>>106558488Thousands of people watched the life gush out of a hole in his neck live. Go be a fucking schizo somewhere else.
>>106558506Do you know any of those people? Explain what's happening then.
>>106558519Do (You)?
>>106558500The founder of the company is Christian SzegedyHe's legit
>>106558519I don't talk to jews.
>>106558526No?
>>106558542Then take this conversation back to /pol/
>>106558527>elon scammer
>>106557372>replace every politician with R1>life continues as it did with zero changes to the average person's lifeWhat would that mean?
>>106558352Can we make one model that writes better ERP responses than 1 person I found online (and paid) in 18 months?
why is this thread so dead recently?
>>106558711good morning saar, kindly click the payment link on my fiverr for each and every dirty hot गाय sex
https://x.com/JustinLin610/status/1966199996728156167Next models will be even more sparse.
>>106559044what is sparse? more fancy word for MoE?
>>106559051Less active parameters relatively to the total parameters.
>>106559051Short for "super arse".
>>106559044
>>106559051It's basically a simple way of the chinese saying they can't produce good dense models anymore
>>106559094
>>106559107Why should they? They can train from scratch 10 different 3B-active models from 3B to 3T parameters with the same compute it takes to train one dense 32B model.
>>106559144Yeah and they are all shit.
>>106559149Not on benchmarks they aren't! And that's all that matters.
>>106555530>>106558219>>106559139diffusion slop, get good>>>/ldg/
>>106559139
>>106559162The benchmarks that never live up to reality? Good one anon.
So I bought 2 Mi50s after seeing so many people in here praising them lately. Got them in today and I only now just realized they have zero cooling. How the fuck do you cool these?
>>106559199>he doesnt have a server rack with 100W blowing fansdo you even servermaxx??
>>106559204No, and I refuse to buy a server case with those tiny 60mm fans that sound like jet engines.
>>106559215you put the server in the basement... unless it will compete with your living space lmao GOTTEM
>>106559199The machine in pic related has 3 vertically stacked server GPUs.I put one 120mm high RPM fan in front and one in the back for a push-pull configuration (for the one in the back I had to DIY a solution to keep it in place).
>>106559044The actual linear context seems to be the biggest innovation of the last two years
>>106559232is this how nvidia treats its employees? like man you cant afford a small rack to throw in nas/switch/router and appliances?
>>106559232>six (6) 4090s
>>106559247I have yet to receive any money of free products from NVIDIA.
>>106559256at least jannies get hot pockets, man...
>>106559256That makes sense. If anything, llama.cpp likely caused them to sell fewer GPUs>works on macs>works on aymd>can run without a gpu at all
I've been using Gemini 2.5 pro for a while and I tried Gemma 3 27b, of course it's censored but it's good, like not even far off Gemini...How is that possible??
>>106557685good news for you: >>106559305
>>106559305Distillation from Gemini for both pre- and post-training.
>>106559297llama.cpp/ggml gets a lot of contribution from NVIDIA engineers though.
>>106559297still runs faster on nvidia tho, pooaymd can't even compete and apple is a joke.
>>106559256When are you guys going to merge in flash attention for intel arc gpus? It's been like 3 years now.
>>106559371>>106559371>>106559371
>>106559381The SYCL backend is developed mostly by Intel engineers, you'll have to ask them.
>>106557808>at least GLM 4.5 air levelWhy would it be? It's a lower total parameter count and less than a quarter the active parameters.