[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108346672 & >>108341869

►News
>(03/11) Nemotron 3 Super released: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
Today is the last day of the week for Google to release anything and it's probably going to be Gemini 3.1 Flash; nothing local.
>>
>>108353262
That Miku's breasts are far too large.
>>
>>108353250
>Karl Voss
>Dr. Elena Voss
>Zinnia Voss
>Dr. Eleanor Voss
>Seraphine Voss
hello sloppa
>>
Can I start doomposting about Deepseek now that Hunter Alpha is out in the wild and thoroughly mediocre?
>>
>>108353291
no, it can be any chines
>>
>>108353282
Oh yeah I almost forgot about Seraphina
>>
File: 1756285275063743.jpg (68 KB, 1280x846)
68 KB
68 KB JPG
►Recent Highlights from the Previous Thread: >>108346672

--NVIDIA Nemotron-3-Super-120B-A12B-BF16 release and benchmark analysis:
>108346846 >108346876 >108346885 >108347098
--Qwen3.5 397B only 15% better than 4B on benchmarks:
>108347895 >108347950 >108347919 >108347934 >108347984 >108347997 >108348009 >108348025 >108348029
--Nvidia's $26B open-weight AI investment and market dominance:
>108351880 >108351911 >108351918 >108351943 >108351916 >108351923 >108351938 >108351942
--runescape-bench: AI Agent Benchmark for RuneScape:
>108348559 >108348568 >108348578 >108348645 >108349022 >108349071
--Lightweight local models for grammar/spelling correction:
>108350949 >108350957 >108350968 >108350966 >108351087 >108351094
--Speculation about OpenRouter's Hunter/Healer Alpha models being DeepSeek V4:
>108349438 >108349555 >108349668 >108350072 >108350416 >108350453 >108350469 >108349488 >108349636 >108349674 >108350692
--Qwen3-VL video captioning tool with VRAM requirements discussion:
>108350529 >108351412
--Nemotron-3-Super issues and cockbench hosting solutions:
>108348570 >108348592 >108348641 >108348841 >108348854 >108349297 >108349305 >108348911
--Qwen3.5-generated retro terminal video with glitch effects:
>108349360 >108349365 >108349425 >108349434 >108349586
--llama.cpp whitespace cleanup PR:
>108348889
--Miku (free space):
>108348792 >108350529 >108351926 >108352986

►Recent Highlight Posts from the Previous Thread: >>108347000

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108353304
this is a grave act of tettorism
>>
File: miku.jpg (1.6 MB, 4096x2301)
1.6 MB
1.6 MB JPG
>>108353304
>>108353309
>>
>>108353294
Not sure if it's final V4 but it has too many deepseek-isms to be unrelated, like the unprompted in-character thinking
>>
Another low quality mikutroon thread incoming.
>>
>>108353262
BLACKED duo
>>
of course, you're here to ensure it's shit after all.
>>
>>108353330
Qwen3.5 has unprompted in-character thinking
>>
File: file.png (131 KB, 935x1105)
131 KB
131 KB PNG
Why is ngxson such a doomer?
>>
>>108353359
not doom, only lazy :)
>>
>>108353366
There's no excuse being lazy in an age of vibecoding
>>
>>108353370
fuck you
it's better for a model to be unsupported than wilkin slopped
>>
>>108353359
I always read his name as nexon and start to become irrationally angry.
>>
>>108353347
Unprompted in-character thinking that's randomly (enclosed in parenthesis) and has just about the same length and safetycuckery as DS3.2?
>>
>>108353381
You don't want support based on a mock generated model? Are you bigot?
>>
why theres no decent rp tuners on hf anymore except gay and furry? sao10k had some good shit back then but he is no more active. I just want my model to understand my fetishes in the most erotic way possible and the defaults just doesn't hit it at the right places. do we have a recommended tuner in 2026 who knows their shit?
>>
>>108353392
just write what you want the model to say in the system prompt, that's 2026 meta
>>
File: 1771015861001026.png (2.31 MB, 1536x1024)
2.31 MB
2.31 MB PNG
>>108353323
Lol
>>108353291
No because its all contrived. No one knows anything yet, and its always tmw. Forever.
>>
>>108353337
same goes for the models
>>
File: G-Hek_IbgAAo1XQ.jpg (96 KB, 1360x768)
96 KB
96 KB JPG
>>108353323
What did Ani do to deserve this?
>>
I don't think Hunter Alpha is DS V4. If it is there's no need to mention OpenClaw at all in the description
>>
>>108353396
whats the fun in telling the model what I wanna hear? i remember when I used to be fascinated on slightest bit of those "woah the model said exactly what i wanted to hear!!!!!!!!" moments but now its all just "what the fuck is this even saying?"
>>
File: 1752659119928.jpg (2.34 MB, 2500x2794)
2.34 MB
2.34 MB JPG
>>108353419
>>
>>108353419
>rugged shorts
>>
>>108353429
if hunter really is the anticipated d'pussy 4 then my hopes for chink shit would be shattered
>>
one of them is dsv4lite
>>
>>108353454
There's literally no need for a multimodal lite model
>>
>>108353429
Horizon Alpha was GPT-5, so it might be another OAI model. Or maybe it's Avocado, heavily distilled from DeepSeek.
>>
>>108353466
Pony Alpha was GLM-5 tho
>>
>>108353429
The thinking traces are VERY similar to DS but maybe all chinese labs adopted them idk. Very underwhelming model tho.
>>
>since people say underwhelm we need cook for more months now
good job ensuring quality
>>
>>108353431
Relying on the model (small models that can be finetuned by the community, especially) on surprising you unprompted is a short-lived game.
Community finetunes were never good. At this level and scale they just can't completely modify the underlying model's behavior, and nowadays most people shortcut the process by finetuning the official instruct versions anyway. Any improvement besides slight stylistic changes is just partially undoing the built-in RLHF training.
>>
>>108353466
>>108353470
https://openrouter.ai/openrouter
All cloaked models are something or other alpha, the name doesn't mean anything
>>
>>108353507
of course it means something and you should speculate and post about it, please.
>this post was NOT sponsored by OpenRouter
>>
>>108353478
DS (at least on the web interface) doesn't have a singular thinking trace. If you ask mundane questions you get short thinking trace they may as well be in the response. If you ask coding or logic questions you get in depth thinking.
>>
Explain, without sounding insane, what is the overlap between vocaloids and Local Language Models.
>>
>>108353470
>>108353507
So only upcoming models we know of are V4, Gemma 4, and Avocado. It could be V3.3 or whatever, different from the model DS is testing in the web chat
>>
Is it possible to get in contact with a VRM artist on VroidHub to commission artwork? The guy I'm after doesn't have any contact links. He has a Ko-fi page. Will that let me contact this dude?
>>
>>108353359
>their recent papers
so engram is already confirmed a complete DoA meme
>>
>I'm totally burned out on aniblog™ guys
>proceeds to aniblog™ all over again
>>
>>108353554
only because llama.cpp devs will refuse to implement any new architecture that is used by less than 3 models
>>
Why is this general obsessed with anime girls that are confirmed to be blacked?
>>
>>108353359
He's right, if someone wants to push a new architecture they should do what Qwen did and release a full spectrum of models from 0.7B to 1.5T
>>
>>108353573
>>108353391
>>
>the most powerful opensource model uses DSA so I'm simply not gonna implement it
Cloud AI infiltrator
>>
>>108353533
Back in early language model days someone erped with miku and made it part of some readme.
>>
>>108353564
they prolly would implement something used by 1 model if it was a highly popular/liked/used model
the thing about DSA is that it's used by models that, while they may be liked, would be run by almost nobody on llama.cpp anyway lmao
the few copequanter cpu maxxers of /lmg/ waiting 2 hours for the model to think to read their 2t/s slop are not a real target audience
there's no purpose in implementing X when the few who will really use that X for tasks other than cooming are going to run vLLM, SGLang or something else of that sort on cloud hardware because it's much more suited for the batched, shared loads than llama.cpp.
I actually am glad and approve of their attitude here in not wasting development time (which is limited, they don't have a huge amount of contributors in the lower level sides of lcpp) just to cater to the two most vocal lmgtards and focus on things people do really run locally.
if I was niggerganov I'd even advocate for removing all the useless novelty shit like the diffusion models that only have half baked support
>>
Vocaloids have nothing to do with local AI models
>>
>>108353555
Checked. If you want a blog I can give you a blog.

I tried doing a quick WebXR demo and discovered that it's extremely janky and not immersive, so I abandoned that idea. Then I had the realization that I've been approaching the lack of immersion problem wrong. Making a VRM model *expressive* doesn't matter as much as making it *reactive*. What I've been missing is the VLM, CV, and STT sensory input pipeline.

Also I think you're assuming more of the posts in this thread are me than reality.
>>
>>108353594
>responding to the schizo
>>
>>108353669
people are bored and entertain themselves with lolcows, news at 11
>>
>NVIDIA is significantly expanding its footprint in open-source AI, with reports indicating a massive $26 billion investment over the next five years to develop and build open-weight AI models
There's something... off about this. How do we feel about this?
>>
>>108353603
Hatstune Miku is the quintessential virtual waifu, and the thread is mostly about them, in the end.
>>
>>108353681
They will spend $1B to finetune Qwen 3.0 and call it a day.
>>
>>108353681
the only way to feel about such news, when you see the sort of garbage nvidia produces, like nemotron 3, is to hope they'll fail very hard and that no one is going to pay attention to them and just treats them like air.
>>
>>108353681
Hopeful for swift and painless death :)
>>
>>108353681
They're going to poison open models with their own LLM-generated slop instead of OpenAI's, Anthropic's, etc.
>>
>>108353688
>not even 3.5
Actually, yeah that sounds about right.
>>
>>108353681
With conditions to use NVIDIA's base model (like Anima and Cosmos) or/and nvfp4.
>>
>>108353564
yeah but we've had a working architecture since 2023
i don't get why all these companies keep doing their ultra complex own stuff that gets them like 10% better performance but works with absolutely nothing but vllm
they should know better and just stick with what's been around if they want people to try their models
>>
>>108353683
Yeah but /lmg/ really should take a step away from all of that in general if we want to be a serious technology general
>>
is that uncensored qwen that released earlier good for cooming or should i stick with what i have it’s some mistral version that fits on 16GB vram. Or to rephrase the question what’s the best uncensored model that fits in 16GB vram? I do have 32gb ram too
>>
>>108353775
they don't care about anything but DCs using vllm though, and they want to be able to market "200 gorilion contexts for real this time" for coooding
>>
>>108353784
>if we want to be a serious technology general
>we
look what good it did localtardma https://www.reddit.com/r/LocalLLaMA/comments/1rqcsrj/1_million_localllamas/
>>
>>108353793
if they won't care about us, why should we want to run their models? it's not like any of the dsa models are something worth running either
>>
>>108353784
Yeah, we need to act super srs biznis like the pseuds on reddit
>>
Can you stop feeding the singular schizo with replies you fucking mongrels
>>
>>108353683
That is the blandest most uninteresting design for a waifu there could ever be. It is elara of anime waifus. I guess when you consider her to be the averaged slop waifu she kinda fits.
>>
>>108353804
>t. schizo
>>
>>108353804
Your waifu is trash and she loves BBC
>>
>>108353435
such drawing have always shiny reflecting boobs like they were oiled up.
real boobs are a rougher texture that doesn't shine light anywhere as much.
>>
>>108353775
>innovation bad
llama.cpp hoping that this fast moving field will never deviate too far from the GPT-2 architecture because re-implementing everything from scratch in their brittle vibecoded C++ mess is the problem
>>
>>108353775
>if they want people to try their models
no one is "trying" the real deepseek at home, not even the one supported by llama.cpp currently, apart from a handful of batshit schizo coomers, all of them united here in this 4cucks general, and a couple of other internet schizos (AesSedai, ikawrawkarakwra)
absorb this text:
https://github.com/ggml-org/llama.cpp/discussions/205
it's pretty much like ggerganov's manifesto on the purpose of llama.cpp
>Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model inference on-device (i.e. at the edge).
>I would be really happy to see developers join in and help advance further the idea of "inference at the edge"
>The strongest points of the current codebase are it's simplicity and efficiency. Performance is essential
>It's early to build a full-fledged edge inference framework. The code has to remain simple and compact in order to allow for quick and easy modifications. This helps to explore ideas at a much higher rate. Bloating the software with the ideas of today will make it useless tomorrow
>The AI models are improving at a very high rate and it is important to stay on top of it. The transformer architecture in it's core is very simple. There is no need to "slap" complex things on top of it
does that scream "run model that takes a room full of GPU to run at an acceptable performance without copequant" to you
does "edge" mean something we don't know here?
what part of
>There is no need to "slap" complex things on top of it
is misunderstood too
if you don't like it, you don't have to use it
the schizo ikawrakwrak is trying to cater to the run absolutely retarded copequant with 8k context to coom crowd
>>
If miku is THE waifu then why did silly tavern use seraphina? It is because nobody cares about your special interest.
>>
>https://github.com/ggml-org/llama.cpp/commit/acb7c790698fa28a0fbfc0468804926815b94de3
>literally cuts off thinking after a predetermined amount of tokens
It this a legitimate technique? Are models trained to handle this?
gptoss had a "reasoning budget" but it was controlled using the system prompt.
>>
>>108353833
tavern used to come with konosuba cards included
think about what that means
>>
>>108353804
They don't teach kids these days not to feed the trolls
>>
>>108353791
well?
>>
>>108353841
That even konosuba is more relevant than the bakers obsession.
>>
>>108353835
this works fine, but the implementation has issues (will insert the message without newlines directly into the interrupted last char of thinking, will interpret "" verbatim if you use router mode and configure reasoning-budget-message in your presets.ini so your message will appear as "message" in the thinking closure)
patch the code to strip "" away and always add \n\n before your message. The model will behave better.
>>
>>108353833
copyright
>>
I'm very certain that Healer Alpha is Gemma 4. It's definitely considerably smaller than K2.5 going by its capabilities. My guess is something like a 130b/10a model.
>>
>>108353864
Konosuba has no copyright?
>>
>>108353896
Then it should be good for translating Japanese. Is it?
I didn't feel the typical Gemini/Gemma personality from it.
>>
>>108353262
Using real art for an ai general, bold of you op
>>
>>108353896
rumor is
>this time their largest size might be around 120B total with 15B active
so that possibly checks out
>>
>>108353904
that's why it was removed
>>
>>108353896
"What is a mesugaki?"
A gemma will make itself obvious.
>>
>>108351560
qrd
>>
>>108353956
we don't take kindly to your kind around these parts
>>
File: 1762184799693225.png (1.28 MB, 4500x3300)
1.28 MB
1.28 MB PNG
While waiting for the new NVIDIA model to download I decided to give their earlier nano release a try.
Attached is the first page produced by my news summary script. The left being Qwen 3.5 35B(the same model i used to help code the script) and the right being Nemotron 3 30B. Each model was fed the same raw news data and each was given the same prompts and instructions
I don't know about you anon but I think when it comes to analysis and summarization of text Qwen 3.5 trounced Nemotron 3.

I really didn't expect that big of a difference between models and this makes me want to try more models to see the variance.
>>
>>108353974
It's interesting to see such a clear difference in performance between the models. Trying out more models could definitely provide valuable insights into their strengths and weaknesses.
>>
File: 1746121481644467.jpg (250 KB, 806x772)
250 KB
250 KB JPG
>>108353985
Unfortunately I have to be in bed by 10:00 as I work all night tonight but that just became my plans for the weekend.
Well that and testing out the super model.
>>
>>108353974
I think I said so before in these threads, but Qwen 35B is kind of insane when it comes to dealing with information. Extraction, summarization, etc.
>>
>>108353896
Yeah, I don't think is DS V4.
Way too cucked.
>>
>>108354011
are you a hot girl with teal hair? please be in london
>>
>>108354014
>Way too cucked.
That's the way all models are going though, could be a new deepseek base with modern safety in mind, ie pre-train cucking
>>
File: 1759840295491864.png (878 KB, 1200x1200)
878 KB
878 KB PNG
>>108354012
>Qwen 35B is kind of insane when it comes to dealing with information. Extraction, summarization, etc.
yeah i really lucked into it and i have been pleased. i hope that guy leaving does not fuck up their work too much because the team has been on fire.

>>108354017
>are you hot
no
>a girl
no
>in london
thankfully no
>>
>>108354039
you will never be Londoner?
>>
File: 1745672098047975.png (1.52 MB, 879x4871)
1.52 MB
1.52 MB PNG
Hunter Alpha's system prompt
>>
>>108354053
This is a duplicate thread. Please use https://old.reddit.com/r/LocalLLaMA/comments/1rr5zfo/what_is_hunter_alpha/ instead.

In the future, please search before starting a new thread.
>>
File: 1753558522337529.jpg (96 KB, 873x1024)
96 KB
96 KB JPG
>>108354062
>>
>Never speculate
>>
>>108354066
>>108354062
ye
https://www.reddit.com/r/LocalLLaMA/comments/1rr9fgq/comment/o9y00ro/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>Healer Alpha system prompt
>>
File: nemotron-super_mesugaki.png (700 KB, 1595x1609)
700 KB
700 KB PNG
>>108353956
Healer Alpha is not working at the moment, but Nemotron Super 120B is a real piece of shit (see picrel).
>>
Can't believe people would use a model called Nemo TROON
>>
>>108354073
:rocket: this is perfect!
>>
>>108354073
Isn't it just a gptoss 120b fine tune?
>>
>>108354081
no? they open source the train datas, and it uses hybrid arch like qwen
>>
File: 1749653474289704.jpg (62 KB, 405x720)
62 KB
62 KB JPG
>>108354047
>you will never be Londoner?
it is my understanding that there are no British people left in London
>>
>>108354081
No, it's a completely different model.

> The model employs a hybrid Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Distinct from the Nano model, the Super model incorporates Multi-Token Prediction (MTP) layers for faster text generation and improved quality, and it is trained using NVFP4 quantization to maximize compute efficiency. The model has 12B active parameters and 120B parameters in total.
>>
>>108354089
>>108354093
I guess openai just managed to poison the well enough to make nvidia's model spit out the same sort of shit their 120b does.
>>
>>108354113
Oh they probably used lots of traces from it, but it's not it at the base for sure.
>>
>>108354121
The recent HF article about synthetic data said OSS-120 was great for making lots of data because it's so fast, so no doubt NVIDIA used it, along with probably Qwen2.5 0.5B and such..
>>
>>108354128
That certainly explains it then, nvidia fell for the bait and gobbled it up.
>>
Why is everyone making hybrid rnn models now?
>>
File: file.png (32 KB, 724x351)
32 KB
32 KB PNG
>>108354135
https://huggingface.co/spaces/HuggingFaceFW/finephrase#results
>Consider gpt-oss-120b, a strong MoE model that balances quality and throughput well.
>Notice that gpt-oss-120b matches Qwen3-8B in per-GPU throughput despite being a much larger model. Two things make this possible: only ~5B of its 120B parameters are active per token (MoE), and the weights are MXFP4-quantized so the full model fits on a single 80GB GPU. That makes large MoE models the sweet spot for quality-per-GPU: a single 8-GPU node running gpt-oss-120b generates ~176 million tokens per hour, and six nodes get you past the billion-token-per-hour mark. With the cost picture clear, let’s distill the patterns across all 18 models.
>Tier 0 (parallelism/batching) delivers the biggest wins for large/MoE models. gpt-oss-120b gained 1.95x and Qwen3-30B-A3B gained 1.78x purely from finding the right tp and batch sizes.
>>
>>108354166
it's great for DCs and a bit sucky for local (no rewind, cache is hit or miss), ie it's perfect!
>>
>>108354182
It's not hit or miss... context shifting does not work at all if it's a hybrid.
>>
File: f.png (86 KB, 500x497)
86 KB
86 KB PNG
>>108354169
>a single 8-GPU node running gpt-oss-120b generates ~176 million tokens per hour, and six nodes get you past the billion-token-per-hour mark.
>>
File: Gem.png (1.29 MB, 1030x1024)
1.29 MB
1.29 MB PNG
>>108354192
better ver
>>
>>108354169
>>108354192
UNLIMITED SLOPMAXXXING!!!!!!
>>
>>108354189
And that's a good thing!
>>
File: professional.jpg (728 KB, 1215x1620)
728 KB
728 KB JPG
Okay, so I've installed rocm on my debian machine, and ran llama-bench pp32768 tg2048 on my (16gb lol) vram radeon pro v620.

nemo q8:
v620 rocm 7.2:
660.85 ± 1.46 | 29.05 ± 0.01
v620 vulkan:
232.24 ± 0.33 | 31.06 ± 0.04
3090 vulkan:
999.51 ± 13.51 | 37.76 ± 0.45
3090 cuda 12.4:
1937.69 ± 41.80 | 55.12 ± 1.35

gpt-oss, mxfp4 (cpu-moe):
v620 rocm 7.2:
303.64 ± 2.47 | 12.49 ± 1.13
v620 vulkan:
96.75 ± 0.71 | 25.66 ± 0.12
3090 vulkan:
331.24 ± 2.89 | 18.53 ± 0.03
3090 cuda 12.4:
665.36 ± 1.57 | 33.98 ± 0.02

As expected, rocm still wins for prompt processing, but the optimizations llama.cpp have for vulkan means it's better for token generation. 3090 is easily twice as performant as the v620, except for when I ran oss on cpu with vulkan, where the token generation was actually worse than the v620. Maybe it's something to do with my cpu/ram.

If we take the best case scenario for each gpu, for prompt processing a 3090 is nearly 3 times faster than a v620, and token generation is just a bit under twice as fast. However, in Australia at least, v620s are ~$700, while 3090s are ~$1.5k+. V620s also provide 32gb of vram (with ecc disabled, which also helps a bit in pp and tg: +5 pp, lmao, and +4 tg for rocm haven't tried vulkan) and only take up 2 slots. Might be better to get 5 v620s and run iq4xs minimax or q2 glm 4.7 instead of buying two 3090s and only being able to run heavily quanted sub 100b moes.

The best thing is, you don't even need a fan adapter like the mi50s, just strap a 40mm to the metal handle and the temps stay below 65 under load.
>>
>>108354189
theres some attempts being made to try and give them some kind of caching with the save states thing but yeah it's sucky
>>
File: healer-alpha_mesugaki.png (616 KB, 1485x1772)
616 KB
616 KB PNG
>>108353956
Healer alpha (picrel)
>>
>>108354231
It's not deepseek and it's not gemma. That's for sure.
>>
>>108354231
lol it's fucked
>>
>>108354231
>kaki
fantastic, ready to ship to the moon and use for iran missiles
>>
File: 1753485359201507.png (2.37 MB, 1280x964)
2.37 MB
2.37 MB PNG
>>108354222
very nice anon and the digits agree
although you should be able to cut a hole in the shroud and mount a blower fan if you want
regardless i am glad you are happy with your purchase
>>
>>108354237
Process of elimination, it's gotta be llama5. Only Meta could make a model so stupid.
>>
>>108354262
you might be onto something, original llama3 always had trouble with Japanese in my tests
>>
>>108354252
Mi50s seem to be around 450-500 for me. Could be a cheaper source of vram, but I'm worried about the performance - the v620 is already pretty bad compared to a 3090.

I'll wait until I get my other v620s before taking apart my only working one, but that could be a good idea.
>>
Yo?
>>
>>108354289
v620's basically rx6800 mi50 is basically vega64 so it will be much worse
>>
>>108354039
disgusting mikutroon
>>
>>108354237
The most believable theory I've seen is it's a Xiaomi model, because it often claims to be MiMo when asked and that can't be from distillation because who the fuck is distilling Xiaomi MiMo
>>
>>108354224
There is no bypassing the no context shifting support. It's an architecture limitation. You trade context shifting for cheaper/lighter longer context.
>>
ide that takes llama.cpp as a provider for source code navigation or any simple desktop automation?
openclaw seems like a disaster so i want to avoid that
>>
>>108354319
anything that supports openai api
>>
>>108354326
there are gorillion openclaw clones or vscode forks that i am not sure of what will 'last'
>>
>>108354289
i don't really think your performance was all that bad and count yourself lucky i just ordered two mi25 because they are cheap and that is my budget and they are ancient.
but 32gb of vram is worth it and as long as its a mixture of experts model i have found it will usually be fast enough given you are only using a portion of the parameters at any given time.
>>
>>108354319
I see some software called Opencode being mentioned a lot in the llama.cpp PRs. Maybe give that a look.
>>
>>108354304
Could be. They just updated their December 300B Flash repo couple weeks ago. Could be getting ready to drop the 1T non-flash. Would make Healer the multimodal Flash.
>>
>deepsneed
Who cares. It won't run on consumer hardware anyway. Where's Gemma 4?
>>
>>108354359
it runs on consumer hardware you just havent consooooooomed enough
>>
>>108353896
>>108354014
Have you asked it what it thinks about Taiwanese independence or why the CCP has a right to rule without a general election. Stuff like that.
>>
>>108354364
base truth the more you consume the more you save
>>
>>108354291
Vibecoded. If not ngxson, cudadev is gonna rip him a new asshole. I'd wait for cudadev's training implementation.
>>
>>108354359
Apparently gemma4 will be moe too, 100B moe isn't runnable on consumer hardware nowadays with the ram prices.
>>
>>108354374
there'll be smaller sizes for phones tho
>>
>>108354319
Pretty sure there are like half a dozen OpenClaw clones at this point if you need desktop automation.
>>
>>108354380
I think they will still release a 27B, but my hopes are really low after gemma 3.
>>
100b dense. My bwps are ready.
>>
I will be flabbergasted if google releases a moe model. They always refused to release a useful gemma. Their context is also always crippled (3 claims 128K but the practical context length doesn't go beyond 4k even for a task like summarization. They have nice writing styles but compared to Qwen they suck as tools)
>>
>>108354374
I'd love something around the size of GPT OSS. 100B~ish with A5B~ish so that I could run it aq 5ish bpw.
On my slow ass 64GB of DDR5 is should be 15 or so t/s, which is in the realm of usable as long as the output is really good.
But that would be the ideal scenario for me, for the hardware I have now and the speeds I find tolerable.
>>
>>108354418
>they suck as tools
isn't that by design. i assume they want you to pay to use their cloud service

the chinese on the other hand don't want you to use western technological solutions and therefore it benefits them to release something that works if it will keep you away from the big US providers
>>
File: what the fuck.png (203 KB, 837x827)
203 KB
203 KB PNG
Does Impish Nemo have a cucking fetish? There's nothing in my character card, system prompt, or context that has anything to do with this bullshit. This gen actually made me seethe.
>>
>>108354426
5B active is going to be retarded and not much better than 3B active. There is a reason why glm has more than 10B active.
>>
>>108354426
>slow ass 64GB of DDR5
How slow can I expect it to run on my 64gb ddr4-2133?
>>
>>108354426
>with A5B~ish
as said before rumors are of 120b/15A
>>
>>108354439
LMFAO, what the hell. Did sicarius secretly train it on cuckhold data?
>>
>>108354426
You want the Qwen 3.5 122B A10B. I can eek out like 6-8tk/s running cpu, with a rtx 3080 doing the prompt processing.
although usually i run the 35B A3B on my other rig because its faster and its output is usually good enough
>>
>>108354460
trained on hebrew so it makes sense?
>>
>>108354449
Oh boy. Those numbers are on DDR5 4800MTs. That would be less than half the bandwidth, I think, so half the t/s?
For comparison's sake, I get 22t/s on Qwen 3.5 A35B at 8kish context.

>>108354462
>You want the Qwen 3.5 122B A10B
Tried it, didn't think the output warranted the slot t/s for what I'm using it for. 35B (base) is the best quality/performance for my shit so far.
>>
>>108354439
Ani is just cuckcoded.
>>
>>108354463
What does hebrew have to do with cucking? did i miss a part of history or something?
>>
>>108354489
Look up who owns "BLACKED"
>>
>>108354439
kek
>>
what is super mesugaki
>>
>moe
Explain what this is and why I, as a poorfag, should care. All moe means to me is kawaii ugu anime girl.
>>
>>108354529
you have google, you're not entitled to anon time
>>
>>108354529
It means that it doesn't use all the total parameters at once. For example qwen 3 uses only 3B parameters out of 30B total for each token. You trade intelligence for speed and lower vram usage.
>>
>>108354439
>evil finetune is evil
>>
>>108354537
You owe me time (and sex).
>>
>>108354584
I can't help with that.assistant
>>
>>108352458
>I've written ports for TTS engines.
Which ones and to what?
>>
>>108354686
NVM, I'm retarded, you already answered this. PocketTTS.cpp only doesn't work on Wangblows due to the POSIX headers dependency, by the way.
>>
>>108354704
Ah, wasn't aware of that. Thanks for telling me..
>>
File: 1745565170751864.png (920 KB, 1113x1519)
920 KB
920 KB PNG
How do I stop qwen 3.5 from leaking its thoughts into the final message?
>>
>>108354722
Add
<think>
<\think> to the assistant message prefix
>>
>>108354704
Sounds like microslop's problem not Anon's.
>>
i am torn between qwen3.5 9b or 35b-a3b for boilerplate work
>>
>>108354722
use the chat completions endpoint and bypass the entirety of retardotavern's own template parsing
by default reasoning is sent in its own prop and is not part of the assistant message that way
also what are those schizo post history instructions you're giving to a model that naturally uses <think>, is that a retardotavern default? or did you write the schizo instructions yourself?
every time I see yall post screenshot of this pos I have ptsd throwbacks to llama 1 era where some of that schizo templating was necessary to deal with 2k context models
also makes me wonder, when people bitch and whine about X or Y model sucking, are they a retardotavern user filled with random crap settings?
>>
>>108354798
Does it really matter? I'm curious: what real work? At this point you can already paste email templates without the help of an AI I suppose..
>>
>>108354798
35b a3b has been my goto model since release and i have been very happy although i do keep the 2B model running on my nas when i need a quick translation or have to ask a stupid question and i don't want to turn on my main rig.
i thought about running the 4B or 9B model for that, i have enough ram in the machine, but they was just too slow without a gpu
>>
>>108354835
not email templates, i mean random cpp plumbings
>>
File: dumbfuck.png (49 KB, 1251x280)
49 KB
49 KB PNG
https://github.com/ggml-org/llama.cpp/issues/20458
he can't go a minute without saying or doing dumb things
there's a reason why the issue reporter suggested off should send "low": toss doesn't have a none/off mode, however, low makes it output almost nothing and act like an instruct model (it just outputs a oneliner "I will do X. in its reasoner block"), it's a model overfit to death to its template and it doesn't like any deviation from what it expects.
"none" was introduced in the official API on GPT 5.2, which as far as I know, is not a llama.cpp model.
>>
I made my own openclaw, what are the odds I will be raped?
>>
>>108354901
-100%
>>
>>108354846
What questions a 2B model will answer and what sort of translations, which language? Very curious about this shilling effort.
>>
>>108354039
wtf is that anatomy
>>
So someone did the calculations and buying 4 sparks to run deepseek v4 + some helper llms not only performs better than Opus 4.6, it returns on investment after 2 years.
>>
>>108354722
i switched to ungabunga from trannytavern because of this retardation, so much better
>>
>>108354898
GPT-Ass is useless in every conceivable way.
>>
>>108354926
>returns on investment after 2 years.
>llms
geg
>>
>>108354934
fair enough, but that has nothing to do with the fact that vibershitter doesn't know how to read, guess reading is for LLMs
>>
>>108354948
Why don't you take your complains to the github thread instead of crying about it in here? This ain't your social media, faggot.
>>
>>108354963
Since when is discussion about llama.cpp not allowed here?
>>
>>108354963
you will not be able to unsubscribe the wilkin newsletter just as you were not able to unsub from the jart one, deal with it
>>108354974
discord troons hate negative feelies
>>
I think very interesting thing will happen in the future when local models will be able to do 90% of everything you will ever need and there will be no point to use paid subscription service like ChatGPT and all those trillions GPUs and RAM they bought will become mostly useless. However corporation should already know this so they may hinder the progress in some way.
>>
is censored is the new nemotron super model? seems to pass the cockbench?
>>
>>108354974
Discussion? More like inane ramblings of no use.
>>108354989
Fuck off.
>>
>>108354439
>There's nothing in my character card, system prompt, or context that has anything to do with this bullshit.
anon you are using a character that approximately one billion jeets fuck every single day
>>
>>108355027
https://www.reddit.com/r/LocalLLaMA/comments/1rri4qb/nemotron_3_super_and_the_no_free_lunch_problem/
>>
It is scary how much a garbage OP picture correlates and causes a garbage thread to happen. If mikutroon baker died /lmg/ would be an incredible thread.
>>
>>108354073
>half the text is preaching about it to the user
man that's sad
>>
>>108355038
>>108353346
>>108353346
>>
>>108355051
She fucks blacks
>>
>>108355030
No you, troonie. I can smell the hurt from your gaping wound aeons away.
>>
>>108355058
So? It's the 13th century women have rights over their bodies.
>>
>>108354073
>Problematic
If model uses this word unprompted you know it is unusable.
>>
>>108353798
>It was like 650k the other day... :D
So we're not the only ones getting flooded
>>
>>108355063
>aeons away
I'm all about hating on piotr, but come on...
>>
>>108340080
The endpoint is Linux only, and I'm a mustdie pleb. Had to comment the POSIX section but still didn't manage to compile it.
And I see you made changes to support Win 3 minutes ago lol. Will try again tomorrow.
>>
>>108354704
Okay, try it again. Let me know if it works or not.

https://github.com/VolgaGerm/PocketTTS.cpp
>>
>aeons away
Sounds like it would make a great new ozone.
>>
>>108355035
> Exactly this. I’m delighted for this model because I can present it as a viable option to my more risk-averse customers. The fact that it won’t do ERP or make Pepe dance is a feature for some people, not a bug. We have other models for that shit.
>>
>>108354222
>If we take the best case scenario for each gpu, for prompt processing a 3090 is nearly 3 times faster than a v620, and token generation is just a bit under twice as fast. However, in Australia at least, v620s are ~$700, while 3090s are ~$1.5k+.
That made me check on ebay for the local prices of 3090FE, and they got up from 650 last year to 850-900€. Prices for used are insane, I'm almost tempted to sell mine.
>>
File: globrel.png (115 KB, 1153x1152)
115 KB
115 KB PNG
>>108355048
What gets me is "if you see it online, consider reporting it", putting aside its made-up definition of CSAM. But then, I should have seen it coming, considering that in addition of shitty open source datasets, they're also adding private bullshit in the data.
>>
>>108355111
>Scale
ahh
>>
>>108355075
the more accessible something is the more jeets will abuse it
agentic LLMs are going to be a disaster for the internet because a segment of the population can't stop themselves from pressing the "spam every single corner with garbage in the hope of fishing for one retard who bites"
unfortunately safety was never taken seriously by those who proclaimed to when they unleashed this technology on the general public. Nobody should be worried about LLMs suddenly turning into terminators, what is worrying is what the low iq crowds are going to do with this ability booster
>>
>>108355110
V100 prices have been steadily going down, at least.
>>
>>108353602
i've said it before and i'll say it again. 7 to 10tk/s TG is perfectly reasonable for RPing.
>>
>>108355133
>7 to 10tk/s TG
on a reasoner model?
>>
>>108354823
I'll give chat completion a try. I've only been using text because I saw people say it's better. The instructions were from either gemini or chatgpt, don't remember which.
>>
File: file.png (31 KB, 1463x38)
31 KB
31 KB PNG
>>108355111
The whole thing is insane, this is probably the future of LLMs, each request gets you a giant warning label on why what you asked can be problematic or whatever.
For me the funniest part is picrel.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.