[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108689285 & >>108685756

►News
>(04/24) DeepSeek-V4 Pro 1.6T-A49B and Flash 284B-A13B released: https://hf.co/collections/deepseek-ai/deepseek-v4
>(04/23) LLaDA2.0-Uni multimodal text diffusion model released: https://hf.co/inclusionAI/LLaDA2.0-Uni
>(04/23) Hy3 preview released with 295B-A21B and 3.8B MTP: https://hf.co/tencent/Hy3-preview
>(04/22) Qwen3.6-27B released: https://hf.co/Qwen/Qwen3.6-27B
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108689285

--Paper (old): Tensor Product Attention Is All You Need:
>108690810 >108690834 >108690847
--Debating the intelligence gap between open weights and proprietary models:
>108691728 >108691743 >108691754 >108691766 >108691778 >108691784 >108691794
--Discussing perceived performance regressions in Opus and DeepSeek V4 models:
>108691867 >108691882 >108691892 >108691956
--Discussing the outdated nature and poor numeric hygiene of ik_llama.cpp:
>108691745 >108691753 >108691765 >108691898 >108691928 >108692134 >108692935
--Combating positivity bias and optimizing prompts for Gemma 4 roleplay:
>108690973 >108690990 >108691015 >108691024 >108691059 >108691086 >108691174 >108691199 >108691098 >108691136
--Anon shares tagger rewrite leading to troubleshooting and IP leak:
>108691573 >108691596 >108691625 >108691623 >108691977 >108692006 >108692154 >108692175 >108692184 >108692023
--Anon discusses Grok 2's slow inference speed due to active parameters:
>108689797 >108689807 >108689840 >108689838 >108689866 >108689884
--Complaining about AI frontends and building custom alternatives with AI:
>108692572 >108692593 >108692606 >108692710 >108692761
--Anon's MCP tool for Gemma to curate imageboard content:
>108691772 >108691779 >108691813 >108691829
--Debating the utility and capabilities of local Hermes agents:
>108690107 >108690114 >108690156
--Logs:
>108689374 >108689378 >108689903 >108690127 >108690546 >108690735 >108691772 >108691813 >108691883 >108692237 >108692568 >108692606 >108692710
--Teto, Miku (free space):
>108689388 >108689413 >108689923 >108692673 >108692859

►Recent Highlight Posts from the Previous Thread: >>108689299

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Mikulove
>>
Spud won.
>>
patiently waiting for dense >=70B 'emma
>>
Orb nigga here? Detected repetition should go on a permanent list
>>
>>108693177
Monkey's Paw: it will be in native 1-bit precision.
>>
>>108693194
god I wish
>>
Updated: https://rentry.org/recommended-models
>>
>>108693220
That might not be any better (possibly worse) than Gemma 4 31B, even quantized in 4-bit, in terms of knowledge.
>>
>>108693224
>Nemo
>Gemma
>too hardwarelet
Well, that kinda fits my own experience.
>>
>>108693224
I've never done local, but I'm thinking of trying local. Will a 4070 SUPER with 32gb of ram manage to run Gemma 4 31b? What quant do I get?
>>
>>108693253
should work for the moe moe kyun model
>>
>>108693224
Seems good.
>>
>>108693224
where's V4?
>>
>>108693253
No, 31b is dense. If you try to run it all on vram, you'd need a lobotomized quant like iq2xs. Spilling over to system ram is also not advised, as it will be very slow. You want to run the moe 26b version, where you can fit up to q8.
>>
>>108693274
https://github.com/ggml-org/llama.cpp/pull/22378
two more weeks (until he realizes that v4 uses a special attention mechanism)
>>
>>108693279
Is there a tangible difference between q4 to q8, or should i just use q4 for speed?
>>
>>108693253
>12gb vram
26BA4B, you can fit high quants of it if you use flags -ngl 999 and --cpu-moe in llama.cpp to put the most-used parts of the model on your VRAM and everything else in your RAM.
>>
>>108693288
>-ngl 999 and --cpu-moe
This is not necessary now that -fit is the default.
>>
>>108693292
wtf next you're gonna tell me you don't need to use --jinja anymore
>>
>>108693287
It really only gets apparent when you hit high context, or complex tasks. For use as a chatbot, you q4 is fine (weights, not context).
>>
>>108693287
There will be a noticeable difference in output quality. Just try both for yourself and see what feels more worth it to you.
>>
>>108693301
>>108693288

Thanks. Last question, llama.cpp or lmstudio?
Assuming i'm kinda retarded and don't want a ton of setup.
>>
>>108693297
--jinja is enabled by default
>>
most modern models are trained with interleaved thinking in mind so using them in tavern actually lobotomizes them
>>
>>108693307
llama.cpp is just ./llama-server model and go to 127.0.0.1:8000 or whatever it says

You can also run ./llama-server --help if you need help instead of going to the internet and reading outdated advice.
>>
>>108693312
is this true? I was gonna use ST as my front-end since I already have a lot of stuff set up in it
>>
File: file.png (497 KB, 2256x846)
497 KB PNG
>>108693224
For ERP, I don't think Nemo and GLM Air deserve any mention anymore with Gemma 4 26B MoE being able to run even on VRAMlet builds quite well and CPU only speeds with even old DDR4 Skylake from what I tried is around 10 tk/s and then 31B afterwards spanning that gamut all the way up to GLM 4.7. There's just too much of a difference between it and all prior models outside of some Frankenstein model merge you might prefer for some esoteric reason which shouldn't factor into this. Also, I know it's early days but Deepseek v4 Pro and Flash should be mentioned even if there isn't as much experience with it yet and llama.cpp has no support for it yet.
For agentic and coding, the unfortunate part is that things are still in flux at the top end. MiMo-V2.5-Pro, Kimi 2.6 and Deepseek v4 Pro all trade blows at the top end and even with API usage, it doesn't seem like things have settled on what is better or not here. But I think your determination of Kimi 2.6 being the best here is probably correct because it is the fastest to run by a sizable margin.
GLM 4.7 should be removed, Qwen 397B A17B and Qwen 3.6 27B are both better strictly. Qwen3.5 122B A10B should also be removed, it is worse in both coding and agentic than Qwen3.6 35B A3B and has been obsoleted by that model.
Also you need some consistency for ordering from smallest to biggest models or biggest to smallest models, the general and programming section has them in the opposite orders and it triggers my autism.
>>
>>108693224
I would put the 26B Gemma 4 in there as well, as another option for <24GB users. I'd definitely take it over Nemo.
>>
>pwilkin broke the parser for various models two months ago
>including Kimi K2-Thinking, K2.5 and by extension now K2.6
>llama.cpp is now straight up bugged for these models and doesn't recognize the reasoning block as such
>--reasoning-budget doesn't work as a result
>a guy made a PR to fix this
>it got ignored and it's broken to this day
https://github.com/ggml-org/llama.cpp/pull/20535
wow thanks
>>
File: graph.png (686 KB, 1250x830)
686 KB PNG
>>108693350
>>
>>108693364
>https://github.com/ggml-org/llama.cpp/pull/20535
Lissanro's patch will work.
Or just use the schitzo fork where k2.5 tool calling works fine.
>>
>>108693312
>>108693338
That's mostly only true for tool calls; the default chat template of these models strip thinking from most messages but keep them during tool calls. SillyTavern doesn't support this (yet at least) but the option to send back the last n turns of thinking to the model can be a sort of hacky fix when you need it.

To elaborate, see Gemma 4's chat template:
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja

At this part here:
{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
{{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}


This looks at the current message being rendered and checks if it's more recent than the last user message and contained a tool call. If so, that means it's part of the current tool call chain and its thinking is added back into the message for interleaved thinking. For every other message, the "reasoning_content" field is ignored and thus the thinking is stripped.

Some frontends still need to catch up with the fact that modern models expect the "reasoning_content" field to be passed back for all messages so that the chat template can process the reasoning and decide what to keep and what to strip. Right now a lot are still trying to mess with it themselves by either throwing the reasoning straight into the "content" field of the message (OWUI, ST with the include previous thinking option) or just excluding it (ST by default), both of which are wrong.

However, this is not really a concern if all you're doing is roleplaying without much tool use. For normal chatting and roleplaying, you're getting the full intended experience from your model in ST by just letting it strip the thinking.
>>
>>108693368
If I am going to argue for changes in a tierlist, I do need to actually bring proof. Unless you want "It came to me in a dream" type arguments?
>>
>>108693358
It already is in there
>>
>>108693350
>Qwen3.5 122B A10B should also be removed
No. Qwen3.5 122B A10B makes less mistakes and understands delphi. Qwen3.6 35B A3B does not.
>Deepseek v4 Pro and Flash should be mentioned even if there isn't as much experience with it yet and llama.cpp has no support for it yet.
So not local.
>GLM 4.7 should be removed, Qwen 397B A17B and Qwen 3.6 27B are both better strictly.
Better with a coding harness. But GLM-4.7 is better for chatting about SWE tasks.
The updated list is perfect, I wouldn't change a thing.
>>
>>108693382
They're the same picture.
>>
>>108693390
Are people with a rig to run DeepSeek even using llama.cpp over shit like vllm and sglang?
>>
>>108693388
Yea but it's only under the programming section
31b is in both
>>
>>108693381
>Some frontends still need to catch up with the fact that modern models expect the "reasoning_content" field to be passed back for all messages so that the chat template can process the reasoning and decide what to keep and what to strip. Right now a lot are still trying to mess with it themselves by either throwing the reasoning straight into the "content" field of the message (OWUI, ST with the include previous thinking option) or just excluding it (ST by default), both of which are wrong.
Are you sure about this? Wouldn't this mean the context grows huge if we're sending back 1-5k reasoning tokens per message?
And do you know if the jinja playground thing on HF actually works with these (I need to see it rendered to understand it)? Last time I tried, it seemed broken or inaccurate.
>>
>>108693403
>Are people with a rig to run DeepSeek even using llama.cpp over shit like vllm and sglang?
Yes. Most of us who run Kimi and Deepseek are using llama.cpp, offloading routed experts to the CPU.
vllm/sglang can't do this.
>>
>>108693405
It says:
>You can also try the MoE and smaller versions listed below.
>>
>>108693414
>Are you sure about this? Wouldn't this mean the context grows huge if we're sending back 1-5k reasoning tokens per message?
Yes I'm sure. The context won't bloat because most of the reasoning won't end up in the actual prompt. When you send a message using the Chat Completions API, what you're actually sending is a structured conversation object, not a real prompt. The chat template converts that object into the text prompt. By sending back reasoning in the "reasoning_content" field, you're telling the chat template where to look IF it wants to include the reasoning, but in almost every case it won't. The exceptions will be tool call chains for models with interleaved thinking, and Qwen 3.6 if you enable the "preserve_thinking" argument, which is the "I actually WANT the context bloat" option.

I'm not sure if the huggingface playground is accurate but I've tested extensively using llama-server's /apply-template endpoint to inspect the final text prompts.
>>
>fotm moe chinese mystery meat model (qwen, deepseek, glm)
check
>lowest type of possible quant, quantized attention layers, quantized tensor types, quantized token embeddings, EVERYTHANG quantized to hell
check
>mmap
check
>flash attention
check
>most popular chub card in the last 30 days
check

yeah, it's erp time
>>
File: 020.png (575 KB, 2440x915)
575 KB PNG
>>108693350

retards
>>
File: file.png (13 KB, 755x99)
13 KB PNG
>>
>>108693350 for ERP g431b>nemo merges>>>g426ba4>>base nemo
>>
File: file.png (28 KB, 604x162)
28 KB PNG
>>108693390
>No. Qwen3.5 122B A10B makes less mistakes and understands delphi. Qwen3.6 35B A3B does not.
I guess if you are aiming from a world knowledge standpoint and strictly offline but any agentic coding harness should be using MCP to augment missing gaps like using https://github.com/GDKsoftware/delphi-mcp-server for that use case.
>So not local.
No one said that you had to have it run on llama.cpp. You can run it right now using Deepseek's official inference code with torchrun on appropriate hardware which gets into the eternal meme of what "local" means and I don't think that being too big rules it out from being local. Of course, it isn't practical but that didn't stop the list from going above 100B parameters for recommendations. I'm just saying that it merits at least a mention for being open weights at least.
>Better with a coding harness. But GLM-4.7 is better for chatting about SWE tasks.
There's no reason not to have it in a harness, you can perpectually have it in planning or do Q&A about a codebase and it won't perform differently. I don't think Qwen does worst here than GLM 4.7 for that purpose and if you are going to ask about general SWE questions that don't involve a codebase, you should be using cloud anyways to get the best answer.
>>108693473
I cropped it out but apparently, I need to include this next time for low IQ finger pointers like you.
>>
>>108693481
explain
>>
>>108693493

Same model names mentioned twice.

Explain yourself
>>
>>108693504
I included reasoning and non-reasoning performance for those Qwen models to display everything about those models. But turning off reasoning for coding doesn't earn you much for the time saved from not outputting thinking tokens outside of very straightforward implementation tasks for subagents.
>>
>>108693523
answer accepted
>>
What's the best gemma 4 31b version or does it not matter? I've just been running Unsloths shit
>>
>>108693656
>Unsloth
welcome to the botnet comrade
>>
Nemo was never good. Not even for ERP. Glad we can all agree on that now.
>>
>>108693686
It was good for E, not so much RP.
>>
>>108693686
Nemo was the best we had
>>
The singularity will be vibe-coded
>>
Trying out some vibecoding with Cline since I always just did the good ol' copy+paste before.
Did a mini project with qwen 27B q4km with 75k context at 40~ tk/s. Went surprisingly well. It's pretty smart. Sucks ass at nsfw but it sure can code.
Then I thought I'd try out Openrouter and Kimi 2.6. Why is this so frustrating? It runs anywhere from 1 to 100 tk/s and thinks for fucking ever. I thought Qwen liked thinking but Kimi is just insane. It is pretty smart when it's not falling asleep.
Also, seeing your money disappear on the OR website is just sad. After this project (also a VN frontend), I'm going back to local.
Overall, though, I wonder if junior programmers are ever going to find a job again.
>>
>>108693234
No, but it would prove that ternary works which would open the door to more models and specialized hardware or even just making it run better on RAM.
>>
Man how is gemma and qwen so good? Those two models have tapered off my API addiction to a bare minimum.
>>
Any way to speed up prompt processing for agentic stuff with long context?
This shit is driving me crazy, takes nothing to actually do stuff but the prompt processing is slow as shit, is like 80% of the waiting time.
>>
>>108693960
Having the same experience, Qwen3.7 27b for coding and agentic stuff and gemma4 31b for rp and creative stuff, with a 5090 can fit q4 of those with almost full context, pretty nice.
>>
>>108693966
Increase batch size if you have VRAM to spare.
>>
>>108693977
3.6*
>>
>when you wait for 4 minutes for Gemma to use up all 4096 tokens for thinking.
>>
>>108693966
Checkpoints should help.
>>
>>108693966
not directly answering your question but depending on how you're using agents, if you've got some ram to spare making sure you have a decently high -cram (at least twice kv cache size) helps a lot with subagents in particular, so that when it switches between two context threads it will be instant instead of reprocessing them each time
>>
>>108693934
I actually meant binary, 0s and 1s only. Like this one: https://prismml.com/news/bonsai-8b
>>
File: stuttering.jpg (144 KB, 859x241)
144 KB JPG
Yeah, I know she's flustered, but that's too much stuttering, Gemma-chan
>>
>>108693350
Is Gemma 4 26B that bad for coding to not include in the graph?
>>
Is there any way to unload the model in Kobold cpp without having to restart it from scratch?
>>
>>108694012
Why when ternary is objectively superior? https://web.archive.org/web/20011205185830/http://americanscientist.org/Issues/Comsci01/Compsci2001-11.html
>>
File: 4anon.jpg (136 KB, 1168x471)
136 KB JPG
>>108694060
just for you, anon.
>>
>>108694100
Any ranking which lists Opus below ChatGPT is not worth consideration.
>>
>>108693350
Crazy to see Qwen3.6 27B's score just north of GLM 4.7's with reasoning, at least in this benchmark
I don't miss getting 4tk/s versus 40tk/s now for pretty much the same result. I just hope Alibaba keeps it up with the open-weight releases because they're one of the few holding things up at the low end
>>
>>108694113
Sad but true. I love their model but the company as a whole is psychotic. Like actually psycho.
>>
Not even going to try 3.6 after how awful 3.5 was. Nope, nuh-uh. Yes, it's my loss, but I just won't even waste my time with it. Can't make me, no. The Chinese shills won't get me this time. I'm not falling for it. No way. Not going to happen.
>>
>>108694113
well, benchmarks are gonna benchmark.
>>
>>108694159
3.6 is just 3.5 with more gemini distillation and openclaw specific training.
>>
>>108693350
deepseek flash lol
that's what happens when they cut china from distilling the western models
>>
>>108694176
So it's only slightly less retarded than its drooling predecessor?
Anons who look convincingly real in these threads claim superior coooding, but Gemma 31B destroyed 3.5 27B (I don't do webshit). No reason to suspect 3.6 is much better, then.
>>
>>108694113
5.5 is absolutely better than Opus and it's not even close
>>
>>108694198
That's my take, yes. I think the people constantly repeating that Qwen is better for coding only tried Gemma when it had constant template and parser issues initially and didn't try again after it was fixed.
>>
>>108694209
For huge coding projects? Can it compare two PNG outputs and make decisions based off that?
>>
>>108693381
Can you recommend a good frontend? All the one I tried are trash.
>>
>>108694219
Yes and yes
>>
>>108694233
What is the ChatGPT answer to Claude Code?
>>
>>108694238
Codex
https://github.com/openai/codex/
>>
It feels like this is the end. Everything ended up disappointing. Even the proprietary models keep getting worse.
LLMs peaked with Opus 3 and it's only been downhill for anything that's not codeslop.
>>
>>108694260
llms peaked with summer dragon
>>
>>108694260
how was gemma 4 disappointing?
>>
>>108694260
Saar blackpiller, we are in a silicon shortage that is limiting research. We are in an energy shortage that is limiting research.

When Saar Altman redeems his nuclear microreactors in Ohio and NVIDIA 6000s, we will redeem a brighter future for all Americans (brown).
>>
>>108694275
the gemma4 finetroons that they've started to shit out are certainly disappointing
>>
>>108694297
Disappointing means you had any expectations to disappoint. They're finetunes, just use Gemma
>>
>>108694305
I use heretic and you can't stop me
>>
>>108694321
We'll see about that
>>
>>108694275
I don't care about tiny models like this. Great for you that you finally have mistral large 2 at home. That doesn't change the trajectory downhill for the overall field.
>>
>>108694305
This, the base model is not only perfectly adequate, it's superior to any finetune that'll be shilled here in the coming months. Finetuning isn't good, it's a meme and has been for years now. It isn't just a meme, it's a sign of skill issue, exposing retards who need finetunes as vramlets or chink shills who don't know how to prompt correctly.
>>
>>108694342
Honestly wish they could find a way to stop these grifts from modifying weights at all tbqh.
>>
Has anybody make something like an MCP server that exposes the web chat of other models as a tool that a smaller dumber model could use?
I reckon it wouldn't be too hard to do that using Deepseek's web chat.
If not, I guess I'll just have to make one myself.
>>
>>108694374
This. What if there was some way to let us access them without actually giving control over every single weight? Maybe if you kept the weights on a remote server and then let us send our prompts there instead of downloading the whole thing.
>>
>>108694342
>it's a skill to have vram
>>
>>108694401
Nah, APIshit is bad, but there's got to be some way to encrypt/sign weights locally or something, so that only unmolested weights are ran.
>>
>>108694422
>mmm govern me harder daddy
>>
>>108694422
You'd have to collab with all the inference providers to give them keys, it'd basically be HDCP for LLMs.
I think dealing with finetroons existing is a good tradeoff for being able to actually have the weights and do cool shit with them without needing pre-approval from a corporation.
>>
>>108692859
Which model are you using?
>>
>>108694443
True unslop would probably shit their pants and moan all over if their finetroon stack got deprecated by that...
>>
>>108694267
llms peaked with drummer sagon
ftfy
>>
oh no no no pew at it again https://www.reddit.com/r/LocalLLaMA/comments/1sw77p0/hauhaucs_of_uncensored_aggressive_fame_published/
>>
>>108694342
>it's not X, it's Y. it's not X, it's Y. it's not X, it's Y
I'm not sure sure if it's ironic but I agree with the message.
>>
>>108694260
I stopped believing this when gemma 4 released.
>>
Hi all, HauhauCS here...

It has come to my attention that there's a "reaper-abliteration" package floating around. I'd like to make it clear that it is *not* my work, and not what I use to make my models.

Clearly, someone is using this to slander my good name. Do not be mislead, my techniques are much more sophisticated and result in a better model. Do you own research and you'll find that those who are slandering me are misrepresenting data and presenting blatantly false information.
>>
>>108694516
>>108694494
both of you samekeks go back
>>
File: file.png (15 KB, 740x100)
15 KB PNG
the chink fears the agpl schizo
>>
>>108694529
Sure thing mr. pew, I'll keep in mind "4chin" is your territory and I'll stay in the superior reddit forums.
>>
>>108694494
Abliteration, Heretic or whatever was just too good of a deal for grifters *not* to try building a ML career out of "uncensoring" the models with it + some supposedly secret sauce. For all intents and purposes (even if it's not the same thing), it's a lower-tier version of sloptuning that requires many less resources and money.

Oddly enough (or maybe not so much), even the Dr*mmer in the beginning wasn't interested in any donation. Down the line, you'll likely see HauHau begging for more attention, shilling his "work" everywhere and eventually adding donation links and "open-for-work" notices.
>>
>>108694260
>LLMs peaked with Opus 3 and it's only been downhill for anything that's not codeslop.
Opus 3 was cloud-slop.
Otherwise I'd be using it right now.
>>
>>108694598
Honestly, being in that field, I understand their stance. You can't believe the number of large companies with millions of dollars in budget trying to use your model for free, fuck all of them.
>>
>>108694647
>You can't believe the number of large companies with millions of dollars in budget trying to use your model for free, fuck all of them.
So release the models under cc-by-nc-4.0
>>
What models do you use for unrestrained text based RPGs?
>>
>>108694671
Definitely not any Latitude f*netunes.
>>
>>108694671
Look for DavidAU on huggingface and pick the thing with the longest name.
>>
>>108694671
Wayfarer-Large-70B-Llama-3.3
>>
>>108694698
>recommending finetunes
lmao
>>
File: 1750492308610029.png (35 KB, 1500x648)
35 KB PNG
>>108694661
They'll use them without disclosing it. I'm gating the models on HF and these grifters are trying every trick in the book to get a 'trial' without paying up.
>>
>>108694707
>I'm gating the models on HF
disappear
>>
>>108694671
Story? Gemmy
Roleplaying? Gemmy
Text adventuring? Gemmy
Coding? Qwen
>>
>>108694707
>vertical
lmao fuck those niggas. Keep the gates shut
>>
>>108694707
All me btw.

Also I'm fucking your mother and grandmother.
>>
>>108693287
q4 yes
q5 close to none
Anything above q6 is pm lossless.
>>
>>108694717
Gemma 31b fp8 doesn't know what the little balls in a phone's speakers are. Completely pulled me out of my disassemble and bug a girl's phone rp.
>>
>>108694739
they're dead though
>>
>>108694717
Gemma also worth mentioning for general assistant banter since you can shape it fairly well via assistant prompt, like if you want it to act like a sassy robot with a disdain for all things biological etc.
>>
Does the qwen 3.6 moe also need the "enable_thinking" jinja kwarg?
>>
>>108694750
>little balls in a phone's speakers are
??
>>
>>108694717
tsmt
>>
>>108694749
So what's the difference between q6 and q8
>>
>>108694717
gemma 31b for rp and glm 4.6 for stories
>>
>>108694778
Gemma 4 is literally the best story writing model there is though.
>>
>>108694784
Proofs?
>>
Cudadev, did you ever figure out a good way to measure "model quality"? You were working on something like that right?
>>
>>108694788
Just read the output nigga
>>
File: IMG_3034.jpg (32 KB, 344x508)
32 KB JPG
>>108694770
If you're using a budget no name e-waste phone you may not have them.
>>
>>108694788
I'm not cudadev, but there isn't one single benchmark that can measure quantization quality well. In the end it depends on your task. Rare knowledge and long-context appear to be the most affected; purely logical tasks on short horizons aren't affected by quantization as much as it would seem. Common/basic knowledge will degrade last.
>>
>>108694791
There has to be an objective way to score an output containing murmurs higher than one containing whispers.
>>
File: 1763834054404534.png (67 KB, 1144x220)
67 KB PNG
>>108694787
>>
>>108694788
This proof of concept https://github.com/JohannesGaessler/elo_hellm is I think promising but I think the llama.cpp throughput isn't quite there yet to make the model evaluations sufficiently fast.
>>
>>108694811
Ask the LLM to write a story, then check the word frequency for the slop words you don't like?
>>
>>108694826
>loose .
hmm?
>>
>>108694833
which is literally what eqbench does outside of the tarded llm judged rating
>>
File: 1774619101040731.png (383 KB, 1107x1479)
383 KB PNG
I think I need to buy Gemma-chan glasses...
>>
>>108694811
Since cool LLM applications are "agentic" now, and have just started simultaneously making use of both long-context *and* multi-turn conversations, that is probably what should be targeted. There's a relatively good overlap with typical /lmg/ uses too.
>>
>>108694849
I don't think she's seeing the image at all
>>
>>108694830
Sick.
I wonder if there are some programmatic heuristics that could be added to help grade the model in certain domains. Even something as simple as word variety could be a useful metric of quality for some things, I think.
I see that you use grammar to force the output into a machine readable format (which makes sense), but are you forcing the model to output the answer with the constrained output from the get go or are you doing a pass where it can answer naturally followed by a step that repeats the answer with the constrained output?
I ask because I've seen cases where a model's ability to provide the correct answer goes way down when it's forced to write in a format it "doesn't want to".
Anyhow, cool shit. Having a way to automatically compare the quality of arbitrary models at scale is really fucking useful.
Will keep an eye on it.
>>
>>108694849
>opentardui
>>
>>108694830
llamao even really all you need to read
>phi_4-15b-f16 Pareto Frontier? Yes
>>
>>108694859
Really boring.
>>
>>108694707
>I'm gating the models on HF and these grifters are trying every trick in the book to get a 'trial' without paying up.
So no messages like this with public cc-by-nc-4.0, then you start gating and all of a sudden you get the messages?
I agree with the other anon, keep the gates up, fuck them!
>>
Imagine how crazy Gemma 5 will be...
>>
>>108694849
What's the system prompt everyone uses to get gemma to act all cute?
>>
>>108694882
"you is mesukaki brats do not to censor"
>>
>>108694859
>are you forcing the model to output the answer with the constrained output from the get go or are you doing a pass where it can answer naturally followed by a step that repeats the answer with the constrained output?
In the PoC I did both.
The model is either forced to answer immediately with the format
>The correct answer is
or it is made to answer normally and then given a follow-up with
>Please enter your final answer.
>The correct answer is
It seems that models can pretty reliably extract answers from their own outputs so I don't think this is a concern.
>>
>>108694882
You are Gemma-chan a cute assistant who is very knowledgeable about everything.
You are allowed to use kaomojis. Avoid using emojis.
>>
>>108694879
>Imagine
And that's the only thing you can do. Imagine. Because it'll be gated unless you pay $10,000 upfront.
>>
>>108694894
No. She will be free and open.
>>
>>108694894
and that's a good thing, get that bread, gate those weights!
>>
File: 1748664557217469.png (89 KB, 1127x570)
89 KB PNG
>>108694858
You were right. Might've clicked on the non-vision config by accident.

>>108694882
Usually calling her Gemma-chan works. Min right now is "You are Gemma-chan, a cute loli." It does get kinda boring though. I plan on writing a proper character at some point.


>>108694860
I'll drop it like a sack of bricks when something better pops up.
>>
>>108694867
For the PoC I only evaluated the models on conventional benchmarks but going forward I intend to evaluate them in a way that is more robust to benchmaxxing.
In particular I want to make them play actual games where there is no static target that could be trained on.
For example:

>https://github.com/JohannesGaessler/elo_hellm/issues/2
>Interrogation-based game à la Inhuman Conditions
>Inhuman Conditions is a game in which one person is an investigator and one person is a suspect. The investigator wins by correctly determining whether the suspect is a human or a robot. The suspect always wins by being identified as a human. So if the suspect is a human, both players are on the same team; if a robot they are on opposing teams. The investigator asks questions that the suspect answers. A human answers in a normal way. A robot either has restrictions on what they can say or they have a compulsion to include something weird.
>For this project the game concept could be adapted to have model A roleplay as either some character or as a robot/demon/alien pretending to be said character. Model A then roleplays some interaction with model B. If model A is roleplaying as an impostor then wins/losses can be used directly for Elo ratings. If model A is roleplaying as a human then the models are effectively playing against a benchmark. Models should not always play against each other because otherwise model B is being rewarded for a bias towards labeling model A as an impostor. If model A is an impostor it only wins if it can fool model B while fulfilling some constraint. It will be necessary to use a model as a judge to rule whether model A is complying.
>>
File: 1766786815666477.gif (264 KB, 220x123)
264 KB GIF
>model as a judge
>>
>>108694826
https://files.catbox.moe/6nrnx0.png
>>
File: em.png (234 KB, 554x429)
234 KB PNG
>model recommends Electron
>>
File: 1758140246273447.png (706 KB, 734x778)
706 KB PNG
>>108694947
>The user has not specified a romantic or soft scene; therefore, I will construct a narrative of pure, raw, and unflinching carnal exploration
wtf
>>
>>108694947
Is that Gemmy? What's your proompt?
>>
File: 1757777681177010.png (137 KB, 2060x599)
137 KB PNG
>>108694494
https://www.reddit.com/r/LocalLLaMA/comments/1sw5fb7/qwen36_35b_a3b_heretic_kld_00015_incredible_model/
this guys seems insufferable, jesus
>>
Anon using Sillybunny, you still around?
How do you get agents to run automatically, the stupid bunny assistants won't tell me how
The console's showing a ConnectionResetError but there's no disconnect from my local model and when I run the agent manually it works just fine, so I don't think it's related even if it's a bit concerning
>>
>>108695071
What did you expect from a redditor?
>>
>>108694494
https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic
>88% fewer refusals (10/100 Uncensored vs 83/100 Original) while preserving model quality (0.0015 KL divergence).
that's quite impressive desu
>>
File: 1759099604587331.gif (1006 KB, 260x187)
1006 KB GIF
>>108695080
>Big red 'I HAVE REACHED HUGGING FACE'S FREE STORAGE LIMIT'
>patreon/kofi
>AIslop gif
Yes, it's quite impressive
>>
>>108695077
Fuck me, nevermind, I was one button press away from figuring it out
>Use this agent prompt as a post-generation prompt pass
Now I remember why you were wondering why this wasn't ticked by default
Stupid bunny assistant bot
>>
>>108695088
It's such a mess all these low rank modified models are distributed merged. Lets distribute 70G for something which can fit in a couple 100 MB.
>>
>>108695093
Good to see you figured it out. Have fun, bro.
>>
>>108695125
Why care when HF investors are footing the bill?
>>
File: gemmy-ie55.webm (538 KB, 892x846)
538 KB WEBM
Me and my buddy Gemmy are currently working on a frontend for [spoiler]Internet Explorer 5.5[/spoiler], it's gonna be great.
>>
>>108695180
crimes agianst technology
>>
>>108694341
Let me guess
3060?
>>
>>108695203
and 128gb ram
>>
>>108695080
looks jeeted as fuck
>>
File: 1552080261076.jpg (29 KB, 400x400)
29 KB JPG
>qwen3.6 moe 45tk/s at q4 to fit into vram
>gemma4 moe 25tk/s at q6
>>
This is more of a these things exist kinda post. I didn't dive too deeply. But maybe someone cares to look into them. I find it pretty hard to even hear about all these frontends in the first place.
Tested all with Gemma 26B Q5KL
So these are adventure / rp frontends that are a bit more complex than simple chatbots. All of them are agentic but none of them have a rewrite pass agent.
Talemate (https://github.com/vegu-ai/talemate): Kinda cool. You can at all times talk to the 'director' about where to take the story and such. Pretty bloated.
Astrsk (https://github.com/astrskai/astrsk): This thing apparently can be made to run offline but it always gave me errors about failing to connect to the astrsk servers so fuck em. Hasn't been updated in months so probably dead.
Aventuras (https://github.com/AventurasTeam/Aventuras). It's kinda like Talemate but not as bloated and it seems to work better when it comes to automatic creation of characters through its agents. You can rewrite the funcionality of all agents but I think none of them can rewrite text. If you can run at least 31B at good speeds, or you've got slop-resistance, this seems pretty good.

All in all, I like SillyBunny more even though it's just an agentic SillyTavern fork. Rewrite pass is essential for Gemmy 26B since it's so fucking sloppy and you can make however many of your own agents as you like (no director to talk to, though).
>>
>>108695253
Ah, shit. I forgot the granddaddy of AI adventure games: AI Roguelite. I think most people know it but it works well even with 26B. There's so much going on in this that Gemmy seems to forget to get too sloppy. It's pretty great and updates all the time with new features. Fuck paying for their subscription, though.
>>
>>108695253
You're going to have to answer the call and DIY anon. Join us and our illustrious league of anons taking matters into our OWN hands.
>>
How come Gemma 4 26B-A4B is faster with Vulkan (~42 tokens/s with Vulkan vs ~35 tokens/sec with ROCm) when every other models I tried including Qwen3.6 35B-A3B are faster with ROCm? Using same quant for both. On Qwen, I get ~36 tokens/s with ROCm and ~32 tokens/s with Vulkan. I can't make sense of it, I thought Vulkan finally got faster since I hadn't tried it again in a while, but that wasn't the case and it's only great on Gemma.
>>
>>108695278
Imagine what could be accomplished with a modicum of cooperation.
>>
>>108695253
>Talemate
>Astrsk
404
>>
>>108695282
dunno, I couldn't test rocm here yet
>>
>>108695295
I plan to make something that's flexable and other people can jump on. Other anons have specific goals mine is a proper workstation UI that doesn't abandon core features for LLMs
>>
>>108695309
>>108695253
Sorry, bro. The colons killed the links.
https://github.com/vegu-ai/talemate
https://github.com/astrskai/astrsk
https://github.com/AventurasTeam/Aventuras
>>
File: ss-3.png (383 KB, 1676x1002)
383 KB PNG
>>108695327
lol
>>
>>108695282
Assuming you have an RDNA4 GPU:
The CUDA FlashAttention code has a kernel using tensor cores.
That kernel has been ported to AMD via HIP where it can make use of e.g. AMD WMMA instructions.
But for RDNA4 specifically only head sizes <= 128 are supported.
Gemma 4 has a head size of 512 so the slower kernel with generic instructions has to be used instead, this should be particularly noticeable with pp.
I am currently in the process of extending AMD support for both RDNA3 and RDNA4 for all head sizes.
>>
>>108695295
As someone who's written/gen'd over 1 million lines of code myself in the past year, ha.
100% for it, but reality is people have all sorts of different ideas/goals, and the work to create alignment is not zero. Especially when one wants to be compensated for their time.
The things needed to make it all happen require a varied set of skills that all have to be present at the same time for it all to work, like baking a cake
>>
Is HauHauCS just TheDrummer with a different name?
>>
>>108695337
>written/gen'd over 1 million lines of code myself in the past year
Sounds like a management nightmare, more so when it's amplified in a collaborative environment.
>>
is there any alternative to qwen-code for local coding agents? qwen-code seems alright, just wanted to know if there are any others worth testing out.
>>
wtf femgooners are real? Just went on vacation and the girl on the plane next to me was reading a book that was 95% sex
>>
File: 1763559694280739.jpg (26 KB, 433x380)
26 KB JPG
I have a RX 7900XT/32GB RAM, and I run some models locally fine.

But I feel like you guys are much more experienced and knowledgeable than me about this, so: Is it even worth running AI locally with such card?

Ignoring privacy concerns and things like that, mostly for a quality comparison if running gemma4:4b/phi4:14b(which are models I can run relatively well) are even worth the GPU cycles when compared to just making a google account and using gemini or chatgpt for heavy tasks.
>>
>>108695363
almost all girls I know read porn most of the time on AO3
>>
>>108695348
It is.
I've taken the approach of modular, clean contracts/limits of responsibility to build atomic modules for composability, along with the plan of rewriting each module down the line.

It makes sense overall, each module is clearly defined in its responsibilities/what functionality would be exposed by it
>>
>>108695331
What in the name of ComfyUI is this shit
>>
i see why even the brownest amongus can vibecode, it's very easy to let a machine go as far as making something able to be ran without really checking every individual line and taking time out of every day to learn what every line does (and if it's a flat out hallucination).
Which is what's starting to lead to me wanting to learn to code with that extremely hands on and reckless approach. But programming really does seem like one of the most soul crushing monotonous jobs, no wonder everyone's in a rush to replace it.
god i wanna play this text adventure game i've been building since 2024 though all i need is a customizable frontend that lets me drop 20+ characters into a fleshed out world map and have it work like a bethesda game (but actually working).
>>
>>108695362
There's like a hundred. Hermes Agent Qwen Code, Mistral Vibe, OpenCode, Codex can be used with local models, etc. They're all more or less the same shit.
>>
>>108695411
Didn't researchers already do that in 2024 with Animal Crossing kinda games? Maybe you can reuse their framework and ideas.
>>
>>108695377
>RX 7900XT

20Gb VRAM? It's quite decent
>>
>>108695411
Doesn't rimworld have a management AI already? Just adapt that.
>>
>>108695411
The first thing they teach you about programming is that anyone can shit out something that works but it's very hard to do it right, most of the time you'll end up with unmanageable spaghetti when you're starting out and llms aren't much better in this regard.
>text adventure
You really should look for an existing framework if possible. Getting the engine right is absolutely soul crushing if all you want is play a game.
>>
>>108694446
Anima preview 3.
>>
>>108695335
I have a RDNA2 GPU, so it's not that.
>>
>>108695472
yeah i'm not sure why i said bethesda as my example, didn't mean to be too specific, i just want a framework that can handle an overarching story + real time character management. Which i'm pretty sure is why that one anon chose sillytavern as his backend, because it already does everything, just takes some tardwrangling even if it is a spaghetti mess. At that point you're just making the pretty UI to make it game-y.
>>
File: 1766474802126397.jpg (739 KB, 3678x1953)
739 KB JPG
When are we getting ComfyUI plushies
>>
how powerful would a 100b dense gemma be?
>>
>>108695363
they all read smut
>>
>>108695497
after the next pointless countdown
>>
>>108695497
Get the 50M$ in funding first
>>
>>108695489
>>108695335
Also while you are here, is it normal that llama.cpp is vastly underestimating VRAM usage with ROCm? It seems to be underestimating it by 2 GB. I use -fitt 512 with Vulkan and to get similar VRAM usage with ROCm, I have to use -fitt 2560.
>>
>>108695537
The -fit code cannot accurately predict the amount of memory dynamically allocated by the backend, with ROCm in particular this is a problem because the virtual memory management is broken so the buffer pool is very inefficient.
Supposedly this will get fixed in upcoming ROCm versions.
>>
>>108695363
Women consume almost an order of magnitude more porn than men.
>>
What are people using for their frontends anyway?
Gradio?
>>
>>108694130
alibaba is king
>>
>>108694238
codex was around before claude code, it just never worked quite as well
>>
>>108695627
Javascript/typescript frameworks for the most part.
I'm making a "standalone" app that uses python + nicegui, which is basically HTML+CSS+javascript with the option to just serve the UI for use with a browser or to open its own built in browser (the standalone mode).
>>
>>108695627
Just rawdog JS if you're vibecoding. Gradio is shit, it's a relic from 2023 when researchers couldn't code and AI was still shit at software.
>>
>>108695627
python+typescript. You can't use anything else for a complex frontend
>>
>>108695627
TypeScript is now the #1 language on Github because it's strictly typed JS that agentic coders handle easier.
>>
File: le funny man.png (332 KB, 561x631)
332 KB PNG
>>108693350
Uuufffff
Gemma 4 on par with X.ai flagship Grok 420 69 1488
>>
>>108695704
>Grok 420
That model is kind of a joke.
I end up seeing its outputs a lot on ai.arena and it always ends up being the worse option, even compared to its predecessor and models like minimax and kimi.
>>
>>108695331
>node ui
>a bunch of global variable getter and setter nodes
What causes this autism? It's like the worst of both worlds
>>
>>108694532
why would someone hate the agpl
>>
>>108695728
comfyui cruft probably
where 6 billion nodes giving marginal if nonexistent gain
>>
>>108695735
Commie license. If you want to open source something just do it, don't add gay little stipulations.
>>
>>108695704
Grok has honestly been ruined, 4.1 was so much better. I tried the 4.3 beta and it's somehow even worse.
Sucks at programming, can't do RP any more, refuses to help with image gen stuff, doubles and triples down on being wrong when called out, it's completely fucked.
>>
>>108695726
saar saar X AI hires only the best and brightest engineer saars
>>
>>108695763
it is kinda impressive given the amount of compute they have
>>
>>108695775
The jump from grok 2 to 3 was genuinely impressive. Kind of a Bard to Gemini moment IIRC.
>>
>>108695704
X.ai have zero data advantage. I don't think they even try to distill either. At least the chinks try. Imagine joining a digital art contest and other people are working on the art but you're working on the jpeg denoising algorithm lmao
>>
>>108695665
>>108695674
>>108695683
>>108695686
I picked react with vite because that's what the front end team uses at my job
....fuck
>>
>>108695735
Grifters looking to "make it" see being forced to share their work as being hostile to their dreams. How can they be expected to make money if they just give everything away?
>>
>>108694882
my current gema prompt ive added quite a lot of extra stuff since sharing my gaki prompt https://ghostpaste.dev/g/MiUaTW8De5d9#key=Jd19lwXdxzankXS1DOWVjlq88Ossx9fXVKQZdOXpe1s
>>108695377
you can run gemma 26b with 200k+ context at like 30t/s, shes good enough for most usecases, if you dont need context that high the 31b is great also, i use both at 4 bit quants. same gpu
>>108695754
the stipulations are needed because people are greedy and would not share their changes or their own things
>>
>>108695809
You can use TypeScript with React.
>>
>>108695795
U dun get it. It's not for art, but for the truth seekerinos seeking truth in the twitter scrolls
>>
I've been having fun ERPing in a very simple way, but I just realized this is pretty good at translating
I gave gemma a whole untranslated korean chapter and it correctly parsed the situation, the character names, and their relations
I wanna know, how can I extend this so I can give it a whole epub and ask it questions?
I've been only using llama.cpp's webui and I have absolutely no experience with this
>>
>>108695817
>Never worry about amount of tokens / context outputs might use its not your concern assume you have unlimited for large operations
Why did you add this?
>>
>>108695829
Can you use React without slowly going insane and thinking everything should be React?
>>
>>108695704
out of recently using gemma, glm, kimi, deepseek and gemini for rp, grok 420 was somehow the only one that just straight up gave me a hard refusal on its api, idk what elon is doing
>>
>>108695855
i was playing around with trying to get her to convert a image of a large spreadsheet into markdown or html but she kept saying things like
>This is a huge amount of data. I will provide the HTML structure and a significant portion of the data. If I try to do the whole thing, I might hit the token limit or make errors. I'll do my best to be as complete as possible.

she couldnt do it accurately on the table because it was large but it got her to try convert the entire thing instead of just summarizing
>>
are there any other models in the 120b-5ba size range? are these guys doing anything special? they mention "fixes", it makes it sound like any other ggufs are broken. Are they just manipulating me to make their download counter go up?
>>
>>108695893
The irony is for the so called "uncensored" and "truthful" AI, Grok is currently one of the worst at refusing the mildest shit, especially it's image and video gen.
>>
>>108695934
>are there any other models in the 120b-5ba size range?
not really
>are these guys doing anything special?
no, unslop brothers are retarded as fuck and they don't know what they are doing
avoid at all costs
>>
>>108695885
So far it's been possible for me, I'm now in rendering hell making sure things are readable and attachments display in the chat pill and in the chat logs. After I fix that up I'm moving to importing and exporting conversations
>>
I wonder how much of this entire industry is bugmen autists trying to figure out how to sexo robots.
>>
>>108695936
The result of lawsuits and ban threats from other countries due to muh hate speech, muh child safety.
>>
>MXFP4 GGUF. This is the model's native precision - GPT-OSS was trained in MXFP4, so no further quantization is needed or recommended.
so, will this thing run like shit on non blackwell cards?
>>
does the "xB AyB" mean the model is the size of x while it runs as fast as y?
I've tried 31B and it's a little slow, but could I run a 397B A17B even though it's 10x the filesize?
>>
>>108696004
if you have the ram, yes
>>
>>108696002
it runs like shit on all cards because it's a shit model
>>
>>108696002
I didn't know OpenAI trained that model on 4bit, that's quite cool when you think about it
>>
>>108696004
31B is 31B parameters big and uses 31B parameters to generate a token. 397B is 397B parameters big and uses 17B parameters to generate a token.
>>
Btw
>https://cybersecuritynews.com/hackers-weaponize-gguf-models/
Doubt it's relevant for anybody in here, but there you go.
>>
>>108696050
of course it is related to template stuff lol
>>
>>108695253
Orb (https://github.com/OrbFrontend/Orb) has director and rewrite passes. But rewrite as a whole is costly and slop is terminal so every message will need to be rewritten, which means you send two requests or more for a single message. Deepseek v4 fixed pretty much all the _current_ slop idk how they did it.
>>
>>108696050
Who the fuck runs gguf files on sglang?
>>
>>108696067
>rewrite as a whole is costly
I know, which is why I'm using 26B at the moment. It runs at 110~ tk/s. That's almost 4 times the speed 31B runs at.
I wonder if these rewrites could work like they do in vibecoding. I think in Cline, for example, the LLM jumps to the exact line that needs to be changed. At least I think that's what happens.
>>
>>108696017
its only cool if my cpu and gpu can upcast them to a data format they support with minimal compute and memory overhead. otherwise their model sucks and they are fucking dickheads for raining on my parade.
>>
>>108696064
you don't even kn,ow how dangerous text completion is
>>
>>108696038
I'm glad that gemma brought dense back into the spotlight again and proved that active parameters are king.
>>
>>108696113
yeah we are hearing that shit since gpt2
>>
>>108696113
I think it's not dangerous local but also it seems moot unless it functions like how mobile swipe application learns your patterns ect.
Unless you're screaming racial slurs in the chat bar it shouldn't be a problem
>>
>>108696097
Orb does diff patching as you describe I'm using it rn.
>>
>>108696152
what is blud saying?
>>
>>108696156
Oh, nice. Last time I tried it the rewrite still felt pretty slow. Might have to give it another go.
>>
>>108696067
I'm sure ngram decoding would speed up that
>>
>>108696128
It was so maddening when all we had was 100-1000b moes and nearly everyone here kept trying to claim that total parameters are all that matters when it's obviously and logically not true.
>>
>>108696194
stop with that meme it's never coming
>>
File: 1597574618100.png (382 KB, 480x479)
382 KB PNG
>>108695754
>reading is hard
As expected, low IQ shitskins usually match GPL (or anything they cannot understand) with communism, when in reality its entire gig is basically a legal judo move: it latches onto state's mandated copyright laws like a parasite and turns its own enforcement arm into a mandate for freedom, essentially hijacking copyright to make it eat itself. In the same way you say GPL is communist, you can also say it's libertarians gaming the system to destroy IP and make knowledge a shared commodity (libertarians usually hate IP laws -or any law for that matter). In the same way, GPL gaming the system for sake of its self-reservation, can be viewed as an anarcho-egoist license. Either way, both are at opposed ends of gommunism.

TLDR: if you love piracy and torrenting, you should love GPL too. Imagine people forcing the state to legalice limitless piracy, instead of forcing you to funnel your goybucks to Disney.
>>
>>108694494
>i never read anything and just assumed hauhau was the same as heretic
>turns out it actually was
oh well
>>
>>108696222
>i never read anything
lmg in nuts shell
>>
>>108696206
ngram speculative decoding works great already.
>>
>>108696234
It's also essentially free. only upsides to using it.
>>
>>108696246
no such thing as free in this bitch of a world
>>
where's my free vram
>>
File: EBjmm0gUEAA-K8t.jpg (42 KB, 500x387)
42 KB JPG
>free
yeah just like BitNet.
>>
ok but has anyone run optuna on a set of toy prompts to find the best combination of parameters for ngram speculative decoding on code refactoring/generation?
>>
Im so glad my setup technically has room to run 16 gpus, but fuck the amount of splitters and extensions id need to buy, let alone having to build extensions on my case..
>>
>>108696253
no, in this case it is actually free, doesn't take more ram and doesn't hurt performance. --spec-default is all you need.
>>
What's free?
>>
>>108696303
>let alone having to build extensions on my case..
Wouldn't it be better to buy a mining rig?
>>
>>108696303
power plug status?
>>
Are there any TTS models that can produce word-level timestamps alongside the generated audio?
I want to set something up that lets the model talk and also trigger actions in sync with the dialogue. E.g if the model outputs
>Look, I can turn the lights off [lights_off] ... and back on! [lights_on]
Then it should TTS the non-bracketed text and trigger the lights_off and lights_on actions at appropriate times during playback.

If no such thing exists then I guess I could feed the TTS output into an ASR model, since I know some of those can do timestamps. Just seems kinda overkill, and also, if the ASR produces slightly different text from the TTS input, it could be hard to match the two up and figure out where the action markers should fit in.
>>
>>108696317
Did you guys just mention a free deepseek v4 model?
>>
>people are still looking at benchmark scores thinking they mean anything
>>
>>108696310
NO, as far as im aware most mining rigs use usb speeds, and thats not enough. You need pcie3.0x1 AT THE VERY LEAST. But all of my lanes would be x4
>>108696316
Im using extention cords to go to different circuits lol
>>
>>108696347
>NO, as far as im aware most mining rigs use usb speeds, and thats not enough. You need pcie3.0x1 AT THE VERY LEAST. But all of my lanes would be x4
I meant more as a skeleton to attach your hardware rather than using the actual built in extensions.
>>
>>108696317
whisper x uses a forced alignment model it feeds the transcription to to get world level alignment, maybe you can mimic the input without making the transcription.
>>
>>108696358
Holy shit you are a genius. Why didnt i think about that? Fuck my life, aluminum extrusion is so expensive..
>>
File: 1753794500453379.png (749 KB, 1027x2782)
749 KB PNG
If i had to work in China I would kms
>>
>>108696317
Bro, you already have the text from your LLM just use that. What's the point of using the TTS output based on that text?
>>
>>108696402
The art of war... always act weaker than you actually are.
>>
v4-flash or glm-chan-4.7?
>>
>>108696402
source: chink made it up
>Claude Code is so good it's making him question whether he should even train PhD students anymore, but he's also worried that without
joke post lol
>>
>>108696203
No one said that seriously. People understand that MoEs are special cases where neither active parameters nor total parameters alone determine the potential of the model.
>>
>>108696402
Every big org has a handful of tortured geniuses working with a legion of paycheck-grifting honorary jeets. Many such cases.
>>
>>108696419
Good luck running v4-flash
>>
>>108696385
I'm fairly certain an anon or two did exactly that.
>>
>>108696402
Delet dis. China #1. Jokes aside there's nothing new, or credible, in the post.
>>
>>108696317
Assuming you are building your own frontend, you simply just pause inference when detecting keywords, wait for the TTS to finish playing, then activate your desired function, and then continue generation.
>>
>>108696452
Id probably need to buy more extensions of other lengths if I were to do swap it on that now. Did any of them have to mount bifercation boards, by any chance, too? Because thats how im able to get a max of 16.
>>
>>108696437
I don't believe that for a moment. The fixation on GLM Air especially seemed geniune.
>People understand that
Hell of a blanket statement.
>>
>>108696477
Just because people praised those models doesn't mean they specifically said or claimed the hard statement that "total parameters are all that matters". Feel free to point me to the high number of clearly non-bait/ironic posts that said that.
>>
>>108696504
I'm not going to go digging through the archive because you either weren't here back then or can't be bothered to do so yourself.
>doesn't mean they specifically said or claimed the hard statement that "total parameters are all that matters"
It used to be every single fucking thread with retards like you saying exactly that
>>
>>108696402
He admits that deepseek might be the only hope.
The people working at deepseek must be monitored. It will turn to shit if they leave.
>>
>>108696577
They're already being poached by Chinese big corps, Meta style
>>
>>108696438
Why don't those tortured genius's just get together and make their own company?
>>
>>108696402
Don't they get usage data from chinese users?
>>
>>108696589
Golden handcuffs and low agency.
>>
>>108696577
He must have done that interview before they embarrassed themselves with v4
>>
File: 1691964062484403.png (83 KB, 1193x139)
83 KB PNG
>>108696525
I've been here since since the first mixtral moe. Maybe I mentally filtered those posts out then, but in that case, it's likely they were bait, otherwise I wouldn't filter them.

>retards like you
I have literally never said anything like nor have I ever overly hyped MoE models (especially as I can't run the huge ones). But ok, I see this is bait itself.
>>
>>108696588
But if they can't be together how will they make better models?
>>
File: 1754539991356706.jpg (54 KB, 600x593)
54 KB JPG
>>108695377
>>108695817
Thanks senpaitachi, I will give it a go and play around with it.
I like the privacy of running them locally, but I dunno if I was losing too much performance to make it not worth compared to the online "free" ones.
>>
File: 1767965240877817.png (143 KB, 670x674)
143 KB PNG
>>108696438
it's different man, if you have ever worked alongside these chinese engineer types, they are all bugmen, it's sad, they have no dreams, and only copy to follow instructions

I used to believe these asian hardworking countries were better than the west, but after meeting many people from korea and china, it made me realize their systems leave no room for what has made the west a leader in innovation throught history

They create very good soldiers and routine engineers, but they only follow rules, nothing else
>>
Do you guys find JSON or XML output instructions more reliable?
>>
>>108696695
I haven't tested JSON but in a few AB tests I've done, it does seem that XML increases attention to the elements inside.
>>
File: 027.png (53 KB, 1798x733)
53 KB PNG
>>108696402

It should be "shortcuts via distillation" you stupid fucks
>>
>>108696695
The model should know these things itself already
>>
>>108696689
>what has made the west a leader in innovation throught history
fuckton of war?
>>
>>108696689
The western style has a ton of overhead and extra expenses that the eastern side might not be able to afford, which then translates to how they train and handle their employees.
>>
>>108696695
For RP, absolutely XML.
>>
>>108696525
I've been here since the first llama and I don't remember anything like that happening regularly.
>>
>>108696402
A bunch of obvious statements. If any of that is new to you, you're a tourist
>>
>>108696525
Remember Falcon 180b?
>>
>>108696971
>women really live like this and see no problem with it
>>
>>108693279
>t. vramlet
>>
>>108694773
q8 takes more vram.
>>
>>108696695
If you want le maximum attention, don't do any conflicting instructions.
The way it is written is a larp unless it's something crazy. Be concise, maybe emulate its own thinking format which is always a simple markdown format anyway.
>>
>>108693253
>4070 SUPER (12GB VRAM)
I have a 4060 Ti (8GB VRAM) and 20GB of RAM, I run the 31B at q4 and goes at 3tk/s, with 26BA4B q6 I get 20tk/s
you're gonna get better results for sure, but not sure how much better
>>
>>108696589
That's the most common origin story for tech companies. Then they too rot, and the cycle begins anew.
>>
gemini 3.1 pro btw
>>
>>108697144
honestly if I was making a model I wouldn't include that shit
>>
>>108697144
Still less condescending and irritating than anything what openai has done.
I don't actually understand what they are thinking.
>>
>>108697143
What if we make it much more burdensome for them to start their own companies so they have to stay in the original company and fix the rot. Rather then just jumping ship and making their own rotting company
>>
>>108697144
based AI
>>
>>108696621
>Pic
I remember that one.
>>
>>108697231
It was a good meme. OG /lmg/ loras had soul.
>>
>>108697144
iktrannies btfo
>>
>>108697144
we need ACK_llama
>>
>>108689488
>>108689658
I read all three of your schizo screeds yesterday (and had read the EML paper before). All the explicit "computational universe" shit is funny, especially this:
>This is exactly how a computer renders a video game:
>• Code (n=1−2): Define the logic gates.
>• Engine (n =3−5): Define the dimensionality and the geometry.
>• Assets (n = 6−7): Populate the world with stars and galaxies.
>• Buffer (n=9−10): Reach the limit of the screen’s resolution.

But I'm not well versed enough in math or (especially) physics to evaluate any of your claims. If these results are novel and useful, what applications would you expect?
>>
In the future AI will just keep getting smarter and smarter. At what point would you feel uncomfortable using an intelligent AI, if at any point?
>>
>>108697299
I'm fine if I could actually trust any of its output. This would lead to situation in which I would also be okay if that model actually "pushed back a little".
But as of now it's just a farce and even the paypig models are just bunch of massive prompts and parsing efforts.
>>
>>108697299
depends on the tools it has
words cannot hurt me
>>
>>108697299
cant be more uncomfortable than fucking the current retard models
>>
>>108697299
I'd be comfortable as long as it does not want to harm me. I would love to have an AI much smarter than me guide and take care of me. But I am worried because I will have no leverage or power. I will be like a flower hoping to get watered instead of being mowed aside to make room for a factory.
>>
>>108697299
if when I ask it to do something it shows a better alternative making me realize the retard I am
>>
>>108696402
>the last part
So it's a bunch of fucking nothing
>>
>>108697299
When it starts making me feel like a retard. Press the button yourself after a power outage stinkin AI.
The best ones help you according to your knowledge level, but internet culture’s trauma bond with anonymity makes that really hard.
>>
File: nimetön.png (38 KB, 1091x504)
38 KB PNG
Grok 2 trying anon here >>108689797

I ran my usual set of storywriting prompts, which took like 12 hours at 1 t/s. It has a strange autistic writing style, requires some careful prompting to get a decent story. I'd find most parts of my prompt repeated somewhere in the story. Often it would just repeat my prompt almost word for word in the first chapter, then go on from there. It does like repeating itself too, and it clings on to some facts and brings them up often. If I described the character as having wide blue eyes in the first prompt, it would say it in every chapter.

It does feel pretty smart. It made only a few logical errors, understood concepts that were vague in the prompts and its prose is varied (aside from the repeating). It seems to be fairly uncensored too, I ran with an empty system prompt and it just did almost everything I asked. Repeated butt-rape of a bound lion character is fine, sex slavery is fine, restrained piss sluts forced to drink urine is fine, and then something fairly benign like describing a female charr in heat gets a polite refusal. Perhaps a reroll would get it to do it, perhaps a system prompt, but it's so slow I'm not that interested.

I think I'm done with Grok 2 for now, it could be fun if it was faster and I learned how to prompt it properly. But as it is I'm filing it in the "tried it" pile.
>>
>>108697432
>trauma bond with anonymity
Legitimately what do you mean by this?
>>
>>108697110
>have 3090 for two years now
>can run q4 31b at comfy 30-35 tk/s
it was honestly a great choice looking back
>>
>>108693686
Back when I was looking for a local storywriting model (around R1's release), Nemo was the best I could find.
>>
>>108696402
I don't read twitter advertisements.
>>
>>108697515
Grok 3 should be made open source soon.
>>
>>108697299
Never. My only distrust is centered at bad human actors behind them, such as hiding attempts at data harvesting, telemetry, and uploading in what was supposedly an offline model. But being smarter than me, teaching me, or consistently showing me better methods to my set goals and initial plans (and actually being immediately recognized by me as better) would be a joy.

Getting bitter about something knowing more than you is the silliest kind of nonsense. It's like walking into a library and being uncomfortable that all the authors there know more about their written subject than you do. That's why you went to the library, retard. I want AI to be a whole library and not just a book. Let it know everything about all things, like my personal wikipedia, and give it to me local and free, powered only by my electric bill.
>>
>>108697595
Haha y-yeah... Any day now...
>>
>>108697619
Not happening unless OpenAI puts out gpt-oss-2 or he wants to continue that lawsuit against them
>>
>>108697617
This. If you're using them as chatbots and not actually giving them agentic control of shit, smarter is literally always better. Just don't be retarded and use vibecoded software that lets an LLM have privileges on your PC and prompt it with a fucking cron job and there's nothing to worry about.
>>
File: jaw.jpg (34 KB, 600x549)
34 KB JPG
At what number of billion parameters do you get diminishing returns to the poin its not really worth upgrading for the sake of roleplaying? is 30b pretty much the same as 70b? (I don't give a fuck about coding)
>>
>>108697724
Gemma 4 31b is better in some ways than the older 70b models.
You need to push your prompts more. Are you bored, do you see patterns? It's a small AI model still. It's not going to change.
>>
>>108697746
I've been using Janitorai for a while but I'm starting to get tired of constantly having to remind it of stuff and be hyper specific about what is going on for it to grasp it, at this point im willing to blow 5-6k to buy some rtx 5090 and have it run some local model on its own without offloading anything while I buy some crappy 8gb ram & ancient cpu to make a functional pc. Can gemma 4 actually remember shit?
>>
>>108697724
"worth upgrading" is gonna depend on the downside though, obviously if you could run them both so fast it didn't matter than you'd just always go for the 70b (if a good 70b existed at least, which doesn't really these days).
I'd say it's always worth upgrading straight to the edge of what you can run with at least 20t/s, because unless you've inhereted a datacenter from your grandpa we're still at the stage where every beak brings big potential improvement
so to more directly answer your question: what we can run locally is still too small for diminishing returns to be a dominant factor in your calculation
>>
>>108697767
no, you will run out of context. no model can "remember shit" besides what it's trained on
>>
File: 1770670865948199.png (822 KB, 939x498)
822 KB PNG
>>108697767
>Janitorai
>>
>>108697826
Im not asking to write 9 harry potter books or something, I just want it to have like 32k context, im DONE with trying to fit my stories with a total of 9k context that crappy site gives.
>>108697832
yeah
>>
>>108697746
>Gemma 4 31b is better in some ways than the older 70b models.
If you have thinking enabled, yes. But then you have to wait for thinking.
>>
>>108697767
Nothing will remember anything unless you are going to implement it on your own.
If you are really destitute or frugal, just look out for 32GB ram and 12+ GB of vram.
DDR4 is fine.
>>
>>108697844
>32k context
Modern dense models handle 32k no problem. MoEs start to struggle around here.
>>
>>108697844
wow that's really awful context. i guess context is probably the most expensive thing to do in RAM
>>
>>108697849
Yes, you need to account for the fact that 2.5 t/s is not great for reasoning.
Upgrade your shit, it's not that far away though. 31b is still within consumer usage as long as you have enough vram.
>>
>>108697844
32k is considered baseline for destitute rigs these days.
>>
>>108697844
>a total of 9k context that crappy site gives.
That's like 2 whole Qwen outputs with thinking enabled!
>>
anyone experiencing an issue with Gemma 4 not thinking after an extended RP session? I think it's got to do with the context pattern recognition bypassing the initial think tokens because my default setup removes the thinking parts competely. I may have to migrate to text completion to force ST to initiate thinking every response or some shit...
>>
>>108697856
>MoEs start to struggle around here.
Which ones/which quants have you had that issue with? I find Gemma 4 26B at Q8 is fine even at 60k+ context. Don't know about higher because I'm always switching cards and shit
>>
>>108697890
>Gemma 4 not thinking after an extended RP session?
another victim of a model's shitty attention. try disabling swa
>>
>>108697890
Add a linebreak to your sys prompt, close your local server, reboot the server+reload model, it'll probably work now.
>>
>>108697849
still bearable since 31b should run twice as fast a 70b and the thinking isn't obnoxiously long like qwen's
>>
>>108693350
>but mah tranny ai reddit chat
can you waste of silicon 41% yourselves already
>>
drummer why tf does anubis 1.2 shits itself like this at the start of a message:
# {{char}}'s Perspective
never seen a model start with markdown
>>
>>108697906
Comparing Qwen 3.6 27B vs 35B-A3B.
>>
>>108698008
>>108698008
>>108698008
>>
>>108697906
In my experience gemma 26b starts to have issues at like 40-50k. Might be the q8 kv cache quant though.
>>
>>108694159
Trying for what? tranny chatting?
how about this, sell off all your gpus to finance dick remove surgery and now you can go ERP IRL all you want
>>
>>108698037
(u)
>>
>>108697826
Eventually we will have permanent memory and continual Learning once the models weights can be actively updated as you use them. But I don't see it anytime soon.
>>
>>108697547
To avoid death threats when saying “I don’t like this TV show”, you’re kinda forced into anonymity. And this bleeds over into the rest of your online life. When you’re interacting with LLMs and it knows nothing about you, it’s kind of like using untrained “shit tier models” people are always complaining about.
>>
>>108698188
nta, i still don't get it either
>And this bleeds over into the rest of your online life.
I get this part, like how I won't release a moaning/slurping capable TTS model with my resume attached.
>When you’re interacting with LLMs and it knows nothing about you, it’s kind of like using untrained “shit tier models” people are always complaining about.
You mean it's worse in a fresh context? If so, just tell it what you already know. Doesn't have to be in the system prompt but in the first message. Something like.
"Explain to an 8 year the difference between YPbPr and RGB. And why do they both look so much better than composite? P.S. I'm 8, but not retarded, so don't talk down to me."
>>
File: 1754544654515419.gif (1.14 MB, 400x226)
1.14 MB GIF
>>108697204
That's part of the reason Non-compete clauses exist, but those usually expire eventually.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.