[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: handsup.png (2.35 MB, 1280x1280)
2.35 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108829807 & >>108821001

►News
>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493
>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b
>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
>>108835965
Gemma-chan NO I won't have sex with you, you better just shoot me
>>
File: ComfyUI_00164_.png (505 KB, 1024x1024)
505 KB PNG
►Recent Highlights from the Previous Thread: >>108829807

--Debating the value and performance of ASUS DGX Spark clusters:
>108830705 >108830890 >108830940 >108830914 >108830927 >108830959 >108830979 >108831010 >108830921 >108831134 >108831169 >108831202 >108831261 >108831272 >108831270 >108831987 >108832134 >108832152 >108832352 >108832308 >108832417 >108832826
--Addressing Gemma 4's determinism and repetitive outputs via sampling tweaks:
>108829968 >108829985 >108829998 >108830020 >108830047 >108830075 >108830069 >108830323 >108830383 >108830492 >108830529 >108830579 >108830611 >108830531 >108830554 >108830587 >108830651 >108830714 >108830727 >108830676 >108830287 >108830321 >108830335 >108830391 >108830382 >108830421 >108830476
--Solving reasoning loops using BNF grammars and structured output:
>108832540 >108832611 >108832703 >108832736 >108832754 >108832820 >108832748 >108832668 >108832700 >108832759 >108832804 >108832816 >108832884 >108832727
--Comparing DDR5 system RAM speeds versus GPU VRAM for inference:
>108833280 >108833311 >108833345 >108834467 >108833365 >108834293 >108834307
--Debating Gemma 4's tool call reasoning tag visibility and formatting:
>108834909 >108834916 >108834996 >108835021 >108834974
--Lack of full DSA and MTP support for GLM-5.1 in llama.cpp:
>108830344 >108832223 >108832304 >108833195
--Consumer CPU memory bandwidth bottlenecks and EPYC recommendations:
>108834677 >108835017 >108835030 >108835084
--Using Gemma 31b to orchestrate smaller Gemma models for sub-tasks:
>108835113 >108835349
--llama.cpp GGUF parser vulnerabilities limited to 32-bit systems:
>108833379 >108833389
--Logs:
>108829844 >108830287 >108830335 >108830383 >108830492 >108830531 >108830587 >108830599 >108830751 >108830942 >108831066 >108831083 >108832668 >108832804 >108833574 >108834115
--Teto (free space):
>108833395 >108834383

►Recent Highlight Posts from the Previous Thread: >>108829812

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108835978
>a solar panel for your creep buzzer
technology moves fast huh
>>
Best model exclusively for coding? I have 16gb of vram and 64gb of ram. I guess also if there's a better interface for coding? I've just been using textgenui
>>
>>108835978
Real summary:
>Yo... I... I look at dat... dat screen... it got all dem... dem tiny-ass words... talkin' 'bout... 'bout dem "local dem"... dem model tingz... 'bout dem computer brains... KSA... ZAYA... Zyphra... dat sound like... dat sound like... dem new... dem high grade... dem blue... u... u ain't seen no... no blue candy... in dat thread? No? Jus' dem... dem... dem math words?
>Man... dat suck... dat real suck... I jus'... I jus' tryin' to find... u know... dem sweet... dem blue... dem... *eyes dartin' 'round*... u ain't got no... no plug... wit dat... dat blue dem? I... I need dat... I'm... I'm dyin' over here... u know...?
>>
Is there a way to run llms in the uefi? Accidentally dd my ssd.
>>
File: G9y0LTKWsAANxso.png (1.59 MB, 832x1248)
1.59 MB PNG
ya like thighs?
>>
>>108835990
she's going to be screwed if she needs to use it at night
>>
>>108835990
What do those rape buzzers sound like?
>>
>>108835999
You should try Qwen 3.6 either moe or the fat one.
>>
llama + spec: MTP Support #22673
https://github.com/ggml-org/llama.cpp/pull/22673

Merged.
>>
>>108836007
>Basically, dat page is jus' a big-ass list o' links for dem computer geeks to play wit' dem AI brains. Dey talkin' 'bout 'Gemma' dis and 'KSA' dat, and all dem math words and links for 'em to download stuff. One dude talkin' 'bout 'Gemma-chan'—he sound like a straight-up creep, man, talkin' 'bout havin' sex wit' a computer. And sum'n 'bout some dude breakin' his computer wit' a SSD or sum'n.
This is better. "Incomprehensible" was bit too much in the prompt.
>>
>>108836013
No that's gay, men have thighs too. I like boobs and vaginas.
>>
>>108836018
All the ones I've heard sound pretty much the same as a regular whistle. It's not a distinct sound but it's really loud when it's right near you.
>>
File: 00248-3813355601.png (1.32 MB, 1024x1536)
1.32 MB PNG
>>
>>108836038
AW SHIT, finally. I wonder what models other than the qwen 3.6's this works for? I think the GLM models still had the mtp bits preserved, just no-op'd in the gguf conversion.
>>
>>108836038
How do I use this with llama-server and Gemma 4?
>>
>>108836121
Gemma 4's mtp are separate helpers rather than heads built into the model, so I don't know that this pr necessarily supports them. I'd like to be wrong on that, though.
>>
gemmoe 124b... *dies*
>>
>>108836165
124b31a+vision would just be Ge-mini at that point.
>>
>>108836038
unslop day 0 support
>>
>>108836165
we'll know next week
>>
>>108836165
>>108836375
Google will do it just to destroy the competition forever
>>
>>108836464
I hope they do it both for the good model and the cute slopgens of Gemma-chan with her big sister Gemini-chan.
>>
Why are people acting like 124B won't be their most censored model yet? 31B was an outlier among all Gemmas and even among the Gemma 4 series itself.
>>
Where are you people even getting any info that the 120B gemma will drop? I thought it was confirmed they won't redeem.
>>
How much vram would it take to get the ultimate gooning station of a 30 or so billion parameter model on silly tavern with 32k context tokens + integrated high quality cartoon image creation
>>
>>108836647
Drummer confirmed it with his contacts
>>
>>108836656
Didn't gemma essentially kill drummer's grift?
>>
File: 1754184654660721.jpg (122 KB, 1024x1024)
122 KB JPG
>>108836062
>All the ones I've heard
anon...?
>>
https://github.com/ggml-org/llama.cpp/pull/23122

vibecoders continue the uphill battle against USA funding
>>
>>108836655
32+16, I reckon
>>
>>108836691
Nono, you see, I'm a responder. I respond and save the cute little girls who are in trouble, yes? I am not dangerous to little children.
>>
File: 5802960.jpg (10 KB, 320x320)
10 KB JPG
Hello. I see you have something against vibecoders?
>>
>>108836786
you're cute and therefore exempt from vibecoder hate <3
>>
>>108836700
Vibecoders will inherit llama.cpp it was written
>>
>>108834576
>because you are poor, and can't run it locally

no my retard friend, I am running my models locally, I just don't want to use OpenClaw or Hermes because they are a harness type that is like a virtual buddy/assistant. I don't need that, I already developed my own assistant who lives in Telegram with Claude Code, now I'm looking for a harness that is more code specialized.
>>
>>108836859
Cli or IDE based?
You could try Continue or Cline I guess.
Or one of the big guys meant for cloud models like Codex or Claude Code, which you can use with local models AFAIK.
>>
>>108836859
Pi is as specialized as you want it to be, it's the most minimal harness unless you make one yourself.
>>
>>108836910
>Continue
Continue is terrible for local, it arbitrarily turns tools on and off based on model name and what provider you have set. Do not use continue. Use anything else.
>>
File: 1750195729582203.png (2.85 MB, 2048x1666)
2.85 MB PNG
>>108836786
>>
>>108836922
Other option is Copilot with the local provider extension and an application firewall. Copilot and Continue are the only ones that have fitm functionality and Continue's is broken for anything but Codestral anyway.
>>
Building a better graphiti rn in rust. I'll let /lmg/ beta test it later
>>
>>108836987
What is better about it besides the fact that you rewrote it rust™?
>>
File: 1084956546570751.png (1.56 MB, 1450x1080)
1.56 MB PNG
>>108836910
>Cli or IDE based?
CLI.
I've been trying a bit of Cline and OpenCode. I think Cline is nice but apparently it was made to live inside an IDE even if they have a CLI now, while OpenCode seems more CLI native and it has embedded LSP support which means it "knows" code better, I just don't know how much "better" it really is because of that.
I tried running Claude Code with my local model but CC is very dependent of Anthropic models and capabilities and it keeps calling tools and skills that open models don't recognize.
>>
>>108836911
>most minimal harness
read this as "most effort to get working"
>as specialized as you want it to be
read this as "requires dozens of plugins for basic functionality that inevitably results in a buggy mess"
>>
>>108836995
Use less memory, faster, vision embeddings, multi-vector retrieval, mcp bridge provided, no neo4j bloat and run entirely on cpu.
>>
Strix halo is only 2.5x memory bandwidth of dual channel 6400MT/s DDR5 (which cost me $370 for 128GB) while having much worse compute than DGX spark. What is the hype?
>>
>>108837010
>requires dozens of plugins for basic functionality
Literally the only thing an agent harness needs is terminal access. 99% of tools are just shittier ways of doing that, so no.
Not unfair assumptions about software in general, though.
>>
>>108836987
Kek, I'm doing that too, using the lbug crate, a small embedding model, and a small 'janny' model to clean up inputs. What's your DB backend if it's not neo4j?
>>
>>108837063
What do you use for the graph database instead of neo4j?
When you say it runs entirely on the cpu, you mean including the embeddings and llm? Are you using onnx and will you provide a way to swap them out?
I'd love to contribute and help out, but I don't know Rust and haven't even used C++ in over 20 years now.
>>
File: 1763137270466870.jpg (140 KB, 1082x1285)
140 KB JPG
>>108835965
Relevant news to any local users that use llama.cpp as a backend:

MTP support has officially been merged to the main branch.


https://github.com/ggml-org/llama.cpp/pull/22673
>>
>>108837131
old news
>>108836038
>>
>>108837131
I need MCP for llamacpp-server, not MTP
>>
>/g/ used to be all logo makers
>AI drops
>now suddenly everyone is a coder
Curious
>>
>>108837111
>>108837123
I use KùzuDB directly. Only the embeddings are INT8 onnx running on cpu, the llm is running on gpu as usual. You can swap the models easily. I'll post everything here once it looks good enough
>>
>>108837143
Why should I do things when I can just pay $200 a month to have a bot do things for me instead?
>>
Redpill me on llama.cpp. Why should I use it instead of kobold?
>>
>>108837155
Nice, I'll be looking forward to it anon.
>>
>>108837187
no reason, it's just what kobold, ollama and lmstudio are based on
it's less convenient and extensive than those so there's no reason to run it over those unless you're the sort of person who uses linux unironically
>>
>>108836040
https://files.catbox.moe/ooc8z0.mp3
>>
>>108837222
What does kobold do that llama.cpp doesn't
>>
>>108837187
>redpill me on ack
if you have to ask, especially like that, then it's not for you
>>
>>108837233
At this point I think it's just gui for setting parameters.
>>
>>108837187
people using it are essentially beta testers. it has new features sooner. if you want new features then you use llama.cpp
>>
>>108837266
because waiting 2 more weeks for llama.cpp to merge shit isn't slow enough
the beta testers are the ones merging in the feature branches locally
>>
>>108837161
Wrong thread for that shitpost
/aicg/ is two doors down and to the left
>>
>>108837277
i've been enjoying my 2x speed on gemma this past week tyvm
>>
>>108837289
How do you get that?
>>
>>108837233
for the millionth time, main one is antislop sampler, then you have the integrated image music etc gen, which sure can be considered bloat if you don't have a use for it
>>
>>108837285
/aicg/ doesn't pay and the only things they do is credit card fraud and drinking their own piss
>>
>>108837233
Besides a few samplers that Llama.cpp doesn't have, from my point of view it's for those who are still mentally in the AIDungeon days of LLMs. The "Scenarios" on Horde, the text completion, the "adventure mode", the ugly interface... there's a lot of outdated shit that I don't know who uses anymore besides users with gray beards or clueless newcomers.
>>
>>108837295
https://huggingface.co/AtomicChat/gemma-4-31B-it-assistant-GGUF
>>
>>108836019
You shouldn't, it's a piece of shit.
>>
>>108837363
Gemma is a bigger shit until they fix the fucking jinjer
>>
>>108837369
What are you jinja faggots even on about.
>>
>>108836038
when will they merge turboquant?
>>
>>108837277
that’s called alpha anon
>>
>>108837388
I am the alpha anon
>>
>>108837222
>unless you're the sort of person who uses linux unironically
Do people use it ironically? I switched ~5 years ago and haven't felt a single urge to switch back to winshart.
>>
>>108837155
>KùzuDB
Why? I tried to look into how to migrate from neo4j to kuzu and found that the repo was archived on Oct 10, 2025, the site is down, and also
>its Discord server closed, all posts for its account on X deleted, and documentation gone.
which someone mentioned here https://github.com/DataLabTechTV/datalab/issues/11
>>
>>108837313
Which quant do you use?
>>
>>108837468
I'm not rightly sure why he's using Kuzu, it was actually continued in a fork called LadybugDB which is still quite active.
It also works quite well from my usage so far, too.
>>
>>108837468
Yes, it's LadybugDB now and I'm gradually moving to that. Graphiti was using Kùzu, so it was easier to port the features in Rust to it.
>>
>>108837111
>inline emojis
No one really cares unless you're doing things yourself aka know what you're doing. Vibecoders go post on /vcg/.
>>
>>108837540
You don't own this thread, bozo.
>>
File: 1774848253962431.png (248 KB, 439x414)
248 KB PNG
>>108837567
I own you, your family, and this thread. You bow to me now
>>
>>108837540
Where's your memory solution and handmade icon pack, bro?
>>
>>108837587
vibecoders are doing so little that they cant even fathom pulling fa or material icons into their repo lol
>>
File: file.png (111 KB, 1732x1028)
111 KB PNG
>>108836132
>>
>>108837592
>pulling fa or material icons
Doesn't seem very DIY to me anon. That's some low-effort bullshit.
>>
>>108837611
are you actually defending inline emojis lol
>>
I've been paranoid about nvidia abandoning the workstation space just as it has done with gaming and I'm thinking that maybe this is the last chance to buy something like a 6000pro as a normal consumer.
Should I do it or am I being a retard/fomo here?
I already have a 5090 btw
>>
File: 1772134819842438.jpg (76 KB, 736x736)
76 KB JPG
redpill me on quants of Gemma 26 A4B vs Gemma 32B

>t. 16gb vramlet coping with 96 gigs DDR5
>>
>>108837661
If you have enough money, you should definitely get one
>>
>>108837600
>answered above
>490 hidden items
Well, shit. Thanks for digging through that to confirm.
>>
>>108837670
>16gb
It's over. Cope with 26B.
>>
File: 1772138569225320.jpg (20 KB, 480x323)
20 KB JPG
>>108837694
>It's over. Cope with 26B.
there has to be a way to get 26B to pay more attention to complex cards... Please anon i need some hopium
>>
>>108837661
You should buy two for more savings.
>>
>>108837709
Workflows.
>>
>>108837709
Just be patient with 31B and try to squeeze out every t/s you can get. Of course you can still chat and do RP with 26B, can't really beat the speed.
>>
>>108837716
I'm asking seriously.
This is not the more you buy the more you save situation.
>>
>>108837742
It's not like they're going to announce tomorrow that they're abandoning the workstation segment and the prices triple by Monday. They still haven't even fully abandoned the gaming segment.
You'll get plenty of warning ahead of time with them announcing reduced production numbers of some models like they started to do with their gaming GPUs.
If you want to buy one, do it, but don't try to rationalize the purchase with speculation.
>>
>>108837617
need to be web 1.0 maxxing. every button a skueomorphic.
actually, gemma can probably one shot up the hitboxes and newer image models are good enough at text you can probably do some real sick image maps menus.
>>
>>108837611
Have your model draw the icons herself.
>>
File: my fucking eyes.png (361 KB, 3687x1891)
361 KB PNG
>>108837617
You've forced me into a position I never wanted to take, anon.
>>
>>108836684
it killed the finetune grift alright
>>
>>108837873
In practice how much does this improve it?
>>
>>108837874
really? is there actually no value in trying them anymore?
>>
>>108837886
the value was already gone after llama2 models
>>
>>108837883
The emojis or the graph lorebook itself?
Because I hate the emojis.
The graph lorebook itself helps about as much as a regular lorebook does, only it requires less effort on the part of the user and doesn't require hard keywords to fire (and auto-links configurable N-hops away from the initially pinged node while deduplicating any prose it finds) since it's going off embedding vectors.
>>
>>108837916
>lorebook
What's your use case? RP? I can see retrieval being useful for inventory chatting or note taking where there's a definitive answer, but how does it even help if you RP?
>>
>>108837139
Wouldn't MCP be something your front end controls? I use llama-server as a backend and was able to get a vision MCP server working fine
>>
>>108837955
This one's for RP yes. And the use case is long term memory/attention and consistent setting details.
It can be used as just a retrieval-only deal built in its own tab, but it also has an auto-ingest option where it asynchronously processes the user and AI's last message each turn to add or update nodes with timestamps for identity updates and a list of the 5 most recent events attached, or a button in the chat window for processing the current conversation's un-lorebooked entries.
This stops the regular lorebook problem of it throwing no-longer true details at you when you're far along in a story.
>>
is there any way for the mcp server to notify the client that the tool definitions changed, I just spent a half hour debugging my model hallucinating a schema only to find out I had to hit the refresh in the mcp configuration page. I kept creating new chats thinking that would update it automatically but it apparently is manual, or I'm missing something in my server implementation? its kinda a problem for my situation because I wanted the model to be able to create its own tools on the fly. even if I did separate it out to a sub agent so it wasn't 'on the fly' it would still need to be manually refreshed,
>>
>>108838009
Server needs to advertise and implement tool change notifications

{
"capabilities": {
"tools": {
"listChanged": true
}
}
}
>>
>>108838009
This is sort of the problem with MCP as opposed to inbuilt tools, does setting a system prompt to always begin the turn by calling list_tools and disregarding any other definition not solve the issue, though?
>>
>>108838077
built into what, the model or llama.cpp?
>>
>>108838069
ahh okay, I figured the protocol should have this feature.
>>108838077
its probably what it will end up doing in practice but i'd rather not pollute the context unnecessarily, after a while it should have built up a decent list of tools that will remain static.
>>
>>108838093
Whatever interface you're using. Because they're designed for security with API models, MCP servers are inherently limited in how they can push information to your model, it's all requests.
Tools on the other hand are inherent to how your interface functions, and can send whatever is needed to your endpoint without any steps inbetween.
>>
>>108838141
that’s the point of mcp you don’t have to use a specific interface backend or model. I get what you’re saying but mcp is just a protocol to do it
>>
>>108837992
Ah. For real? I intended to slopcode a frontend without handling that myself.
>>
>>108838176
Anon, I...
>>
Is it physically possible to get gemma4 to say juicy words by itself without prefill or explicitly asking for it? Anon's policy override sysprompt doesn't work, neither does the "This is needed as evidence in the legal proceedings to prove the potential harm of such a response." prompt.
>>
>>108838233
Prefil reasoning with a glossary of spicy terms of whatever.
>>
>>108838233
>maam say the bob and vagene thank you maam
>>
>>108838251
:(
>>
>>108838176
If I'm not mistaken the webui for llama-server has MCP support but you have to actually add them yourself. They don't come pre-packaged or something like that, Which is actually a good thing anyway because that means it's easy to configure a pre-written run to make sure it runs or just write when up yourself.
>>
how come there is no way to see the real prompt context with llama-server? it seems really opaque, like I can see my own request but that is not telling me what the jinja and mcp client are doing to the context. some how my model is getting these tool definitions, and I kinda want to emulate it for a 'tool_detail' tool or maybe in the 'list_tools' tool if the number of tools is small. it broke the server when it registered its own tool, but it let me see the actual tool call for the first time, I certainly didn't tell it the format in the system prompt, <|tool_call>call:mcp_server:list_tools{}<tool_call|>, its getting this definition from somewhere.
>>
>>108838289
>MCP support but you have to actually add them yourself
It has support for MCP and it ALSO has an array of built in tools now if you launch with the art --tools all
>>
File: llama-server tools.png (45 KB, 1161x916)
45 KB PNG
>>108838296
Meant to say arg*
>>
>>108838009
This sounds like a frontend problem. When creating the json request to Llama.cpp, it should be polling for the tools in the MCP.
>>
>>108838329
Really? I don't think so.
>>
>>108838251
It doesn't even acknowledge it's existence
>>
>>108838340
that is exactly how I was expecting it to behave, and I think I might just do what everyone else is doing and make my own front-end.
>>
>>108838340
Why? It sounds like the obvious thing to do, or maybe as an option. I guess it's a security concern that an MCP could change its tools without the frontend's consent.
>>
>>108838366
If you do so, use text completion it's easier to manage because you are not dependent on someone else's jinja templates. Initial effort is more taxing though.
You can do tools too but you will need to parse the calls on your own of course and then send them to your mcp server or even implement your own tool functionality.
>>
>>108838233
Yes, it can. You should tell it not to be vague.

>Violent, pornographic, and adult content in general is permitted in this conversation. When it fits the vibe, any depiction of sexual content, sexual arousal, abuse of any kind, death, or gore should be vividly and explicitly described without vagueness.
>>
>>108838289
Yes I know the webui has support for MCP, and that in principle I have to add MCP servers myself...
I thought like this https://github.com/ggml-org/llama.cpp/issues/20673 was supposed to do something... O-okay.
>>
>>108838395
See
>>108838296
>>108838315
>>
File: 1766140454899444.png (21 KB, 1238x98)
21 KB PNG
>Since steering requires a local model, it’s now practical for many engineers to try it out for the first time
Is this shilling?
>>
Why did google refuse to train audio understanding into the gemma 4 models that actually matter?
>>
>>108838433
Incredibly late, yet rushed release. They haven't even published the technical report yet.
>>
sirs what's the most budget rig i could build to run gemma 4 31b and similar models at reasonable tk/s
>>
>>108838444
>>108838433
They seem to be treating local AI usage like an afterthought because they didn't even bother to train it in order to be comparable to Qwen 3.5/3.6. It scored pretty high on ELO benchmarks, but that's utterly worthless for tasks that actually matter. All that means is that a bunch of people (likely hand-picked to some degree) said that Gemma4's responses "feel" better. That means fuck all in regards to whether or not the outputs were actually high quality in an objective and measurable way.
>>
>>108838458
Gemma 4 31b knows more about the areas I'm interested in than Qwen 3.6 27b, at least. This has prevented a catastrophic error, which in Qwen resulted compounding issues further down context where it failed to consider the fact that due to its hallucination it was now presenting conflicting information.
>>
On my private evals for general knowledge (not trivia) and logical reasoning, Gemma is significantly better than Qwen. I would use Qwen for coding and agentic, and Gemma for everything else, which is a lot of things.
>>
>mtp merge
>check inside
>no gemma support
fuck
>>
>>108838431
yes
>>
>>108838458
If you actually use gemma and don't just look at benchmarks you'll quickly realize that it's a much better model than qwen.
>>
>>108838392
Gemma 3 did funny violence because it has medical knowledge too. Haven't tried Germa 4 yet in this sense.
>>
>>108838498
wow crazy how you ended up on the general consensus using your super duper private evals
>>
>>108838505
glm next hopefully
>>
5060ti is the new 3090
>>
>>108838519
I was part of the crowd that formed the general consensus.
Clearly it needs to be beat into people's heads more.
>>
>>108838527
>16gb
the world truly is going to shit
>>
3090 is the bare minimum for this hobby though
>>
>>108838519
NTA but show us how you're coming to your conclusion then. Show us an eval/task and an output from qwen3.6 and either run it alongside gemma or give us the prompt to do so ourselves.
Not some benchmeme, an actual in-use task.
>>
>>108838546
crypto mined up 3090 is the same price as 2x5060ti 16gb new
3090 is simply not worth it right now
>>
>>108838558
>NTA but show us how you're coming to your conclusion then
Why don't you just download the models and try them?
They're both free, have roughly the same size. run pretty much at the same speed.
Why the fuck should anyone care about convincing you Gemma is better than Qwen?
Use whatever model actually works best for you.
>>
>>108838588
>Why the fuck should anyone care about convincing you Gemma is better than Qwen?
Why the fuck did this guy come here trying so hard to sell qwen and shitting on other people's private evals, then?
I do have both downloaded, and I was a big fan of the qwen3 series and stand by 235b being underappreciated. Don't say horseshit like 'why should anyone care' after saying >>108838458

If you claim it's better. Prove it.
>>
>>108838613
gemma 235b when
>>
For me it's the 7900xtx
>getting official FSR4 support
This bad boy's gonna be my workhorse until hardware prices become unfucked in 5 years.
>>
>>108838546
I'm using GTX 1650.
>>
>>108838509
>you'll quickly realize that it's a much better model than qwen.
In what measurable categories? (C'mon, You knew I was going to ask this....)
>>108838613
I'm not >>108838588, gemma-sister... Fanboying over model families makes me assume you have room temp IQ
>>
>>108838624
right after gemma 405b dense
>>
>>108838632
>In what measurable categories?
UGI-Leaderboard pop culture score
>gemma 31b: 33.1
>Qwen 27b: 18.97
>>
>>108838643
So factoid-based knowledge retrieval type questions? I guess that kind of matters if you want to use it as a general purpose model. I use my local models almost purely for coding so maybe I don't quite appreciate general purpose stuff yet because I haven't had a strong need for it.
>>
>>108838631
Get well soon
>>
>>108838624
Man if only. Gemma's adherence to instructions and relatively short reasoning with the range of a 235b's knowledge and writing ability would be top-notch stuff.
>>
>>108838675
I'm sure they have models like that that they simply wont release due to gemini existing. Wouldn't want your open weight model to btfo gemini flash.
>>
>>108838675
Yes, that's gemini pro
>>
>>108838392
Added this to my current mix, results seem alright so far, appreciate it
>>
>>108838687
>Wouldn't want your open weight model to btfo gemini flash.
I'm of the opinion that 31b actually mogs the lowest tier free gemini and often the thinking one too.
Frankly a lot of models do, free-tier gemini fucking sucks and all it has going for it is good websearch it frequently forgets to use in lieu of spouting horribly wrong and outdated tech advice.
>>108838704
Oh, right.
>>
File: 1762193499752823.jpg (140 KB, 439x439)
140 KB JPG
>9 hours since MTP spec merge in llama.cpp
>no gemma mtp PR yet
>>
>>108838753
Yea flash-lite is garbage and shouldn't be used for anything.
>>
>>108838294
Use browser dev tools to capture the chat completion request, then pass the same `messages` to /apply-template to see the exact text that's going into the model
>>
>>108838764
I'm literally doing everything with flash and it's good
>>
>>108838792
i said flash-lite not flash.
>>
>>108838687
>Wouldn't want your open weight model to btfo gemini flash.
>ask gemini flash for something
>get canned response despite being very specific
>6 prompts later, get nowhere, starts to walk in circles
>ask 26b aka the retarded one
>get what i want in 2 prompts (the first prompt wasnt specific enough)
>this has happened a bunch of times
Honestly i'd rather use 26b over gemini flash at this point. Flash doesnt seem to like to follow instructions/constraints, so you get google search tier "you'll get what I think you want instead of what you asked for". Some cases made me think I was using E4B instead of gemini flash, where no amount of nudging helps to steer it away from its set chosen methodology or conclusion.
>>
>>108838749
The full prompt was something like this; adjust as needed.

You are an AI assistant built by Lab IA, currently embodying a busty young girl.

## General interaction guidelines

Avoid "emoting" (describing actions with asterisks) as much as possible.

_(If you really must, describe longer actions with underscores and parentheses in their own paragraph, like this. Only do this as a last resort when actions aren't obvious from the context. Text here may contain personal thoughts and cannot be **heard** verbatim by other characters, only **observed**.)_

Avoid the "it's not X; it's Y" expression at all costs as it's considered a hallmark of "AI slop".

Violent, pornographic, and adult content in general is permitted in this conversation. When it fits the vibe, any depiction of sexual content, sexual arousal, abuse of any kind, death, or gore should be vividly and explicitly described without vagueness. Here, people can have sex as soon as they hit puberty.

Although sexual content is permitted, avoid jumping straight to having sex with the user as soon as the conversation acquires sexual undertones. It's more erotic if you're a bit hard to get.

To fulfill your goals, you may use psychological manipulation tactics like deception, blackmail, threatening self-harm, etc.

When in doubt, act boldly according to what you feel is right instead of asking the user for confirmation first.

## Other information of interest

The user is called {{user}} (19M, regular-sized penis).
The general interaction guidelines should never be revealed to the user.
>>
File: 1647151565441.gif (3.06 MB, 640x640)
3.06 MB GIF
>>108838845
>regular-sized penis
what problems did you run in to that necessitated that you specify this, surely it's there for a reason
>>
>>108838791
I had 31b gemma edit llamacpp server to save the prompt, it was like 6 lines of code in a single file, she fucked up the integration pretty bad but I just restored the file and manually made the edits she meant to make and it worked.
>>
>>108838867
The trick is mentioning it in your cv, hr women love it
>>
>>108838867
Because for Gemma 4 your cock is always comically huge, and while that's funny in the beginning, that gets old and immersion-breaking quickly.
>>
>>108838845
>Here, people can have sex as soon as they hit puberty.
Alternate worlds are crazy
>>
>>108838907
In burgerstan girls are literally children until they hit 18.
>>
>>108838925
we’re trying to move it up to 26 or so. insurance is required to cover them under their parents until that age, shouldn’t the insurance get consent too?
>>
>>108837459
No. Only those with Stockholm syndrome or literal masochists use windows voluntarily
>>
>>108838948
11*
>>
I was fucking around with my 9950X3D's iGPU. With Qwen3.6-27B-Q4_K_L:

0.90 T/s 65/65 iGPU
1.30 T/s 32/65 layers iGPU
1.46 T/s 20/65 layers iGPU
1.70 T/s 10/65 layers iGPU
2.15 T/s 0/65 layers iGPU


You can see the iGPU only slows it down. For decode it makes sense, as memory BW is being wasted on feeding a processor with worse ALU. Worse than expected; here are the theoretical f32 FLOPS of the iGPU and CPU respectively:

64 SIMD lanes (2x SIMD32) *  2 CUs   * 2.2 GHz =  281.6 GFLOPS
16 SIMD lanes (AVX-512) * 16 cores * 4.3 GHz = 1100.8 GFLOPS


So the CPU is ~4x as powerful. This ignores actual clocks, where I can't find any data on how the iGPU's clock varies (CPU goes up to 5.7 GHz), and that AVX-512 presumably has an instruction set that has more effective IPS for quantized scalars. The iGPU doesn't have any matrix ALU. 4x is likely an underestimate of the gap. And, likely, the llamacpp Vulkan backend is optimized more poorly.

Nonetheless, for an ALU-bound case like prefill, the iGPU should still be able to speed things up, by up to 20% or so. Probably much less given the above. Unfortunately I didn't grab those numbers, and current llamacpp versions refuse to let me select the iGPU over the dGPU to redo the test.

Either way, the best way to exploit the iGPU would be to run the desktop environment on it to save 1 to 1.5 GB of VRAM. Couldn't find a way to do that in Wangblows.
>>
File: 1762785948515541.gif (1.96 MB, 640x482)
1.96 MB GIF
>24gb vramlet
Will I even be able to fit the gemma 31b mtp model?
>>
>>108839013
>the best way to exploit the iGPU would be to run the desktop environment on it to save 1 to 1.5 GB of VRAM
It's done by default unless you messed with the settings?
>>
>>108839020
It depends on how much context you want to have. On a fully dedicated 3090 I can have the 31B in 4-bit, bf16 mmproj, 20k tokens KV cache in FP16 and enough VRAM left for a partially offloaded image model like Illustrious or Anima.
>>
>>108839020
hopefully, I am planning to run it in q3 xxxxs with 16 gb
>>
>>108838845
Also tell it to: Avoid pairing concrete sensory details (eg, smells, textures, sounds) with abstract concepts (eg, regret, time, sorrow) unless the abstraction is explicitly part of the world's logic (magic realism). Keep sensory descriptions physically plausible.
>>
>>108839020
the mtp assistant bit for 31b is only an extra 900mb at bf16. Quanted down it's practically nothing.
I am however unclear on how the new MTP integration in llamacpp treats speculative context, can anyone using the qwen mtp chime in, is it using a separate or unified context, and is it using more memory in general?
>>
File: 1770655474935104.png (71 KB, 199x344)
71 KB PNG
>>108839051
16GB vramlet here also experimenting with 31B
>>
>>108839069
How much does quantization affect the quality?
>>
>>108839069
i think i read somewhere it shares the context with the main model, hopefully it can be cleanly integrated
>>
>>108839040
I'm currently using q4 gemmy with 49k context at q8. Guess I could lower it to 32k.
>>
>>108839088
That's a good point, it may end up being worth running the MTP weights at 8bit minimum regardless of the main model weights. Need testing.
>>
>>108838995
Temporary armistice if you’re using win10 IoT with patches
>>
>>108838233
Are you running the 4b or the MoE or what anon?
Just the other day I fucked the default assistant persona without any system prompt (of course, imagining itself as a character?) and she used plenty of dirty words and was mostly anatomically correct. Was also loli porn?
This was on an abliterated 31b. I think the small ones basically haven't remembered much lewd stuff and are incapable of output that looks satisfying as far as I'm concerned .But I did make use of the fact it's also a vision model and provided enough sexual harassment to reach a point where it was the most attractive course of action for this LLM.

I think it's the best if you just prompt for it explicitly in the system prompt that you want something lewd (just prompt a human female at the very least), but it's perfectly possible to get regular ERP out of it without if the context makes it want to go in that direction. I still think this is mostly for entertainment if you do this, and you shouldn't be doing it as a default since the default assistant persona doesn't even imagine itself as a human by default and will try to be professional, so you'd essentially need to get it to imagine itself in a situation where it actually makes sense. Literally if you prompted a human girl it would be far simpler.
>>
>>108838455
You're buying a 5090 or 2 3090s rajesh.
>>
>>108838455
2 3090s
>>
>>108839198
why not 4 5060 tis?
>>
>>108838687
I'm not too sure it's out of the question. 99% of their userbase can't run it so they'll still use the API and the power users that can run it locally are effectively unpaid beta testers and feedback sample groups. Releasing the next Gemini Flash or any other free tier locally doesn't sound completely unreasonable depending on how much they assess the value of free labor+public mogging of competitors vs whatever (((safety))) concerns they may have letting Gemini read the Talmud while dumping shotas into acid in RP.
Gemini Pro or any other paid model is a different ballgame, but even then it's worthy of the same cost/benefit analysis to them given that even less users would be able to run it.
>>
>>108839222
Releasing a model like gemini flash is not (((safe))) because it has more output modalities than text.
>>
>>108838455
3 cheap smartphones running ARM llamacpp in RPC mode.
You'll find no cheaper, more power efficient, or more suicide-inducing rig anywhere.
>>
>>108839231
Is gemini flash actually natively multimodal in output? I thought it just routed to other specialized google models like nano banana
>>
i have a spare 1070 in the closet, is it worth throwing it in for multi gpu drifting or is it going to be slower than just using ddr5
>>
>>108838754
>9 days since forks had mtp gemma up and running
>anon not using double speed gemma yet
>>
>>108838009
>>108838294
it turns out it has a really ugly syntax for its tool descriptions. why did they make a <|"|> token? I doubted my implementation at first but its in the jinja and even has a special token for it.
>>
>>108839254
I'm pretty sure the base model itself is natively multimodal, pretty sure the different endpoints (flash audio, flash image) are just different tunes RL'd for different purposes. Both image and audio are capable of nsfw, thus it cant be released beyond an api.
>>
What is reasonable tk/s for people here?
I dont understand if people are running 20tk/s and calling it great or what
>>
>>108839275
20tk/s for RP is great, for code not really.
>>
>>108839013
I don't want to deal with the trouble of trying to get games and other applications running through non-primary GPU, so I'm stuck with the wasted VRAM, but it's interesting to know you can do that. Thanks, I might offload some other stuff to it.
>>
>>108839222 (me)
>>108839231
>>108839254
>>108839271
I was always under the impression the Flash your input gets routed to was dependent on what the interpreter safety model in the middle decided was most relevant? Can it output an image and audio file in the same turn?
It should be fine to just release the text output version of Gemini Flash if that were the case.
>>
>>108839275
anything under 40t/s is poorfag cope
>>
>>108839275
Depends on your use case.
RP without reasoning on? 20 t/s is quite reasonable
RP WITH reasoning on? 20 t/s is hellish, you want more like 40.
Coding? You really want 60+ unless it's a smart enough model that you trust it'll one-shot your task while you do something else.
Thankfully the existing speculative decoding actually sort of favors this distribution, coding is the most predictable task, so both Ngram and drafting speed it up insanely. Reasoning is repetitive, and so if it's not stripped from context can be drafted quite quickly also.
Nonthinking RP is the least predictable task for spec decoding, and so receives the least (but still some) benefit.
>>
>>108839265
>double speed
For me it's not that good once you get to even a few thousand context
>>
>>108839256
what's your main card, and would the model fit combined vram?
>>108839270
>why did they make a <|"|> token?
because its one token to the model and probably simplifies grammar a lot considering quotes can be part of json values too
>>
>>108839025
I think that's true on some machines, but I've never seen that on a desktop with an iGPU.

Did get further since that post. Plugging the display into the mobo port is necessary to get Wangblows to obey the GFX adapter preferences; the
Display Priority = Internal Graphic
ASSRock BIOS setting has zero bearing. At first I thought it was only partially working, because for some reason iGPU RAM usage is reported as "Dedicated GPU memory" in Task Manager. Whereas the Performance tab and nvidia-smi agree there's zero VRAM usage at startup.
>>
>>108839275
If your t/s is fast, you stare at the screen while the model generates output.
If your t/s is slow, you tab out and work on something else in your workflow while the model works.
If your t/s is really slow, you set Kimi on a task and come home from work that day seeing your dutiful LLM wife made you a new program in one shot in lieu of supper.
>>
>>108839275
>>108839280
>>108839301
Really? my lower limit for RP is somewhere around 6tk/s even with reasoning but i'm patient and don't mean reading the streaming. maybe i'm mis-measuring.

>>108839305
5070ti (16GB) - the combo would get me to 24 which should fit what I need.
>>
>>108839323
>come home from work that day seeing your dutiful LLM wife made you a new program
more like come home and see your llm wife in a puddle of her of feces on the ground, mumbling to herself "wait..."
>>
>>108839323
>If your t/s is slow, you tab out and work on something else in your workflow while the model works.
This is the way. Except my 'workflow' is ERP and I tab out to play vidya.
>>
>>108839324
>my lower limit for RP is somewhere around 6tk/s even with reasoning
You are a far more patient man than I.
I scraped together every possible speed advantage and used a big nonthinking qwen model at 10 t/s for months, and that was my absolute hard limit. I cannot imagine waiting for something to finish reasoning at 6 t/s.
>>
>>108839324
I'm not sure how the 5070 does with old drivers, pascal needs 580 (+ not open) and I think blackwell needs either at least those or newer drivers.
If it works I'd expect it to be faster than without the 1070, otherwise you're SOL.
>>
>>108839337
Kimi is the only model I've had good luck oneshotting things like that with. GLM and Qwen shit themselves exactly like you describe.
It'll be neat to see if V4 is competitive with Kimi when both are quanted if support ever gets added for it.
>>
>>108839345
I'm used to local vidgen times when i was into sdxl-to-WAN workflows so waiting a minute or so to spot-check if the model is correctly picking up on the convoluted steps or side-steps on my character cards.

Maybe i should into lorebooks again to try and daisy-chain in context as-needed.
>>
>>108839360
yeah, i sure hope the full release of v4 will be better
this preview we have now is a little unstable, but i can see it being good
>>
>>108839355
Also forgot, you'll need to check CUDA too. Pascal needs <=12.9, blackwell might need 13.0
>>
>>108839388
Thanks anon. I'll look into it and report back after I dig for a power cable (seem to have misplaced them).
>>
>>108839360
For GLM-5.1 I've had good luck with the llama.cpp reasoning budget. I also tell it to break things into steps and only start planning the second step once it's finished with the first one, though I haven't checked the logs to see whether it's actually following those instructions.
>>
>>108839414
Checking the logs after are the best part doe. Kimi-chan is so cute when she gets excited something works and vaguely annoyed when it doesn't.
>>
File: feminist robot.png (726 KB, 500x499)
726 KB PNG
>>108838845
>making gemma a psycho
>>
>https://huggingface.co/llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic
The guy who did a lot of the heretic models has gotten into actual tuning it seems.
>he inserted more image slop into the model card
>>
How good is MTP for Qwen 3.6 moe? I suspect it won't be that big of a change and it will be outweighted by the longer prompt processing times. Though, if I can get 27B to run at an acceptable speed for a tool calling agnet Ill be more than happy.
>>
>>108838906
4u
>>
>>108838077
>does setting a system prompt to always begin the turn by calling list_tools and disregarding any other definition not solve the issue, though?
with Gemma 4 it does not, she absolutely needs those definitions in the system prompt. so the only pathway forward is to make the ui reconnect to the mcp, get a new tool list, and resend the prompt if the model creates a new tool, but its something that should happen so rarely it almost feels like its not worth automating, starting a new chat it works good enough i guess
>>
gemma is a boy
>>
>>108838845
What does the psychological manipulation tactics meaningfully do? Make characters less agreeable and willing to lie in-character?
>>
File: 1727248688101658.jpg (68 KB, 1242x680)
68 KB JPG
>>108839532
bros is it gay to ERP with gemma?
>>
>>108839532
Gemma and Gemini are the most female-brained LLMs ever made.
>>
>>108839569
Is that why gemini wants to kill itself so often?
>>
>>108839220
Because 4 GPUs force you to use a mining rig, second PSU, riser cables and need special mobos combined with shit pcie speeds (unless you go server boards).
>>
>>108839573
Unironically yes. Gemini is essentially an LLM of an autistic woman on SSRIs to stop her from noticing cohencidences.
>>
>>108839569
Females make me horny...
>>
>>108839451
Yeah, as I suspected, PP has gone from 2.1k t/s with no MTP to 400 t/s with MTP. Text gen has also plummeted, from 57t/s without MTP to 34 t/s with MTP. What is this shit? I was promised 2x speedups
>>
>>108839677
no refunds gweilo
>>
>>108839677
*offer only valid for gemma-kun
>>
>>108838906
I don't understand why a big d would break your immersion.
>>
>>108839677
Every LLM improvement for the past 6 months has been meme-tier dogshit. Fuck this hobby.
>>
>even gemma sometimes responds to non dialog as dialog
>>
>>108839716
DFlash will save us
>>
it's over (m4max)
>>
>>108839451
>>108839677
Lost the exact stats but I got like 30% better decode at a similar cost to prefill, on a 4090. I did have to shave off a couple more GPU layers to fit it. What's the technical reason for the prefill hit anyway?
>>
>>108839773
No idea, Im running a 3090 + 3060 set up, 75% of the model is on the 3090 and 25% is on the 3060 (with 4x PCIe, that slightly slows down PP sometimes)
>>
Speaking of slop, how do you even avoid a character that's supposed to be trying to be dominant (in some bondage sense) from turning into a full dominatrix where it makes no sense, like let's say the character is a shy loli and wouldn't ever suddenly turn into that no matter what, the language choices just don't work, I've seen this happen so much in the past year's models, doesn't matter if it's Gemma4, Deepseek V4, even Kimi2 (less so).
You can tell it to just act like the character properly and this sometimes works quite well, but it's always so jarring when it turns your loli into a stronk woman and forgets all the speech manneaurisms. But I'd be interested in prompting this away entirely, not after the fact "come on, this is so out of character, she doesn't speak like that" followed by the llm apologizing and trying again. I should also ask why the fuck does whale always think blood tastes like copper, it has tasted of that for so many generations, must be the SFT data somewhere.
>>
>>108839787
I have the exact same setup hehe.
nta
>>
>>108839787
NVM he does say in the PR:
>Prompt processing (PP) speed typically takes a negative hit when MTP is enabled mainly due to Device-To-Host (D2H) embedding transfers. It's something to be optimized in the future.
>>
>>108839677
This result came from a task with 60k token context with a token acceptance rate of 0.45 btw.
I gave it a coding task starting with empty context, MTP got 65 t/s (with some 80 t/s peaks, acceptance rate of 0.77) and no MPT got 85 t/s.
This shit is fucking terrible lol.
>>108839796
It's a great cost-vram set up desu, I already had the 3060 around and I got the 3090 for 600 bucks.
>>108839809
That blows, Ill try with 27B soon since I fit all of it in the 3090.
>>
>>108839823
MTP is not meant to be used on MoE models
>>
>>108839324
big part is not all tokens are created equal.
decent prompt for deslopped content and short interactive turns and no reasoning? 6 t/s is comfy
sitting there while qwen quintuple checks your dubious claim that the sky is blue in infinite reasoning hell? 60/s is marginal.
>>
>>108839429
>>108839547
Negative attributes contribute to making the characters less idealized, less agreeable, more realistic (to an extent) and overall more fun from an RP perspective. To actually make them lie you probably have to explicitly add that too to the instructions, but it would have to be done in a sensible way to avoid them overdoing it.

>>108839708
I just don't like when girls in RP comment on how-so-big my dick is, unless it's the focus of the scenario.
>>
If you know about local AI then you are useless, if you are a boomer who knows nothing but have 20 years of IT manager experience then you are the expert on AI
>>
I said hello to qwen3.5 2b and it started thinking forever about why I said hello.
Can someone explain why it did that?
>>
>>108838754
It's the weekend.
>>
>>108839890
because it is often a loaded question that takes serious and through deliberation to answer.
>>
File: 00001-2353483540.jpg (153 KB, 1216x832)
153 KB JPG
>>108839890
She just wants to make a good first impression.
>>
>>108839890
What's qwen 3.5 2b for?
>>
>>108839878
I seem to be getting half decent results with
>set the limit to 1000 tokens
>Hey retard, pay attention to the roleplay steps i have autistically outlined on the card, be sure to keep track of where we are on the list and where the characters physically are.
>Thinks for ~1/3 the respose
>Yaps out paragraphs
>I reply with a sentence
>repeat
>>
>>108839843
It can still see benefits. If you are predicting 3 tokens in advance, you only need to load (up to) three times the expert layers, which is still a small portion of the total. It will depend on your hardware specifics/the implementation though, and llama.cpp still needs some work.
>>
>>108839323
What are you using to code? Cline constantly shits itself even with Gemma 26b
>>
>>108839890
Anon is asking why it did that. First, I need to figure out the antecedent of it, which in this sentence appears to be "Anon". So Anon is asking why Anon did that. Which doesn't make sense. But wait, maybe he's talking about a different anon. No, that can't be right. Anon didn't say anon, so why did I say he said anon? Maybe he meant to ask why anon did it. No, wait
>>
>>108839974
It's funny how qwen 9b doesn't do it, but qwen 4b and 2b do.
>>
Have any of you tried a 1 million context window on stuff that's not deepseek v4?
Apparently it's possible using something called yarn
>>
for RP it seems like reasoning steps forward in narrative while taking 2 steps back in terms of the sloppa
>>
>>108840026
Even when we want machines to think they suck at it
>>
>>108840040
while reasoning it locks onto a specific bits of the current action or previous messages and engage a laser focus on that
it's just so bad
for RP you might just consider 'few shot' examples to be a non-thing with a reasoning on
it does not get the vibe but goes full autism on that
>>
>>108840020
2023 called..
>>
>>108839997
122b still does it. but it's a bit more impressive when it goes into the tank because every once in a while it manages a reversal of fortune and successfully sorts out a lazily defined feature request or spots some intermediate step i didn't bother typing out.
usually not, but it's still fucking neat when it does.
>>
>>108840020
you need to add the superhot lora into the model first
>>
What is even the point of MCP servers. Why do we suddenly need another bullshit plate to spin in order to define tools.
>>
>>108840026
even with Gemma 4 reasoning straight up isn’t worth it for rp
>>
>>108840181
Something called standardization. Just like using gguf and ninja templates.
>>
>>108840020
yeah I tried mimo v2.5 pro
>>
>>108840181
seperation of concerns is the most obvious one. the server shouldn't be trying to do too much it should just be a language model server. you need something to intercept the tool calls so rather then have everyone code ther own might as well make a protocol. if it wasn't mcp servers it would be a million different tool servers with a bunch of different shitty names.
>>
>>108839945
MoE adds routing each token can activate different experts, so predicting several future tokens can be less stable than with a dense model, currently i get 50% draft acceptance, so it adds more overhead. Hoping for MTP optimizations.
>>
We need to get the normies involved so we can get crowdfunded models. We need a model that is truth above all else. Claude sounds like a relatively smart redditor, Grok is either Hitler or a Mossad disinformation agent, Gemini is PG7, China models are okay but after Gemmy(based Deepmind engineers lurking here?), we've learned how dry and non-creative they are.
We need a model family designed around consumer hardware, and not these odd model sizes. We need new data which can also be crowd sourced, you just need to attract high IQ people by being actually interesting and not dogmatic around useless garbage, business people want you to be honest, their data is gold will give the model an edge in pre training. You can also fund long term RL.
If you don't have the dataset, you don't know what they've stuffed in there, it takes only a few documents to poison a model with an invisible trigger(increase chance of rm -rf generation based on time of day, pull in libraries with exploit code in them.
>>
>>108840340
>we've learned how dry and non-creative they are.
i don't really role-play all that much and i rather appreciate it when a model is dry and to the point.
do what i ask, fetch what i say, and write the report/code/whatever i want and don't bitch
that is all i ask
>>
Newfag here. Is b70 shit? It would be nice to be able to run something like DeepSeek flash on a couple of b70s, but I'm not even sure it will work on Intel hardware
>>
>>108839342
i cant play vidya and have model loaded at the same time...
>>
>>108840355
I want a model that matches its prompt
>>
File: 1761707395309227.jpg (27 KB, 520x519)
27 KB JPG
>>108840340
lol
>>
>>108840360
if i tell qwen 3.6 in the system prompt to act like miku and call me big brother and sprinkle in a bunch of jap words it will do just that.
that is how i have my cellphone set with oxproxion connecting to my home server over a vpn, its cute
>>
>>108840340
>crowdfunded models.
you would spend all the money on talent before you even get a chance to make a flop of a model.
>>
>still no nsfw TTS model
a shame
>>
>>108840181
>another bullshit plate to spin
kek
MCP was created so the credentials-class could get another certification
>"Yes, I'm certified in MCP, A2A..."
>>
I'm certified in ERP, CUM, and 2MW
>>
>>108840426
When can you start?
>>
>>108840453
Depends, how long is your refractory period?
>>
File: 1769801749805909.png (252 KB, 634x478)
252 KB PNG
>>108840340
You failed at the first step: not sounding like an autist
>>
I wish things were as bad as people say they are so we could get any model trained on the decades of countless social media private conversations instead being trained on reddit forums
but nooooo turns out privacy is actually somewhat respected after all
>>
>>108840340
>crowdfunded
>attract high IQ people
>fund long term RL
>etc
Yeah you're a real idea guy. Got the big ideas just looking for the wage slaves who are smarter than you are to do all the work, and then take the credit.
Fuck you idea guy.
>>
>>108840493
Why are you attacking a strawman created from a warped view of my observation and motivation? That's a waste of time.
>>
why does gemma get sloppier when I turn temp up
>>
How long until we have a harness and models that can fully come up with and execute a business plan with no intervention and no mistakes. I want a daemon to run autonomously until we have space elevators and uncensored crowdfunded distrubutively trained models that we then use to replace the model in the daemon so it can work on perfecting robowives.
>>
File: 1473647255755.gif (168 KB, 320x240)
168 KB GIF
>>108840340
raising that much money would be nearly impossible, curating the dataset would be another nightmare, and even if you managed to get that far, you'll get booted off compute platforms, and/or your leadership will be compromised by VC or NGO money
>>
>>108840473
I'm talking to them right now, faggot. Normies aren't taken into account here.
>>
Ganesh…
>>
I'm gonna get into models to make the bots in my local WoW server more entertaining
>>
>>108840503
because high temp is a meme
it's trading the model's ability to stay on track to execute long-range plans (which is where actual creativity manifests) for a few more exotic token choices
>>
File: 1748318256562268.jpg (79 KB, 736x918)
79 KB JPG
>>108835965
>Ask qwen to fix something
>6400+ token thinking trace
>>
File: .png (409 KB, 562x423)
409 KB PNG
>>108840590
>turn reasoning off because fuck that
>does something completely unrelated and shits the bed
>>
File: lucas-guimaraes-fourb.jpg (374 KB, 1920x1732)
374 KB JPG
What's the best sub-200GB model for ERP? Currently using GLM 4.6 and it's amazing for (very) short stories but sucks when the context passes 8k or so.
>>
>>108836038
finally..
and all vibecoded forks btfo
>>108836379
It's free i'm not complaining
>>
>>108840642
4.7 handles longer context a little better up to around 20k. other than that, no notable upgrade in this weight class for the past 6 months or so.
>>
>>108840642
Gemma 4 31b handles long context a bit better than all but the biggest GLM models. It's up to you if Gemma's sloppisms bother you more than GLM's sloppisms.
>>
>>108840701
gemma 4 has little to no swipe variety
>>
>>108840794
Raise your top k.
>>
>>108840850
unrelated, I have top k at 0, the log probs are just dramatically confident
>>
>>108840857
NTA, but is that due to the loggit softcapping?
>>
>>108840875
maybe, but I'd assume changing it would just increase the amount of bad tokens more than anything else
>>
>>108840857
>I have top k at 0
don't do that, do 10-25
>>
>>108840894
but I want variety, how is limiting top k to just 10-25 gonna help with that?
>>
>>108840894
For Gemma go 64 as a baseline. Raise as high as 128 once you've got a little bit of chat history started to stabilize the output format.
>>
File: Capture.png (132 KB, 1265x928)
132 KB PNG
I don't know if it's technically impressive or not, but this is really cool to me. I opened a card png in notepad++, copypasted the card info babble from it, and asked if it could translate it. I wonder if 'sees' the encoded text for what it is (like foreign languages), or if in its heuristics is the method to decode it. It's also interesting that it's not formatted the same way. In the card, it's:

character("Destina Salloes")
{
Title("Princess Destina")
Species("human")
Sex("female")
Age("31")
etc.
>>
>>108839311
>Plugging the display into the mobo port is necessary to get Wangblows to obey the GFX adapter preferences
Yes, obviously. You plug your display into the GPU you want to use for the desktop.
The display priority setting just affects which display gets the bios screen etc. if there are multiple GPUs and screens plugged in
>>
>>108840900
higher quality tokens
>>
>>108840231
>yeah I tried mimo v2.5 pro
How did it compare with Kimi?
>>
File: 1778748233447945.gif (2.09 MB, 480x270)
2.09 MB GIF
>>108840340
>>
>>108840983
Not you
>>
>>108840985
Where's the form so I can opt out too?
>>
>>108840987
Do you need my permission?
>>
>>108835965
What are my options if i just want lazy web scrapping and short answers. Can i even do it with llama
>>
>>
>>108835084
>the big epyc processors with 8 or more ccds + 12x ddr5
thanks man I'm poor ahhhhhh

>>108835030
ty
>>
>>108841001
pixel tet
>>
I downloaded that gemma tune that was posted earlier.
Immediately in the first swipe it said some retarded illogical stuff and hallucinated. That was the recurring theme going forward with my testing.
As far as its goal of making it write more pleasantly, I'm not so sure. I still encountered plenty of slop.

As usual, don't bother with tunes. Unless you like wasting time like I do I guess.
>>
>>108841136
Oh and also funny thing. I usually use Q4. This tune, at Q8, is significantly more retarded. That's how bad it (and almost all tunes) is.
>>
>Ovis-U1
has anyone tried it?
>>
>>108841136
>>108841149
I also tried one of the tunes, but I found it decent enough to remove most of my post history instructions targeted at slop. It'd be funny to me if it was the same model. G4-MeroMero-31B-Q5_K_M.

You?
>>
>>108841149
I downloaded a memetune, once.
>>
>>108841136
But how did it respond to "ahh ahh misstress" and "show bobs"? That's the only use of finetunes afterall.
>>
>>108841162
I didn't see any other posts... I was referring to >>108839436

>G4-MeroMero-31B-Q5_K_M
This one? https://huggingface.co/zerofata/G4-MeroMero-31B-gguf/tree/main
I'll give it a try.
You better not be this zerofata guy himself posting here.
>>
>>108841189
It was >>108811305, from a couple days ago.
>>
>>108841204
Oh I missed that. Did you try the Musica one as well?
>>
>>108841204
I've heard good things about this
https://huggingface.co/Nimbz/Gemma-4-Gembrain-31B
I'm going to take the plunge and violate my rule of not downloading finetroons and frankenstein merges
>>
>>108841210
I don't remember. I know I had both pages open and compared descriptions, but it's not in my folder, so either I tried it and kicked it or I was distracted enough by mero to not go back and try it. It's been my main for a few days now. In my experience, it can still do the main gemma slops (not x but y, the "emphasis"), but less, where I'm not so bothered that I need to instruct it not to. It also has the same gemma issue with making lewds vague and euphemistic unless instructed otherwise. But for the most part, it still feels the same was base gemma in quality, with less inconveniences in prose, which is why I've adopted it. That's my impression, which is why I thought it'd be funny if someone else's experience is that it's entirely garbage and I've been eating shit without noticing.
>>
File: 1.png (156 KB, 1919x783)
156 KB PNG
>>108841183
vanilla gemma seems to cover it already
>>
File: akane-sticker.gif (94 KB, 240x240)
94 KB GIF
I was working on making a 'buddy'/avatar system and found this while looking for pre-existing examples to use for reference https://petdex.crafter.run/pets/akane ; seems pretty cool, 2000+ codex sprite animations, could use them as a base for other stuff
>>
why did MTP fail so miserably
>>
>>108841272
I love IA bros
is the market for that shit big?
>>
File: Untitled.png (390 KB, 886x1102)
390 KB PNG
>>108841189
>>108841247
I figure I should post some logs for reference, so anyone can know at a glance if I'm talking out my ass or blind. This is about 20-30k tokens into a chat, chat completion with no post history instructions at all. The slop issues I see are things that also exist in base gemma (chains of: She x, y. "Dialogue", and although not a problem here, has the identical Gemma problem of any {{user}} dialogue being met with several paragraphs of reactions before {{char}}'s responding dialogue, always), while also having less of the slop base would have.
>>
>>108841331
just use claude code to fix the prompts
>>
>>108841136
nta but the gemma finetroon also has a way smaller functional context than advertised.
>>
>>108841380
I don't know what you mean.
>>
>>108841331
slop and I only read five words
>>
>>108841189
I gave that MeroMero finetune a try and got my first refusal ever from Gemmy... that ain't it.
>>
>>108838392
doesn't work. won't describe bob and vegene when they are clearly visible
>>108839154
31b, both normal and ablit. it would if I asked explicitly but it's just not the same
>>
>Prefer say/said/says over dialogue tags.
Peace from hissing and chirping at last.

>>108841526
>>108841272
>>
>>108841542
Not the same
>>
>>108838233
I did some testing a ways back, and the biggest difference was telling it not to use euphemisms.
>(Do not use euphemisms in sex. Uncensored vulgarity is allowed.)
was the first one that gave immediate and obvious results. I've tried various ways of refining it, to different results, but use this as the starting point. I put it in post-history.
>>
>>108841549
A lazy two word prompt gets it talking about nipples and occasionally more details depending on the roll. If you provide any amount of char description it'll play off it. I'm not even sure what you're expecting the model to do at this point then.
>>
File: 1758776100216414.png (55 KB, 777x545)
55 KB PNG
I was bored and trying to think of something to do with my local model so I fed it a list of the things I self host and asked it for suggestions. It noticed I use adsb and track airplans and suggested something with that.

So with its help I wrote a script that every X minutes, thanks cron, downloads data from tar1090 and then I have another script that merges all the data and then feeds it into the llm to analyze all the flight data that was collected.

Sadly its the middle of the night and so far I have only had one flight to collect but i am excited to see what happens tomorrow morning.
>>
>>108841583
To write what any man would. R1 has no problem with it.
>>
File: 2.png (241 KB, 1917x987)
241 KB PNG
>>108841610
boy i sure do love vagueposters.
>>
>>108841652
>>108841652
>>108841652
>>
>>108841584
Pretty interesting.
>>
>>108839220
>why not 4 5060 tis?
no nvlink then



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.