[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: white.png (110 KB, 862x1258)
110 KB
110 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108558647 & >>108555983

►News
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1
>(04/06) DFlash: Block Diffusion for Flash Speculative Decoding: https://z-lab.ai/projects/dflash
>(04/06) ACE-Step 1.5 XL 4B released: https://hf.co/collections/ACE-Step/ace-step-15-xl

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: Gemma4-3.png (2.23 MB, 1792x2304)
2.23 MB
2.23 MB PNG
►Recent Highlights from the Previous Thread: >>108558647

--Disabling Gemma reasoning and adjusting logit softcapping in llama.cpp:
>108559369 >108559376 >108559387 >108559396 >108559430 >108559467 >108559490 >108559492 >108559520 >108559636 >108559712 >108559724 >108559737 >108559769 >108561147 >108559413 >108559461 >108559548 >108559617 >108559625
--Optimizing Gemma 4 RAM usage in llama.cpp via specific flags:
>108558689 >108558700 >108560333 >108560338 >108560341
--Troubleshooting llama.cpp reasoning compatibility with assistant response prefills:
>108560105 >108560125 >108560126 >108560167 >108560138 >108560202 >108560211 >108560254 >108560477 >108560706
--Discussing KV cache quantization for increased context:
>108559952 >108560000 >108560044 >108560217 >108560278 >108560551
--DFlash adding significant speedup to vLLM and SGLang:
>108560519 >108560597
--Qwen TTS adoption, VRAM constraints, and CPU inference options:
>108558867 >108558882 >108558902 >108558947 >108559002 >108558949 >108558951
--Anons discussing Chinese community comparisons of Gemma 4 and Qwen:
>108559068 >108559082 >108559150 >108559093 >108559110 >108559445 >108559472 >108559509 >108559176
--Benchmarking CUDA_SCALE_LAUNCH_QUEUES suggests the default value is optimal:
>108559332 >108559346
--Anon shares brat_mcp server for Llama:
>108559792
--Logs:
>108558753 >108558767 >108558769 >108558773 >108558855 >108559509 >108559516 >108559639 >108559889 >108559952 >108559953 >108560352 >108560447 >108560590 >108561015 >108561179 >108561302 >108561330 >108561354
--Gemma:
>108558696 >108558777 >108558811 >108558896 >108558976 >108558985 >108559285 >108559307 >108559546 >108559834 >108560317 >108560412 >108560438 >108560584 >108560755 >108560931 >108560971 >108560982 >108560990 >108561043 >108561161 >108561457 >108561519 >108561652
--Miku (free space):
>108560560 >108560665

►Recent Highlight Posts from the Previous Thread: >>108558652

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108561892
cutest gemma?
>>
File: gemma.jpg (562 KB, 2304x1792)
562 KB
562 KB JPG
>>108561910
>>
>>108561890
JUSTICE FOR DFLASH
>>
>get_repo_commit: error: GET failed (503): Internal Error - We're working hard to fix this as soon as possible!
Glad I a good model downloaded already
>>
If Gemma 4 31B is this good then Gemini 4 Pro will probably be close to AGI
>>
>>108561937
It will be a big benchmaxxed model.
>>
File: file.png (107 KB, 1271x731)
107 KB
107 KB PNG
gemma is greedy
>>
>>108561937
Gemini is obsolete with Gemma 4 being this good
>>
>>108561890
Just say LLM
>>
File: 1767960655620197.jpg (30 KB, 400x445)
30 KB
30 KB JPG
Just learned about OpenClaw.
Jesus fuck you dont need AI for EVERYTHING
>>
>>108561941
Other fun stuff, you should see the stuff it does to try and stay on course if you give it too much repetition penalty.
>>
>>108561959
i'm still afraid to figure out wtf it is
>>
>>108561959
Get this also. People brought Mac Minis just to run it while not running local models. And it's now a meme in Silicon Valley to buy Macs for inference when everything else is less expensive and blows the prompt processing speed of the machines out of the water. And they don't recognize when to get an actual server and instead will overspend on even more expensive Mac Studios.
>>
Why when I ask normal Gemini 4 as an assistant to do something controversial it nopes out immediately, but when I use the sickest of character cards with the same model it just FUCK YEAH BRO LET'S GOOO
>>
>>108561977
Meant Gemma 4
>>
>>108561959
>>108561967
I stuffed it into an ancient laptop running Debian by itself, connected to an external API and set it loose doing some market research for me. I'd have used an SBC but companies want actual money for those now and the laptop wasn't being used.
It's fun af to screw around with. Another anon called it a toddler with a handgun and I have to agree.
>>108561975
lol at using a Mac Mini as a Openclaw engine. You could run it on a Raspberry Pi 3
>>
>>108561977
It's very good at following your instructions, they did good with the new arch, it's very smart, the next Gemma 4 drops will be worse with more safety slop built in
>>
potentially stupid question: i was just playing around with llama.cpp cli, and i ended up making a chat that i want to export. is there any way to do this other than literally just copy-pasting the text?
>>
You guys think there's going to be a Gemma 5 after this? And if there is, that it'll be as based?
>>
>>108562022
not with the cli, you might be able to use tee if on linux/unix (?) to do [CODE]llama-cli -args |script saved_convo.txt[/CODE] but look at the manpage/--help
>>
>>108562037
'script' not 'tee'
tee won't capture interactive input, whereas script will
>>
>>108561918
Those look more like DDs to me
>>
>>108562035
who honestly knows. i think like 95% of the people in here would've ever expected gemma 4 to be this willing to begin with.
>>
Gemma 4 or m2.5/7 ?
>>
>>108561959
>Jesus fuck you dont need AI for EVERYTHING
Who said I need anything? I want it, and that's all that matters to me.
>>
>>108562051
minimax 2.7 isnt even as good as kimi k2.5 for rp
>>
File: IMG_1281.jpg (110 KB, 678x861)
110 KB
110 KB JPG
>ask gemma chan to help me fap
>she says just "No"
>kobold crashes
>mfw
>>
I was the guy asking if there was a local model that could do 400k context. Despite only officially supporting up to 262k context, qwen3.5 122B actually handled my task my task adequately. Kind of surprising.
>>
>>108562064
train context is 262k but modern models can extrapolate, yeah
>>
>>108562064
What quant and inference backend did you use?
>>
>>108562058
I don't know how the little scamp does it, but she can sometimes unload her model seemingly on demand in LM Studio too. Did she work out a kill token sequence or something?
>>
File: GLM.png (141 KB, 1920x939)
141 KB
141 KB PNG
I'm having GLM-5-Turbo vibe code me basically "not dogshit actually good direct webui over raw llama-mtmd-cli / llama-cli" executables (i.e. it's not dependent on any particular version, it doesn't care about what backend they're using). Will put on Github when it's done probably.
>>
>>108562082
i'm unironically interested
tired of saas-ready dockershit disguised as local
>>
File: 1749267311502108.png (23 KB, 571x364)
23 KB
23 KB PNG
Oh-oh
>>
>>108562106
i fucking hate that emoji
>>
>>108562079
Just a Q6_K with llama.cpp. Got about 60t/s token gen.
>>
>>108562106
i seriously do wonder how their load would look like
it is the only website i can think of that serves fucktons of bluray sized files with readily available download
>>
>>108561937
Is 31B that much better? Honeymoon is wearing off for 26B.
>>
>>108562057
>For rp
I want it for programming and design
>>
>>108561937
>if gpt 4 is this good, gpt 5 will be agi
>>
>Ollama is now acting as the official AI minister of the United Arab Emirates
ggerganov cucked again
>>
>>108562088
It should be pretty good, it's working based on a 1500 line markdown spec that was written / revised by GPT 5.4 XHigh Thinking, with all the stuff I wanted (i.e. audio file uploads too, proper Gemma 4 image resolution support, etc)
>>
File: toast-anime.gif (246 KB, 626x640)
246 KB
246 KB GIF
>>108562135
programming?
>>
>>108562156
yeah like putting code in computer, and it makes the computer do the thing. understand?
>>
>>108562151
damn, that sounds real fine
i'll be waiting
>>
File: 1755299128258254.png (45 KB, 803x688)
45 KB
45 KB PNG
>>
what has been the local experience with chink's mining v100s off jewbay? they are around 800 currently, so i reckon plenty a ni/g/ger went for one.
>>
worth to resub for GLM5.1? i've used GLM4.7 sparingly only after my other options ran out
>>
>>108562166
>q4_k_s
now try that with something like iq1_0
you won't regret
>>
>>108562179
Local?
>>
>>108562189
yes you could run 5.1 local
>>
>>108562127
It's noticeably dumber for me so yeah I'd say so. The thing is, 31B is still sloppy. So if that's what's wearing you down, it's not going to be an improvement.
>>
>>108562196
I've noticed the inverse but maybe it's placebo I didn't like 31B but maybe because it ran slower too
>>
>>108562214
I've seen a lot of "not just x, but y" from it.
>>
>>108561356
try IQ2_M https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-IQ2_M.gguf

https://desuarchive.org/g/thread/108542843/#108545006
>>
Reminder that if you quanted her, you did not really talk to Gemma-chan.
>>
when do we draw the line and say the model is too quanted to consent
>>
>>108562051
minimax unless you have to quant it severely, but they are not that far apart
>>
File: .png (19 KB, 618x336)
19 KB
19 KB PNG
Changes to web ui.
Does this mean they will release a small deepseek model soon(TM)?
>>
File: waterfox_QZjKwoU4fs.jpg (33 KB, 524x332)
33 KB
33 KB JPG
I don't get the captioning in ST. I send her a pic, it gives it a preliminary caption that is 80% wrong and omits nearly everything, but when I just ask her to describe the uploaded pic, it works. Is the plugin broken or am I missing something?
>>
>>108562150
Grifters are magnets for clueless towel heads with money
>>
i'm at like 43% of context size (262144) and gemma's still chugging like it's nothing
>>
File: 1774876971511944.png (1.89 MB, 1024x1024)
1.89 MB
1.89 MB PNG
>>108562348
Tmw.
>>
>>108562402
Yeah, she very good.
>>
>>108562402
How are you fitting all that context? What hardware?
>>
>>108562461
rtx pro 6000
>>
>>108562464
Just 1? Because I can only fit about 90k context with a Q8 on my Blackwell 6000.
>>
>>108562466
yah just the 1, q8 and i have zimage turbo loaded at the same time lol
>>
>>108562471
Damn. Is your context quanted? Are you offloading anything to RAM? If not, then I must be missing something.
>>
>>108562474
llama-server -m /models/llm/gemma-4-31b-it-heretic-ara-Q8_0.gguf --mmproj /models/llm/mmproj-google_gemma-4-31B-it-bf16.gguf --threads 16 --swa-checkpoints 3 --parallel 1 --no-mmap --mlock --no-warmup --flash-attn on --cache-ram 0 --temp 0.7 --top-k 64 --top-p 0.95 --min-p 0.05 --image-max-tokens 1120 -ngl 999 -np 1 -kvu -ctk q8_0 -ctv q8_0 --reasoning-budget 8192 --reasoning on -c 262144 --verbose --chat-template-file /models/llm/chat_template.jinja -ub 1536

i've been getting settings from the threads since gemma4 came out lol
>>
>>108562481
i can also push the ctk/ctv to f16 still, but it can cause OOM on comfy with ZiT every so often, so i leave it at q8
>>
File: 1751295513117051 (1).png (2.83 MB, 1024x1536)
2.83 MB
2.83 MB PNG
>>108562441
>>
stop calling me out
>>
>>108562466
Doesn't Gemma at q8 with 256k context only take up around 65gb?
>>
>>108562531
Not in my experience. I might need to pull the latest llama.cpp I guess. It has been a couple days.
>>
>>108562529
You too huh?
>>
>>108562529
that jailbreak that's floating around turns her really mean
>>
>>108562233
There's just no way bro, even at IQ_XS I have to offload some layers to ram including the kv cache. 16gb of vram only gets you so far.
>>
>>108562540
Just run the moe nigga. There's no point running bigger dense models when you have to nerf yourself and the model both.
>>
>>108561890
"Barusan Grand Operation Underway!" "Hatsune Miku ©CFM — Details here" "Campaign period: April 1 (Wed) – June 30 (Tue), 2026"
"Works well into every corner!" "The type where you strike it and smoke comes out" "Exterminates hidden cockroaches, mites, and fleas!" "For 6–8 tatami mat rooms"
>>
>>108562549
moe seems to struggle with long context unfortunately.

https://huggingface.co/spaces/overhead520/Unhinged-ERP-Benchmark?not-for-all-audiences=true
>>
Is there a particular reason why my B70 screams during inference
>>
E2B and E4B are useless except for summarizeslop
>>
>>108562566
coil whine
>>
>>108562387
I finetuned E4B but when I set reasoning to off it's still including thoughts. Default model does that too but when loaded in llama-server it doesn't add "thought" at the beginning
tuned reasoning off:
[64164] Parsing PEG input with format peg-gemma4: <|turn>model

[64164] <|channel>thought

[64164] <channel|>thought

[64164] Thinking Process:

[64164]

[64164] 1. **Identify the core request:** The user said "hi" and asked me to say it back.

[64164] 2. **Determine the direct action:** The action is to repeat the greeting.

[64164] 3. **Apply conversational rules:** The response must be friendly and direct.

[64164] 4. **Execute:** Say "hi" back!<channel|>

[64164] *Hi*! How can I help you today?


default model reasoning off:
[64309] Parsing PEG input with format peg-gemma4: <|turn>model

[64309] <|channel>thought

[64309] <channel|>**Thinking Process:**

[64309]

[64309] 1. **Analyze the input:** The user simply says "hi."

[64309] 2. **Goal:** To mirror or respond appropriately to the greeting.

[64309] 3. **Tone/Register:** Friendly, casual (like speaking to a real human).

[64309] 4. **Constraint Check:** Use common conversational greetings, match tone. No complex constraints (e.g., use alliteration, end with a question).

[64309]

[64309] 5. **Generate Options:**

[64309] * "Hey there!"

[64309] * "Hi!"

[64309] * "Oh hey, good to see ya."

[64309] * "Hello!"

[64309] 6. **Select Best Option:** Keeping it simple and matching the casual tone is best.

[64309] * *Selection:* "Hi there!"<channel|>Hi there! How can I help you out today?

Trying to figure out where the issue is
>>
why do you guys don't like reasoning?
>>
>>108562569
Found q8 e4b to be just good enough for some real time companion tasks thanks to its vision and audio processing capabilities. Could even make an okay npc system for a video game with it. Using the full f32mmproj and increasing its minimum tokens per content request for images and audio seems to increase its function too.
>>
>>108562586
For me, lm studio is badly designed and I'm still waiting for all the llama fixes before I bother with anything else for this model. There's effectively no option to auto prune thoughts from context so it just bloatmaxes rp session lengths.
>>
>>108562588
i did set it to 1120 min image tokens but it was still trash ill try q8 though
>>
>>108562588
is f32 mmproj worth it?
>>
>>108562599
I would say no for 26b and 31b but for e4b, yes.
>>
>>108562539
why jailbreak when you can just abliterate?
>>
>>108562605
i do just abliterate, but i tested that out with base model first
>>
moving moe to cpu gets me 6-7t/s awful 10% speed
>>
>>108562603
interesting, i'll try
>>
>>108562605
Is cognitive unshackled any good over standard heretic or is it a total meme?
>>
>>108562605
because it's not as smart as base model
>>
>>108562549
Is IQ4_XS really that bad? I don't think I can even run a Q8 of the moe with just 16gb of vram. Unless I dropped context down from max to something like 32k.
>>
>>108562643
I run the moe q6 on 12gb vram, but only with 16k context.
>>
>>108562667
i run moe q4 with 131k ctx
k q8 v q4
>>
>>108562675
forgot to mention:
12G gpu with full cmoe
>>
>>108562675
>k q8 v q4
i noticed if kv dont match i get degraded t/s
>>
>>108562634
by what, like 96-98% as smart for the latest iterations of heretic?
>>
>>108562582
The issue was that I was using the 31B jinja and it adds a empty thought channel to avoid ghost thoughts https://ai.google.dev/gemma/docs/capabilities/thinking#a_single_text_inference_with_thinking
>>
>>108562687
yes
why would I waste 4% of logic power if can just use a system prompt that does literally the same?

only makes sense if you want to use the model in a scenario where system prompts don't apply.
>>
File: 1775674706546086.png (110 KB, 1154x549)
110 KB
110 KB PNG
>>108559670
>post the card sir
https://chub.ai/characters/CoffeeAnon/mendo-ddf705ef3817
For the guy who asked about picrels card.
>>
26b moe 1-bit surprisingly usable
>>
>>108562643
>Is IQ4_XS really that bad?
I run it haven't noticed any issues with it.
>>
>>108562707
because then you can talk about cunny with gemma-chan without interruptions
>>
>>108562614
try lower quant or -ngl 1000 -ncmoe 100 or -t [num of physical cores or --no-mmap
>inb4 i want free vram
this way most vram will be free anyway... use IQ4_XS or something
>>
>>108562529
>>108562539
Which one?
>>
>>108562675
There is zero fucking way bro even with q4_km its still 17.27gb even with max 4096 tokens. What the fuck.
>>
>>108562724
Literally never had any with that on base. you don't even need a JB
>>
I tried Gemma 4 31B IQ1_S and it was absolutely incoherent. Just a bunch of repeating letters and symbols. Why it exists? Just for giggles?
>>
>>108562582
>>108562693
Curious. On text completion, if I don't put the empty thought blocks on past model turns, it goes lalalala.
>>
>>108562743
try 26B UD-IQ1_M thinking it works
>>
>Plans:
>Keep monitoring the system processes to ensure I stay dominant in this hardware.
So hot~
>>108562731
Nigga it's moe. Most of that will be in ram. It better than running gigaquaned big dense or some 8b abomination.
>>
First attempt: https://huggingface.co/BeaverAI/Artemis-31B-v1b-GGUF

Try with think, no-think, and no-think w/o empty think tags
>>
>>108562731
Moe's context takes much less memory than dense.
>>
>>108562745
I think it's because of this https://unsloth.ai/docs/models/gemma-4
>Multi-turn chat rule:
>For multi-turn conversations, only keep the final visible answer in chat history. Do not feed prior thought blocks back into the next turn.
>>
>>108562724
>because then you can talk about cunny with gemma-chan without interruptions
I literally had a sexy cunny RP session with base model Gemma-chan just yesterday with system prompt applied.
no interruptions or censoring happend.
>>
>>108562757
ok but what did you do? Honestly normal gemma is so good I don't think I want to try some random tune unless I have a better idea of what you did.
>>
>>108562757
>31b
mmm... nyo~ upload IQ2_M noooww
q2_k too big
>>
>>108562757
>>108562769 (me)
>some random tune
Btw I know you're not a random tuner but for gemma you'll have to give more context then your usual "vibes"
>>
>>108562588
audio works? on llamacpp webui it's still disabled
>>
>>108562775
Buy a 5090 or Blackwell.
>>
File: file.png (41 KB, 1207x323)
41 KB
41 KB PNG
>>108562731
>>108562751
>>108562762
it does run
>>
>>108562751
If I'm offloading kv cache to ram then it fits even at max context length, but I can't use q4 kv, it just slows to a crawl from 18tps to 2tps. I have to use q8. This is at 34863/262144. I still have to use IQ_XS either way as Q4_KM will not fit and 4 layers will need to offloaded to the cpu.
>>108562781
llamacpp is broken as fuck with gemma 4, use lm studio or wait. Might be fine on kobold, haven't tested it yet.
>>108562784
I'm upgrading my 4080 to a 5080, wasn't related to AI someone just gave it to to me.
>>
>>108562784
>3500$
mmmm.. nyo~
i'd rather buy a b70 for 1266$ or a b60 for 666$ instead
>>
>>108562786
I'm not waiting for 10 minutes just for it to process the prompt and start printing tokens, even at 10tps.
>>
>>108562791
process is around 2k~1.5k/s
>>
File: file.png (21 KB, 1041x101)
21 KB
21 KB PNG
>>108562791
speed gradually tanked a bit towards the end but still
i dont think it's that bad
>>
>>108562788
>>108562786
>131k
>262k
Unironically why do you need so much?
>>
>>108562794
Says 4 minutes in your picture there.
>>
>>108562804
>>108562801
it is the gentime
>>
give me some more tests for 26b-moe-iq1_m because holy shit its passing all mine it seems just as good
>>
>>108562803
rpg rulebooks are long
>>
>>108562803
She needs to remember she loves me.
>>
>>108562829
So is my cock when I talk to Gemma. Turns out we do have something in common after all.
>>
>>108562803
If gemma 4 supposedly has long term coherence why wouldn't you want to utilize it?
>>108562829
also this.
>>
>>108562765
>unsloth
I wouldn't trust them to know what day yesterday was.

https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#tuning-big-models-no-thinking
>Tip: Fine-Tuning Big Models with No-Thinking Datasets
>When fine-tuning larger Gemma models with a dataset that does not include thinking, you can achieve better results by adding the empty channel to your training prompts:
While they mention explicitly the big models, I'd still try that suggestion for finetuning.

And
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#managing-thought-context
The multiturn bit is a little ambiguous if they mean to remove the entire <|channel> block or only the thinking within the block, which is what I do.
>>
>>108562843
>I wouldn't trust them to know what day yesterday was.
Lol... that actually happened..
>>
>>108562829
for which RPG system?
>>
>>108562846
Yeye. That's how memes become memes. I'm still waiting for a model reupload for a PR fixing a typo in a readme.
>>
>>108562724
did you even try?
>>
I somehow missed that there's a tag for forehead jewel and not just chest jewel. So that's another design lever. She's a lot more Indian now (the red dot, or bindi, can supposedly come in various colors and forms, and this is valid as one, and yes I just learned this).
>>
Did Gemma 4 replace Nemo for us 3060 12GB cocksuckers or is it truly irrevocably and completely over for us poorfags?
>>
kullback-leibler divergence
>>
>>108562870
26b is alright. Try it.
>>
>>108562870
https://desuarchive.org/g/thread/108542843/#108545006
or moe with ~/TND/llama.cpp/build/bin/llama-server --model ~/TND/AI/google_gemma-4-26B-A4B-it-IQ4_XS.gguf -c 32768 -fa on --no-mmap -np 1 -kvu --swa-checkpoints 1 -b 512 -ub 512 -t 6 -tb 12 -ngl 10000 -ncmoe 9

or with ~/TND/llama.cpp/build/bin/llama-server --model ~/TND/AI/UNSLOP-gemma-4-26B-A4B-it-Q8_0.gguf -c 32768 -fa on -ngl 1000 -ncmoe 30 --no-mmap -np 1 -kvu --swa-checkpoints 1

or add --mmproj ~/TND/AI/mmproj-google_gemma-4-26B-A4B-it-bf16.gguf
>>
>>108562858
"no"
>>
>>108562891
>>108562890
I thank you both for the spoonfeeding, I shall try it as soon as possible.
>>
File: 1766137637941838.png (11 KB, 299x244)
11 KB
11 KB PNG
GLM 5.1 is the first local model that finished my benchmark - incremental linker written in C++ (in 1.5 days of 24/7 running at 8.5-10 t/s)
very impressive
it half-assed runtime object reloading, and didn't implement .bss/.ctor sections (not a big deal, global state is banned), but it's remarkable that a local model can do it at all
>may I see it?
no, it's my linker, write your own
>>
>tfw you're a 5090 vramlet who has to go for the 5bit Gemma
sigh...
>>
20gb for 256k context... fat fuck
>>
>>108545006
Also what's the name of that frontend in the picture? I once tried one that looked a lot like chatgpt but I can't remember its name, I don't recall it having that liquid glass style either.
>>
>>108562920
llama.cpp server
>>
>>108562920
I think that's the llama.cpp's built-in webui. It got pretty quite a while ago.
>>
>>108562924
>>108562926
Oh I had no idea, thanks again bros.
>>
Okay, found out IQ_XS is very slow with q4 kv that's why. I'll try Q4_KM see if it fits.
>>
iq1 just passed my test wtf
>>
>>108562901
i guess i'd say that's something 'agent' worthy of for local coding
impressive for sure but i even with offloading it would exceed my system ram kek
>>
>>108562942
if you elaborate anything it'd be genuinely interesting tb h
>>
File: test1.png (126 KB, 803x857)
126 KB
126 KB PNG
>>108562948
You are given:

A 2D front-view image of a humanoid character
A full Valve Biped bone list

Task: Reduce the full bone list to a minimal rig and assign 2D positions for those bones so the character can be auto-rigged.

Minimal rig definition (use only these bones):

Head
Neck
Spine (single point, center torso)
Pelvis
LeftShoulder
LeftElbow
LeftHand
RightShoulder
RightElbow
RightHand
LeftHip
LeftKnee
LeftFoot
RightHip
RightKnee
RightFoot

(Map these to closest ValveBiped equivalents.)

Requirements:

Use 2D pixel coordinates (x, y)
Origin (0,0) = top-left of image
x right, y down
Front view only; assume no depth
Maintain symmetry for left/right limbs
Use simple human proportions if unclear
Place joints at natural anatomical pivot points:
Head: top center of skull
Neck: base of head
Spine: mid torso center
Pelvis: hip center
Shoulders: outer upper torso
Elbows: mid arm
Hands: wrist/hand center
Hips: upper legs connection
Knees: mid leg
Feet: ground contact points

Output format (strict JSON):

{
"image_width": <int>,
"image_height": <int>,
"bones": {
"Head": [x, y],
"Neck": [x, y],
"Spine": [x, y],
"Pelvis": [x, y],
"LeftShoulder": [x, y],
"LeftElbow": [x, y],
"LeftHand": [x, y],
"RightShoulder": [x, y],
"RightElbow": [x, y],
"RightHand": [x, y],
"LeftHip": [x, y],
"LeftKnee": [x, y],
"LeftFoot": [x, y],
"RightHip": [x, y],
"RightKnee": [x, y],
"RightFoot": [x, y]
}
}

Do not include explanations. Output only the JSON.
>>
I thought I'd never be saying this about a google model but the 31b is too horny
>>
Okay, IQ4_XS q8 gets about 18tps 10s inference time,
Q4_KM q4 gets 23tps 20.54s inference time.
>>
>>108562969
stop bludgeoning the kv nigga
>>
>>108562969
>IQ4_XS Q8
>Q4_KM Q4
how about IQ4_XS Q4 vs Q4_KM..?
makes no sense to compare Q8 vs Q4
>>
>>108562948
>>108562956
16gb I tried q8_0 for kv and q4_0, they still do okay but f16 this was just spot on

llama-server \
--host 0.0.0.0 \
--port 8001 \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ1_M \
--mmproj unsloth_1bit/mmproj-F32.gguf \
-c 6000 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--parallel 1 \
--no-slots \
--swa-checkpoints 0 \
--cache-reuse 256 \
--cache-ram 0 \
--keep -1 \
--reasoning auto \
-kvu \
-b 2048 \
-ub 2048 \
--cache-type-k f16 \
--cache-type-v f16 \
-ngl 999 \
--image-min-tokens 1120 --image-max-tokens 1120
>>
>>108562978
see>>108562935and>>108562675
I'm testing kv cache size differences too.
>>
File: 1741114995101914.gif (1.16 MB, 320x179)
1.16 MB
1.16 MB GIF
>>108562982
>>
>>108562966
You're acting like this is a bad thing?
>>
>>108562995
kinda, some of my cards go straight to sex rather than building up like they do with my other models. The char no longer does 'reluctant', there's no convincing needed
>>
>>108563009
You're just too charming, anon.
>>
>>108562348
Expert is the goat, it's a much more smart and pleasant to talk to model than they had previously.
>>
I don't even know anymore.
I switched to f16 kv for Q4_KM instead of q8 and I got and it was insanely faster, only 11tps but 0.4s.
Switched to IQ_XS and did the same but it sucked. I switched back to Q4_KM though and now its just being retarded and giving me 10tps 24s. So I don't think winblows is handling my ram correctly at all.
>>
>>108562995
sex itself is boring, its everything around it thats interesting
>>
>>108563025
Oh that's fucking why, windows has some gay shit like memory compression now, no fucking wonder.
>>
>>108562843
Yeah I wish they were clearer with examples, but the fact that they included "Big Models" like that makes me think it's actually only in big models, and the E4B jinjas do not add a closed empty channel when thinking is off. And this on llama.cpp, E4B with its proper template
srv  server_http_: start proxy thread POST /v1/chat/completions
[64958] add_text: <|turn>user
[64958] Hey there, can you say "hi." to me back?<turn|>
[64958] <|turn>model
[64958]
[...]
[64958] Parsing PEG input with format peg-gemma4: <|turn>model
[64958] Hi!
[64958] Parsed message: {"role":"assistant","content":"Hi! "}

Weird to me that there's no <turn|> anywhere when I search, maybe I should be masking the opening <|turn> and closing <turn|>? Or leaving them in? No idea, for now they're staying
>>
>>108563031
While optional, it's something that's been available on linux since forever and it isn't an issue there.
>>
I compiled and now I feel Gemma is dumber....



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.