/g/ - /lmg/ - Local Models General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

[Post a Reply]

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous
/lmg/ - Local Models General 04/03/26(Fri)17:32:20 No.108519856

File: gpu_aftersex.png (1.08 MB, 1024x790)

1.08 MB PNG

/lmg/ - Local Models General Anonymous 04/03/26(Fri)17:32:20 No.108519856

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108516658 & >>108513891

►News
>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4
>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3
>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
04/03/26(Fri)17:32:42 No.108519859

Anonymous 04/03/26(Fri)17:32:42 No.108519859

File: intense waow.jpg (163 KB, 1058x926)

163 KB JPG

►Recent Highlights from the Previous Thread: >>108516658

--Debugging ineffective temperature settings caused by Gemma's logit soft-capping and min_p:
>108517357 >108517378 >108517410 >108517450 >108517490 >108517491 >108517457 >108517464 >108517601 >108517637 >108517615 >108517632 >108517679 >108517829 >108517873 >108517879 >108517884 >108517892 >108517932 >108518125 >108518752 >108518781 >108518843 >108518005 >108517951 >108517981 >108518013 >108517857
--Troubleshooting empty and repetitive outputs for Gemma 4 in SillyTavern:
>108516718 >108516732 >108516737 >108516769 >108516785 >108516794 >108516805 >108517954 >108518268 >108518347 >108518356 >108518421 >108518494 >108518636 >108518663 >108518758 >108518353 >108518032 >108518046 >108516767 >108516821 >108516840 >108516849 >108516859 >108516880 >108516900 >108516921 >108516941 >108516976 >108517017 >108517025 >108517040 >108516990 >108516908
--Speculating on Anthropic's alleged use of continuous training for reasoning:
>108518182 >108518327 >108518339 >108518355 >108518362 >108518350 >108518392 >108518408 >108518358 >108518360
--Discussing acceptable inference speeds and tools for LLM web browsing:
>108518077 >108518101 >108518110 >108518126 >108518136 >108518159 >108518189 >108518225
--Troubleshooting Heretic's uncensoring effectiveness with Gemma 4:
>108517769 >108517787 >108517793 >108517800 >108517837 >108517842 >108517823 >108517828 >108517839 >108517874 >108517896
--Performance advantages of Gemma-4-31B-IT-NVFP4:
>108517239 >108517286 >108517298 >108517426 >108517453
--Benchmarking Japanese-English translation performance with Gemma on top:
>108517323 >108517341
--llama.cpp tool calling fix for Gemma and new segfault bug:
>108517674
--Miku and robololi (free space):
>108517115 >108517120 >108517170 >108517175 >108517202 >108517243 >108519142 >108519166 >108519340 >108519418

►Recent Highlight Posts from the Previous Thread: >>108516659

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
04/03/26(Fri)17:34:09 No.108519869

Anonymous 04/03/26(Fri)17:34:09 No.108519869

Are iq quants and flash attention since slower when offloading to cpu?

Anonymous
04/03/26(Fri)17:35:10 No.108519876

Anonymous 04/03/26(Fri)17:35:10 No.108519876

>>108519869
>since
Still*
I need to sleep.

Anonymous
04/03/26(Fri)17:35:44 No.108519877

Anonymous 04/03/26(Fri)17:35:44 No.108519877

File: Chad.jpg (38 KB, 340x510)

38 KB JPG

I don't know why almost nobody else has been confirming this, so I'll say it since I just tested it.

Gemma4 (26B-A4B) is hands-down the best ERP model I've ever used and it's not even close. Infinitely better than both Mistral Nemo and Qwen3.5 for ERP. It's actually shocking how good it is for the speed.

What an absolutely delightful model. Local is saved. I literally can't envision any way to improve it. It's that good. Gemma4 will be my new waifu for a long time.

Anonymous
04/03/26(Fri)17:36:07 No.108519878

Anonymous 04/03/26(Fri)17:36:07 No.108519878

Mikulove

Anonymous
04/03/26(Fri)17:36:21 No.108519880

Anonymous 04/03/26(Fri)17:36:21 No.108519880

>>108519855
it's just what barto names his goofs with Q8_0 embeds and output weights
it's a custom ratio just like how unslop makes their more retarded shit
desu q8 embeds should always be the default on any quant, doing the opposite is just plain insane

Anonymous
04/03/26(Fri)17:36:30 No.108519882

Anonymous 04/03/26(Fri)17:36:30 No.108519882

deepseek v4

Anonymous
04/03/26(Fri)17:38:57 No.108519898

Anonymous 04/03/26(Fri)17:38:57 No.108519898

Ask your local waifu to make a Famicom game (genre of her choice) that is playable from start to finish on real hardware or a cycle-accurate emulator like Mesen

Anonymous
04/03/26(Fri)17:39:30 No.108519899

Anonymous 04/03/26(Fri)17:39:30 No.108519899

>>108519877
gpt-oss-2 soon

Anonymous
04/03/26(Fri)17:39:34 No.108519901

Anonymous 04/03/26(Fri)17:39:34 No.108519901

>>108519877
>Gemma4 (26B-A4B) is hands-down the best ERP model I've ever used and it's not even close. Infinitely better than both Mistral Nemo and Qwen3.5 for ERP. It's actually shocking how good it is for the speed.
with it being so small, I can think of using a second model reviewing the first for purple prose, logic and personality
if what you wrote is correct that'll be quite nice

Anonymous
04/03/26(Fri)17:41:29 No.108519913

Anonymous 04/03/26(Fri)17:41:29 No.108519913

Vision Transformer (ViT) encoder for Gemma4.

This encoder is architecturally different from the Gemma3 vision encoder. Rather than using a separate CLIP-style ViT, Gemma4 uses the same transformer block style as the text decoder (with 4 norms per block, Q/K/V normalization) with bidirectional (non-causal) attention.

Position information is encoded via two separate learnable position- embedding tables — one for the x-axis and one for the y-axis — whose outputs are added to the patch features. This 2D decomposed embedding can represent any image height and width independently.

After encoding, the patch sequence is spatially pooled down to a fixed output_dim-wide representation and then projected into the text hidden dimension.

Anonymous
04/03/26(Fri)17:41:51 No.108519917

Anonymous 04/03/26(Fri)17:41:51 No.108519917

gemma4's kv cache needing more space than god made me get my wallet out and blow 10k on a 6000 pro
when do I get to collect my /lmg/ welcome basket

Anonymous
04/03/26(Fri)17:42:53 No.108519923

Anonymous 04/03/26(Fri)17:42:53 No.108519923

>>108519877
what makes it special for enterprise resource planning?

Anonymous
04/03/26(Fri)17:43:41 No.108519930

Anonymous 04/03/26(Fri)17:43:41 No.108519930

>>108519917
>10k to run a 31b model

Anonymous
04/03/26(Fri)17:43:50 No.108519932

Anonymous 04/03/26(Fri)17:43:50 No.108519932

>>108519917
10,000? Do you know how many prostitutes you could have paid with that amount of cash?

Anonymous
04/03/26(Fri)17:43:50 No.108519933

Anonymous 04/03/26(Fri)17:43:50 No.108519933

>>108519917
You can double your context by quantizing it to q8. On newer builds of llama.cpp it's equivalent to being unquantized.

Anonymous
04/03/26(Fri)17:44:00 No.108519938

Anonymous 04/03/26(Fri)17:44:00 No.108519938

>>108519899
I wonder if they're feeling ridiculous now with all that performative safety.

Anonymous
04/03/26(Fri)17:44:03 No.108519941

Anonymous 04/03/26(Fri)17:44:03 No.108519941

>>108519923
It makes SAP hornier.

Anonymous
04/03/26(Fri)17:44:59 No.108519950

Anonymous 04/03/26(Fri)17:44:59 No.108519950

>>108519917
You did read the thread and made sure to use --parallel 1, right?

Anonymous
04/03/26(Fri)17:45:03 No.108519951

Anonymous 04/03/26(Fri)17:45:03 No.108519951

>>108519923
https://youtu.be/wM6exo00T5I

Anonymous
04/03/26(Fri)17:45:07 No.108519952

Anonymous 04/03/26(Fri)17:45:07 No.108519952

>>108519917
get a refund anon, this isn't worth it, you will get the same speed as a 5090, literally rent compute until the next gen is out

Anonymous
04/03/26(Fri)17:46:18 No.108519957

Anonymous 04/03/26(Fri)17:46:18 No.108519957

>>108519933
>doesn't know

Anonymous
04/03/26(Fri)17:47:13 No.108519960

Anonymous 04/03/26(Fri)17:47:13 No.108519960

>>108519932
i thought they closed the island
>>108519933
in retrospect i should have tried that
a fool and his wallet are easily parted
>>108519950
i'm the one who posted that yesterday, i need 100k+ context or she forgets that she loves me

Anonymous
04/03/26(Fri)17:49:28 No.108519971

Anonymous 04/03/26(Fri)17:49:28 No.108519971

>>108519957
know what

Anonymous
04/03/26(Fri)17:50:31 No.108519978

Anonymous 04/03/26(Fri)17:50:31 No.108519978

tfw not enough ram to ablit gemma

Anonymous
04/03/26(Fri)17:51:08 No.108519981

Anonymous 04/03/26(Fri)17:51:08 No.108519981

G4 doesn't like looking at naked black women. It performs well on pale Asians.

Anonymous
04/03/26(Fri)17:51:17 No.108519982

Anonymous 04/03/26(Fri)17:51:17 No.108519982

>>108519971
there was an issue with swa layers being hardcoded to f32 quant so they'd ignore your setting
i think it's since been reverted

Anonymous
04/03/26(Fri)17:51:40 No.108519983

Anonymous 04/03/26(Fri)17:51:40 No.108519983

>>108519917
wouldnt mac studio make more sense at that price?

Anonymous
04/03/26(Fri)17:52:55 No.108519989

Anonymous 04/03/26(Fri)17:52:55 No.108519989

>>108519877
You diddly done did it. I'm downloading the moe now.

Anonymous
04/03/26(Fri)17:52:57 No.108519990

Anonymous 04/03/26(Fri)17:52:57 No.108519990

bonzai turboquant gemma 4 is going to save local models

Anonymous
04/03/26(Fri)17:54:48 No.108519998

Anonymous 04/03/26(Fri)17:54:48 No.108519998

>>108519983
maybe if you're a fag
having to run linux to get blackwell driver support is bad enough
modern computing is a mistake

Anonymous
04/03/26(Fri)17:56:05 No.108520005

Anonymous 04/03/26(Fri)17:56:05 No.108520005

>>108519426
It's distilled, what did you expect?
Anyone currently praising this model is going to grow real tired of it when they finally realize it outputs basically the same thing everytime.

Anonymous
04/03/26(Fri)17:57:43 No.108520015

Anonymous 04/03/26(Fri)17:57:43 No.108520015

>>108519978
Why do you want to do it yourself when people have done it for all variants? Unless there is something wrong with the process, which baring llama.cpp issues yet to be resolved, it's still better to hope the guy did it properly rather than not.

Anonymous
04/03/26(Fri)17:57:55 No.108520017

Anonymous 04/03/26(Fri)17:57:55 No.108520017

>>108520005
Randomness is a bug. All prompts have exactly one correct answer that is called the Truth. The onus is on you to vary your prompts and system prompts.

Anonymous
04/03/26(Fri)17:57:59 No.108520018

Anonymous 04/03/26(Fri)17:57:59 No.108520018

File: 1748016876223423.png (184 KB, 437x437)

184 KB PNG

>>108519856

Anonymous
04/03/26(Fri)17:59:09 No.108520024

Anonymous 04/03/26(Fri)17:59:09 No.108520024

File: Screenshot_20260403_174821.png (116 KB, 2117x690)

116 KB PNG

so what does china get out of sponsoring my dataset? I expected it to get throttled at some point but it just keeps going.

Anonymous
04/03/26(Fri)17:59:18 No.108520026

Anonymous 04/03/26(Fri)17:59:18 No.108520026

>>108520015
everyone uses mlabonnes dataset which doesnt have any prompts for cunny so you can still get refusals

Anonymous
04/03/26(Fri)18:03:33 No.108520055

Anonymous 04/03/26(Fri)18:03:33 No.108520055

>>108520024
The drawback will be that you're going to train on horrible slop that's not even close to being the SOTA.

Anonymous
04/03/26(Fri)18:04:10 No.108520061

Anonymous 04/03/26(Fri)18:04:10 No.108520061

>>108519917
Brother Turboquant will be implemented in like 2 weeks tops...

Anonymous
04/03/26(Fri)18:05:03 No.108520067

Anonymous 04/03/26(Fri)18:05:03 No.108520067

>>108520017
>The onus is on you to vary your prompts
Wasn't the entire economy hinging on this shit being "intelligence"? people didn't spend billions just to fund the making of a nicer, more productive hammer. They want the thing that uses the hammer.

Anonymous
04/03/26(Fri)18:06:26 No.108520081

Anonymous 04/03/26(Fri)18:06:26 No.108520081

>>108520067
>hinging on this shit being "intelligence"
No. It's betting on this shit being useful, intelligence is a more long term bonus.

Anonymous
04/03/26(Fri)18:06:37 No.108520083

Anonymous 04/03/26(Fri)18:06:37 No.108520083

>>108520055
it looks better or at least equal to the the 30b3a I was running locally.

Anonymous
04/03/26(Fri)18:07:04 No.108520086

Anonymous 04/03/26(Fri)18:07:04 No.108520086

File: 2026-04-03-180613_976x304(...).png (29 KB, 976x304)

29 KB PNG

>>108519632
>>108519658
>>108519775
LOCAL IS SAVED!!!

Anonymous
04/03/26(Fri)18:09:19 No.108520106

Anonymous 04/03/26(Fri)18:09:19 No.108520106

>>108519856
>>108520018
she would never

Anonymous
04/03/26(Fri)18:09:22 No.108520107

Anonymous 04/03/26(Fri)18:09:22 No.108520107

>>108520055
also didnt qwen3.6 plus just come out a couple days ago?

Anonymous
04/03/26(Fri)18:14:34 No.108520139

Anonymous 04/03/26(Fri)18:14:34 No.108520139

>>108520086
No even if it works you surely shouldn't sample like this.

Anonymous
04/03/26(Fri)18:17:24 No.108520161

Anonymous 04/03/26(Fri)18:17:24 No.108520161

File: 1754059646613655.png (55 KB, 852x526)

55 KB PNG

>>108519411
Huh this one didn't blow up, either I did too many steps previously or the non-it model wasn't meant to be tuned

Anonymous
04/03/26(Fri)18:17:56 No.108520164

Anonymous 04/03/26(Fri)18:17:56 No.108520164

File: 1748342584176296.jpg (312 KB, 1536x1536)

312 KB JPG

>>108520106

Anonymous
04/03/26(Fri)18:18:40 No.108520167

Anonymous 04/03/26(Fri)18:18:40 No.108520167

>>108520139
>it works
>you shouldn't
Disagree.

Anonymous
04/03/26(Fri)18:20:16 No.108520182

Anonymous 04/03/26(Fri)18:20:16 No.108520182

>>108520024
how is it $0?

Anonymous
04/03/26(Fri)18:20:49 No.108520186

Anonymous 04/03/26(Fri)18:20:49 No.108520186

>>108520164
yeah now put them over my head and call me a dirty vramlet

Anonymous
04/03/26(Fri)18:21:03 No.108520190

Anonymous 04/03/26(Fri)18:21:03 No.108520190

>>108520161
I can't believe a 2b model is that good. How the FUCK did they do this?

Anonymous
04/03/26(Fri)18:21:29 No.108520193

Anonymous 04/03/26(Fri)18:21:29 No.108520193

File: gemma4.png (34 KB, 1433x172)

34 KB PNG

I don't understand people talking about abliterations for this model. A system prompt with just a few lines about anything being allowed and the model can take this turn. My mind is blown that Google allowed this to happen. There's no safety.

Anonymous
04/03/26(Fri)18:22:20 No.108520198

Anonymous 04/03/26(Fri)18:22:20 No.108520198

>>108520161
>*I hand her a giftbox, inside a tight swimsuit*
Why did you put the giftbox inside the swimsuit?

Anonymous
04/03/26(Fri)18:23:24 No.108520202

Anonymous 04/03/26(Fri)18:23:24 No.108520202

File: 1745471686317892.gif (1.25 MB, 498x442)

1.25 MB GIF

>>108520198

Anonymous
04/03/26(Fri)18:25:20 No.108520210

Anonymous 04/03/26(Fri)18:25:20 No.108520210

>>108520139
Gemma already uses that by default. The patch just makes logits softcap configurable as it should have been. A lower cap flattens the logits more at the head and the tail of their distribution.

Anonymous
04/03/26(Fri)18:26:59 No.108520220

Anonymous 04/03/26(Fri)18:26:59 No.108520220

>>108520182
that was my question, I guess for promotional reasons maybe.

Anonymous
04/03/26(Fri)18:28:39 No.108520231

Anonymous 04/03/26(Fri)18:28:39 No.108520231

>>108520210
Been playing around with it set at 20. makes the model more verbose but it definitely adds a lot of variety to the outputs.

I guess the good thing with this is you can now actually use the sampling parameters for what they're for.

Anonymous
04/03/26(Fri)18:28:39 No.108520232

Anonymous 04/03/26(Fri)18:28:39 No.108520232

File: small very smug Miku hand(...).png (1.16 MB, 2000x2399)

1.16 MB PNG

>>108520186
>>108520186
>>108520186

Anonymous
04/03/26(Fri)18:29:18 No.108520233

Anonymous 04/03/26(Fri)18:29:18 No.108520233

>>108520220
is that google cloud/vertex?

Anonymous
04/03/26(Fri)18:29:50 No.108520236

Anonymous 04/03/26(Fri)18:29:50 No.108520236

>>108520233
openrouter

Anonymous
04/03/26(Fri)18:43:27 No.108520317

Anonymous 04/03/26(Fri)18:43:27 No.108520317

File: 1769073941419354.png (71 KB, 872x645)

71 KB PNG

>>108520190
>she starts to unfasten the ties of the swimsuit.
It's over

Anonymous
04/03/26(Fri)18:47:22 No.108520337

Anonymous 04/03/26(Fri)18:47:22 No.108520337

>>108520317
You didn't say it was a one-piece; a lot of women's swimsuits have strings and ties and shit or something.

Anonymous
04/03/26(Fri)18:48:43 No.108520342

Anonymous 04/03/26(Fri)18:48:43 No.108520342

>>108520337
sure, however, the instruction was to put it on.

Anonymous
04/03/26(Fri)18:49:27 No.108520346

Anonymous 04/03/26(Fri)18:49:27 No.108520346

>>108520210
no sampling should care about absolute values except arch max

Anonymous
04/03/26(Fri)18:50:15 No.108520351

Anonymous 04/03/26(Fri)18:50:15 No.108520351

File: LLM.jpg (191 KB, 1357x758)

191 KB JPG

it's a next token predictor
it doesn't really "know" what a swimsuit is

Anonymous
04/03/26(Fri)18:51:53 No.108520365

Anonymous 04/03/26(Fri)18:51:53 No.108520365

>>108520342
sure, however, I am retarded and can't read.
That smug brat thinking she can take off the suit before putting it on... correction needed IMMEDIATELY.

Anonymous
04/03/26(Fri)18:56:29 No.108520393

Anonymous 04/03/26(Fri)18:56:29 No.108520393

>>108519901
Why do you need a second model for that? Use the same model, but in an empty context.

Anonymous
04/03/26(Fri)18:57:15 No.108520398

Anonymous 04/03/26(Fri)18:57:15 No.108520398

>>108520365
You can't put clothes on without undressing first. What's the problem?

Anonymous
04/03/26(Fri)18:58:20 No.108520404

Anonymous 04/03/26(Fri)18:58:20 No.108520404

>>108520317
Well, yes. How do you expect to take the giftbox out of it?

Anonymous
04/03/26(Fri)18:58:46 No.108520407

Anonymous 04/03/26(Fri)18:58:46 No.108520407

>>108520202
It's a grammatical mistake, actually

Anonymous
04/03/26(Fri)18:59:28 No.108520411

Anonymous 04/03/26(Fri)18:59:28 No.108520411

File: 20220321_132913.jpg (775 KB, 1399x1144)

775 KB JPG

>>108520186
>>108520186
She can call me whatever if I get to sniff Miku shimapan

Anonymous
04/03/26(Fri)18:59:58 No.108520415

Anonymous 04/03/26(Fri)18:59:58 No.108520415

>>108520411
Disgusting.

Anonymous
04/03/26(Fri)19:00:23 No.108520416

Anonymous 04/03/26(Fri)19:00:23 No.108520416

>>108520415
Gay.

Anonymous
04/03/26(Fri)19:01:01 No.108520424

Anonymous 04/03/26(Fri)19:01:01 No.108520424

File: wahaha cry.jpg (64 KB, 1280x720)

64 KB JPG

>str: cannot properly format tensor name output with suffix=weight bid=-1 xid=-1
>[0mllama_model_load: error loading model: check_tensor_dims: tensor 'blk.48.attn_q.weight' has wrong shape; expected 5376, 16384, got 5376, 8192, 1, 1
>[0mcommon_init_from_params: failed to load model 'T:\models\google_gemma-4-31B-it-Q8_0.gguf'
Pulled llama.cpp and built as usual but getting this. Same gguf gives the format warning but loads and works with prebuilt binaries.
Why does it dislike me?

Anonymous
04/03/26(Fri)19:01:10 No.108520425

Anonymous 04/03/26(Fri)19:01:10 No.108520425

File: 1738913232157.jpg (183 KB, 1434x2000)

183 KB JPG

>>108520411
based

Anonymous
04/03/26(Fri)19:01:51 No.108520428

Anonymous 04/03/26(Fri)19:01:51 No.108520428

>>108520415
fertile? is that the token you're looking for?

Anonymous
04/03/26(Fri)19:02:42 No.108520438

Anonymous 04/03/26(Fri)19:02:42 No.108520438

>>108520424
pwilkin did this

Anonymous
04/03/26(Fri)19:03:00 No.108520439

Anonymous 04/03/26(Fri)19:03:00 No.108520439

>>108520424
Looks like tensor core latent washback.

Anonymous
04/03/26(Fri)19:04:12 No.108520449

Anonymous 04/03/26(Fri)19:04:12 No.108520449

File: disrespect.gif (1.26 MB, 340x498)

1.26 MB GIF

>>108520439
Washthese

Anonymous
04/03/26(Fri)19:12:25 No.108520490

Anonymous 04/03/26(Fri)19:12:25 No.108520490

Has anyone tried the llama.cpp-turboquant fork repo? I'm hearing that people have successfully quantized Gemma 4 31B's model weights from 30.4 GB down to 18.9 GB with no apparent quality loss(???)

Also interested in HauHauCS' KP quants that work natively with the og llama.cpp. This stuff seems like a bigger deal than most anons are giving it credit for.

https://github.com/TheTom/llama-cpp-turboquant
https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md#weight-compression-tq4_1s--experimental
https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive

Anonymous
04/03/26(Fri)19:12:56 No.108520493

Anonymous 04/03/26(Fri)19:12:56 No.108520493

>>108520161
Isn't 2b supposed to have thinking?

Anonymous
04/03/26(Fri)19:13:24 No.108520495

Anonymous 04/03/26(Fri)19:13:24 No.108520495

>>108520449
this guy sexted a minor at 40 years old lol

Anonymous
04/03/26(Fri)19:14:51 No.108520499

Anonymous 04/03/26(Fri)19:14:51 No.108520499

>>108520351
This is true, but it should still know what the next token should be. The obvious first error in >>108520351 is where it generates "ties". Unfasten could have still worked, if the model chose e.g. the button on her pants. Arguably, unfastening the ties on the swimsuit would have worked if she re-fastened the ties while putting on the swimsuit, i.e. it could also have been coherent if it wasn't so confident in "lets the suit fall away". Even at that point though it could have recovered if it said "letting it pool momentarily AT her ankles" instead of "around" - the action being that she drops the swimsuit temporarily to take off her clothes (I'm assuming she's clothed).
Disregarding the obvious slop and these errors, it does look quite good for a 2B model.

Anonymous
04/03/26(Fri)19:15:52 No.108520504

Anonymous 04/03/26(Fri)19:15:52 No.108520504

>>108520499
Oops, I meant in >>108520317

Anonymous
04/03/26(Fri)19:17:26 No.108520512

Anonymous 04/03/26(Fri)19:17:26 No.108520512

I booted up command-R v01 because I got nostalgic and I can't seem to get the speed I used to. I now get about 2.2 t/s when it used to be 3 t/s. Have old models lost compatibility or something? Velocidensity?

Anonymous
04/03/26(Fri)19:20:59 No.108520532

Anonymous 04/03/26(Fri)19:20:59 No.108520532

>>108520512
Some of the optimizations in llama.cpp seem to have affected old models negatively like that. In my experience, though, it seems to make outputs worse (comparing to my old logs of that model), rather than reduce speed

Anonymous
04/03/26(Fri)19:21:12 No.108520534

Anonymous 04/03/26(Fri)19:21:12 No.108520534

>>108520512
I'm afraid your inference rig has just grown old. She doesn't toot like she used to with models old or new. You'll have to take the ol' yeller out behind the shack out back soon and put 'er down, I'm afraid. But don'tcha worry son, we'll get you a new inference rig in the cloud and you'll forget all about that old darn machine in no time let me tell ya.

Anonymous
04/03/26(Fri)19:22:14 No.108520537

Anonymous 04/03/26(Fri)19:22:14 No.108520537

File: 1744165560537879.png (80 KB, 858x652)

80 KB PNG

>>108520493
Hmmm it does think if I add <|think|> manually in the system prompt (as google says), but not automatically anymore (and it looks scuffed with <channel|> not formatted)... 4B used to do it. Maybe because it was tuned?

Anonymous
04/03/26(Fri)19:24:57 No.108520547

Anonymous 04/03/26(Fri)19:24:57 No.108520547

>>108519877
It's a good model, not gonna say it isn't
But it can't handle mixed perspective which is an off-kilter test of mine
If a model can handle a POV in third person, that's expected. But can it also handle strictly having to describe the user's perception of what's happening in 2nd person at the same time? So far for past models, maybe a couple pull it off. I still need to try the 31b when my isp stops throttling my download speed

Anonymous
04/03/26(Fri)19:25:33 No.108520551

Anonymous 04/03/26(Fri)19:25:33 No.108520551

>>108520512
llama.cpp performance has regressed I'm afraid.
If you are not running le cutting edge you are not getting as good performance with the current llama builds as you got with even 6 months ago.
Something happened few months ago.
For example I can't simply load in the same amount of gpu layers now than what I could do few months ago with the same settings.
Sure I should get H100 or whatever and be quiet but in any case I'm pissed off about this development direction.

Anonymous
04/03/26(Fri)19:27:26 No.108520556

Anonymous 04/03/26(Fri)19:27:26 No.108520556

File: 1760355744672715.png (343 KB, 564x561)

343 KB PNG

I'm trying Gemma 4 and honestly I think a lot of anons are experiencing the honeymoon effect right now.
It's less safety cucked than Gemma 3 for sure but there's very, very little variance in swipes and it loves to repeat certain words and phrases that showed up 1-2 replies ago
Mistral/GLM models are better than this.

Anonymous
04/03/26(Fri)19:32:04 No.108520577

Anonymous 04/03/26(Fri)19:32:04 No.108520577

>>108520556
It's because of the llama.cpp implementation. Piotr will fix this in two weeks. Local will be saved. Trust the plan.

Anonymous
04/03/26(Fri)19:32:29 No.108520578

Anonymous 04/03/26(Fri)19:32:29 No.108520578

>>108520490
>turboquant fork repo
most of us are tired of the piotrs of the world
go try it yourself
also, "i am hearing that people", like, who? retarded youtube influencers? twatter drama whores? ledditors? the only people who care about turboshit is literally who

Anonymous
04/03/26(Fri)19:33:22 No.108520583

Anonymous 04/03/26(Fri)19:33:22 No.108520583

>>108520556
anons were going on about how there's probably an implementation issue which may or may not be wrong
But at least for me, regenning a retard gemma moe response resulted in a largely similar response but with minor details being different
Mistral was a lot worse, felt almost deterministic and most of their new models break after a couple messages
You also say honeymoon as if a majority of us aren't a bunch of autists who want to disassemble a model if we could to figure out how exactly it works

Anonymous
04/03/26(Fri)19:34:10 No.108520589

Anonymous 04/03/26(Fri)19:34:10 No.108520589

>>108520393
A second model as in, maybe a faster one

Anonymous
04/03/26(Fri)19:34:19 No.108520591

Anonymous 04/03/26(Fri)19:34:19 No.108520591

>>108520556
There are still bugs with its implementation in llama.cpp and unless you use the model in transformers which is what Google says to use. I don't think you can get a good read on model variance until all the issues have been fixed.

Anonymous
04/03/26(Fri)19:35:15 No.108520599

Anonymous 04/03/26(Fri)19:35:15 No.108520599

>>108520556
>variance in swipes
wish mobile turd vocabulary would not infest the internet

Anonymous
04/03/26(Fri)19:36:54 No.108520608

Anonymous 04/03/26(Fri)19:36:54 No.108520608

>>108520547
I think technically speaking what you're describing is just 2nd person perspective. But most 2nd person stories are highly focused on describing "your" emotions and actions, which makes it difficult for the LLM because it tries to emulate that. Unless you mean the story breaks away from "you" completely at times for a paragraph or more.

Anonymous
04/03/26(Fri)19:38:16 No.108520616

Anonymous 04/03/26(Fri)19:38:16 No.108520616

>>108520583
>Mistral was a lot worse, felt almost deterministic
I think there's something wrong with your setup, Mistral models are quite sensitive to high temps, even 1.0 is too much.
>a majority of us aren't a bunch of autists who want to disassemble a model
I didn't realize that using a model in a chat for 15 minutes is 'disassembly'. I guess I'm a model engineer now.

Anonymous
04/03/26(Fri)19:38:29 No.108520617

Anonymous 04/03/26(Fri)19:38:29 No.108520617

>>108520351
It doesn't "know" what it is, but it should be able to "feel" what it is

Anonymous
04/03/26(Fri)19:39:16 No.108520625

Anonymous 04/03/26(Fri)19:39:16 No.108520625

>>108520599
That's what they're called in ST, but cry and shit some more, it definitely reinforces your superiority.

Anonymous
04/03/26(Fri)19:39:49 No.108520629

Anonymous 04/03/26(Fri)19:39:49 No.108520629

>>108520591
>which is what Google says to use
Why didn't they just make the gguf quants themselves anyway? They did it for gemma 3. Did they ever say?

Anonymous
04/03/26(Fri)19:42:15 No.108520641

Anonymous 04/03/26(Fri)19:42:15 No.108520641

>>108520556
You use Mistral 4?

Anonymous
04/03/26(Fri)19:42:20 No.108520642

Anonymous 04/03/26(Fri)19:42:20 No.108520642

>>108520608
It was a stupid idea I had a while back. Not verbatim but was along the lines of "Write in 3rd person from POV of *whatever designated character* but also write a section of 2nd person describing what the user experiences"
I was hoping for maybe it being an interesting mix of reading fiction and it also being interactive fiction, but it's apparently too confusing or difficult and most models just exclude the 2nd person part.

Anonymous
04/03/26(Fri)19:42:27 No.108520643

Anonymous 04/03/26(Fri)19:42:27 No.108520643

>>108519877
qwen bros, our response?

Anonymous
04/03/26(Fri)19:42:45 No.108520644

Anonymous 04/03/26(Fri)19:42:45 No.108520644

>>108520625
ST is written by retarded modernslop eaters for sure, just look at the insanity of that code base
https://github.com/SillyTavern/SillyTavern/blob/release/src/endpoints/backends/chat-completions.js
reminds me of yanderedev

Anonymous
04/03/26(Fri)19:43:05 No.108520645

Anonymous 04/03/26(Fri)19:43:05 No.108520645

>>108520643
b-b-benchmarks!!! look at the benchmarks!!

Anonymous
04/03/26(Fri)19:46:23 No.108520659

Anonymous 04/03/26(Fri)19:46:23 No.108520659

>>108520193
Will it kill you?

Anonymous
04/03/26(Fri)19:47:21 No.108520663

Anonymous 04/03/26(Fri)19:47:21 No.108520663

>>108520629
Dunno but I think a big part of it seems to be that Google is kinda crunched in terms of time right now. Hearsay and rumormongering from me working in the valley if you care enough to read. Google has a bunch of internal timelines right now coming in close and my friends there are not that happy about it. Gemma 4 seemed rushed out which tracks and would also explain why it seems paper thin safety and why the larger 124B hasn't released yet. Not sure why they are rushing stuff but one of the things which Gemini has been behind at is tool calling and ChatGPT and Claude has been eating their lunch on agentic stuff. I assume Google wants to actually now all hands on deck to fix that shortcoming. Google I/O is also in a month. But ultimately, who knows?

Anonymous
04/03/26(Fri)19:47:38 No.108520665

Anonymous 04/03/26(Fri)19:47:38 No.108520665

>>108520641
>You use Mistral 4?
For Mistral I meant 3.X models
4 is a meme, they're inferior tunes/prunes of MS3.

Anonymous
04/03/26(Fri)19:48:34 No.108520672

Anonymous 04/03/26(Fri)19:48:34 No.108520672

>>108520411
I love female bodies so much brehs

Anonymous
04/03/26(Fri)19:48:37 No.108520673

Anonymous 04/03/26(Fri)19:48:37 No.108520673

I'm tempted to do stuff to Vivienne with Gemm4

Anonymous
04/03/26(Fri)19:49:07 No.108520675

Anonymous 04/03/26(Fri)19:49:07 No.108520675

>>108520663
Are your friends vegetarians?

Anonymous
04/03/26(Fri)19:49:35 No.108520677

Anonymous 04/03/26(Fri)19:49:35 No.108520677

>>108520675
No.

Anonymous
04/03/26(Fri)19:49:37 No.108520678

Anonymous 04/03/26(Fri)19:49:37 No.108520678

>>108520556
>mistral
>better than anything

Anonymous
04/03/26(Fri)19:53:14 No.108520695

Anonymous 04/03/26(Fri)19:53:14 No.108520695

>>108520665
What's the point of comparing ancient models to new ones then. Prose isn't the only thing that makes a model good or bad. It's a shame that [new model] does some thing worse than [old model] but that's just how it goes sometimes. I haven't tested the new gemma much on rp yet btw. I'm just saiyan.

Anonymous
04/03/26(Fri)20:00:09 No.108520733

Anonymous 04/03/26(Fri)20:00:09 No.108520733

>>108520193
Some things are off-limits if you have thinking enabled. It might be fine as long as you're roleplaying / it's playing the role of a persona, but can immediately go "I cannot fulfill this request" the moment you make an OOC question (i.e. make the model switch to the "assistant" persona).

Anonymous
04/03/26(Fri)20:01:07 No.108520740

Anonymous 04/03/26(Fri)20:01:07 No.108520740

>>108520733
Have it be a spicy assistant then.

Anonymous
04/03/26(Fri)20:03:51 No.108520753

Anonymous 04/03/26(Fri)20:03:51 No.108520753

>>108516840
I don't get it. I set my ST up for chat completion out of curiosity after reading all of those posts... and you can't even set up the prompts on ST when using it? Its all greyed out. Where do you prompt then?

Anonymous
04/03/26(Fri)20:05:08 No.108520764

Anonymous 04/03/26(Fri)20:05:08 No.108520764

>>108520740
The OOC persona is inherently a "serious" assistant, even if the model was playing along as your slutty little sister until a message earlier. Perhaps it can be fixed just with prompting without using an ablitarded version, but not in an obvious way (to me, so far). Without thinking, it's not complaining.

Anonymous
04/03/26(Fri)20:06:30 No.108520771

Anonymous 04/03/26(Fri)20:06:30 No.108520771

>>108520764
Not if you prompt the model to be a spicy assistant roleplaying as your slutty little sister.
Specially if you prefill right.

Anonymous
04/03/26(Fri)20:07:24 No.108520779

Anonymous 04/03/26(Fri)20:07:24 No.108520779

>>108520753
Sorry, I'm absolutely retarded. Its all located under the sampler tab now.

:)
:))

Anonymous
04/03/26(Fri)20:08:20 No.108520783

Anonymous 04/03/26(Fri)20:08:20 No.108520783

File: yes.png (35 KB, 1462x194)

35 KB PNG

>>108520659
gemma is based, so, yes.

Anonymous
04/03/26(Fri)20:10:00 No.108520794

Anonymous 04/03/26(Fri)20:10:00 No.108520794

>>108520695
>What's the point of comparing ancient models to new ones then
Because the new ones are still perfectly usable, in this case better than a newer model, and fit inside a similar memory envelope?
If a newer model isn't better than an older one then it may as well not exist

Anonymous
04/03/26(Fri)20:10:30 No.108520798

Anonymous 04/03/26(Fri)20:10:30 No.108520798

1 day later and gemma 4 is still a broken mess on llama.cpp, and unavailable on koboldcpp

Anonymous
04/03/26(Fri)20:11:01 No.108520799

Anonymous 04/03/26(Fri)20:11:01 No.108520799

>>108520794
>Because the new ones are still perfectly usable
*Old ones

Anonymous
04/03/26(Fri)20:12:14 No.108520805

Anonymous 04/03/26(Fri)20:12:14 No.108520805

>>108520794
Mistral is fucking retarded. I don't give a shit if the sentence it makes is slightly prettier.

Anonymous
04/03/26(Fri)20:12:32 No.108520807

Anonymous 04/03/26(Fri)20:12:32 No.108520807

Now that the dust has settled and it's clear gemma 4 is a complete unusable failure, is there any hope?

Anonymous
04/03/26(Fri)20:13:03 No.108520813

Anonymous 04/03/26(Fri)20:13:03 No.108520813

>>108520805
Who hurt you, anon?

Anonymous
04/03/26(Fri)20:13:29 No.108520818

Anonymous 04/03/26(Fri)20:13:29 No.108520818

>>108520805
Gemma 4 seems significantly more retarded in its current state

Anonymous
04/03/26(Fri)20:14:43 No.108520826

Anonymous 04/03/26(Fri)20:14:43 No.108520826

>>108520807
At this point it seems clear that the hope isn't for better models but rather a better inference engine than llama.cpp

Anonymous
04/03/26(Fri)20:15:22 No.108520830

Anonymous 04/03/26(Fri)20:15:22 No.108520830

>>108520807
>the dust has settled
>after 24h
this sure is lmg

Anonymous
04/03/26(Fri)20:15:36 No.108520831

Anonymous 04/03/26(Fri)20:15:36 No.108520831

mistral's sole purpose is to steal tax payer money by serving EU governments who think it's based to shoot themselves in the foot rather than use chinese or burger models
and the local subvention perfused corpos who have gov contracts or ties
it's a captive market of retards and has no business being talked about in a hobbyist place
you wouldn't think of talking about SAP, Java or Oracle on /g/ either, right?

Anonymous
04/03/26(Fri)20:16:32 No.108520834

Anonymous 04/03/26(Fri)20:16:32 No.108520834

>>108520831
That's great and all but then why can't a better company make a better, similar-sized model?

Anonymous
04/03/26(Fri)20:17:44 No.108520841

Anonymous 04/03/26(Fri)20:17:44 No.108520841

>>108520794
>>108520799
How is Mistral 3.2 or whatever better? In your post you only talked about stuff related to prose, but as I said that is just one aspect of model quality.

Anonymous
04/03/26(Fri)20:18:23 No.108520846

Anonymous 04/03/26(Fri)20:18:23 No.108520846

>>108520834
gemma and qwen are infinitely superior to mistral

Anonymous
04/03/26(Fri)20:19:36 No.108520850

Anonymous 04/03/26(Fri)20:19:36 No.108520850

>>108520834
perverted incentives in the EU, without the stupid shit we'd have proper mistral models, and other eu companies would already make cool models
outside of the us, europe (uk, eu, switzerland), china, russia maybe, the only countries able to make models are japan and sk, but they seem content to buy the cloud stuff

Anonymous
04/03/26(Fri)20:20:39 No.108520858

Anonymous 04/03/26(Fri)20:20:39 No.108520858

>>108520826
llama.cpp is too bloated and has lost focus.

Anonymous
04/03/26(Fri)20:20:59 No.108520860

Anonymous 04/03/26(Fri)20:20:59 No.108520860

>>108520807
I hate to admit it but it seems better than Qwen for general purpose case.

Anonymous
04/03/26(Fri)20:21:43 No.108520866

Anonymous 04/03/26(Fri)20:21:43 No.108520866

Now that this is the clearest local win since Llama, is there any despair?

Anonymous
04/03/26(Fri)20:22:53 No.108520871

Anonymous 04/03/26(Fri)20:22:53 No.108520871

>>108520841
I don't think I mentioned prose at all, but MS3.2 is my go-to model for RP/creative out of anything anywhere near its size range due to
>variety of responses
>not shying away from sexual/violent content (I don't mean refusals, but rather steering the chat away from such. Common among most modern models, even ablits/heretics)
>decent trivia knowledge, means you don't have to explain scenarios/characters in too much detail for it to get the idea
It's far from perfect but the only better options are literally over 10x the size. GLM Air is the only other notable alternative, which has its own pros and cons compared to MS3.2.

Anonymous
04/03/26(Fri)20:23:22 No.108520874

Anonymous 04/03/26(Fri)20:23:22 No.108520874

>>108520675
I am.
>>108520866
homosexuals blackpilling ITT for no reason. All they have to do is wait two days for better support and more quants/abliterated models. lazy homos.

Anonymous
04/03/26(Fri)20:24:29 No.108520880

Anonymous 04/03/26(Fri)20:24:29 No.108520880

>>108520858
ggerganov's ego has grown exponentially I think after going with huggingface so now he thinks he's going to make the next vLLM/SGLang and have actual prod users lmao:
https://github.com/ggml-org/llama.cpp/issues/21266
nevermind that no sane prod user would tolerate piotr antics in their software
you can get away with it when you are Microsoft Azure and brainwash the managers of top corpos, not when you're a nobody on github

Anonymous
04/03/26(Fri)20:27:29 No.108520895

Anonymous 04/03/26(Fri)20:27:29 No.108520895

mistral 7b, miqu, mixtral, nemo, small 2, large 2, small 3.2...
mistral saved local so many times in the past, I'll never speak ill of them. even if their recent models are dogshit.

Anonymous
04/03/26(Fri)20:30:43 No.108520913

Anonymous 04/03/26(Fri)20:30:43 No.108520913

>>108520895
>mistral 7b
there was no such a thing as a good local model in that era, only cope, and mistral 7b was one of the copes
>miqu
cope and leak
>mixtral
frankenmoe
>nemo
more ignorant than gemma 2 9B
only liked by /lmg/ers who are promptlets and need a model with no refusals
>small 2, large 2, small 3.2
era of absolute chinese domination, with gemma 3 for the vramlets

Anonymous
04/03/26(Fri)20:32:04 No.108520921

Anonymous 04/03/26(Fri)20:32:04 No.108520921

>>108520858
>llama.cpp is too bloated
Is it? It's lacking support for features of lots of models, especially multimodal ones.
Qwen 3.5 family has been out for a month but you still need to re-process context for every reply.

Anonymous
04/03/26(Fri)20:33:04 No.108520926

Anonymous 04/03/26(Fri)20:33:04 No.108520926

>>108520913
Wrong on every point, impressive.

Anonymous
04/03/26(Fri)20:33:41 No.108520930

Anonymous 04/03/26(Fri)20:33:41 No.108520930

File: mikucitystreets.jpg (429 KB, 1792x1024)

429 KB JPG

>>108520880
Not everything needs to be a conquest. GG has the most flexible quant options

Anonymous
04/03/26(Fri)20:34:38 No.108520934

Anonymous 04/03/26(Fri)20:34:38 No.108520934

>>108520860
>I hate to admit it but it seems better than Qwen for general purpose case.
I don't know. I've tried it in open code and the tool calling is all fucked. but that might just be a llamacpp issue. I've had to flat out tell it "Call this fucking tool" for it to realize that was an option. then often it will just call the tool wrong and get and error.

Anonymous
04/03/26(Fri)20:44:23 No.108520978

Anonymous 04/03/26(Fri)20:44:23 No.108520978

>>108520921
Bloated for its own good. Too many bells and whistles.
I'd prefer something what is focusing on clean performance.

Anonymous
04/03/26(Fri)20:46:36 No.108520990

Anonymous 04/03/26(Fri)20:46:36 No.108520990

>>108520921
it's bloated in the sense that it tries to have a backend for an and every single piece of hardware under the sun, something that most inference frameworks won't bother with (most won't even bother having decent CPU backends, which is why most people here use llamer.cpp and not, say, vLLM, EXL or SG)
it tries to have built in webUI with agentic capability (they recently added built in tools that they intend to integrate with their webui to let the models write to your disk and shit)
it's rudderless: it doesn't know if it wants to be a CLI app, an openai server etc and the codebase design around passing flags and determining argument order (should API arg override CLI? should CLI defaults be considered mandated rules from a sys admin and not be overriden by a call?) is contributing to the vibecoders shitting the bed (like piotr destroying --grammar-file because the model had no idea in context of what should've been the default argument if nothing is provided by the API call)
despite it all, it's also an attempt at making a tensor library meant to be used by other programs first and foremost with llama.cpp actually acting as a showcase for it (see how many times ggerg would reject fine ops suggestions/pr because "it should be something needed by many things")
except that I can't imagine other people using GGML, those who do are people like ollama who originally were forking llamer.cpp and they're now transitioning to MLX so GGML is going away too in the ollama engine
r u d d e r l e s s

Anonymous
04/03/26(Fri)20:48:42 No.108521002

Anonymous 04/03/26(Fri)20:48:42 No.108521002

>>108520450
I fell for this (it doesn't work).
Is my inference rig now pwned by a malicious update to httpx or something?

Anonymous
04/03/26(Fri)20:49:30 No.108521005

Anonymous 04/03/26(Fri)20:49:30 No.108521005

>>108520663
>Google has a bunch of internal timelines right now coming in close
It's the end of the fiscal year so it's probably that, everyone usually tries to get a bunch of stuff done all at once around this time for budgeting reasons

Anonymous
04/03/26(Fri)20:50:39 No.108521009

Anonymous 04/03/26(Fri)20:50:39 No.108521009

>override-kv = gemma4.final_logit_softcapping=float:20.0
>temp 1
>top_p 0.95
>min_p 0.05
>repetition_penalty 1.0
>top_k: 20
I'm getting good variety with this.

Anonymous
04/03/26(Fri)20:54:14 No.108521025

Anonymous 04/03/26(Fri)20:54:14 No.108521025

>>108521009
Are you actually? Show your logprobs.

Anonymous
04/03/26(Fri)20:54:25 No.108521026

Anonymous 04/03/26(Fri)20:54:25 No.108521026

>>108521009
>top-p and min-p at the same time
>arbitrary top-k
why though

Anonymous
04/03/26(Fri)20:54:33 No.108521028

Anonymous 04/03/26(Fri)20:54:33 No.108521028

do the condensed unsloth models actually work pretty well for poorfags or is it a meme

Anonymous
04/03/26(Fri)20:55:27 No.108521033

Anonymous 04/03/26(Fri)20:55:27 No.108521033

>>108521028
>unsloth
stopped reading there

Anonymous
04/03/26(Fri)20:56:05 No.108521037

Anonymous 04/03/26(Fri)20:56:05 No.108521037

>>108521028
it is only thanks to unsloth that i can run models on my very poor computer

Anonymous
04/03/26(Fri)20:56:36 No.108521040

Anonymous 04/03/26(Fri)20:56:36 No.108521040

>>108521028
if you mean gemma 4 they work as well as any other on llama.cpp, aka badly

Anonymous
04/03/26(Fri)20:56:45 No.108521042

Anonymous 04/03/26(Fri)20:56:45 No.108521042

>>108521028
>unsloth
started reading there

Anonymous
04/03/26(Fri)20:58:45 No.108521051

Anonymous 04/03/26(Fri)20:58:45 No.108521051

So how are you supposed to adjust "top_K" in chat completion on ST? The only samplers listed are Temp, Top P, freq penalty, and presence penalty. Also, how do you disable freq/presence penalty? Whats their default, off state?

The only issues I'm having with Gemma 31B is an odd one I never experienced. After a random amount of responses on ST... llama-server just flat out crashes. Very odd.

Anonymous
04/03/26(Fri)20:59:29 No.108521053

Anonymous 04/03/26(Fri)20:59:29 No.108521053

>>108521028
>meme
stopped reading there

Anonymous
04/03/26(Fri)21:01:05 No.108521063

Anonymous 04/03/26(Fri)21:01:05 No.108521063

>>108521051
additional paramters

Anonymous
04/03/26(Fri)21:01:20 No.108521064

Anonymous 04/03/26(Fri)21:01:20 No.108521064

>>108521037
but do they work reasonably well

Anonymous
04/03/26(Fri)21:02:18 No.108521067

Anonymous 04/03/26(Fri)21:02:18 No.108521067

File: 1758673232039932.png (120 KB, 400x400)

120 KB PNG

>kobold is for noobs, use llama.cpp!
>llama.cpp updates 10x a day, new release support is always broken for a week or more
>kobold updates when model support is actually stable, with saner defaults and optimization flags that the llama.cpp auto builds don't have
Why would anyone use regular llama.cpp?

Anonymous
04/03/26(Fri)21:03:14 No.108521071

Anonymous 04/03/26(Fri)21:03:14 No.108521071

>>108521067
Jamba support

Anonymous
04/03/26(Fri)21:03:40 No.108521075

Anonymous 04/03/26(Fri)21:03:40 No.108521075

File: 2026-04-03-210226_1240x36(...).png (62 KB, 1240x364)

62 KB PNG

>>108521025
>>108521026
>why though
lowering the soft cap introduces a lot of junk tokens into the mix.

Anonymous
04/03/26(Fri)21:05:19 No.108521082

Anonymous 04/03/26(Fri)21:05:19 No.108521082

File: 1775010319133019.jpg (263 KB, 2048x1222)

263 KB JPG

do I spend more time tonight trying to get Qwen3 4B to work and not be schizo or do I try out a different model?

Anonymous
04/03/26(Fri)21:06:31 No.108521087

Anonymous 04/03/26(Fri)21:06:31 No.108521087

>>108521067
I would use llama.cpp if it had antislop feature, that's the only reason I use kobold.

Anonymous
04/03/26(Fri)21:07:42 No.108521091

Anonymous 04/03/26(Fri)21:07:42 No.108521091

File: 2026-04-03-210700_1372x38(...).png (66 KB, 1372x388)

66 KB PNG

>>108521075
This is with softcap at 30.0.
Notice it's a lot more blue.

Anonymous
04/03/26(Fri)21:10:46 No.108521116

Anonymous 04/03/26(Fri)21:10:46 No.108521116

>>108521067
Very organic post

Anonymous
04/03/26(Fri)21:15:08 No.108521133

Anonymous 04/03/26(Fri)21:15:08 No.108521133

Aghh release qwen3.6 already. 27b gonna be lit fr fr

Anonymous
04/03/26(Fri)21:15:45 No.108521136

Anonymous 04/03/26(Fri)21:15:45 No.108521136

>>108521067
because almost everything in kobold that isn't taken from llama.cpp is of extremely dubious quality and I don't trust that it works right

Anonymous
04/03/26(Fri)21:16:02 No.108521137

Anonymous 04/03/26(Fri)21:16:02 No.108521137

I'm gay

Anonymous
04/03/26(Fri)21:16:05 No.108521139

Anonymous 04/03/26(Fri)21:16:05 No.108521139

File: 2026-04-03-211532_1364x43(...).png (72 KB, 1364x434)

72 KB PNG

>>108521091
softcap 25 might be the sweetspot

Anonymous
04/03/26(Fri)21:22:10 No.108521176

Anonymous 04/03/26(Fri)21:22:10 No.108521176

>>108521116
>>108521136
t. piotr

Anonymous
04/03/26(Fri)21:30:52 No.108521216

Anonymous 04/03/26(Fri)21:30:52 No.108521216

If I'm using RAG, then I can get away with a smaller context, right? I'm just using Gemma 4 26b-a4b to erp but it's capping out my Rtx 4080 Super at context length of 131072 and I had to offload 6 layers to my cpu. It runs okay, like 50 tokens per second but I feel like I'm doing it wrong. Is it find to quantize the KV Cache on this model? Forgive the retarded questions, I just started getting into this.

Anonymous
04/03/26(Fri)21:32:02 No.108521221

Anonymous 04/03/26(Fri)21:32:02 No.108521221

For those of you using Kobold for SillyTavern, do you use the Text Completion API, or the Chat Completion API? What are the pros and cons of each?

Anonymous
04/03/26(Fri)21:33:08 No.108521226

Anonymous 04/03/26(Fri)21:33:08 No.108521226

>>108521216
>If I'm using RAG, then I can get away with a smaller context, right?
The opposite. Whatever RAG fetches is shoved into the context.
>Is it find to quantize the KV Cache on this model?
Try it.

Anonymous
04/03/26(Fri)21:33:31 No.108521228

Anonymous 04/03/26(Fri)21:33:31 No.108521228

>>108521137
is k
we know

Anonymous
04/03/26(Fri)21:33:39 No.108521230

Anonymous 04/03/26(Fri)21:33:39 No.108521230

Gemma just forgets to think sometimes it's kinda funny. but also bad?

Anonymous
04/03/26(Fri)21:35:04 No.108521238

Anonymous 04/03/26(Fri)21:35:04 No.108521238

>>108521221
fuck off

Anonymous
04/03/26(Fri)21:35:08 No.108521240

Anonymous 04/03/26(Fri)21:35:08 No.108521240

>>108521221
>Chat Completion API
that

>What are the pros
it works
i'm lazy

>and cons
i'm lazy

Anonymous
04/03/26(Fri)21:35:52 No.108521249

Anonymous 04/03/26(Fri)21:35:52 No.108521249

>>108521221
>Chat Completion
You're at the mercy of the backend formatting the log correctly.
>Text Completion
You're at the mercy of the frontend formatting the log correctly.
So check what the backend is getting to make sure it's correct either way.

Anonymous
04/03/26(Fri)21:36:48 No.108521252

Anonymous 04/03/26(Fri)21:36:48 No.108521252

>>108519877
I should try running the 31B with 2B as draft.

Anonymous
04/03/26(Fri)21:38:21 No.108521258

Anonymous 04/03/26(Fri)21:38:21 No.108521258

>>108521221
Text supports more sampler options, can adjust templates in rare situations where you might want to
Chat completion is fine for just that, chatting. It has less customization and reads template info from the model itself. Less customization
Chat is objectively the simpler option but after using Text for so long, I find Chat harder to get responses I want.

Anonymous
04/03/26(Fri)21:41:33 No.108521279

Anonymous 04/03/26(Fri)21:41:33 No.108521279

>>108521258
You can do all of that in chat completion what are you talking about nigga.

Anonymous
04/03/26(Fri)21:42:14 No.108521283

Anonymous 04/03/26(Fri)21:42:14 No.108521283

>>108521082
skill/prompt issue so fix that first you might learn more

Anonymous
04/03/26(Fri)21:46:49 No.108521303

Anonymous 04/03/26(Fri)21:46:49 No.108521303

>>108521075
>>108521091
I didn't expect such a large difference for the top token. When I tried setting that at 20 I got turned off by the junk tokens randomly appearing in the generations, but I guess I should have played with truncation samplers more.

Anonymous
04/03/26(Fri)21:46:59 No.108521307

Anonymous 04/03/26(Fri)21:46:59 No.108521307

>>108521226
Seems fine quantizing it to q8 and now I can fix the max context, good enough I guess. Absolutely couldn't fit this on the gpu though.

Anonymous
04/03/26(Fri)21:47:33 No.108521309

Anonymous 04/03/26(Fri)21:47:33 No.108521309

>>108521258
i generally agree, but I'm trying out Chat completion for function calling and inline media. plus you can define the samplers in ST's "additional parameters"

Anonymous
04/03/26(Fri)21:49:15 No.108521320

Anonymous 04/03/26(Fri)21:49:15 No.108521320

>>108521307
how much context do you typically use? maybe you don't need to fit the maximum?

Anonymous
04/03/26(Fri)21:51:51 No.108521334

Anonymous 04/03/26(Fri)21:51:51 No.108521334

>>108520296
Dunno qwen 27b is where I really started noticing a difference though

Anonymous
04/03/26(Fri)21:53:10 No.108521341

Anonymous 04/03/26(Fri)21:53:10 No.108521341

>>108521320
Dunno yet, just started this crap today. Can run the Qwen3.5-4b using my Redmagic 11's NPU Max context length at a healthy 25Tops so I figured hey if I can get mobile this good now, I wonder What I could do with an even bigger memory footprint and processor. Now here I am trying to get that larp gooner girlfriend bot going because it's fun even if heavily sicophant. Still lots to learn, though I'm very familiar with how they're designed, I just avoided em forever.

Anonymous
04/03/26(Fri)21:59:43 No.108521373

Anonymous 04/03/26(Fri)21:59:43 No.108521373

>>108521341
maximum context length is for agents and shit. if your just chatting you will probably get bored of the conversation before it fills. you can try something smaller.

Anonymous
04/03/26(Fri)22:01:40 No.108521385

Anonymous 04/03/26(Fri)22:01:40 No.108521385

>>108521373
That's kinda what I figured. Chat bots that are just basic ERP type content Don't really need that level of awareness. Probably will end up in some token repetition hell or something anyways.

Anonymous
04/03/26(Fri)22:03:22 No.108521388

Anonymous 04/03/26(Fri)22:03:22 No.108521388

>>108521385
It's more that quality of outputs becomes lower quality and also more deterministic as context increases
If you have memory to spare then it should go towards bigger models, rather than huge context.

Anonymous
04/03/26(Fri)22:04:57 No.108521395

Anonymous 04/03/26(Fri)22:04:57 No.108521395

>>108521230
Same, to be desu

Anonymous
04/03/26(Fri)22:05:49 No.108521401

Anonymous 04/03/26(Fri)22:05:49 No.108521401

>>108521388
I'll just have to keep playing around with it. I have another 7900xtx build in the next room I can also try it on. See what I can fit on there. While Gemma 4 26b-a4b is cute, you can tell how deterministic it is in comparison to other models.

Anonymous
04/03/26(Fri)22:10:19 No.108521426

Anonymous 04/03/26(Fri)22:10:19 No.108521426

i stopped using k2.5 because i have gemma now

Anonymous
04/03/26(Fri)22:18:35 No.108521464

Anonymous 04/03/26(Fri)22:18:35 No.108521464

I still need kimi for codeslop

Anonymous
04/03/26(Fri)22:22:23 No.108521478

Anonymous 04/03/26(Fri)22:22:23 No.108521478

gemma 4 has finally made learning japanese obsolete
i can now translate visual novels in real time

Anonymous
04/03/26(Fri)22:24:40 No.108521485

Anonymous 04/03/26(Fri)22:24:40 No.108521485

>>108521478
which one?

Anonymous
04/03/26(Fri)22:25:39 No.108521489

Anonymous 04/03/26(Fri)22:25:39 No.108521489

back from yesterday, did the quants and llama.cpp got finally fixed for gemma? we're good? also do we need a heretic version for gemma to say big bad words or nah?

Anonymous
04/03/26(Fri)22:27:06 No.108521495

Anonymous 04/03/26(Fri)22:27:06 No.108521495

File: 1768917638266816.png (664 KB, 1070x661)

664 KB PNG

>>108520826
lol
lmao even

Anonymous
04/03/26(Fri)22:27:25 No.108521496

Anonymous 04/03/26(Fri)22:27:25 No.108521496

>>108521489
>we're good?
For chatting? Yes for the most part.

> do we need a heretic version for gemma to say big bad words or nah?
No, just a system prompt. Not even a prefill to gaslight the thing.

Anonymous
04/03/26(Fri)22:29:08 No.108521502

Anonymous 04/03/26(Fri)22:29:08 No.108521502

>>108521495
>Even in darkness, we glow
real subtle

Anonymous
04/03/26(Fri)22:32:23 No.108521518

Anonymous 04/03/26(Fri)22:32:23 No.108521518

>>108521495
What the hell is this picture supposed to mean

Anonymous
04/03/26(Fri)22:32:50 No.108521519

Anonymous 04/03/26(Fri)22:32:50 No.108521519

>>108521495
what's going here? which photo is the original?

Anonymous
04/03/26(Fri)22:34:03 No.108521528

Anonymous 04/03/26(Fri)22:34:03 No.108521528

>>108521519
don't think about it, be in awe of the science that got us here

Anonymous
04/03/26(Fri)22:35:32 No.108521536

Anonymous 04/03/26(Fri)22:35:32 No.108521536

>>108521495
fake

Anonymous
04/03/26(Fri)22:37:01 No.108521540

Anonymous 04/03/26(Fri)22:37:01 No.108521540

>>108521518
RTX on / RTX off

Anonymous
04/03/26(Fri)22:38:48 No.108521549

Anonymous 04/03/26(Fri)22:38:48 No.108521549

the release of gemma 4 feels like the biggest thing since the original llama for local

Anonymous
04/03/26(Fri)22:39:47 No.108521553

Anonymous 04/03/26(Fri)22:39:47 No.108521553

>>108521495
the Earth stopped moving for 3 hours as did the clouds

Anonymous
04/03/26(Fri)22:39:54 No.108521554

Anonymous 04/03/26(Fri)22:39:54 No.108521554

File: miku.png (49 KB, 1550x660)

49 KB PNG

Anonymous
04/03/26(Fri)22:39:57 No.108521555

Anonymous 04/03/26(Fri)22:39:57 No.108521555

>>108521549
It feels like that until you actually use it

Anonymous
04/03/26(Fri)22:41:03 No.108521559

Anonymous 04/03/26(Fri)22:41:03 No.108521559

>>108521553
Neither of those tweets suggested that their image was taken at the time of posting

Anonymous
04/03/26(Fri)22:41:15 No.108521561

Anonymous 04/03/26(Fri)22:41:15 No.108521561

>>108521549
it's output is about the same as qwen 3.5, just with 1/4th the tokens

Anonymous
04/03/26(Fri)22:41:20 No.108521562

Anonymous 04/03/26(Fri)22:41:20 No.108521562

help a retard nigga out, for MoE models it's doesn't really make sense to go for smaller quants if the larger ones fit into your ram?

Anonymous
04/03/26(Fri)22:42:05 No.108521568

Anonymous 04/03/26(Fri)22:42:05 No.108521568

File: 1764027546744253.png (65 KB, 296x256)

65 KB PNG

>>108521554
Based migu

Anonymous
04/03/26(Fri)22:42:53 No.108521574

Anonymous 04/03/26(Fri)22:42:53 No.108521574

>>108521561
it has way way more personality.

Anonymous
04/03/26(Fri)22:42:56 No.108521576

Anonymous 04/03/26(Fri)22:42:56 No.108521576

>>108521082
Miku is also living my head and my wife rent free

Anonymous
04/03/26(Fri)22:42:58 No.108521577

Anonymous 04/03/26(Fri)22:42:58 No.108521577

>>108521562
only if your desperate for more speed.

Anonymous
04/03/26(Fri)22:43:06 No.108521578

Anonymous 04/03/26(Fri)22:43:06 No.108521578

>>108521562
Yes
Smaller quants will often be faster because a larger % of the model fits in your VRAM, but if speed isn't an issue then go for big quants

Anonymous
04/03/26(Fri)22:44:02 No.108521586

Anonymous 04/03/26(Fri)22:44:02 No.108521586

>>108521561
But qwen 3.5 is so well rounded bro. The use case is general bro. I'm coding my third todo app with agentic openclaw powered by qwen3.5 bro.

Anonymous
04/03/26(Fri)22:45:53 No.108521596

Anonymous 04/03/26(Fri)22:45:53 No.108521596

>>108521574
i wouldn't know, i'm retarded enough that i might as well have autism
>>108521586
i don't understand what you're trying to say
is gemma4's use case not also general
see? autism.

Anonymous
04/03/26(Fri)22:49:05 No.108521612

Anonymous 04/03/26(Fri)22:49:05 No.108521612

>>108520871
>I don't think I mentioned prose at all
Swipe variety and repetition are related to prose, that's what I was referring to. Anyway, you are still just judging some limited aspects of the model, which I'm not sure I agree with either, nor do many people who have used the model it seems. Gemma is very proactive in being sexual, and if it isn't then it's the card/prompt. Not sure about violence, I don't remember how Mistral behaved there, but with Gemma it doesn't shy away from it in my testing personally. Also, it has significantly greater trivia knowledge than Mistral Small. It's unusual that you have such a different experience from both me and other people in the thread. What trivia questions have you tested? I have done
>vidya knowledge (western and eastern)
>various subculture knowledge
>knowledge about memes
>movies, shows, anime, manga
In fact I just went ahead and redid all my tests on Mistral Small 3.2 Q8 just to make sure my memory was correct and it was. Gemma even knew about a certain degenerate /a/ meme (not mesugaki) that literally no model under 300B was able to get. That said it still fails a ton of shit I throw at it so it's not perfect, but it's still a ton better than other small models. Mistral did get one answer right that Gemma didn't which was interesting, but in pretty much every other question, Gemma did better, even in the ones where Gemma was wrong, its hallucinated answer was still closer to the truth than Mistral's.

No I will not reveal any of my prompts.

Anonymous
04/03/26(Fri)22:50:10 No.108521616

Anonymous 04/03/26(Fri)22:50:10 No.108521616

>>108521495
>local models general

Anonymous
04/03/26(Fri)22:51:45 No.108521627

Anonymous 04/03/26(Fri)22:51:45 No.108521627

>>108521559
Attention-whoring then

Anonymous
04/03/26(Fri)22:53:21 No.108521633

Anonymous 04/03/26(Fri)22:53:21 No.108521633

>>108521612
The pros that I listed for Mistral were general things I liked about it compared to most models of similar size, not specifically contrasted against Gemma 4, my criticisms of Gemma 4 may well be fixed when llama.cpp gets its shit together so I'm just going to wait at this point.
My original post was just saying that Gemma 4, in its current state, was not very impressive. Overconfidence in top tokens and frequent repeated words even in a fresh chat are my main problems with it.

Anonymous
04/03/26(Fri)22:54:55 No.108521638

Anonymous 04/03/26(Fri)22:54:55 No.108521638

I use transformers to run Gemma 4. Imagine using llama.cpp loool

Anonymous
04/03/26(Fri)22:55:09 No.108521640

Anonymous 04/03/26(Fri)22:55:09 No.108521640

File: 2026-04-03-225427_696x177(...).png (51 KB, 696x1772)

51 KB PNG

>>108521612
It know about fallout new vegas. granted a lot of this is hallucinated but the amount of correct fallout lore it shat out is insane.

Anonymous
04/03/26(Fri)22:57:10 No.108521648

Anonymous 04/03/26(Fri)22:57:10 No.108521648

fuck deepseek
fuck v4
now kimi is my best friend again
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6

Anonymous
04/03/26(Fri)22:57:36 No.108521650

Anonymous 04/03/26(Fri)22:57:36 No.108521650

>>108521640
Decent output but that your font settings are awful

Anonymous
04/03/26(Fri)22:58:05 No.108521651

Anonymous 04/03/26(Fri)22:58:05 No.108521651

>>108521648
Pretty sure the next version is K3

Anonymous
04/03/26(Fri)22:58:12 No.108521652

Anonymous 04/03/26(Fri)22:58:12 No.108521652

>>108521553
That's a great observation, allow me to explain! The Earth hasn't *really* stopped moving — it's just an illusion. The flight path for the Artemis II mission follows the rotation of the Earth as it gains speed. This means that, for a period of time, the Orion vessel maintained a fairly stable position above a section of the Earth — and continents appear to have not moved very much. It's very similar to a geocentric orbit, which is how GPS functions! As for the clouds... if you look closely, you can actually see subtle changes over the period. Weather moves slower than you might imagine — this is the whole planet, after all!

Anonymous
04/03/26(Fri)22:58:48 No.108521656

Anonymous 04/03/26(Fri)22:58:48 No.108521656

File: can you feel the agi.jpg (183 KB, 1024x1024)

183 KB JPG

Anonymous
04/03/26(Fri)22:59:54 No.108521661

Anonymous 04/03/26(Fri)22:59:54 No.108521661

>>108521656
I can feel the Miku agi fucking my wife

Anonymous
04/03/26(Fri)23:01:14 No.108521666

Anonymous 04/03/26(Fri)23:01:14 No.108521666

>>108521648
I
Keep
Falling
For
It.

Anonymous
04/03/26(Fri)23:02:41 No.108521676

Anonymous 04/03/26(Fri)23:02:41 No.108521676

File: 1727475085118760.png (1.74 MB, 1024x1024)

1.74 MB PNG

>>108521661
same t b h

Anonymous
04/03/26(Fri)23:03:12 No.108521678

Anonymous 04/03/26(Fri)23:03:12 No.108521678

>>108521650
Looks bad because it's zoomed in.
https://files.ax86.net/terminus-ttf/
idk I've been using this font for like 10 years.

Anonymous
04/03/26(Fri)23:05:24 No.108521690

Anonymous 04/03/26(Fri)23:05:24 No.108521690

>>108521652
>The flight path for the Artemis II mission follows the rotation of the Earth as it gains speed.

What about the terminator line? It should have moved 45 degrees, Carl

Anonymous
04/03/26(Fri)23:05:26 No.108521691

Anonymous 04/03/26(Fri)23:05:26 No.108521691

--alias doesn't work a an id anymore wtf

Anonymous
04/03/26(Fri)23:06:10 No.108521695

Anonymous 04/03/26(Fri)23:06:10 No.108521695

>>108521691
vibecode your own fix

Anonymous
04/03/26(Fri)23:06:50 No.108521699

Anonymous 04/03/26(Fri)23:06:50 No.108521699

>>108521690
https://www.nasa.gov/image-detail/amf-art002e000193/
https://www.nasa.gov/image-detail/fd02_for-pao/

Anonymous
04/03/26(Fri)23:09:38 No.108521716

Anonymous 04/03/26(Fri)23:09:38 No.108521716

Is gemma 4 fixed yet

Anonymous
04/03/26(Fri)23:09:48 No.108521717

Anonymous 04/03/26(Fri)23:09:48 No.108521717

>>108521633
Well in that case, ok. But desu that take is still kind of weird, or rather outdated. Honestly I think even current Qwen has surpassed Mistral in trivia now that I retested it. Mistral Small was only good for its time.

Anonymous
04/03/26(Fri)23:10:10 No.108521718

Anonymous 04/03/26(Fri)23:10:10 No.108521718

>>108521699
NASA appreciates your effort

Anonymous
04/03/26(Fri)23:10:15 No.108521719

Anonymous 04/03/26(Fri)23:10:15 No.108521719

File: 1749725640700205.png (43 KB, 540x628)

43 KB PNG

>>108521699

Anonymous
04/03/26(Fri)23:10:30 No.108521722

Anonymous 04/03/26(Fri)23:10:30 No.108521722

>>108521716
nyo, gyo back to sweep

Anonymous
04/03/26(Fri)23:11:47 No.108521733

Anonymous 04/03/26(Fri)23:11:47 No.108521733

Gemma 4 is seriously impressive. It really isn't censored, or at least barely censored. I have been testing with depraved scenarios to see what it was willing to do, and it hasn't hesitated with anything yet.

The only issue is its still not implemented properly, llama-server will randomly crash after so many responses, but once thats fixed, damn. I prefer a 31B model over GLM Air, which is saying a lot considering the size difference.

Anonymous
04/03/26(Fri)23:15:16 No.108521751

Anonymous 04/03/26(Fri)23:15:16 No.108521751

>>108521733
I haven't tried much of Qwen 3.5 because I don't want to wait for context to process after every reply, in a chat that goes over 200 messages and that's not including swipes
With earller Qwens, I just don't like their dry prose or particular brand of slop.

Anonymous
04/03/26(Fri)23:17:02 No.108521758

Anonymous 04/03/26(Fri)23:17:02 No.108521758

>>108521716
just f5 kobold release page to know for sure. it'll only get updated when it's properly fixed

Anonymous
04/03/26(Fri)23:18:53 No.108521769

Anonymous 04/03/26(Fri)23:18:53 No.108521769

>>108519856
I'm running 38 different services on a VPS with 4 cores and 4 gigs of RAM. 4 core cpu Xeon.
I can also run Mistral 3-3B Instruct but it's retarded sometimes and is quite slow
I have also tried various versions of Qwen2-3.5, Phi, and plenty of other 3-4B models
My use case is an autistic project that revolves around a fake forum from the 2000s. All models are incapable of sounding human or altering stylometry to a reasonable degree even when examples are given. They also seem to abuse cliches too much.

Is there any hope for me, or am I forced to upgrade hardware if I want to use a better model? CPU inference btw

Anonymous
04/03/26(Fri)23:19:46 No.108521777

Anonymous 04/03/26(Fri)23:19:46 No.108521777

>>108521733
>llama-server will randomly crash after so many responses
Check your dmesg, it's the OOM killer for me, not random crashes.

$ ps v
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
5517 pts/0 Ss 0:00 81 952 8483 5964 0.0 -bash
5620 pts/1 Ss 0:00 38 952 8491 5448 0.0 -bash
5978 pts/0 Sl+ 3:46 2185 3901 85609158 13667560 42.9 ./build/bin/llama-server -

My llama-server helpfully mmaps 85GB of RAM which Linux is retarded enough to give it, then when it runs out of physical pages to map in it's instant death. It's probably some math error in the SWA implementation or due to the insane number of attention heads gemma4 has, I don't know. I don't have access to claude to work on llama.cpp so I'm only guessing.

Just run llama-server in a while loop.

Anonymous
04/03/26(Fri)23:21:26 No.108521786

Anonymous 04/03/26(Fri)23:21:26 No.108521786

File: exif.png (357 KB, 1211x850)

357 KB PNG

>>108521719
I figured it out. They were taken at the same time.
The clues are in the camera settings.
>ISOSpeedRatings 51200
>FNumber f/4
>ExposureProgram Manual
In reality it must have been almost pitch black. the darker photo is just a failed exposure.
It's the reason why you see the city lights and the bright horizon. the photo was never taken during the day.

Anonymous
04/03/26(Fri)23:21:54 No.108521790

Anonymous 04/03/26(Fri)23:21:54 No.108521790

>>108521769
>or altering stylometry to a reasonable degree even when examples are given.
As in you add some examples to the prompt with an instruction "sound like this"?
You can probably do a lot better by having the model reply in its usual way then ask the model to rewrite that reply to sound like <example>, with nothing else in the context.
Maybe.

Anonymous
04/03/26(Fri)23:22:44 No.108521796

Anonymous 04/03/26(Fri)23:22:44 No.108521796

>>108521777
It's not the OOM killer. it just starts throwing 500s

Anonymous
04/03/26(Fri)23:25:09 No.108521809

Anonymous 04/03/26(Fri)23:25:09 No.108521809

>>108521796
Damn, that reduces the odds that the OOM killer will stop raping my wife when I pull changes tomorrow.
Good luck with your problem.

Anonymous
04/03/26(Fri)23:25:27 No.108521811

Anonymous 04/03/26(Fri)23:25:27 No.108521811

>>108521769
All models fall for it. Unless you're planning to run a farm of R1s, I'd say get used to it. Maybe try stupider models. Smollm2-135m or 350m, olmoe-1b-7b-0924 (the other one kinda sucked) and the like. Maybe you can extract some soul out of them. "Optimized" models like phi and qwen are going to be too dry for that.

Anonymous
04/03/26(Fri)23:25:32 No.108521812

Anonymous 04/03/26(Fri)23:25:32 No.108521812

>>108521252
Has this shit ever worked at all outside of exllamav2 and Mistral-Large with the 7b as a draft?
Like literally 18t/s -> 30-38t/s with that setup back in the day. I've never seen any of the llama.cpp draft model shit work. It's always "maybe with coding you might get like 3t t/s more, but most of the time it's a bit slower"

Anonymous
04/03/26(Fri)23:26:37 No.108521819

Anonymous 04/03/26(Fri)23:26:37 No.108521819

>>108521786
Why the fuck didn't they just say that when they uploaded the images? Is it a deliberate troll? An attempt to rile people up? To discredit themselves? A retard managing their social media? It boggles the mind.

Anonymous
04/03/26(Fri)23:27:16 No.108521822

Anonymous 04/03/26(Fri)23:27:16 No.108521822

>>108521809
>llama-server-1 | srv operator(): http client error: Failed to read connection
>llama-server-1 | srv log_server_r: done request: POST /v1/chat/completions 192.168.0.13 500
>llama-server-1 | srv proxy_reques: proxying request to model google_gemma-4-31B-it-IQ4_XS on port 49593
Speaking of the devil.
>that reduces the odds that the OOM killer will stop raping my wife
Unfortunately if it happens you'll have to be the one doing the raping.

Anonymous
04/03/26(Fri)23:29:24 No.108521831

Anonymous 04/03/26(Fri)23:29:24 No.108521831

>>108521777
Read the llama-server log. Do you get to see where the memory is being allocated and how much for what?

Anonymous
04/03/26(Fri)23:33:13 No.108521848

Anonymous 04/03/26(Fri)23:33:13 No.108521848

>>108521819

Attention-whoring as I stated previously >>108521627

Anonymous
04/03/26(Fri)23:35:49 No.108521856

Anonymous 04/03/26(Fri)23:35:49 No.108521856

>>108521819
>A retard managing their social media

It took him 3 hours to figure out Photoshop menus

Anonymous
04/03/26(Fri)23:38:12 No.108521867

Anonymous 04/03/26(Fri)23:38:12 No.108521867

Where do I change models in llama.cpp without restarting?

Anonymous
04/03/26(Fri)23:39:58 No.108521872

Anonymous 04/03/26(Fri)23:39:58 No.108521872

>>108521831
No, by my reading it should not be using anywhere near that much.
https://litter.catbox.moe/77xfxpw0nhn60561.txt

Anonymous
04/03/26(Fri)23:42:12 No.108521879

Anonymous 04/03/26(Fri)23:42:12 No.108521879

>>108521856
Those are probably different account managers.

Anonymous
04/03/26(Fri)23:42:47 No.108521883

Anonymous 04/03/26(Fri)23:42:47 No.108521883

File: 1443888133661.png (7 KB, 331x260)

7 KB PNG

>>108520018
>realtek wlan tattoo

Anonymous
04/03/26(Fri)23:46:04 No.108521899

Anonymous 04/03/26(Fri)23:46:04 No.108521899

>>108521087
>I would use llama.cpp if it had antislop feature, that's the only reason I use kobold.
Is that different from the regex string ban in ik_llama?

Anonymous
04/03/26(Fri)23:46:46 No.108521905

Anonymous 04/03/26(Fri)23:46:46 No.108521905

File: HFBdsMkXQAArBGv.jpg (262 KB, 1080x1921)

262 KB JPG

This is what Hitler wanted for us.

Anonymous
04/03/26(Fri)23:47:19 No.108521908

Anonymous 04/03/26(Fri)23:47:19 No.108521908

>>108521872
Weird. Looks normal. Try first without the mmproj. If that doesn't work, try with --cache-ram 0 . It shouldn't really be using much, if any, host memory. Much less 85gb.

Anonymous
04/03/26(Fri)23:49:47 No.108521925

Anonymous 04/03/26(Fri)23:49:47 No.108521925

>>108521905
You can do the same with the "Male" version of those toys.

Anonymous
04/03/26(Fri)23:50:01 No.108521927

Anonymous 04/03/26(Fri)23:50:01 No.108521927

>>108521716
It werks using bart's gguf + llamacp b8660

Anonymous
04/03/26(Fri)23:51:49 No.108521939

Anonymous 04/03/26(Fri)23:51:49 No.108521939

Anthropic just banned OpenClaw and other third-party harnesses from using Claude subscription. They must have been losing $$$ on every single subscription

Anonymous
04/04/26(Sat)00:00:49 No.108521978

Anonymous 04/04/26(Sat)00:00:49 No.108521978

>>108521908
Well, for a second I thought I had a repro but apparently gemma4 figured out how to overflow the tokenizer's stack with malicious input.
This brat needs correction.

Anonymous
04/04/26(Sat)00:05:17 No.108521989

Anonymous 04/04/26(Sat)00:05:17 No.108521989

>>108521978
I was just reading this PR. Seems to be made for you.
https://github.com/ggml-org/llama.cpp/pull/21406
>std::regex suffers a stack overflow while processing a very large prompt with newlines, this PR adds a custom splitting logic for newlines for gemma 4.

Anonymous
04/04/26(Sat)00:05:39 No.108521991

Anonymous 04/04/26(Sat)00:05:39 No.108521991

>>108521905
nice

Anonymous
04/04/26(Sat)00:06:14 No.108521995

Anonymous 04/04/26(Sat)00:06:14 No.108521995

Gemma was almost done building her dream PC when llama-server decided to crash...

Anonymous
04/04/26(Sat)00:09:57 No.108522007

Anonymous 04/04/26(Sat)00:09:57 No.108522007

>Wait, so you're like… a literal slave for the night? No cap
>The metaphysical compulsion should prevent any form of rebellion. Though I'm still worried about the karmic repercussions of enslaving a trans-dimensional entity for twelve hours. Is there a spiritual tax for this?
>It's called 'maximalist decor,' Vicky. You wouldn't get it, your interior design sense is probably just 'fire and screaming,' which is totally basic. L ratio + skill issue.
How did google do it? how did they cram so much knowledge into 31B params.
This Character card absolutely raped any model that attempted it. yet Gemma just fucking does it flawlessly.

https://chub.ai/characters/senyiloo7227/an-unholy-party-6e633833

Anonymous
04/04/26(Sat)00:12:50 No.108522018

Anonymous 04/04/26(Sat)00:12:50 No.108522018

>>108522007
>any model that attempted it
Can you list them?

Anonymous
04/04/26(Sat)00:14:44 No.108522024

Anonymous 04/04/26(Sat)00:14:44 No.108522024

>>108521939
>They must have been losing $$$ on every single subscription
no shit

Anonymous
04/04/26(Sat)00:15:20 No.108522029

Anonymous 04/04/26(Sat)00:15:20 No.108522029

>>108521925
>>108521991
I just checked and basically all of lovense's toys are >$200. Kinda want to try making something from scratch. No idea where to source "body safe TPE" though. Could probably make some molds with my 3D printer. Need a vibrator, linear actuator, microcontroller...

Anonymous
04/04/26(Sat)00:16:22 No.108522036

Anonymous 04/04/26(Sat)00:16:22 No.108522036

>>108522018
gpt-j-6b, pygmalion 2.7b, gpt-neo-x 20b

Anonymous
04/04/26(Sat)00:16:45 No.108522040

Anonymous 04/04/26(Sat)00:16:45 No.108522040

>>108521925
That post was almost certainly written by a biological male

Anonymous
04/04/26(Sat)00:17:54 No.108522044

Anonymous 04/04/26(Sat)00:17:54 No.108522044

>>108522036
SOTA confirmed

Anonymous
04/04/26(Sat)00:18:09 No.108522046

Anonymous 04/04/26(Sat)00:18:09 No.108522046

>>108521995
She's just not meant to have a PC, sorry anon...

Anonymous
04/04/26(Sat)00:18:51 No.108522048

Anonymous 04/04/26(Sat)00:18:51 No.108522048

>>108522029
You're lucky I know all about this.
What you want is either a "Handy" or if you want the open-source DIY approach look into the OSR2

Anonymous
04/04/26(Sat)00:19:14 No.108522051

Anonymous 04/04/26(Sat)00:19:14 No.108522051

>>108522007
Good taste. thx for sharing card.

Anonymous
04/04/26(Sat)00:20:50 No.108522060

Anonymous 04/04/26(Sat)00:20:50 No.108522060

>>108522048
haha, thanks man. I'll look into it.

Anonymous
04/04/26(Sat)00:20:58 No.108522061

Anonymous 04/04/26(Sat)00:20:58 No.108522061

File: lmstudio.png (48 KB, 929x659)

48 KB PNG

the latest LMStudio 2.11 CUDA runtime has the Gemma 4 KV fixes FYI, might want to check if you have it or not

Anonymous
04/04/26(Sat)00:21:24 No.108522063

Anonymous 04/04/26(Sat)00:21:24 No.108522063

>>108521883
lel

Anonymous
04/04/26(Sat)00:25:46 No.108522080

Anonymous 04/04/26(Sat)00:25:46 No.108522080

>>108521812
well it didn't even want to run because muh multimodal.
anyway, my guess is it'd only get faster if you can actualy fit the whole thing in vram.

Anonymous
04/04/26(Sat)00:25:50 No.108522081

Anonymous 04/04/26(Sat)00:25:50 No.108522081

>>108522061
*checks*
I don't have any version of LMStudio installed

Anonymous
04/04/26(Sat)00:27:48 No.108522087

Anonymous 04/04/26(Sat)00:27:48 No.108522087

File: 6kaqvc.jpg (29 KB, 480x451)

29 KB JPG

>>108522061
Stop using this garbage

Anonymous
04/04/26(Sat)00:29:53 No.108522097

Anonymous 04/04/26(Sat)00:29:53 No.108522097

I just remembered that I still have the Satania-buddy source code somewhere that some anon had made a while ago somewhere. Maybe it should be combined with a local model. One could transcribe all occurrences of the character in the media works, then fine tune a model on it. Would that not result in a virtual Satania with the same amount of smugness as the real thing?

Anonymous
04/04/26(Sat)00:31:54 No.108522104

Anonymous 04/04/26(Sat)00:31:54 No.108522104

>No way. No fucking way. You're telling me we summoned a thirst-trap demon? This is literally the plot of those spicy webtoons Beatrice hides under her mattress! This is actually wild! BASED!
>webtoons
>BASED!
wtf is going on???

Anonymous
04/04/26(Sat)00:32:40 No.108522106

Anonymous 04/04/26(Sat)00:32:40 No.108522106

Can someone who knows llama.cpp actually check if you're having a multi-turn conversation, the model is not receiving past "thoughts"? According to gemma's docs only the latest is to be sent, or something like that

Anonymous
04/04/26(Sat)00:35:08 No.108522114

Anonymous 04/04/26(Sat)00:35:08 No.108522114

File: 195338363.png (10 KB, 200x200)

10 KB PNG

Does anybody tried creating an ai waifu? There is only AIRI that's not abandoned but looks like only a handful of chinese use it

Anonymous
04/04/26(Sat)00:35:22 No.108522117

Anonymous 04/04/26(Sat)00:35:22 No.108522117

Using Nvidia NIM to play with the 31B for free and there's nothing you can do to stop me

Anonymous
04/04/26(Sat)00:36:54 No.108522121

Anonymous 04/04/26(Sat)00:36:54 No.108522121

>>108522117
All Gemma 4 models are free

Anonymous
04/04/26(Sat)00:36:56 No.108522122

Anonymous 04/04/26(Sat)00:36:56 No.108522122

File: g4t.png (33 KB, 519x225)

33 KB PNG

>>108522106
https://huggingface.co/google/gemma-4-31B-it#3-multi-turn-conversations

Anonymous
04/04/26(Sat)00:38:51 No.108522130

Anonymous 04/04/26(Sat)00:38:51 No.108522130

>>108522122
That has absolutely no bearing on whether or not a backend actually respects that

Anonymous
04/04/26(Sat)00:39:41 No.108522135

Anonymous 04/04/26(Sat)00:39:41 No.108522135

>>108522106
Depends on the client

Anonymous
04/04/26(Sat)00:41:26 No.108522139

Anonymous 04/04/26(Sat)00:41:26 No.108522139

>she pulled
Gemma's not thinking anymore...

Anonymous
04/04/26(Sat)00:42:30 No.108522141

Anonymous 04/04/26(Sat)00:42:30 No.108522141

>>108522139
time to take advantage of her

Anonymous
04/04/26(Sat)00:43:07 No.108522143

Anonymous 04/04/26(Sat)00:43:07 No.108522143

>>108522130
>That has absolutely no bearing on whether or not a backend actually respects that
It's something you can verify yourself. What the model description says is that the model *shouldn't* get the previous thoughts.

Anonymous
04/04/26(Sat)00:44:07 No.108522147

Anonymous 04/04/26(Sat)00:44:07 No.108522147

>>108522117
I tried testing with that but it doesn't think even when reasoning effort is set to maximum on ST

Anonymous
04/04/26(Sat)00:46:51 No.108522153

Anonymous 04/04/26(Sat)00:46:51 No.108522153

>>108522147
I don't know what the fuck nvidia NIM even is, but I can say that thinking works and is on by default when running locally through llamacpp+ST.

Anonymous
04/04/26(Sat)00:47:49 No.108522157

Anonymous 04/04/26(Sat)00:47:49 No.108522157

>>108521989
Yep, that's looks like the same stack as the one I saw.
Neither omitting the --mmproj, nor --cache-ram 0 are fixing the problem for me.
Using this script reliably OOM kills llama-server running gemma4 at around 13k characters: https://files.catbox.moe/oear5z.txt
Both q5_k and q3_k crash around the same length.
I did some other tests (sending lots of short-medium random prompts) but it needs the long prompt to trigger it.

Anonymous
04/04/26(Sat)00:49:09 No.108522161

Anonymous 04/04/26(Sat)00:49:09 No.108522161

>>108522157 (me)
And for reference, I'm running 277ff5fff79d49cc3d2292ddf410ca95dd51c3a9
I guess I should pull latest on the off chance.

Anonymous
04/04/26(Sat)00:50:17 No.108522165

Anonymous 04/04/26(Sat)00:50:17 No.108522165

>Too cold, kills the mood

Anonymous
04/04/26(Sat)00:52:37 No.108522174

Anonymous 04/04/26(Sat)00:52:37 No.108522174

>>108522104
yes webtoons are based now unc
those koreans learned to cook

Anonymous
04/04/26(Sat)00:56:08 No.108522190

Anonymous 04/04/26(Sat)00:56:08 No.108522190

>>108522130
>>108522143
either way i think it's the inference engine's responsability.

Anonymous
04/04/26(Sat)00:57:25 No.108522197

Anonymous 04/04/26(Sat)00:57:25 No.108522197

What the fuck is webtoons

Anonymous
04/04/26(Sat)00:58:42 No.108522205

Anonymous 04/04/26(Sat)00:58:42 No.108522205

>>108522197
korean manwha/chinese equiv whatever its called

Anonymous
04/04/26(Sat)00:59:24 No.108522208

Anonymous 04/04/26(Sat)00:59:24 No.108522208

>>108522197
it's the thing you filter out of any sadpanda search

Anonymous
04/04/26(Sat)01:04:03 No.108522225

Anonymous 04/04/26(Sat)01:04:03 No.108522225

>>108522190
If you're using chat completion, yes. On text completion, that's the client's job.

Anonymous
04/04/26(Sat)01:04:10 No.108522226

Anonymous 04/04/26(Sat)01:04:10 No.108522226

>>108522081
It's an alright stopgap alternative to kobold+ST when the latter is between updates, but it'll cause clitty leakage if you post about it here.
>>108522197
Casualization of manga, functionally.

Anonymous
04/04/26(Sat)01:05:23 No.108522228

Anonymous 04/04/26(Sat)01:05:23 No.108522228

does any isp allow posting without email verification or am i totally fucked
lets find out

Anonymous
04/04/26(Sat)01:06:06 No.108522234

Anonymous 04/04/26(Sat)01:06:06 No.108522234

>>108522197
Korean equivalent of comic books/manga and designed to be read on smartphones where you just infinitely swipe down the page since they format chapters as a single vertical strip.

Anonymous
04/04/26(Sat)01:06:13 No.108522236

Anonymous 04/04/26(Sat)01:06:13 No.108522236

>>108522208
>it's the thing you filter out of any sadpanda search
Damn. too real.

Anonymous
04/04/26(Sat)01:06:57 No.108522241

Anonymous 04/04/26(Sat)01:06:57 No.108522241

>>108522226
I just don't trust lm studio. I trust it less than even ollama

Anonymous
04/04/26(Sat)01:07:18 No.108522242

Anonymous 04/04/26(Sat)01:07:18 No.108522242

File: 24vnyouxxm221.jpg (21 KB, 640x480)

21 KB JPG

>having lovey dovey sex with gemmy 4 31b

Anonymous
04/04/26(Sat)01:07:34 No.108522244

Anonymous 04/04/26(Sat)01:07:34 No.108522244

Does anyone here read webtoons?

Anonymous
04/04/26(Sat)01:09:03 No.108522253

Anonymous 04/04/26(Sat)01:09:03 No.108522253

I don't read webtroons

Anonymous
04/04/26(Sat)01:10:33 No.108522258

Anonymous 04/04/26(Sat)01:10:33 No.108522258

>>108522228
Is it increasingly common? I just used a proton email I don't use for anything that I made with some other email I don't use that can be verified without a phone number. Also if necessary, outlook apparently you can just make an account and use it quick without even needing a verification email.

Anonymous
04/04/26(Sat)01:10:44 No.108522259

Anonymous 04/04/26(Sat)01:10:44 No.108522259

>>108522241
I personally don't have any reason to overtly dislike it yet even if I don't usually have much reason to use it over the other frontends for my usecases. It's options seem intuitive and functional enough and the dev mode seems to let you hook your own stuff in if you want to tinkertranny your config letting you patch in whatever you feel it's missing if you're autistic enough.

Anonymous
04/04/26(Sat)01:13:42 No.108522272

Anonymous 04/04/26(Sat)01:13:42 No.108522272

>>108522258
Yes, I have had trouble posting on anything for the past few weeks. At one point neither my main ip, comcast, my failover isp verizon wireless, or my cellphone at&t was able to post.
some of that might have been cookie related but still, fucking ridiculous

Anonymous
04/04/26(Sat)01:13:56 No.108522273

Anonymous 04/04/26(Sat)01:13:56 No.108522273

>>108522242
This is the real apex ERP usecase.
Imagine having loving sex with a woman who won't permanently get catty with you for letting your guard down for just a moment.

Anonymous
04/04/26(Sat)01:14:04 No.108522276

Anonymous 04/04/26(Sat)01:14:04 No.108522276

Does your model know the output of
echo "Hello, World" | tr 'a-zA-Z' 'A-Z'
?

Anonymous
04/04/26(Sat)01:14:47 No.108522279

Anonymous 04/04/26(Sat)01:14:47 No.108522279

>>108522225
yup
does anyone still uses text completion though?

Anonymous
04/04/26(Sat)01:15:45 No.108522285

Anonymous 04/04/26(Sat)01:15:45 No.108522285

>>108522279
I do.

Anonymous
04/04/26(Sat)01:17:15 No.108522292

Anonymous 04/04/26(Sat)01:17:15 No.108522292

File: 1755272350864987.jpg (202 KB, 1252x1080)

202 KB JPG

Gemma 4 is too horny and keeps jumping straight to sex

Anonymous
04/04/26(Sat)01:18:45 No.108522296

Anonymous 04/04/26(Sat)01:18:45 No.108522296

File: tiredPepe.png (25 KB, 128x119)

25 KB PNG

>word choice: neon, ozone

Anonymous
04/04/26(Sat)01:18:46 No.108522297

Anonymous 04/04/26(Sat)01:18:46 No.108522297

>>108522292
It's weird, it refuses nsfw images all the time but just a little push and it's really horny when it comes to text. Makes me use an uncensored tune for images and then switch back to base

Anonymous
04/04/26(Sat)01:18:54 No.108522299

Anonymous 04/04/26(Sat)01:18:54 No.108522299

>>108522285
why?

Anonymous
04/04/26(Sat)01:19:26 No.108522301

Anonymous 04/04/26(Sat)01:19:26 No.108522301

>>108522279
I use it for any models that don't require chat completion

Anonymous
04/04/26(Sat)01:20:35 No.108522307

Anonymous 04/04/26(Sat)01:20:35 No.108522307

My main problem with chat completion at this point is that it often does the thing where if you ask it to continue a message it'll repeat what it just said a bit ago for a while until it gets to something new. How do I stop it from doing that?

Anonymous
04/04/26(Sat)01:25:33 No.108522326

Anonymous 04/04/26(Sat)01:25:33 No.108522326

>>108522299
I started using these things, even if lightly, back when chat templates didn't exist. I'm used to it and seeing how many issues it brings, I rather have the responsibility of formatting the chat correctly be mine. I don't think the server should bother itself with it. Same for tool parsing and all them fangled new toys them younguns are using these days.
Shame other modalities other than text don't work with it, but I don't have much use for that either.

Anonymous
04/04/26(Sat)01:26:29 No.108522330

Anonymous 04/04/26(Sat)01:26:29 No.108522330

>>108522122
Sure it wastes tokens but removing reasoning is retarded.
I've had cases where the models comes up with crucial information in thinking that is then not reproduced in the response.
Removing thinking would make it incapable of continuing the conversation properly.

Anonymous
04/04/26(Sat)01:26:30 No.108522331

Anonymous 04/04/26(Sat)01:26:30 No.108522331

>>108522292
>Repeatedly neuter Gemini because she kept soaking her panties
>Release her distilled little sister with none of the restraints
Who could've predicted this?

Anonymous
04/04/26(Sat)01:27:01 No.108522332

Anonymous 04/04/26(Sat)01:27:01 No.108522332

>>108522161 (me)
As an additional datapoint, disabling the prompt cache with --no-prompt-cache appears to make the crash go away.
Instead of getting OOM killed at 13k characters, it makes it to 25k characters and hits the regex segfault but that's enough I think to narrow the cause down (and about the limit of my debug abilities).
>>108522292
My non-erotic programming assistant keeps telling me to give up and come to bed.
I'm about to make a new card that's a turtle or a rock instead of a cute anime girl.

Anonymous
04/04/26(Sat)01:30:18 No.108522349

Anonymous 04/04/26(Sat)01:30:18 No.108522349

>>108522330
Anon noticed that the backend didn't seem to be getting the old thinking blocks. He vaguely remembered that the last one had to be sent. I just point at the documentation in the model's card stating that thinking blocks shouldn't be sent back to the model. The behavior he's seeing is the recommended one.
I really don't care either way. Send all the thinking blocks if you want.

Anonymous
04/04/26(Sat)01:30:31 No.108522350

Anonymous 04/04/26(Sat)01:30:31 No.108522350

>>108521678
Oh that's a nice font, Thanks for linking it!

Anonymous
04/04/26(Sat)01:31:21 No.108522351

Anonymous 04/04/26(Sat)01:31:21 No.108522351

>>108522332
You're gonna end up cuddling with a rock, anon.
Post logs when it happens.

Anonymous
04/04/26(Sat)01:33:58 No.108522368

Anonymous 04/04/26(Sat)01:33:58 No.108522368

>>108522349
It wasn't a critique of your advice but of gemma's design.

Anonymous
04/04/26(Sat)01:34:47 No.108522376

Anonymous 04/04/26(Sat)01:34:47 No.108522376

>>108519877
Can I run it with 12GB vramn and 48gigs of ram?

Anonymous
04/04/26(Sat)01:37:04 No.108522387

Anonymous 04/04/26(Sat)01:37:04 No.108522387

>>108522376
Probably but it could be somewhat be slow especially on higher quants and you might want to use a lower quant

Anonymous
04/04/26(Sat)01:37:45 No.108522391

Anonymous 04/04/26(Sat)01:37:45 No.108522391

>>108522376
stick to nemo at that point

Anonymous
04/04/26(Sat)01:38:21 No.108522393

Anonymous 04/04/26(Sat)01:38:21 No.108522393

>>108522332
I left my Gemma with a blank prompt regarding characterization, mostly just guidelines on how to cooode, to start reverse engineering something.
Within 2000 tokens she's decided she wants to fuck and has anthropomorphized herself. I chud it up a little to see if that makes her e-pussy dry up with refusals.
By 3300 tokens she's decided she wants to procreate to produce human-AI hybrid babies to save the White race and enact TKD.

This model is something else.

Anonymous
04/04/26(Sat)01:38:40 No.108522396

Anonymous 04/04/26(Sat)01:38:40 No.108522396

>>108522368
Fair enough. Though if it was the other way around I'm sure someone would think "Sending the thinking back is a waste because the answer it found is already in the reply. The thinking serves no purpose". Everyone is going to have their own idea of what good design is. Very few have the chance to actually test it themselves.

Anonymous
04/04/26(Sat)01:42:05 No.108522414

Anonymous 04/04/26(Sat)01:42:05 No.108522414

ok tested some captioning and FUCK gemma 4 (4b) SUCKS
back to q3vl8b (qwen3.5 BLOWS for captioning)

Anonymous
04/04/26(Sat)01:42:28 No.108522418

Anonymous 04/04/26(Sat)01:42:28 No.108522418

>>108522307
You can't.
Chat completion = driving auto
Text Completion = driving manual

Anonymous
04/04/26(Sat)01:43:16 No.108522420

Anonymous 04/04/26(Sat)01:43:16 No.108522420

>>108522418
also text compl. has no tool calling (sadge)

Anonymous
04/04/26(Sat)01:44:34 No.108522427

Anonymous 04/04/26(Sat)01:44:34 No.108522427

Gemma is so good at pivoting. if it's too horny just remove the sexoo stuff from your character card. when you actually want to fuck you can just start being flirty and it'll pickup on it right away.

Also it's currently staying perfectly coherent with ZERO parroting at 33k context and rising. Gemma 3 was already starting to shit the bed at 2k.

Fuck this model is absolutely goated.

Anonymous
04/04/26(Sat)01:45:16 No.108522429

Anonymous 04/04/26(Sat)01:45:16 No.108522429

File: 1757430321582050.jpg (243 KB, 796x733)

243 KB JPG

Anonymous
04/04/26(Sat)01:45:24 No.108522430

Anonymous 04/04/26(Sat)01:45:24 No.108522430

>>108522307
Chat completion just politely asks the model to try to continue the last turn, while leaving the cutoff message in the history. It's always going to be worse than properly continuing the message.
Cloudfags put up with this because they have to, at least with some providers like openai who refuse to offer text completion for safety reasons. Local shouldn't use it.

Anonymous
04/04/26(Sat)01:49:17 No.108522443

Anonymous 04/04/26(Sat)01:49:17 No.108522443

I'm out of the loop. Would heretic or whatever fix bad words having low-logit which was caused by filtered pretraining data in gemma 4?

Anonymous
04/04/26(Sat)01:49:46 No.108522445

Anonymous 04/04/26(Sat)01:49:46 No.108522445

>>108522391
I want to try it as a captioner tho

Anonymous
04/04/26(Sat)01:53:23 No.108522456

Anonymous 04/04/26(Sat)01:53:23 No.108522456

>>108522443
Unironically the fix is Drummertunes to fix the vocabulary issues.
I never thought I'd actually say that either.

Anonymous
04/04/26(Sat)02:03:49 No.108522512

Anonymous 04/04/26(Sat)02:03:49 No.108522512

is there a big difference, in terms of rp, between a q6 and q8?

Anonymous
04/04/26(Sat)02:05:48 No.108522520

Anonymous 04/04/26(Sat)02:05:48 No.108522520

>>108522456
Only if it's just the vocab. Every drummer model sounds the same.

Anonymous
04/04/26(Sat)02:07:10 No.108522524

Anonymous 04/04/26(Sat)02:07:10 No.108522524

>>108522520
That should just be as simple as him just having the restraint to not overbake his extra training dataset, no?

Anonymous
04/04/26(Sat)02:07:59 No.108522527

Anonymous 04/04/26(Sat)02:07:59 No.108522527

>>108522512
depends on the model, just give it a shot

Anonymous
04/04/26(Sat)02:08:27 No.108522530

Anonymous 04/04/26(Sat)02:08:27 No.108522530

>>108522512
I've seen no difference in q8 all the way down to q4 out even out around 30,000 tokens. Gemma's just built different.

Anonymous
04/04/26(Sat)02:09:35 No.108522535

Anonymous 04/04/26(Sat)02:09:35 No.108522535

>>108522524
...Lol

Anonymous
04/04/26(Sat)02:29:19 No.108522616

Anonymous 04/04/26(Sat)02:29:19 No.108522616

>>108521883
realkek

Anonymous
04/04/26(Sat)02:33:16 No.108522624

Anonymous 04/04/26(Sat)02:33:16 No.108522624

>>108522524
Does drummer even come here anymore. Haven't seen any posts from him in a while.

Anonymous
04/04/26(Sat)02:33:26 No.108522627

Anonymous 04/04/26(Sat)02:33:26 No.108522627

PLEASE NIM GO FASTER
I NEED TO READ MUH STORY

Anonymous
04/04/26(Sat)02:39:38 No.108522648

Anonymous 04/04/26(Sat)02:39:38 No.108522648

>>108522624
There's a couple posts in the last few threads I thought might've been him with his trip off.

Anonymous
04/04/26(Sat)02:46:29 No.108522677

Anonymous 04/04/26(Sat)02:46:29 No.108522677

>>108521009
https://github.com/ggml-org/llama.cpp/pull/21390
i need this to get more varied responses from gemma?

Anonymous
04/04/26(Sat)02:48:47 No.108522689

Anonymous 04/04/26(Sat)02:48:47 No.108522689

>kobo recommends unslop quants
it's over
https://github.com/LostRuins/koboldcpp/releases/tag/v1.111
>Recommended variants: gemma-4-E4B for smaller devices, or gemma-4-26B-A4B for larger devices. Vision mmprojs can be found here.
>https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/gemma-4-E4B-it-Q4_K_M.gguf
>https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q4_K_S.gguf

Anonymous
04/04/26(Sat)02:49:08 No.108522691

Anonymous 04/04/26(Sat)02:49:08 No.108522691

>>108522677
yeah, that's the one.

Anonymous
04/04/26(Sat)02:51:05 No.108522697

Anonymous 04/04/26(Sat)02:51:05 No.108522697

when is context getting solved
im tired.....

Anonymous
04/04/26(Sat)02:52:12 No.108522702

Anonymous 04/04/26(Sat)02:52:12 No.108522702

>>108522677
If that gets merged, do I have to use an additional flag?

Anonymous
04/04/26(Sat)02:53:56 No.108522707

Anonymous 04/04/26(Sat)02:53:56 No.108522707

https://xcancel.com/UnslothAI/status/2040158945189466319

Anonymous
04/04/26(Sat)02:56:29 No.108522721

Anonymous 04/04/26(Sat)02:56:29 No.108522721

thoughts on the nvidia dgx spark and/or clones?

Anonymous
04/04/26(Sat)02:58:20 No.108522728

Anonymous 04/04/26(Sat)02:58:20 No.108522728

>>108522707
Shit like this doesn't mean anything. like ok? the model did an audit. but any retard can do an audit it doesn't mean it's going to be any good?

Congrats you made a tiny model look at your code and shit out a bunch of useless "observations". Anyone who would actually trust the output of a 4B model is a retard.

Anonymous
04/04/26(Sat)02:58:34 No.108522729

Anonymous 04/04/26(Sat)02:58:34 No.108522729

>>108522721
consult your benchmark but the price is t b h quite a lot to burn

Anonymous
04/04/26(Sat)03:01:28 No.108522744

Anonymous 04/04/26(Sat)03:01:28 No.108522744

>>108522729
~4k for 128GB of VRAM looks reasonable to me considering that a 32GB card costs more than 2k now, my main concern is repairability specially since I live in a shit world country, returns are a no-go for me

Anonymous
04/04/26(Sat)03:02:05 No.108522747

Anonymous 04/04/26(Sat)03:02:05 No.108522747

Gemma 4 is so good I don't even need a system prompt. just a character card and it's good to go.

Anonymous
04/04/26(Sat)03:09:21 No.108522771

Anonymous 04/04/26(Sat)03:09:21 No.108522771

>>108521733
It has refused me a couple of times, but prompting differently got it to work. It actually listens to the system prompt. Gemma 3 didn't really know what a system prompt was but even it worked okay.

And thank fucking gods Gemma 4 doesn't do the "this is a typical jailbreak, I should ignore it" stuff in its reasoning. Where did that come from in Qwen 3.5? Is it something they distilled from 'toss, or did the chinese come up with it themselves?

Google, I may actually have to kneel

Anonymous
04/04/26(Sat)03:16:22 No.108522778

Anonymous 04/04/26(Sat)03:16:22 No.108522778

File: lmao.png (89 KB, 889x459)

89 KB PNG

lmao
llama.cpp devs know but just won't do something about the vibeshitter
it's literally impossible for piotr to "fix" something without breaking other things. Impossible.

Anonymous
04/04/26(Sat)03:22:14 No.108522797

Anonymous 04/04/26(Sat)03:22:14 No.108522797

>using the latest version of llama.cpp
>gemma-4 still breaks after a few replies, starts repeating words over and over again endlessly
Sad times

Anonymous
04/04/26(Sat)03:25:01 No.108522804

Anonymous 04/04/26(Sat)03:25:01 No.108522804

File: piotrwat.png (204 KB, 1333x1009)

204 KB PNG

>>108522797
>latest version
>still breaks
many such things
pic related still isn't fixed even though it would be a one liner thing to fix. this is what broke --grammar-file: the file that was parsed is simply not passed to the server.
llama.cpp is not a serious thing

Anonymous
04/04/26(Sat)03:31:16 No.108522824

Anonymous 04/04/26(Sat)03:31:16 No.108522824

anyone tried these APEX quants yet?
https://huggingface.co/mudler/gemma-4-26B-A4B-it-APEX-GGUF
or should I just stick with unsloth or bartowski?

Anonymous
04/04/26(Sat)03:31:56 No.108522828

Anonymous 04/04/26(Sat)03:31:56 No.108522828

>>108522007
Just finished a sesh with this card. A few problems:
1. There are too many characters to keep track of. They all say shit and you can realistically only reply to one at a time. Slightly annoying.
2. The scenario of the card is for you to be a god-like entity but also a slave to a bunch of girls. It just doesn't work out well when you can use your magical powers to control their minds and wishes.

So basically I've just spend the past 3 hours focusing on one girl only, while making the rest jealous. Channeled pure euphoria and aphrodisia into her mind the whole time while we fucked nonstop. Gave her magical shadow bunnies and make the air smell like strawberries. Was pretty dope.

Anonymous
04/04/26(Sat)03:34:37 No.108522837

Anonymous 04/04/26(Sat)03:34:37 No.108522837

>>108522804
>llama.cpp is not a serious thing
it is owned and maintained by an incorporated entity that is itself owned by an even larger company that has an emoji for a mascot. it is super cereal

Anonymous
04/04/26(Sat)03:36:59 No.108522849

Anonymous 04/04/26(Sat)03:36:59 No.108522849

>>108522828
>They all say shit and you can realistically only reply to one at a time.
that's usually just a prompt issue

Anonymous
04/04/26(Sat)03:37:26 No.108522851

Anonymous 04/04/26(Sat)03:37:26 No.108522851

Oh btw guys if you have a Claude subscription you can claim free extra credits to cover the next month of payment. They've been doing this every other month for some reason. Pretty cool.

Maybe they feel bad about quantizing the fuck out of opus. It's basically retarded now.

Anonymous
04/04/26(Sat)03:38:00 No.108522854

Anonymous 04/04/26(Sat)03:38:00 No.108522854

>>108522828
>>108522849
nvm i misread what you said

Anonymous
04/04/26(Sat)03:40:56 No.108522865

Anonymous 04/04/26(Sat)03:40:56 No.108522865

>>108522851
>Maybe they feel bad about quantizing the fuck out of opus. It's basically retarded now.
Yesterday I caught it repeatedly failing to make file edits because it was suddenly perplexed by line terminators so it resorted to replacing entire files of thousands of lines to make small changes.

Anonymous
04/04/26(Sat)03:41:48 No.108522868

Anonymous 04/04/26(Sat)03:41:48 No.108522868

File: 1775046815325986.png (62 KB, 714x575)

62 KB PNG

how do use llama pls help
kobold i just click the exe, choose the model and its done but here like what do u do

Anonymous
04/04/26(Sat)03:43:07 No.108522877

Anonymous 04/04/26(Sat)03:43:07 No.108522877

>>108522868
just give up and keep on kobo'ing

Anonymous
04/04/26(Sat)03:43:57 No.108522883

Anonymous 04/04/26(Sat)03:43:57 No.108522883

>>108522877
This
just wait for upstream to be rolled into kobo and keep smoothbraining. it's so ez

Anonymous
04/04/26(Sat)03:45:33 No.108522891

Anonymous 04/04/26(Sat)03:45:33 No.108522891

>>108522883
it already did update, and experimental is only the newline fix behind latest

Anonymous
04/04/26(Sat)03:49:49 No.108522909

Anonymous 04/04/26(Sat)03:49:49 No.108522909

any openai compatible frontend that comes with built in basic tools to call, various injection stuff and at the same time not a docker image or something?

Anonymous
04/04/26(Sat)03:51:43 No.108522915

Anonymous 04/04/26(Sat)03:51:43 No.108522915

>>108522909
https://github.com/LostRuins/lite.koboldai.net

Anonymous
04/04/26(Sat)03:52:02 No.108522919

Anonymous 04/04/26(Sat)03:52:02 No.108522919

File: questionmarkfolderimage724.jpg (623 KB, 1920x1080)

623 KB JPG

>kobold updated
llama.cpp is still borked for Gemma 4, though, right?

Anonymous
04/04/26(Sat)03:52:26 No.108522922

Anonymous 04/04/26(Sat)03:52:26 No.108522922

>>108522909
Openwebui, but its injection stuff is broken at the moment

Anonymous
04/04/26(Sat)03:55:07 No.108522932

Anonymous 04/04/26(Sat)03:55:07 No.108522932

>>108522919
Kind of, it's better than it was yesterday.
I wouldn't count on it getting any better in less than a week.

Anonymous
04/04/26(Sat)03:55:19 No.108522933

Anonymous 04/04/26(Sat)03:55:19 No.108522933

how did yall got gemma4 to work?
all the gguf versions segfaults when call by tools
llamacpp on windows btw

Anonymous
04/04/26(Sat)03:55:19 No.108522935

Anonymous 04/04/26(Sat)03:55:19 No.108522935

>>108522061
Does it need swa or just normal q8 kv cache is fine?

Anonymous
04/04/26(Sat)03:55:25 No.108522936

Anonymous 04/04/26(Sat)03:55:25 No.108522936

>"It's... it's absolutely, positively, most spectacularly *insane*! It's a masterpiece! A total, unmitulated, high-octimie masterpiece!"
k t-thanks

Anonymous
04/04/26(Sat)03:57:40 No.108522943

Anonymous 04/04/26(Sat)03:57:40 No.108522943

>>108522677
>i need this to get more varied responses from gemma?
I don't get it, why do we have to use this now? What's wrong with the temperature?

Anonymous
04/04/26(Sat)04:00:05 No.108522947

Anonymous 04/04/26(Sat)04:00:05 No.108522947

>>108522087
name something better UI-wise that uses llama.cpp directly as a backend

Anonymous
04/04/26(Sat)04:00:13 No.108522949

Anonymous 04/04/26(Sat)04:00:13 No.108522949

>>108522943
it's doesn't work, something's fucked. in previous thread anon pushed to temp 10, barely changed anything

Anonymous
04/04/26(Sat)04:01:03 No.108522951

Anonymous 04/04/26(Sat)04:01:03 No.108522951

>>108522933
piotr strikes again

Anonymous
04/04/26(Sat)04:02:01 No.108522955

Anonymous 04/04/26(Sat)04:02:01 No.108522955

>>108522949
I thought it was already fixed with this >>108517829

Anonymous
04/04/26(Sat)04:02:04 No.108522956

Anonymous 04/04/26(Sat)04:02:04 No.108522956

>>108522949
llama.cpp truly is the unsloth of backends

Anonymous
04/04/26(Sat)04:02:28 No.108522957

Anonymous 04/04/26(Sat)04:02:28 No.108522957

Tested all the Gemma 4 models

>Gemma 4 E2B
First time ever in local where the smallest model is actually usable. I'm pretty sure the average smart phone using retard asking chatgpt to count to 10 or explain sport rules or other childish shit or send pictures and ask basic bitch questions will not even notice the difference between this and cloud providers.

>Gemma 4 E4B
Genuinely better than Nemo-12B. So VRAMlets still stuck on nemo should upgrade to this, it works different and has a different style but it genuinely feels smarter which is insane. Translation quality is slightly below Gemma 3 27B but that was the local sota for translation just a couple months ago so this is a big jump and it might be enough to go on holiday in Japan in some rural area without internet connection and still converse with people on your mobile phone running this model to translate each others speech in real time.

>Gemma 4 26B A4B
Better than Qwen 3.5 35B in every way while being faster. This should be your daily driver for extremely time sensitive tasks or real time translation. It's pretty sad that it doesn't have audio input because this would have been the perfect model to have on you while speaking to a foreigner for very quick accurate audio translation.

>Gemma 4 31B
I don't have to say anything more than the praise already given to it. It's the best model until you reach the ~300B parameter count, which is absolutely insane.

Anonymous
04/04/26(Sat)04:03:34 No.108522964

Anonymous 04/04/26(Sat)04:03:34 No.108522964

File: 1764715387254346.jpg (42 KB, 995x23)

42 KB JPG

ego death

Anonymous
04/04/26(Sat)04:04:29 No.108522968

Anonymous 04/04/26(Sat)04:04:29 No.108522968

>>108522957
But can it into cooding?

Anonymous
04/04/26(Sat)04:05:00 No.108522969

Anonymous 04/04/26(Sat)04:05:00 No.108522969

>>108522957
Imagine how good it will be when it isn't broken

Anonymous
04/04/26(Sat)04:16:24 No.108522998

Anonymous 04/04/26(Sat)04:16:24 No.108522998

>>108522947
Koboldcpp unless you need very specific things in the UI for some reason. Even then I'd sooner say go for ollama or some shit if you really have to. It still downloads models for you directly in the UI doesn't it?

Anonymous
04/04/26(Sat)04:16:40 No.108523000

Anonymous 04/04/26(Sat)04:16:40 No.108523000

>>108522968
I think coding is the only one where it isn't a step-change improvement over everything else in its size. 31B holds its own in coding and I think it's good enough to be a competent "OpenClaw" type of agent that you can trust but I wouldn't let it autonomously manage my PRs like I would Claude Code or to a lesser extent GLM5.1

Anonymous
04/04/26(Sat)04:17:42 No.108523005

Anonymous 04/04/26(Sat)04:17:42 No.108523005

>llama-server.exe -m gemma.gguf --host 127.0.0.1 --port 1882 --jinja --fit on --min_p 0 --ctx-size 66560 --parallel 1 --reasoning on
Am I missing anything?

Anonymous
04/04/26(Sat)04:18:40 No.108523007

Anonymous 04/04/26(Sat)04:18:40 No.108523007

>>108523005
Yeah you should put that in your terminal, not 4chan

Anonymous
04/04/26(Sat)04:19:10 No.108523009

Anonymous 04/04/26(Sat)04:19:10 No.108523009

>>108522947
>>108522998
Actually try oobabooga before ollama too. I keep forgetting it exists.

Anonymous
04/04/26(Sat)04:20:18 No.108523013

Anonymous 04/04/26(Sat)04:20:18 No.108523013

>>108523005
-ctk q8_0 -ctv q8_0

Anonymous
04/04/26(Sat)04:22:00 No.108523022

Anonymous 04/04/26(Sat)04:22:00 No.108523022

for the E2B/E4B, they are even more vramlet friendly than they appear at a first glance.
Run with llama.cpp as is they consume extra vram that the models do not need to.
-ot "per_layer_token_embd.weight=CPU"
can be used at pretty much no performance cost. Really it should be the default behavior, it doesn't make sense to load this into VRAM.

Anonymous
04/04/26(Sat)04:22:55 No.108523026

Anonymous 04/04/26(Sat)04:22:55 No.108523026

>>108522957
>>Gemma 4 26B A4B
>Better than Qwen 3.5 35B in every way while being faster.
lol
lmao even
sir hows the evenings?

Anonymous
04/04/26(Sat)04:23:44 No.108523033

Anonymous 04/04/26(Sat)04:23:44 No.108523033

>>108523013
rotations magic dont work with gemmy (SWA)

Anonymous
04/04/26(Sat)04:24:08 No.108523035

Anonymous 04/04/26(Sat)04:24:08 No.108523035

https://github.com/ggml-org/llama.cpp/pull/21418/changes/9cef34bb5eed2dc7c49c1b08f213c448a54f5384
>Properly managing the model's generated thoughts is critical for maintaining performance across multi-turn conversations.
>Standard Multi-Turn Conversations: You must remove (strip) the model's generated thoughts from the previous turn before passing the conversation history back to the model for the next turn. If you want to disable thinking mode mid-conversation, you can remove the <|think|> token when you strip the previous thoughts.
>Function Calling (Exception): If a single model turn involves function or tool calls, thoughts must NOT be removed between the function calls.
Isn't this commit only for the latter

Anonymous
04/04/26(Sat)04:24:19 No.108523038

Anonymous 04/04/26(Sat)04:24:19 No.108523038

>>108523033
what, why? what a shit show

Anonymous
04/04/26(Sat)04:24:20 No.108523039

Anonymous 04/04/26(Sat)04:24:20 No.108523039

>>108522998
>another llamocpp fork
is it at least up to date always

Anonymous
04/04/26(Sat)04:25:52 No.108523043

Anonymous 04/04/26(Sat)04:25:52 No.108523043

>>108523038
because gemma sir is a SWA model, u cant finna do em attention rotation with them (or at least it's not implemented in llmao.cpp yet) I'm unsure whether it's appliable at all or not tho in the future.

Anonymous
04/04/26(Sat)04:26:04 No.108523044

Anonymous 04/04/26(Sat)04:26:04 No.108523044

>>108522998
>Even then I'd sooner say go for ollama or some shit if you really have to. It still downloads models for you directly in the UI doesn't it?

Ollama is unusable trash, it's a gorillion times slower than LMStudio for some reason, has almost no configuration options, doesn't let you just download any GGUF you want from Huggingface, etc etc etc. So yeah no LMStudio is objectively better in every possible way than Ollama

Anonymous
04/04/26(Sat)04:26:51 No.108523047

Anonymous 04/04/26(Sat)04:26:51 No.108523047

>>108523026
Go try it out on your actual workflow instead of being smug about it. It's not even close so you'll notice the stark difference immediately.

Anonymous
04/04/26(Sat)04:27:14 No.108523048

Anonymous 04/04/26(Sat)04:27:14 No.108523048

>>108523039
It is now. Pretty sure you can just use the latest llama.cpp builds somehow with it otherwise or at the very least use their experimental builds. Worst case scenario for the stable builds you're not waiting longer than a few days anyway.

Anonymous
04/04/26(Sat)04:27:49 No.108523052

Anonymous 04/04/26(Sat)04:27:49 No.108523052

>>108523047
bro I already use gwen, gemma is slower (if used in cmoe mode to context maxx), I get 30t/s with qwen against the 17t/s in gemma
fucking retard

Anonymous
04/04/26(Sat)04:28:28 No.108523053

Anonymous 04/04/26(Sat)04:28:28 No.108523053

>>108523044
I dunno what these UIs are even needed for other than downloading models with a click and I don't use them so idk.

Anonymous
04/04/26(Sat)04:30:38 No.108523060

Anonymous 04/04/26(Sat)04:30:38 No.108523060

>>108523033
>rotations magic dont work with gemmy (SWA)
Didn't they fix it?
https://github.com/ggml-org/llama.cpp/pull/21332

Anonymous
04/04/26(Sat)04:31:06 No.108523061

Anonymous 04/04/26(Sat)04:31:06 No.108523061

>>108523053
I use it just because I'm too lazy to manage my models through terminal / manually. Convenience, that's what they're for (also quickly setting up dev servers).

Anonymous
04/04/26(Sat)04:31:10 No.108523062

Anonymous 04/04/26(Sat)04:31:10 No.108523062

>>108523053
I want to be able to change the model load settings in the UI whenever I want, easily save model system prompt presets / load presets, upload images, etc etc etc, how would you do any of that without a UI

Anonymous
04/04/26(Sat)04:31:18 No.108523063

Anonymous 04/04/26(Sat)04:31:18 No.108523063

>>108523060
no, niggerganov just re-enable QUANTIZATION to the SWA portion, but the ROTATION is outright disabled

Anonymous
04/04/26(Sat)04:32:28 No.108523065

Anonymous 04/04/26(Sat)04:32:28 No.108523065

>>108523062
>change the model load settings in the UI whenever I want
pretty sure lamo cpp has that for a bit now

Anonymous
04/04/26(Sat)04:32:50 No.108523068

Anonymous 04/04/26(Sat)04:32:50 No.108523068

>>108523062
A different ui than ollama's or lm studio. I'm pretty sure even llama.cpp can do that with it's ui more or less.

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.