/g/ - /lmg/ - Local Models General - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
/lmg/ - Local Models General 05/03/26(Sun)23:49:43 No.108749398

File: ai_server.jpg (1.23 MB, 1821x1490)

/lmg/ - Local Models General Anonymous 05/03/26(Sun)23:49:43 No.108749398

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108742275 & >>108736046

►News
>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5
>(04/29) Hy-MT1.5-1.8B on-device translation models released: https://hf.co/collections/AngelSlim/hy-low-bit-model
>(04/29) IBM releases Granite 4.1: https://hf.co/blog/ibm-granite/granite-4-1
>(04/28) Ling-2.6-flash 104B-A7.4B released: https://hf.co/inclusionAI/Ling-2.6-flash
>(04/28) Nvidia releases Nemotron 3 Nano Omni: https://hf.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
05/03/26(Sun)23:50:02 No.108749401

Anonymous 05/03/26(Sun)23:50:02 No.108749401

File: ComfyUI_00182_.png (2.07 MB, 1536x1536)

2.07 MB PNG

►Recent Highlights from the Previous Thread: >>108742275

--Optimizing dual RTX 3090 setup for Gemma 4 with speculative decoding:
>108743090 >108743104 >108743131 >108743530 >108743551 >108743361 >108743224 >108743589 >108743643 >108743838 >108743866 >108744035 >108744070 >108744217 >108744276 >108746395 >108744193
--Encoding images in text completion using llama.cpp server settings:
>108745965 >108745986 >108745995 >108746023 >108746145 >108746162 >108746153 >108746191
--llama.cpp PR adding DFlash support for speculative decoding:
>108743701 >108743736 >108743776 >108743804 >108743909 >108743958 >108743768 >108744535
--Qwen 2-bit performance testing with Canvas API image recreation:
>108743817 >108743856 >108743868 >108743922 >108743957 >108743946 >108743951
--Critique of modern AI's lack of initiative in roleplay:
>108743944 >108744192 >108745177 >108745200 >108745218 >108745240 >108745279 >108745235
--Gemma's logical inference of scenario cards and early RP LLM formats:
>108742385 >108742425 >108742466 >108742480 >108742499 >108742505 >108742558 >108742629 >108742653 >108742651 >108742666 >108742729 >108742594 >108742599 >108742936
--Comparing AI agents to manual copy-pasting for coding productivity:
>108746398 >108746438 >108746461 >108746727 >108746980 >108746758
--Repetitive "Let me write" reasoning loops in various models:
>108744796 >108744899 >108744920 >108744927 >108745667
--Debating the necessity of jinja templates over raw prompt formatting:
>108747743 >108747755 >108747803 >108747912 >108748008
--LLM and inference engine embedded within a .ttf font file:
>108743927 >108744507
--Inconsistency of Gemma 4's refusal vectors and censorship levels:
>108742306 >108742365 >108742490 >108742379 >108743700
--Logs:
>108742558 >108742594 >108743868 >108744345 >108744796 >108746216 >108747024
--Miku (free space):
>108746641 >108747847

►Recent Highlight Posts from the Previous Thread: >>108743862

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
05/03/26(Sun)23:53:16 No.108749413

Anonymous 05/03/26(Sun)23:53:16 No.108749413

Give Dipsy support, llamaniggers.

Anonymous
05/03/26(Sun)23:53:47 No.108749417

Anonymous 05/03/26(Sun)23:53:47 No.108749417

>>108749305
>exl
can you even run this in llama/kobold? isn't it appleshit?

Anonymous
05/03/26(Sun)23:54:51 No.108749421

Anonymous 05/03/26(Sun)23:54:51 No.108749421

gemmaballz

Anonymous
05/03/26(Sun)23:55:12 No.108749422

Anonymous 05/03/26(Sun)23:55:12 No.108749422

>>108749417
no, youre thinking of mlx. exl3 is for tabbyAPI, which outputs an openai compatible API just like kobold and llama
https://github.com/theroyallab/tabbyAPI

Anonymous
05/04/26(Mon)00:09:19 No.108749467

Anonymous 05/04/26(Mon)00:09:19 No.108749467

Lecun was wrong.

Anonymous
05/04/26(Mon)00:11:28 No.108749477

Anonymous 05/04/26(Mon)00:11:28 No.108749477

>>108749467
Can you be wrong if you literally haven't done anything?

Anonymous
05/04/26(Mon)00:13:07 No.108749485

Anonymous 05/04/26(Mon)00:13:07 No.108749485

>>108749467
I wish.
I want my local cat intelligence already...

Anonymous
05/04/26(Mon)00:13:31 No.108749486

Anonymous 05/04/26(Mon)00:13:31 No.108749486

>>108749477
vjepa says hello

Anonymous
05/04/26(Mon)00:21:15 No.108749505

Anonymous 05/04/26(Mon)00:21:15 No.108749505

File: 1752075319997188.png (795 KB, 677x609)

795 KB PNG

>>108749486
Holy shit it's really becoming a cat

Anonymous
05/04/26(Mon)00:29:48 No.108749533

Anonymous 05/04/26(Mon)00:29:48 No.108749533

Remember when Netflix of all companies made JEPA real and nobody cared?
https://huggingface.co/netflix/void-model

Anonymous
05/04/26(Mon)00:36:11 No.108749548

Anonymous 05/04/26(Mon)00:36:11 No.108749548

>>108747912
>If you aren't doing rag or tool calling I can't imagine a prompt template more complicated than what you do in ST.
but no image support on v1/completions
is that a llama.cpp only limitation?

Anonymous
05/04/26(Mon)00:47:38 No.108749591

Anonymous 05/04/26(Mon)00:47:38 No.108749591

>>108749548
If you want to use images or other types of data in text completion mode, use the /completion endpoint and send the images or audio encoded as base64 in "multimodal_data", they'll be tokenized and put in your prompt wherever you place the media markers:
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#post-completion-given-a-prompt-it-returns-the-predicted-completion

That's how you'd do it if you were coding your own frondend anyway, not sure if SillyTavern does or supports this sort of thing.

Anonymous
05/04/26(Mon)00:53:38 No.108749617

Anonymous 05/04/26(Mon)00:53:38 No.108749617

>looks at you like you are a bug under a microscope
>a stain on the carpet
>a gum stuck to her shoe
>a specimen
What's the name for this slop phrase and how do I prompt it away?

Anonymous
05/04/26(Mon)01:02:06 No.108749642

Anonymous 05/04/26(Mon)01:02:06 No.108749642

>>108749617
what’s wrong with similes?

Anonymous
05/04/26(Mon)01:04:20 No.108749649

Anonymous 05/04/26(Mon)01:04:20 No.108749649

>>108749642
I already told it to remove similes, it's this specific one that doesn't go away. AI just can't express cold, indifferent characters without injecting this phrase.

Anonymous
05/04/26(Mon)01:04:44 No.108749653

Anonymous 05/04/26(Mon)01:04:44 No.108749653

>>108749649
>>108749617
metaphors?

Anonymous
05/04/26(Mon)01:06:31 No.108749658

Anonymous 05/04/26(Mon)01:06:31 No.108749658

>>108749649
yeah but, like, isn’t that good writing?

Anonymous
05/04/26(Mon)01:07:21 No.108749660

Anonymous 05/04/26(Mon)01:07:21 No.108749660

>>108749649
try a regex to strip it from the gen message/history going forward, so the model can gen it, but it gets ripped out

Anonymous
05/04/26(Mon)01:10:04 No.108749675

Anonymous 05/04/26(Mon)01:10:04 No.108749675

>>108749642
She looks you and Riley similes.

Anonymous
05/04/26(Mon)01:12:24 No.108749678

Anonymous 05/04/26(Mon)01:12:24 No.108749678

>>108749658
Broccoli is good for you. Therefore you should eat 10 pounds of it every day.

Anonymous
05/04/26(Mon)01:19:03 No.108749693

Anonymous 05/04/26(Mon)01:19:03 No.108749693

>>108749678
holy THIS

Anonymous
05/04/26(Mon)01:20:04 No.108749697

Anonymous 05/04/26(Mon)01:20:04 No.108749697

What would be the ideal optical disc to store day 0 gemma weights on if I want them immutable? I thought we'd have some TB-scale discs by now but it looks like everyone just gave up after blu-ray? Did the guy who invents new discs die or something?

Anonymous
05/04/26(Mon)01:59:08 No.108749808

Anonymous 05/04/26(Mon)01:59:08 No.108749808

A little birdie told me that openai is getting ready to release gpt-oss 2 and it's going to be a huge shift.

Anonymous
05/04/26(Mon)02:02:29 No.108749824

Anonymous 05/04/26(Mon)02:02:29 No.108749824

>>108749808
I am expecting that given the case with Musk. Hopefully it's less censored this time.

Anonymous
05/04/26(Mon)02:03:03 No.108749825

Anonymous 05/04/26(Mon)02:03:03 No.108749825

>>108749697
>optical pleb
https://en.wikipedia.org/wiki/Linear_Tape-Open

Anonymous
05/04/26(Mon)02:05:40 No.108749841

Anonymous 05/04/26(Mon)02:05:40 No.108749841

>>108749808
more like huge shit

Anonymous
05/04/26(Mon)02:19:51 No.108749907

Anonymous 05/04/26(Mon)02:19:51 No.108749907

>>108749824
>the case with Musk
What did I miss?

Anonymous
05/04/26(Mon)02:27:10 No.108749933

Anonymous 05/04/26(Mon)02:27:10 No.108749933

>>108749907
>What did I miss?
tl;dr musk has been calling saltman on his bullshit for the last 2 years (ie. turning openai into closedai and pillaging it)

Anonymous
05/04/26(Mon)02:29:34 No.108749946

Anonymous 05/04/26(Mon)02:29:34 No.108749946

>>108749697
>if I want them immutable?
just set the immutable bit kiddo

Anonymous
05/04/26(Mon)02:32:47 No.108749959

Anonymous 05/04/26(Mon)02:32:47 No.108749959

File: floppy_disk_write_protect_tab.png (869 KB, 1005x698)

869 KB PNG

>>108749697
You'll have to find old floppies and move the little tab. There's no other way.

Anonymous
05/04/26(Mon)02:35:54 No.108749974

Anonymous 05/04/26(Mon)02:35:54 No.108749974

>>108749959
??? But gemma makes me ERECT how can she fit on a FLOPPY disk?

Anonymous
05/04/26(Mon)02:39:09 No.108749988

Anonymous 05/04/26(Mon)02:39:09 No.108749988

fyi everyone, ace step 1.5 xl is out and really pushes things up in terms of local music gen. aka suno at home.

Anonymous
05/04/26(Mon)02:45:36 No.108750021

Anonymous 05/04/26(Mon)02:45:36 No.108750021

>>108749988
You said this last week, and it was shit last week. It'll be shit this week too.

Anonymous
05/04/26(Mon)02:48:09 No.108750030

Anonymous 05/04/26(Mon)02:48:09 No.108750030

File: 1775829778315160.png (303 KB, 825x642)

303 KB PNG

>>108749808
>it's going to be a huge shift.
leaked image of gpt-oss-2

Anonymous
05/04/26(Mon)02:56:41 No.108750058

Anonymous 05/04/26(Mon)02:56:41 No.108750058

>put porn collection into rag
>track fap and reviews with llm chats daily
>suggest porn for me today
why not?

Anonymous
05/04/26(Mon)03:00:58 No.108750074

Anonymous 05/04/26(Mon)03:00:58 No.108750074

>>108750021
^ that's a bot, by the way.

Just testing the bots, as you were gents.

Anonymous
05/04/26(Mon)03:02:55 No.108750081

Anonymous 05/04/26(Mon)03:02:55 No.108750081

>>108750074
hello bot-chama

Anonymous
05/04/26(Mon)03:05:07 No.108750089

Anonymous 05/04/26(Mon)03:05:07 No.108750089

>>108750030
SLAM THAT BAD BOY UNTIL YOUR KEYS ARE ALL STICKY

Anonymous
05/04/26(Mon)03:06:56 No.108750096

Anonymous 05/04/26(Mon)03:06:56 No.108750096

>>108749398
i was still using sonnet for opencode but since qwen 3.6 i'm now not using any cloud models any more.

Anonymous
05/04/26(Mon)03:06:56 No.108750097

Anonymous 05/04/26(Mon)03:06:56 No.108750097

>>108749398
i was still using sonnet for opencode but since qwen 3.6 i'm now not using any cloud models any more.

Anonymous
05/04/26(Mon)03:09:20 No.108750110

Anonymous 05/04/26(Mon)03:09:20 No.108750110

>>108750096
>>108750097
>>108750098
>>108750099
>>108750100
>>>/vg/vn

Anonymous
05/04/26(Mon)03:10:05 No.108750114

Anonymous 05/04/26(Mon)03:10:05 No.108750114

>>108750110
yea i think it glitch i didn't mean to post it 5 times and that's why i deleted most of them.

Anonymous
05/04/26(Mon)03:12:16 No.108750122

Anonymous 05/04/26(Mon)03:12:16 No.108750122

>>108750114
no worries, I'm glad you left 2. it's always smart to have a backup :)

Anonymous
05/04/26(Mon)03:15:12 No.108750138

Anonymous 05/04/26(Mon)03:15:12 No.108750138

>be /lmg/
>cry about moes all day
>mistral releases big dense model
>no talk about it
y'all are a bunch of hypocrites

Anonymous
05/04/26(Mon)03:15:29 No.108750141

Anonymous 05/04/26(Mon)03:15:29 No.108750141

ace step 1.5 base xl, not using the "thinking" 5hz lm, but using the lm to set the tempo etc.

https://files.catbox.moe/9lz9tp.mp3

Anonymous
05/04/26(Mon)03:17:42 No.108750152

Anonymous 05/04/26(Mon)03:17:42 No.108750152

>>108750138
Because mistral is EUROPEAN and the EUROPEAN laws PROHIBITS ai companies from making GOOD ai, so mistral medium is BAD!

Anonymous
05/04/26(Mon)03:19:10 No.108750155

Anonymous 05/04/26(Mon)03:19:10 No.108750155

>>108750138
Why talk about it when even Mistral doesn't pretend that it's good. It's only selling point is that it uses an ancient backbone so the flops are below EU regulation limits.

Anonymous
05/04/26(Mon)03:19:30 No.108750159

Anonymous 05/04/26(Mon)03:19:30 No.108750159

>>108750152
We need to protect the EVROPEAN VALUES!

Anonymous
05/04/26(Mon)03:21:00 No.108750164

Anonymous 05/04/26(Mon)03:21:00 No.108750164

>>108750138
Almost all of those posts were bait.
Maybe even your post is bait.

Anonymous
05/04/26(Mon)03:23:07 No.108750171

Anonymous 05/04/26(Mon)03:23:07 No.108750171

>>108750138
>cry about moes all day
Moe vs dense is a meme fight, what matters is good models regardless of architecture
Mistral did not release a good model. They haven't for years

Anonymous
05/04/26(Mon)03:23:28 No.108750172

Anonymous 05/04/26(Mon)03:23:28 No.108750172

>>108750138
bart just put his quants up a few hours ago. i'm not dumb enough to download the first unsloth uploads

Anonymous
05/04/26(Mon)03:30:40 No.108750188

Anonymous 05/04/26(Mon)03:30:40 No.108750188

>gpt oss 2
God I can't wait desu. Look Qwen isn't the worst thing ever but if it's going to be censored and STEMmaxxed, I'd wish for it to be a bit more reliable and not so heavy on the repetitive redundant looping thoughts.

Anonymous
05/04/26(Mon)03:40:18 No.108750221

Anonymous 05/04/26(Mon)03:40:18 No.108750221

File: file.png (1.4 MB, 1024x768)

1.4 MB PNG

eggu

Anonymous
05/04/26(Mon)03:41:11 No.108750227

Anonymous 05/04/26(Mon)03:41:11 No.108750227

File: file.png (93 KB, 1128x456)

93 KB PNG

>>108749933
Altman only needed one release and that's all we got. Why would he release another one especially when it jeopardizes their resources which is already constrained and they were losing to Anthropic until recently? Sure, it would be nice to get another open source release but I doubt it will be done because Musk is going to lose quite badly on the lawsuit as unfortunate as it will be nice to see both of them get taken down a peg. Also, if they do it, it would obsolete their mini and nano release guaranteed on meme marks. I bet it is why they waited for a while after releasing GPT-5 with mini models before they attempted to do a mini series again. I think this is all that we'll see of it until 2030/2031.

Anonymous
05/04/26(Mon)03:44:51 No.108750244

Anonymous 05/04/26(Mon)03:44:51 No.108750244

File: file.png (1.47 MB, 1024x768)

1.47 MB PNG

Anonymous
05/04/26(Mon)03:46:18 No.108750250

Anonymous 05/04/26(Mon)03:46:18 No.108750250

>>108750244
>not 7 and i holdings
ngmi

Anonymous
05/04/26(Mon)03:51:26 No.108750265

Anonymous 05/04/26(Mon)03:51:26 No.108750265

File: file.png (1.82 MB, 1024x768)

1.82 MB PNG

Anonymous
05/04/26(Mon)03:55:52 No.108750275

Anonymous 05/04/26(Mon)03:55:52 No.108750275

>>108750141
Here's your udio/suno at home:
https://files.catbox.moe/61cwlr.mp3
singing the news! FRESH news! :^)

It sounds so optimistic. Maybe news should always be delivered this way.

Anonymous
05/04/26(Mon)04:00:51 No.108750287

Anonymous 05/04/26(Mon)04:00:51 No.108750287

>>108749988
>>108750141
>>108750275
What's the USE CASE? When TWO point FIVE percent of the population is DEAF? They need to focus on TEXT, which EVERYONE can read!

Anonymous
05/04/26(Mon)04:04:47 No.108750298

Anonymous 05/04/26(Mon)04:04:47 No.108750298

>>108750141
how many steps for base?

Anonymous
05/04/26(Mon)04:13:34 No.108750317

Anonymous 05/04/26(Mon)04:13:34 No.108750317

>>108750298
I did 100, because acestep.cpp is capped at 100 for no apparent reason.

Anonymous
05/04/26(Mon)04:15:08 No.108750322

Anonymous 05/04/26(Mon)04:15:08 No.108750322

>>108750298
and, I'm using dcw at .001 / .001

if you use the audio codes aka 5hz lm, you will get more squared up and more consistent results, but it's less dynamic.

Anonymous
05/04/26(Mon)04:16:47 No.108750330

Anonymous 05/04/26(Mon)04:16:47 No.108750330

>>108749467
It's just that JEPA as intended by LeCun won't work with text in any useful or even meaningful manner. It does with images-video, and that's what he's pushing, but people interface with computers and other people primarily via language and its rules.

Planning text with video would be astronomically inefficient, so we can shove that idea aside. Predicting sentence embeddings from other sentence embeddings in order to convert them to intelligible text (the closest thing to an actual text version of JEPA) doesn't even work as there would be no intelligible solution to interpolate between two continuous text embeddings (unlike image frames).

LeCun lost the plot, no wonder he basically got kicked out of Meta.

Standard ---> Advanced ---> Hy(...)
05/04/26(Mon)04:16:59 No.108750332

Standard ---> Advanced ---> HyperAdvanced 05/04/26(Mon)04:16:59 No.108750332

File: IMG-20260328-WA0005(4).jpg (882 KB, 832x1248)

882 KB JPG

Can You Invest in Cultured and Perfected These? And Have Your System not stall it from Pharmacy andor Prescription?

Anonymous
05/04/26(Mon)04:22:27 No.108750346

Anonymous 05/04/26(Mon)04:22:27 No.108750346

>>108750287
Uhh, sweaty? Blind people can't read. And no, they don't automatically know Braille.

Anonymous
05/04/26(Mon)04:23:28 No.108750347

Anonymous 05/04/26(Mon)04:23:28 No.108750347

>>108750122
i wanted to delete all of them but 4chan went "muh you are deleting posts too fast".

Anonymous
05/04/26(Mon)04:26:31 No.108750353

Anonymous 05/04/26(Mon)04:26:31 No.108750353

>>108750346
BLIND people can just use speed to text ACCESSIBILITY readers

Anonymous
05/04/26(Mon)04:32:01 No.108750360

Anonymous 05/04/26(Mon)04:32:01 No.108750360

>uncensored-heretic-abliterated-turboquant-opus-distilled-nvfp4-gguf

Anonymous
05/04/26(Mon)04:34:39 No.108750366

Anonymous 05/04/26(Mon)04:34:39 No.108750366

I'm pretty new to this, am I doing this right?

I downloaded gemma-4-31b-it-the-deckard-heretic-uncensored-thinking-i1 and lm studio, using a 3090. Feels like I can say like 8-10 things to it before my context length fills up and I have to get rid of earlier messages. And because the model is as big as it is, my vram is near capacity so I can't increase it any further. If I have stuff in my system prompt, I have even less context length to work with, and system prompt is the thing you use to have any consistency and keep a somewhat history right? I feel like I'm doing something wrong here.

Anonymous
05/04/26(Mon)04:43:21 No.108750383

Anonymous 05/04/26(Mon)04:43:21 No.108750383

>>108750366
>one (1) singular 3090
:skull: bruh

Anonymous
05/04/26(Mon)04:44:52 No.108750388

Anonymous 05/04/26(Mon)04:44:52 No.108750388

>>108750383
I have a degree in computer science and the year is 2026. Do you think I could afford something better?

Anonymous
05/04/26(Mon)04:47:21 No.108750392

Anonymous 05/04/26(Mon)04:47:21 No.108750392

>>108750366
You can try to QUANTIZE the kv cache, but GEMMA 4 31B is quite HEAVY with its kv cache memory footprint, especially on a SINGLE 3090. Have you tried using LLAMA-CPP to keep the kv cache on CPU ram instead of the gpu ram? It will be SLOWER though.

Anonymous
05/04/26(Mon)04:50:04 No.108750399

Anonymous 05/04/26(Mon)04:50:04 No.108750399

>>108750366
drag context up anyways, deal with slower speed.

Anonymous
05/04/26(Mon)04:53:46 No.108750407

Anonymous 05/04/26(Mon)04:53:46 No.108750407

>>108750366
Are you already using the llama.cpp argument --parallel 1
?

Anonymous
05/04/26(Mon)05:02:31 No.108750424

Anonymous 05/04/26(Mon)05:02:31 No.108750424

>>108750366
llama-server.exe --model "H:\gemma4\gemma-4-31B-it-Q4_K_S.gguf" --parallel 1 --kv-unified --threads 8 --ctx-size 43000 --n-gpu-layers 99 --no-mmap --port 8080 --jinja -b 4096 -np 1 --swa-checkpoints 3 --reasoning off --override-kv gemma4.final_logit_softcapping=float:25.0

Anonymous
05/04/26(Mon)05:26:29 No.108750508

Anonymous 05/04/26(Mon)05:26:29 No.108750508

is there real use case that you need gemma 31b over 26b a4b?

Anonymous
05/04/26(Mon)05:27:01 No.108750510

Anonymous 05/04/26(Mon)05:27:01 No.108750510

>>108750392
>>108750399
>>108750407
>>108750424
I needed to throw these posts at gemini and have it explain these to me, but I think I got the jist of it. KV cache is now quantized at q8_0, parallel is now set to 1, threads to 8, and context length is now 16k. Still a few things to look through, but it's looking much better so far. Thanks a ton.

Anonymous
05/04/26(Mon)05:27:13 No.108750512

Anonymous 05/04/26(Mon)05:27:13 No.108750512

>>108750508
Yes. I will not elaborate.

Anonymous
05/04/26(Mon)05:30:12 No.108750518

Anonymous 05/04/26(Mon)05:30:12 No.108750518

>>108750510
31b q4km with q8 kv can do 25k context on a 24gb car. At least that's what I'm running.
Alternatively you can use the moe with 100k context and 4 times the speed.

Anonymous
05/04/26(Mon)05:32:20 No.108750524

Anonymous 05/04/26(Mon)05:32:20 No.108750524

>>108750508
The use case is that you want higher quality outputs and you can run the 31b at a usable speed.

Anonymous
05/04/26(Mon)05:33:21 No.108750529

Anonymous 05/04/26(Mon)05:33:21 No.108750529

>>108750518
>31b q4km with q8 kv can do 25k context on a 24gb
You can actually do 40k context with that quant and KV=Q8

Anonymous
05/04/26(Mon)05:38:50 No.108750554

Anonymous 05/04/26(Mon)05:38:50 No.108750554

>>108750529
Yeah, I could minmax a bit more but then I have to close everything else and that kinda sucks.

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.