[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1768958853289633.png (325 KB, 1478x1374)
325 KB
325 KB PNG
Has anyone tried to use the new Gemma 4 models with any agent harnesses locally? I mainly use Opencode and my current machine is powerful enough to run gpt-oss 120b at q4_k_m quantization (I could use higher quants but then the t/s and prompt processing speeds fall off a cliff the longer the context gets) but apparently Gemma 4 curb stumps it despite it only being 31b. Is it actually worth trying or is it just more benchmaxxing? Also, I've seen people here say that it's not worth using Moe models because they are inherently "dumber" than sense models The only advantage to using moe is faster t/s, especially if you're using weaker hardware. To those who say that, does that mean I should just only be concerned with the dense 31B model? Does the KV cache behave differently? Like, does the moe kv cache build up slower and lead to lesser slowdowns at longer contexts than dense models or does it behave around the same?
>>
>>108566129
It's way too big for my pc
>>
>>108566129
iirc, gemma is a general purpose model and not a coding model. so you can get pretty good performance on crappier hardware than other similar models, but it's not going to be especially good at coding
>>
>>108566129
Qwen 3.5 Opus 4.6 Distillation is decent for coding. Even better if you use another model to keep it on task
>>
>>108566331
I can run it on my handheld PC.
>>
You would think that the audio and image capabilities would hurt the coding performance.
>>
asked it to change the Neovim theme in my dotfiles folder. It said, ‘Sure!’, read a bunch of files, and then immediately ran out of context and forgot what it was doing. Asked it to change my theme, but it wasn’t retaining the information it had already read, so it kept rereading the files ad infinitum.
>>
File: 1769672499352563.jpg (37 KB, 922x715)
37 KB
37 KB JPG
>>108567093
>Ran out of context

Let me guess. You were running it with llama.cpp at the back end and forgot to correctly set the -c perimeter to a reasonably high number
>>
I'm new to local ai's

Anyone here using wsl and rocm? I'm struggling to get ollama working with it. Just wondering if anyone has any tips or other alternatives
>>
>>108567108
no i increased the context
>>
>>108567158
>no i increased the context
To what? Other parameters like temperature matter a lot too. Did you use their recommended settings?

>>108567153
>wsl
For Windows? For what purpose?

https://ollama.com/download/windows
>>
>>108567170
Windows Subsystem Linux.

I've been using it for 6 years in my job and it's a habit. I haven't created a python script in windows.

I did see that gemma4 recommends "lemonade" so I'll give that a go when I can be bothered to take another crack at it.
>>
>>108566129
I can't even get it to work
it chats fine but crash when trying to use with any tools
Did anyone actually tried this shit out?
>>
>>108567153
I gave up on wsl and just used PowerShell instead
>>
>>108567335
I think it's about time I learn power shell if I get failures tomorrow. Cheers
>>
>>108567258
werkz on my machine. You might have to update the agent harness you're using. pic rel is opencode using the moe version locally.
>>
>>108567750
what gpu
>>
>>108567760
>>
>>108566129
Is it not available at openrouter etc? Try it out there. And report in, it's actually interesting.
>>
>>108567750
I cant be the only one with this problem its literally unusable
do u see any vram spike
https://github.com/ggml-org/llama.cpp/issues/21690
>>
>>108566938
More data to train on, so it ends up with better capabilities overall.
>>
>>108567158
>>108567108
>>108567093
just increasing context during inference won't help you much if the model wasn't trained to work at high context length
>>
>>108567809
Cool shit
>>
>>108567881
My backend is ollama (which itself is based heavily on llama.cpp) so whatever issue you're running into doesn't seem to be the case for me (I haven't done any heavy usage of gemma4 yet so for all I know it could shit itself at long contexts like what he's >>108568059 >>108567093 experiencing so who knows. So far all I've done is have it create a README for this https://github.com/AiArtFactory/llava-image-tagger and the had it spoonfeed me how to push a verified update-commit to main. I think next I'll see if it can create and modify custom nodes and workflows for me for ComfyUI like Kimi-k2.5 was able to do form me
>>
>>108566129
If they are not lying about 26B performance, it's worth looking into, maybe.

Gemma4 31B (dense model) Q4
3090 - 36t/s
4090 - 43t/s
5090 - 65t/s

Gemma 4 26B (MoE) Q4
3090-120t/s
4090-147t/s
5090-182t/s
>>
>>108567809
It won't go far actually. High RAM macs are a waste.

gemma4 31B (dense)
M3 Max 36GB - 12t/s

gemma4 26B (MoE)
M3 Max 36GB - 56t/s
>>
>>108568314
performance on my machine. I'm >>108567750
>>
bonus

Gemma4 E2B Q4_K_S (from unsloth)
RX580 - 11t/s

Gemma4 E2B (not sure about Q, it's the AI Edge Gallery version, 2.5GB in memory)
ARM Mali-G57 MC2 - 5t/s
>>
>>108568314
>High RAM macs are a waste.
why?
>>
>>108568314
nobody cares about 36gb macs lol.
"high ram macs" are 128GB mac studios and up.
>>
>>108568385
nta. way ahead of ya
>>
>>108568376
Nice to have big stuff in RAM. Useless when it's too slow to be useful. Like with LLMs.
>>108568404
Not a dense model. But yeah, this one seems usable. But then again... Do you need more than 64GB? Probably not. Even 32 used to be a waste, before OSS were released. What existed prior was mostly garbage.
>>
File: 1752989845418212.png (361 KB, 2072x1307)
361 KB
361 KB PNG
>>108568445
>Not a dense model
???

https://huggingface.co/google/gemma-4-31B

>seems usable.

Define unusable. I really hope t/s isn't the metric you're using....
>>
>>108568109
>ollama
Do yourself a huge favor, switch to dockerized llama.cpp
>>
>>108568474
>???
Look again. At the top of your screen shot it says
ollama run gemma4:26b-lalala
>>
>>108568711
You said "not a dense model" but I tested the moe AND the dense model
>>
mmm im more curious of that technology of google to zip usage of memory, i guess they have implement it in this local model,
>>
>>108566129
>Is it actually worth trying
what do you mean, you already have local llm setup. you just download the file and select it from the agent dropdown



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.