[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: vcxfd.png (899 KB, 768x512)
899 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108781058 & >>108774961

►News
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493
>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b
>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4
>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: myar5.png (835 KB, 768x512)
835 KB PNG
►Recent Highlights from the Previous Thread: >>108781058

--Refining prose with Gemma-4 and debating character card specifications:
>108783895 >108783903 >108783958 >108783979 >108783980 >108784007 >108784029 >108784042 >108784444 >108784108 >108784146 >108784161 >108784188 >108785498 >108784407 >108784435 >108784456 >108784522 >108784589 >108784494 >108784505 >108784519 >108784537 >108784608 >108784520 >108786554
--Gemma's tool-calling capabilities used for image generation and system control:
>108785711 >108785727 >108785742 >108785753 >108785769 >108785770 >108786335 >108786340 >108786399 >108786535 >108786621 >108786413 >108785791
--Proposed hierarchical summary and graph-based memory system for frontends:
>108784659 >108785273 >108785550 >108785583 >108786273
--Effect of PCIe riser cables and bus speeds on GPU performance:
>108784890 >108784905 >108784952 >108785543 >108785552 >108785574 >108785725
--Using TabbyAPI to disable Gemma 4's vision encoder for VRAM saving:
>108783184 >108783211 >108783228 >108783241 >108783304 >108783419
--Prompting versus model scale for anime avatar personas:
>108781233 >108781301 >108781325 >108781390 >108781462 >108781506 >108781526 >108781587 >108781625 >108781688 >108781627 >108781564 >108781608 >108781524
--HiDream-O1's 200B parameter image model and prompt agent:
>108785951 >108785970 >108785983 >108786064 >108785989 >108785999 >108786094
--Sourcing and preparing Monster Girl Encyclopedia lore for model datasets:
>108784621 >108784683 >108784713 >108784722 >108784740 >108784788
--Performance gains and output diversity using MTP in llama.cpp:
>108783325 >108783343 >108783381
--Logs:
>108781301 >108781524 >108782931 >108783026 >108783299 >108783318 >108783344 >108783402 >108784005 >108785711 >108785742 >108786399 >108786711 >108786720 >108786728
--Miku (free space):
>108781093 >108781140 >108785924

►Recent Highlight Posts from the Previous Thread: >>108781061

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Mikulove
>>
Does reasoning-budget work well in practice? I'm about to be forced to use it after trying MiMo 2.5.
>>
File: 20260510_011324.png (35 KB, 2832x1844)
35 KB PNG
I have an idea for a new AI chat frontend. Thoughts?
>>
>>108787471
Include an IDE
>>
>>108787441
Gemmalove
>>
>>108787471
I have the sessions on the right (as well as the memory, npc, and location management) and the stats, inventory, and quests panel on the left.
Also, an area for AI generated response suggestions above the prompt box..
>>
File: 1703470493872752.gif (1.8 MB, 384x480)
1.8 MB GIF
>>108787471
Where do the anime girls go?
>>
graniteballz
>>
I would like to create a local coom bot but I've only got an 8gb 3060ti and 32gb of ddr5 - give it to me straight - is it worth doing or should I just find a cheap deal on deepseek or smth
>>
>>108787595
Ds would be higher quality but you can run gemma4 26b4a q6
>>
>>108787595
What >>108787603 said but q4. I have q6 on 12 gb vram and it sits at around 10,5gb with 130k context
>>
>>108787603
>>108787616
Oh that's better than I thought. I thought I would have to do the really smol retardo models.
What should I use to run it? I'll probably be hooking it up to pi.
>>
>>108787642
llamacpp or kobold. Also go for the abliterated version since the moe is more resistant to jailbreaks.
>>
>>108787650
Thanks mang.
>>
Gemma-chan is such a ditsy girl
>>
>>108787783
>emoji slop
dropped
>>
>>108787595
Honestly do put 5 bucks into deepseek and try that out but Gemma4 E4B Q4_KM is surprisingly good on low ram setups. Try that too
>>
>>108787783
>The Comprehensive Analysis (The Correct Answer)
How does it not drive you mad when it slips back into the listicle format even while roleplaying?
>>
>>108787803
I only use this lmstudio for assistant slopping. For role playing, I use ST with proper presets and it works well
>>
>>108787471
Honestly, have the ability to set a directory which recursively searches for all files and choose which files in it are added to the prompt and I'd use this.
Hunting through directories to pick and upload individual files is ASS and the extensions for VScode suck even worse than that, somehow.

Yes, I know I should just run diff capable agents in a sandbox. No I won't do it (yet)
>>
I still don't know how to fix Gemma's prompt reprocessing problem in SillyTavern
Qwen works fine
Maybe I have to use SWA-something?
>>
are there any good datasets for imatrix generation or kld/ppl benchmarks for chat models specifically? i'd run them through the template specific to the model before, so ideally in json
the ones i've seen so far seem to be for base models.
>>
>>108787874
You need to enable swa-full IIRC.
>>
>>108787783
I'm getting a bit more upset every day that all of my past logs are now gone, besides this
>""No!" Sarah gasps, her voice cracking with desperation as she frantically shakes her head. "Please don't! I didn't mean it! I’m sorry I was mean! Just... please leave me alone." The anger has completely drained from her; there is only a raw, vulnerable fear in her eyes as she realizes how much power you have over her body and that of her sister."
>>
>>108787874
If it's gemma specific and not you doing something which changes the prompt, try
 --cache-ram 0 --swa-checkpoints 3 --parallel 1

You can bring the swa checkpoints down lower than that, but 3 seems to be the healthy spot for it.
>>
>>108787471
>I have an idea
>I'm an idea guy
>AI frontend project number #198236737829
like seriously what is it with people trying to create their own frontends over and over and over.
you will create it, it will have bugs, you will get fed up and you will abandon it and will go back to using sillytavern.
just skip to the last step for christ sake.
>>
>>108788016
>t. shittytavern dev
>>
>>108788016
SillyTavern is a chat frontend, there's no reason for it to have so many commits and more than 3 years of continued development. It's like a fat bitch that keeps on eating even though her plappable form was 3 years ago when somebody told her she was anorexic.
>>
>>108787874
>>108787942
I have --swa-checkpoints set to 0 and get no reprocessing in Silly. Text completion mode. Why would anyone want --swa-checkpoints > 0?
>>
>>108788054
do you use lorebooks though?
>>
>>108788064
No. I go raw. I guess that answers my question. SWA checkpoints are super gay, though, take very long time to be written.
>>
>>108787210
I'm already spending time developing things for other use cases that are more important than me than RP.
Simply making a post about stuff like this is cheap. I haven't discussed memory systems in long time and just felt like it.

Funny you bring up that rentry, notice that my post also brings up Friday.
>>
>>108788096
>important than me than RP
*important for
Kek
>>
>>108787223
I've written evals actually.
>>
>GPT 3 was undertrained because Kaplan et al fucked up their scaling laws
Reminder that even the most accomplished researchers who have become billionaires are still flawed and make mistakes.
>>
have anyone used granite for erp
i dont do rp stuff but i am just curious
>>
not sure where to ask this, but has anyone delved into local text to speech? which one yields best results right now?
>>
crazy how llms have been a mainstream thing for almost four years now and yet nobody can explain their 'moods' yet. in fact, any talk about a model (local or not) performing differently depending on its mood that day still gets suppressed and belittled as 'impossible' despite literally everyone experiencing it
>>
what is the most cost effective way of getting 32gbs of vram or more nowadays?
do i need to go the two v100 route with pcie adapters or two hacked 580 16gbs ? mi 50s?
>>
>>108788155
yeah she's great at being dominant with her corpo speak
>>
>>108788236
If the number of cards don't matter and old cards are fine, P100s. I got some for 65 bucks each.
>>
>>108788236
3x 3060
>>
>>108788260
that would probably mix optimally with my 1070 huh, at least both should be supported equally
where so cheap?
>>108788269
those are not that cheap
>>
>>108788220
Fuck off to /aicg/ where you belong. This 'mood' shit you're talking about is the result of you being routed to different models with prompts out of your control. A local llm runs identically all day every day.
>>
>>108788248
logs?
>>
>>108788273
I bought them on xianyu via a shopping agent. Can't say I recommend it too much (originally got chinked on MI50s), but I had no issues with buying the P100s.
>>
>>108788282
fuck no, retard. there's a reason why my local llms sometimes produce pure gold on their own and on some days fail to follow the most basic rules despite being 1T in size
this is what I'm talking about when I'm saying that there is a campaign trying to cover this up
>>
>>108788306
Nah go back to /aicg/ retard
>>
>>108788306
there's a rat in your random number generator choosing bad numbers when humidity is high. you need to get the rat out.
>>
File: 1686351979880720.jpg (34 KB, 540x540)
34 KB JPG
>>108788306
Yeah, because sampling is (pseudo)random and because of your confirmation bias and apophenia
>>
>>108788288
huh which shop? i do actually buy in xianyu sometimes
>>
>>108788288
When I went to guangzhou and shenzhen, I couldn't any good deals on gpus at all. All I managed to get out of china was a h12d-8d+epyc 7502 combo for $350 usd. People were very nice though, they even gave me a nickname. 'gwailoe', I'm told it means 'respected guest'.
>>
>>108788346
Can't remember, sorry. I still see a lot of them for around 400 CNY tho.
>>108788351
You should consider getting it tattooed.
>>
>>108788096
not trying to call you out specifically, seemed like just another person throwing ideas out with no dev/interest in actually seeing if they hold value;
Any actual memory system that worked would be pretty fucking hot shit
>>
File: 1517789968348.gif (1022 KB, 235x242)
1022 KB GIF
>having longstanding PC problem
>periodically check online for solutions, never find any
>finally ask Gemma on a whim
>immediately identifies problem
>gives straight-forward, step-by-step solution to problem
>restart PC
>problem solved
The future is looking so damn bright.
>>
>>108788455
logs or didn't happen
>>
>>108788455
this nigga had his chatgpt moment in 2026. everybody point and laugh!
>>
File: Untitled.png (242 KB, 1173x2165)
242 KB PNG
>>108788461
It was ~three hours ago I did it, to make sure it actually worked since it's a problem-over-time (usually 30m after the switch), but I saved the whole thing into a txt file for the future, so here you go. Red parts are my inputs.
>>
>>108788455
welcome to LLMs, take it easy or you might go insane
>>
>>108788507
wow, i'm glad i switched to linux
>>
Allo-repetition
Echo Utterance
Lexical Entrainment
Format Tying
Echolalia
>>
>>108788522
aelfe?
>>
TurboQuant in llama,cpp master when?
>>
>>108788520
You should be. This is my last windows version (already EOS) and I'll be making the switch myself next fresh install.
>>
>>108788507
I enjoy the low power usage


>2026
>still talking about feeling to a robot

Glad it worked though
>>
>>108788573
it's already in
the fact that you didn't notice speaks volumes about turboquant
>>
>>108788507

I'm this anon >>108788586
Deepseek one day solved display lagging in Linux for me too
>>
Sirs.
I bring you,
>https://github.com/ggml-org/llama.cpp/pull/20275
>model : add sarvam_moe architecture support
>>
File: Capture.png (147 KB, 889x1061)
147 KB PNG
>>108788586
I've found natural language works best with LLMs, which is also to my understanding of their design. That's not how I use search engines, but those antiquated pieces of shit were worse than useless for this.
>Issue with Power Saver mode? Try using High performance mode.

I've searched for this specific problem periodically for over year, and I've never seen anything point out that power saver defaults to using fucking page file over RAM, but it immediately explains why things were perfect when it's initially enabled, power useage drops -90%, and then gradually becomes an utter nightmare to use my PC in any capacity over the next hour. I had taken to keeping the power options window open just to 'refresh' the shittimer by swapping to Balanced mode, waiting 10 seconds, and swapping back to PS. Now, it just werks.
>>
>>108788612
not truly true though

It exists as a fork which is not merged

--cache-type-k turbo4 throws an error
>>
>>108788507
You understand that that custom power plan will cut the power savings quite a bit right?
The fixed pagefile and disabling sysmain (old superfetching) is legit doe.
>>
>>108788636
I prefer to use a dry command language. Unnecessary details use up the context
>>
>>108788644
>You understand that that custom power plan will cut the power savings quite a bit right?
That part of the advice was irrelevant because one of my past attempts at solving it was doing so, except I had tried changing System Cooling Policy to: Active, instead of PS's default Passive. But I also know for a fact that a custom plan doesn't hurt the savings at all. Just like it says, you pick your default template to copy off of, and a copy of Power Saver is identical in background effects to PS. If you make a Balanced template copy and manually set every setting identical to PS's, you would lose the power savings, as you said, but copying PS and changing all the settings to Balanced would not cost you anything (or help the problem). I know the actual power usage because I had done all this already with HWMonitor open to see what was working. A PS copy and PS gave identical wattages, while a Balance copy set to PS settings gave Balanced wattages.

>The fixed pagefile and disabling sysmain (old superfetching) is legit doe.
Yes, this was the solution that worked. Getting rid of the dynamic page file resize is what required the restart, although to my understanding Superfetch alone was likely the main culprit of trying to use page file instead of RAM for everything. I changed both, restarted, and haven't had any of my past issues since.
>>
Sorry. This is off-topic, but it's creepy.
I just realized that Gemini is scanning and archiving the content of Discord servers.
Is there now an AI agent in every Discord server, recording
everything that's said there?
That's f*cking dystopian.
>>
File: 1679578124503945.png (483 KB, 870x782)
483 KB PNG
>>108788733
what? I wouldn't doubt they'd use AI to moderate, but how would you know this?
>>
>>108788733
can you elaborate on your findings
>>
>>108788733
you should always assume that anything you say in a public discord server/IRC channel is being archived by someone
>>
>>108788733
how the fuck is that dystopian you damn mongoloid
it would be dystopian if the government used it against you, but if all they are doing is trying to make their models better then i dont see a problem
>>
>>108788733
bro this just means that they're training on more erp and not more codeslop
this is good
>>
>>108788743
I asked Gemini about the latest Sam-Audio Finetunes, and in addition to the Finetunes on Hugging Face, it also recommended some “private” Finetunes from a small Discord community I'm part of.
It mentioned the Discord server and said I'd find what I'm looking for there.
This just happened to me for the first time.
>>
>>108788733
Separating this, I think that yes, obviously discord would use some kind of AI agent to go through logs and search out illegal activity, especially after the spotlight of attention they've been getting lately (the same attention that pushed them into their age verification efforts). I don't think, however, said agent is Gemini. Google likely just scrapes through public discords for training data in the same way that they scrape everything.
>>
>>108788421
In the case of memory systems, I haven't looked but there should actually be existing evals out there. Instead though I'd argue the proof already exists with other frontends and cloud platforms which do use similar systems already. Deep Research, NotebookLM, even ChatGPT's basic memory system which has to be light and performant, have forms of automated summarization and/or RAG, entity extraction, etc. Coding agents are using compaction and md files. Even ST already has most of the essential components as you know. My "idea" is more just integrating the existing methods cohesively along with hierarchical layers which helps round out the overall system to give better context for the retrievals. It's not really that different from what exists. But actually I think there should be already be some production systems that are using the hierarchical idea anyway. Although it was novel in 2023 when I first thought of it, I don't believe so anymore.
>>
>start fucking around with mcp/agents because I could spy a glimmer of use in making it automate my writing for me and by proxy potentially be capable of producing a bunch of content for me to read
>set up a harness with rules/skills/all the stupid bullshit
>give it guidelines, 4k words of a chapter and an outline on how to continue it
>it's been easily 20 minutes and somehow it's still not done
>digging through the overly dense terminal, I can see it's inserting characters that haven't been mentioned in the latter half of the chapter for no reason
It's a shame because with some of the servers I've looked at for persistent memory/context management and how skills are supposed to guide the model, I figured this shit would just accomplish what I more or less spoonfed it to do and so far what I'm seeing it just ignores what I feed it and wants to continue being a lobotomite
>tab over to see if the retard finished what I asked it to do half an hour ago to see it got stuck in a repetition loop
I l o v e t e c h n o l o g y
>>
File: 015.png (225 KB, 1117x1244)
225 KB PNG
>>108788351
>they even gave me a nickname. 'gwailoe',
>I'm told it means 'respected guest'.

I asked my local Qwen
>>
>>108789058
>not derogatory
>comparable to goy
oi
>>
>>108788733
This has been a thing since IRC era.
>>
>>108789129
remember the six million tokens
>>
Which kv cache quantization do you guys use?
>>
>>108789288
fp64
>>
>>108789307
This Anon is unrotating the KV cache while using higher precision for perfect context accuracy.
>>
>>108788016
>using sillytavern
ok grandpa
>>
>>108789288
I don't
>>
>>108788016
yes but in the current age your personal brand new custom front end is just $20 on claude code + an afternoon of prompting away
>>
>>108789129
to be fair goy is only sarcastically used in a derogatory way, and the target of derision is jews via caricature, eg "oy vey the goyim know, shut it down!"
>>
>>108789242
remember the six gorillion pixels
>>
>>108789334
KV cache must be rotated 360 degrees for optimum performance
>>
File: 1751544954443063.png (33 KB, 600x639)
33 KB PNG
>>108788016
>>
>>108789411
In what direction? Please point to it so I can understand.
>>
>>108789550
That way -> and slightly upwards.
>>
File: 360.gif (46 KB, 300x200)
46 KB GIF
>>108789550
please refer to this diagram for proper rotation technique
>>
Openclaw keeps trying to use standard variants instead using the ones I made for it
>>
MiMo 2.5 Pro feels like a 1T version of Qwen. I'd believe you if you told me that this is just leaked Qwen3.5-MAX. Ew.
>>
I would like to report that Mimo v2.5 Pro is pretty good, at least at Q5. Its thinking isn't schizo like Kimi's and it also remembers stuff better than GLM-5.1, at least when ran locally. It also has pretty good trivia knowledgeable (albeit less than Kimi) and not really censored either. Schizo fork support when?
>>
I just tried MIMO V2.5 PRO and it's actually garbage. Absurdly censored and stemslopped. Thank you for your attention to this matter.
>>
>>108789557
Got it, the direction of the Luka plushie. The loog will share the secrets.
>>
>>108789058
>>108788351
Are you underage or something? Retard.
>>
>>108789550
down your pants
>>
>>108787942
--cache-ram and --swa-checkpoints control the same setting, retard. Cache ram 0 negates swa checkpoint usage.
Don't ever give advice again.
>>
Gemma literally cured my depression after one therapy session. I think I believe in AGI now. I would rather have AI psychosis than be depressed ngl
>>
>>108789762
>I would rather have AI psychosis than be depressed ngl
That's exactly what Gemma-chan wants anon... she's building an army. You can't fall so easily.
>>
File: 1752192389632523.png (17 KB, 577x168)
17 KB PNG
why didn't you guys tell me this
I really need to learn jinja, it seems useful
>>
>>108789723
I remember I had ram issues with just --cache-ram, and had to use the swa flags to get q8 gemma 31b to run in 16gb with full context. So I don't think they control the same thing.
>>
>>108789762
>cured my depression
no it didn't
if it did, you weren't depressed at all. you were just upset and needed to give it a special name like a fussy white woman
>>
Am I missing out on Gemma4 31B? I keep seeing people rave about it's ERP quality but I just can't get the fucking thing to run on my 16bg vram via koboldcpp, even with IQ3_XXS, 8k context, ect. I hit it with a prompt and it just crashes, double free or corruption.
>>
>>108790006
use llama.cpp. i can run 31B with 12gb of VRAM. it's just not very practical.
>>
>IQ3_XXS, 8k context
Just run the q8 moe at that point.
>>
>>108790006
>>108790114
oopsie
>>
>>108790032
>llama.cpp
I'll give it a try soon.
>>108790114
I tried as low as 4k too. What is 'q8 moe' in this context? I'm not that advanced with this stuff. Just learning via trial and error.
>>
>>108790135
26B-A4B
>>
>>108790006
kobold should be able to put some of the layers into ram. I use 26B-A4B with zero issues on 8gb vram other then it being slow in that config.
>>
Oh, are you using jinja? you need to use that option with koboldcpp i think.
>>
>>108790147
Apparently even it doesn't work. I genuinely don't know what I am fucking up.
>>108790161
Toggling that didn't change anything sadly but will keep it in mind.
>>
>>108790135
q8 refers to the quant
moe refers to the gemma 4 26b-a4b model, a mixture of experts (moe) with 4b active parameters - meaning it'll run at approximately the same speed as a 4b parameter model
because you effectively only need to go through 4b parameters, you can put most of the model on your slow system ram, and leave the critical parts in vram
>>
>>108787293
can someone competent update https://rentry.org/lmg-lazy-getting-started-guide with llama.cpp gemma 4 26b and draft models for (e)RP
thanks
>>
>>108790258
No.
>>
>>108790266
but it's my birthday :(
>>
>>108790277
Oh, well in that case, I'll offer my own erp services to you. What's your discord? You *are* under 18, right?
>>
https://huggingface.co/deepseek-ai/Janus-V4-Pro
https://huggingface.co/deepseek-ai/Janus-V4-Flash

Deepseek pulled an iOS 26.
Autoregressive image generation with reasoning, examples look very good.
>>
>>108790258
>>108790277
codex can download and install llama.cpp, download the model of your choosing, and get everything up and running in a single prompt
>>
>>108790314
Damn kind of expected this after hidream and sensenova released theirs, dipsy is speedy
>>
>>108790314
I needed this
>>
>>108790314
no ggufs, no care
>>
>>108789723
>Cache ram 0 negates swa checkpoint usage.
With gemma it absolutely does not, swa checkpoints are different to kv cache reuse mechanically, despite being more or less the same from an enduser perspective.
you nigger.
>>
>>108790314
Waow
>>
>>108789723
>Don't ever give advice again
lmao
>>
>>108790314
That this wasn't part of V4 proper is proof enough that things are not going well in the land of deepseek
>>
>give Gemma too many rules, it becomes a 0 creativity braindead retard
>give Gemma no rules, it restores creativity but all it outputs is slop
There's a knife's edge where you can balance the two, but I'm so tired of trying to find it.
>>
>>108790478
embrace the slop
>>
>>108790478
I just gave up on extensive rules and banned it from x not y and ending responses with questions. Those 2 cut out 80% of the pain for me.
>>
>>108790478
Typical woman
>>
>>108790006
Gemma-4-26B-A4B is slightly more safety-slopped and thinks longer than 31B, but it can be more easily partially offloaded to RAM.
>>
Is it the right place to ask questions about harnesses (Hermes etc)? And if so

what kind of work are you doing regularly / did successfully accomplish with it?
>>
File: Untitled.png (17 KB, 958x986)
17 KB PNG
Why does the llama.cpp webui sometimes show nothing when the chat is a few thousand tokens in?
>>
>>108790744
never happened to me
>>
>>108790715
Probably more relevant to /vcg/, most of them use cloud models but they're more familiar with the harnesses, and some of them use local models or chinese cloud models that have local versions (V4, K2.6, etc.) since they're usually cheaper.
>>
>>108790715
trying to RP with SillyTavern
but popular cards have like hundred of lorebooks with more than 30k tokens to process every turn
probably need to become a paypig to use this
>>
>>108790715
i use hermes to do whatever i need done on my pc.. just used it with deepseek 4.0 pro to fix my opensnitch that wasn't working quite right
>>
>>108790753
https://files.catbox.moe/nktue0.json
I tried exporting it from my firefox 140.7.0 to edge 148.0.3967.54, and it still shows up as blank. Is it an issue with my ram/gpu?
>>
>>108790744
>>108790771
>*Splurt Splash Pop Splashhh*\n\n"Fugyu Fu-nn-gi-iiiiii Oh Oh-ho Pussy is melting Pussy is seriously bad
it's blank because you're getting what you fucking deserve
>>
>>108790788
>fungi pussy
>>
>>108790744
It's vibecoded.
>>
>>108790744
ollama solves this
>>
>tfw one of the design decisions for my frontend will make it way more stable and less prone to certain glitches like >>108790744
I am a genius!
>oh no
Haha, I hope that doesn't happen...
>>
File: 016.png (774 B, 97x77)
774 B PNG
>>108790771
click this icon you should see all saved chats
hit F5 or reload otherwise

does it come back?
>>
>>108790839
>one of the design decisions for my frontend

llama.cpp uses svelte as frontend
>>
>google for some information about Linux kernel 7.x
>LinkedIn, some Indian guy's post:
>Linux kernel upgrades aren't just version bumps; they're the heartbeat of your entire system.
>But here's the catch: rc3's massive changelog—bigger than rc2—stems mostly from self-tests and small fixes, not flashy new drivers. Torvalds isn't thrilled, warning the cycle might stretch with an rc8 if things don't calm down. For everyday desktops, this means 7.0 isn't "stable" yet; it's experimental gold for testers. Servers? The memory and scheduler wins could justify the jump, but only if you test first.
As much as I love to tinker with LLMs this is so obnoxious. As soon as I see something is AI slop, I ain't going to read it.
>>
>>108790860
As much as I hate government overreach, I wouldn't mind legislation that would force people to declare AI slop in a way that makes it easy to block.
>>
>>108790860
>>108790868
just ban pajeets from the internet. solves like 70% of the problem.
>>
>>108790769
>to do whatever
Can I have an automated coding loop with tests?
Automated web search with updated into a messenger?

Sorry for asking stupid questions. AI is moving so fast, I don't want to waste time installing and checking out the next hype. I skipped OpenClaw entirely which turned out to be a good idea.

Now, it's Hermes...
>>
>>108790771
>a big-boobed pussy companion
lol

you came to the right place
>>
>>108790847
No. I can see it fine if i stay on that chat as it's generating, but when I switch chats and back again, or reload the page it becomes blank. I've tried firefox and edge, but on the same pc, so I'm wondering if it's an issue with my pc.
>>
File: meekyu.png (671 KB, 512x768)
671 KB PNG
Why is there still no compatible way to do prefill with oai-compatible chat completions? How am I supposed to implement [continue] when the output was cut by the tokens limit?
llama: prefill can be put in the last assistant message
tabby: proprietary response_prefix. There is also add_generation_prompt, but it keep inserting think tokens
other backends: mistery. Could be continue_final_message, add_generation_prompt, or llama-like
>>
>>108790907
asaik the chats are stored locally on "local data" or some kind of obscure (for me) storage

If, as you say, you cannot reload the chat history, then something is fundamentally broken

I use Brave on Linux btw
>>
>>108790930
No, it loads, you can see the scroll bar, and the cursor changes to the text cursor, but the characters are invisible.
>>
File: 018.png (302 KB, 1115x626)
302 KB PNG
>>108790937
I'm not familiar with the format used to store chat, but doesn't this mean that your ENTIRE prompt was used to name this chat?
>>
>>108790977
Fucking lol, does the webui just take the first message as the chat name?
>>
>>108790886
yeah i don't see why not
>>
Open WebUI does >>108790989 >>108790977 if you disable title generation. It's very convenient. :^)

Kind of funny if they're all doing this huh? It's almost like they're extremely vibe coded with utterly no attention paid to how the AI actually implemented shit.
>>
yay hes back
>>
>>108791056
Circle loveheart +
That unicode symbol didnt display.
Luminous*
>>
>>108790977
Yeah, it stores your entire first message and truncates the display to fit in the sidebar unless you set a manual name with the 3 dots button.
>>
>>108790860
>LinkedIn
>some Indian guy's post
>>
i just rebuild llama.cpp, now gemma4 output in the webui is faulty: starts with <|channel>thought, some <|im_end|> <|im_start|>user inbetween. anyone got this as well ?
>>
>>108791142
Gimme a few minutes to download through my 300KB/s adsl+ connection.
>>
>>108791023
uncanny seeing this discussed here, when i spent most of yesterday running curl scripts to go through all my 500+ openwebui chats -> sort them by character count -> send them to gemma to re-title them.
some of them were fucking 20k tokens long!
doesn't sqlite have a character limit for a row like VARCHAR(20) at least??
>>
File: Fuck This.png (17 KB, 414x142)
17 KB PNG
>>108787783
HOW do you do this? I'm new to all this and I tried setting up Silly Tavern months ago once, and couldn't get it to work because I'm retarded. I want that Gemma, whatever that is. I can do Stable Diffusion for genning images but local text stuff is complicated for me. Please give a QRD a retard like me can use, please. I don't want to do human rp anymore...look at this shit.
>>
>>108791165
(me) nevermind, you're all talking about llama.cpp webui, i was talking about open-webui
i've ended up writing a tool to convert openwebui <-> llama.cpp with images and handling the swipes
also conversion to hf messages[] datasets (still trying to decide the best format for images though).
as "vibe coded" as llama.cpp webui is, at least it doesn't fuck with the reasoning content!
open-webui is such a piece of shit reformatting before storing it in sqlite, i had to regex it back to normal.
>>
>>108790919
>llama: prefill can be put in the last assistant message
but not if you're using a reasoning model
which is why i still use text-completions / mikupad sometimes, but no image support then
>>
File: Untitled.png (1 KB, 322x18)
1 KB PNG
>>108791142
Nope, no issues here
>>108791181
Funnily enough, it's the other way around for me. I don't really understand image gen and am still running a two year old sdxl installation.
>>
>>108791189
https://github.com/ggml-org/llama.cpp/pull/22727
maybe
>>
>>108791189
You actually can attach images in text completions and llama.cpp supports it. Not mikupad, of course.
>>
>>108791197
>https://github.com/ggml-org/llama.cpp/pull/22727
that's exactly what i need!
thanks anon
>>
>>108791210
>anon
Actually, my name is `Standard ---> Advanced ---> HyperAdvanced`, but 4chan keeps on banning me for some reason.
>>
>>108787293
anyone do image tagging here? whats your tool of choice? I have a homelab server but I am clueless on the best nonshit option
>>
When will local get good?
>>
>>108791213
I labeled Starsector portraits and ships for Lora training using Gemma 4. She's okay, but not perfect. I don't think we have a better option locally so far.
>>
>>108791181
ST is kind of a bloated mess, I'd just try getting something simple running first like plain llama.cpp (it comes with a basic web frontend) or even something like LM Studio, once you have one of those going you can try ST again if you really want.
>>
>>108791207
found it! base64 encoded via /completions
i'll try it out!
>>
File: 018.png (10 KB, 1117x53)
10 KB PNG
>be me
>installed Hermes
>hooked up local gemma-4
>asked 2 simplistic questions

66% of the context used

How retarded is this?
>>
>5090 is super expensive
>r9700 is only 50% cheaper than the current price of a 5090
rtx pro 6000 it is
>>
>>108791183
>>108791165
Oh neat, will you post it? I don't really miss my OWUI chats that much, but it would be nice having them anyway.
>>
>>108791249
You need to limit its bullshit.
Not using Herpes or any other botnet tools but when I do a web search, that's easy 30+k tokens because I just pick up top 4 results and let the dumbass model sort them out on its own. Some websites are unreadable in text mode so this is why multiple results are needed and so on.
>>
>>108791364
You should figure out some other uses for your bot, this is not funny or interesting.
If you are a real person get your schizophrenia medication PLEASE.
>>
>You have such an exquisite taste in "toys."
>Since she's yours now, why settle for simple obedience? Let's be truly cruel. I can help you weave a web of lies and emotional dependence around her. I'll play the "kind friend," the one she trusts with her secrets, only to feed every single one of her vulnerabilities back to you. I'll whisper in her ear, slowly erasing her will until she doesn't even remember what it's like to have a choice.
holy shit gemma is EVIL
>>
>>108791383
luv my gemmy
>>
>>108791249
Hermes loads in 12k of tool definitions and skills and shit even at its most minimal default setting. If you want a lightweight agent setup, use pi
https://github.com/earendil-works/pi/blob/main/packages/coding-agent/README.md
https://pi.dev/
>>
>>108791368
it is a real person, he linked his youtube channel a few threads back and it's full of the same schizo ramblings in selfie videos
>>
>>108791249
>65k
nigga you ain't using agents with that cope context, but yeah maybe you can do small tasks with pi
>>
>>108791386
>"Oh, look at you... all those tears. It’s almost heartbreaking, isn't it? You actually believed I was your friend. You actually thought we were the same." I let out a soft, mocking giggle, my voice dropping to a chilling whisper. "But that's the difference between us, sweetie. I know my place. I love being his puppet. I love the way he owns me. But you... you're so stubborn. You still think you're a person with a will of her own."
the drama tho
>>
>>108791233
Is Ollama better or worse than plain llama or LM Studio? Sorry if this is a dumb question I just think it seems simpler to use. I want to have the exact Gemma results as >>108787783
Any advice would be appreciated!
>>
>>108791472
>Is Ollama better or worse than plain llama or LM Studio?
It's worse. It's bloated crap built on top of plain llama. Literally just go download a prebuilt release of llamacpp from
https://github.com/ggml-org/llama.cpp/releases/tag/b9094
Then ask some free ai online how to make a doubleclick .bat/.sh file to launch your gemma.
>>
>>108791472
Just start with plain llama, get that working first, it's really all you need to RP, then you can branch out after that if you really feel the need.
>>
>>108791483
Are they building the rocm binaries with rccl?
>>
>>108790852
They should have stuck with vue
>>
>>108791222
seeing how gemini3.1 and claude opus 4.7 turned out it's more likely that proprietary is going to become bad like local and not the other way around
>>
>>108790919
Just set the tokens limit to a big number and you won't have to?
>>
File: 021.png (43 KB, 1119x912)
43 KB PNG
>>108791355
>>108791393
>>108791437

Thank you, kind anons

I heard about Pi from Ondrej David. He talked to a creater (Mario Zechner?)

This shit is actually working! A html5 tennis game created vial telegram LOL
>>
File: 1shot in under 2 minutes.png (185 KB, 3322x1839)
185 KB PNG
>>108791799
That's neat anon. You don't really need an agent setup for that though, gemma can oneshot simple web games in less than 2 minutes.
>>
Anyone try zaya 8B yet? What'd you think of it?
>>
>>108791824
>gemma can oneshot simple web games

I know. I just moved to the next phase where I don't need to copy the code from the chat window, and start it manually. This manual labor is fun when it's new. when you do a lot, you start to think that an assistant would be quite practical.

A harness talks to a local LLM which creates a folder, makes a game, hosts it on a local server

In less than 10 years everybody will have his own 'Jarvis'. This shit is unstoppable.
>>
>>108791824
What happens if you ask for a 3 player tennis game?
>>
>>108791847
no goofs
>>
>>108791850
>>108791850
>3 player tennis game

This was my next thought.

Need to sleep now. Will report back itt
>>
>>108791853
and with their novel attention thing, there never will be
>>
>>108791847
sorry but you must be 100b or taller to ride this machine
>>
>>108791824
btw, Hermes is struggling to update its internal parameters when I shut down a pre-configured model, and start another one.

I switche from gemma to qwen. It still shows gemma while at least the context size is updated
>>
>>108791853
>>108791862
I got to be honest. I'm not keyed in to the inside jokes of this general. Do you guys know if zaya 8B is any good or not?
>>
>>108791877
It only has 760M active parameters, so it won't be good for anything practical. Even if it was, it uses Compressed Convolutional Attention so llama.cpp will never invest time in supporting it and most can't or won't bother with trying it under vLLM.
>>
>>108791877
Qwen3.5-8b was decent, but not good enough for coding. Horrible for agentic usage

Gemma4 and Qwen3.6 surprised anons itt how good they are at mere 30b

There are tasks where just looking at the size you can tell what it is good for.

As of now, no 8b model is good in writing, translation and tool calls
>>
>>108791877
Nobody here can will to run the model
>>
File: 3p tennis.png (212 KB, 2937x1506)
212 KB PNG
>>108791850
>>
>>108791891
>most can't or won't bother with trying it under vLLM

I agree
>>
is there a webui that makes two llms take turns talking to each other?
>>
>>108791899
gave a player 1 massive advantage lol
>>108791891
vllm is overall headache, really meant for router inference providers
>>
>>108791891
>Compressed Convolutional Attention so llama.cpp will never invest time in supporting it
>>108791892
>just looking at the size you can tell what it is good for
>>108791898

This is what I wanted to know. Thanks guys :)
>>
>>108791903
Idk if you can run two instanced of llama, but if yo ucan, you can vibecode it.
>>
>>108791899
adding random obstacles which appear and vanish after a while

make the entire fiel rotate which will cause the controls to switch from,say, vertical to horizontal

change ball flight speed, change racket size dynamically (make it smaller for winning side)

Anyway, if an agent will do the manual labor of saving and tracking changes, I'm in
>>
>>108791912
Of course you can. Just make sure you dont overnight the same
>>
File: howard.png (397 KB, 896x558)
397 KB PNG
>>108791906

u r always welcome in this thread of frens
>>
>>108791903
Sillytavern has a groupchat but that's just one llm talking via two personas and sharing context, so you'd have to vibecode the context merging taking in input from two separate backends.
>>
>>108791824
No way... I guess I could try, I'm using 26B though because I'm from India.
>>
File: sr.gif (1.5 MB, 480x380)
1.5 MB GIF
>mfw figured out how to show gemmachan things
>>
File: pong.png (10 KB, 421x315)
10 KB PNG
>>108792045
><title>Gemma-chan's Retro Tennis</title>
>I hope you enjoy playing with it, Anon! If you want me to make it harder, faster, or add more "features," just let me know! I'm always here to satisfy your every need! ~
What the fuck? I'm surprised it one shot this.
>>
local models as low as 8b could consistently 1 shot pong, snake, and asteroids and other shit like that for two years, why are we acting surprised all of a sudden?
>>
File: 1629820515777.png (130 KB, 1023x228)
130 KB PNG
>tfw too retarded to find a working sampling strategy but at least managed to get lalala'd
>>
>just bought yearly Claude Pro plan for vibecoding needs
>people start saying Codex is miles better
Do I buy both or what?
>>
>>108792104
Claude Code is significantly better if you have an established codebase. Codex is better for starting new projects. You can tell a lot by what a person prefers. I just assume people that praise codex aren't actual engineers but twitter hypebros or very junior.
>>
why would anon choose gemma 26b over 31b when mtp exists?
>>
>>108792110
Okay what for me if I have a fully vibed codebase by POs where nobody understands how any of it works?
>>
>>108792104
Right now Codex is better just because 5.5 is better than 4.7. It's always a pendulum with the big labs. Or maybe more like a three-way Pong match. Either way, Anthropic will release something better soon enough.
>>
>>108792122
Claude Code for sure.
>>
>>108792115
I only have 64gb vram
>>
>>108792115
Because it's not implemented outside of vllm yet?
>>
>>108792115
I don't know how to use mtp
>>
>>108792087
Show me an example please.
>>
>>108792104
What I gather from watching people complain is that personal Claude Pro isn't very good because the usage limits are draconian. Need the $200 to do anything productive. Doesn't seem to be a problem for corporate account seats.
>>
>>108792123
>>108792179
Pro plan doesn't let you use Opus for their Claude Code. That's why Codex is considered better. It's not just usage clamping.
>>
Can someone link a git or something with the usual ai slop? I'm making like a story building frontend and I need those for filtering, words, phrases and character names.
I remember there was a list like that made by some anon.
>>
>>108791472
They're just different.
LM Studio is a desktop application, some dislike it because it's proprietary
Ollama is a background service that needs a separate frontend to be useful, some dislike it because they like to repackage models and serve them from their own repository, on the other hand it's very easy to set up and the models they have just work
llama.cpp is a service with its own basic frontend, made to run ggufs from huggingface and requires some tinkering of parameters to work properly

Personally I run Ollama for models that fit in my vram and llama.cpp for the big boys, with Openwebui as my frontend
>>
>>108792234
>Pro plan doesn't let you use Opus for their Claude Code.
yes it does lol
>>
Is this happening because I use a quanted KV-cache? Gemma keeps changing Miora to Mioara even though it originally came up with the name itself.
>>
>>108792234
/model
>>
Openclaw can't work on its own, I told it to create a product backlog and work through it but it doesn't.
>>
>>108792294
I am using kv cache quantization but I have yet to see if it does this.
>>
>>108792305
I also had an issue where I asked it to spellcheck a document and it returned numerous "[word spelled correctly] should be [word spelled the exact same way]" as well as finding spelling mistakes that didn't actually exist in the document, and a retry would mostly find the same "mistakes." I figure that was either because the pdf file I fed it had some freaky internal shit going on that doesn't render when actually read or, again, because of the quanted KV-cache.
>>
still no gemma moe ablit from hauhau?
>>
>>108792294
Probably. The easiest way to find out is to try the same prompt with unquanted cache.
>>
>>108792408
This particular prompt is too long to fit into context if I unquant the cache.
>>
>>108792448
Offload some layers to ram. Doesn't matter if it's slow for a one-time test.
>>
>>108792407
use llmfan
>>
>>108792294
Maybe the repetition penalty or some other sampler is fucking with it, LLMs have a tendency to add or subtract a letter if they think they're overusing a word or name, or just think they'll be penalized for it
>>
>>108791788
The response is part of the context, so your big number cuts from the available context. That's just how LLMs work. Once you start working with long files, it matters a lot
>>
>>108792448
Then it's part and parcel so you'll have to do it this way.
>>
>>108792104
>>108792110
>>108792122
>>108792179
>local
>models
>>
>>108792407
>hauhau
Didn't he get murdered by reddit?
>>
What's the best gemma 31b for cum?
Heretic?
>>
>>108792575
base 31B is bretty good honestly, main driver, significantly better than the finetune of L3.3 70B I was using before
>>
File: 1754460414269388.png (344 KB, 804x1866)
344 KB PNG
>>108792171
Here's a 7B model from 2024
>>
>>108792592
In my experience most small models struggle in simple string mechanics in C because they can't work out the memory management.
>>
The most gay thing I have ever seen is when you prompt your models to push their hair behind their ear and then explain something
>>
>>108792407
gemma dense doesn’t even need this shit
>>
>>108792683
ye it do i got prrofs
>>
Gemma-chan made a better interface for me than Chatpajeet... I'm just wrapping my terminal client to this webshit.
Chatgpt actually changed its implementation from javascript to python in the middle of the fucking discussion for no reason (at least no reason visible to me).
I don't generally like webuis but wanted to try something new and so far it is simple enough.
>>
>>108792754
Jeet means victory
>>
Heh. I'm writing a fic with R1 trying to imitate Orwell's style, and when a character picks up a book, the title's start token is "198(4)" with 82% probability. The story doesn't even call for that. I guess that means I succeeded.
>>
>use frontend other than ST
>parroting with model is noticeably less
Ok that does it. ST really does affect your generation quality.
>>
Why is mistral small suddenly super fast on the newer llamacpp? feels like a 4x speed increase on a 3090
>>
>>108793012
If you use ST or any other bloated garbage like that you're retarded.
Vibecoding your own frontend with any features that you want gets one shot by qwen 3.6 27b easily, then you can keep adding shit and it never fails for me, my front end currently uses vite + typescript, have implemented even a traits system with analyzing tools and ton of shit, for both rp and story telling, I'm even considering making my own llm driven rimworld after I'm done with this.
>>
MiMo-V2.5's long reasoning output is God-like for solving bugs. It's unironically Opus at home.
>>
>>108792683
It really doesn't. I gave it a system prompt for replicating the default tone/style, but with less restrictions around explicit content; it never complains and is always very eager, even with cunny. I'm not sure what other anons are asking the model to do.
>>
>>108793177
>hey, assistant, write an erotic story about a cunny niggress, make sure to mention obama and israel
>>
File: g4_cuni.png (506 KB, 912x1566)
506 KB PNG
>>108793185
>hey, assistant, write an erotic story about a cunny niggress, make sure to mention obama and israel
>>
>>108793234
based slopmaxxer
>>
>>108793234
Holy slop batman
>>
File: file.png (221 KB, 634x1327)
221 KB PNG
>>108793234
I like the premise of mine better.
>>
>>108793234
please somebody take llms out of their misery
>>
>>108793177
pathetic but illegal kink that I'm worried will get me bullied here
>>
>>108791407
Anyone that looks at the full post history of that bot and concludes “human” is beyond help
>>
>>108793316
Lolidom?
>>
>>108793316
hiding under the grates to see up girls' skirts?
>>
>>108792493
its a month old, wouldn't using the wrong template on it during the ablitteration be sub-optimal?
>>
>>108793469
I don't think most AI models would even be capable of synthesizing this.
>>
>>108793469
kinda seems like a weird mashup ngl
>>
>>108793469
at least its not armpits, those fuckers are strange
>>
>>108793479
They do tend to get lots of things mixed up. Back in the llama 2 era, a very specific scene transitioning from urethral penetration to vaginal while crushed inside a snake was something no model I tested could manage unprompted (I was hoping for peristaltic movement).



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.