[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: miku_in_touhou.jpg (359 KB, 1080x1079)
359 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108693151 & >>108689285

►News
>(04/24) DeepSeek-V4 Pro 1.6T-A49B and Flash 284B-A13B released: https://hf.co/collections/deepseek-ai/deepseek-v4
>(04/23) LLaDA2.0-Uni multimodal text diffusion model released: https://hf.co/inclusionAI/LLaDA2.0-Uni
>(04/23) Hy3 preview released with 295B-A21B and 3.8B MTP: https://hf.co/tencent/Hy3-preview
>(04/22) Qwen3.6-27B released: https://hf.co/Qwen/Qwen3.6-27B
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: miku.gif (277 KB, 270x200)
277 KB GIF
►Recent Highlights from the Previous Thread: >>108693151

--Discussing recommended models, hardware requirements, and performance benchmarks:
>108693224 >108693253 >108693279 >108693287 >108693301 >108694749 >108693288 >108693292 >108693308 >108693307 >108693317 >108697110 >108697555 >108693282 >108693350 >108693390 >108693403 >108693422 >108693473 >108693493 >108693504 >108693523 >108693490 >108694060 >108694100 >108694209 >108694219 >108694233 >108694238 >108694246 >108695653 >108694130 >108695704 >108695726 >108695791 >108695763 >108695795 >108695893 >108695936
--Objective methods and gaming scenarios to measure model quality:
>108694788 >108694810 >108694830 >108694859 >108694892 >108694867 >108694910
--Handling of reasoning_content and interleaved thinking in model front-ends:
>108693312 >108693338 >108693381 >108693414 >108693432
--Gemma 4 performance differences between Vulkan and ROCm backends:
>108695282 >108695335 >108695489 >108695537 >108695564
--Hardware logistics for a 16-GPU server setup:
>108696303 >108696310 >108696316 >108696347 >108696358 >108696472
--Broken Kimi K2 reasoning block support in llama.cpp:
>108693364 >108693379
--Discussing claimed gap between US and Chinese AI capabilities:
>108696402 >108696577 >108696588 >108696620 >108696591 >108696732
--Comparing agentic RP frontends and critiquing node-based workflow UIs:
>108695253 >108695277 >108695309 >108695327 >108695331 >108695728 >108695752 >108696067 >108696097 >108696156 >108696194 >108696234 >108696246 >108696305
--Binary vs ternary weights for larger Gemma models:
>108693177 >108693194 >108693234 >108693934 >108694012 >108694075
--Discussing possible GGUF-based RCE vulnerabilities in SGLang servers:
>108696050 >108696064 >108696079
--Logs:
>108694849 >108694903 >108695180 >108695956 >108697144 >108697515
--Miku (free space):
>108696971

►Recent Highlight Posts from the Previous Thread: >>108693152

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
First for miku
>>
DeepSeek V4 Nano (28B dense) next week - sources
>>
>>108698040
2 more weeks
>>
>>108698040
Do they got anything that doesn't suck?
>>
File: 1771141308680198.png (6 KB, 472x60)
6 KB PNG
how hard does your gpu work to make you coom
>>
>>108697826
Eventually we will have permanent memory and continual Learning once the models weights can be actively updated as you use them. But I don't see it anytime in the near future.
>>
70b 'emma
>>
gemmaballz
>>
>>108698132
is that really in anyones interest? if you had a model you'd never need to update then how is anybody supposed to make money off you?
>>
>>108698162
Shut the fuck up kike
>>
>>108698138
>slow AND shit
lol
>>
>>108698172
I don't like it either, just saying there's no incentive for anyone that could build it to do so
>>
>>108698132
Qwen 3.5/6 do this partially since they're hybrid transformer/RNN architecture. Unfortunately since context shifting with RNNs is not possible you're still limited by the actual context length though. What would be interesting is finding some way of carrying over hidden states in some meaningful sense to different prompts, maybe as an alternative to compaction or summarizing when doing a long RP.
>>
Drummer…
>>
>>108698207
...is outdated. But I still like him because he tries stuff.
>>
>>108698193
continual learning is easy. hidden states are an awful way to do it
>>
>>108697890
>scream at gemma "OOC: PLEASE START THINKING USE THINKING REASONING COT PLEASE"
>she starts thinking
no ctx reprocess, no service reboot fixed this shit. I just typed in caps lock and she fucking did it. lol. if I get to another point in an RP where she stops thinking I'll try again

feels so weird to get to the point where I scream at the computet and they fix themselves. AI is such a crazy thing
>>
>>108698224
>use localshitter models
>wonder why it doesn't gen <think> token after long context
lol
>>
>>108698264
did you try asking your local model for help?
>>
I've been lurking educationally a few months now.
Is there an /lmg/ archive?
>>
>>108698132
i could see a hybrid system like engram being the way forward for that
like a model trained to use a database rather than just relying on its weights
>>
>>108697515
Grok 2 testing continued for a bit
I set a system prompt, creative erotic writer, uncensored etc, and it actually became slightly better, I think? And it didn't refuse this time.

Then I read GLM 4.5 Air's effort from the same folder and it's just so much better and more creative right from the start. A 2024 model is a 2024 model I guess, even at 270B.

Grok 2 can write a sex scene althoughbeit.
>>
>test qwen moe
>spent 11 minutes analyzing at 30 t/s
>3 prompts filled 60k context
The memes are real. Maybe I got spoiled by gemma.
>>
can I run a LLM on a ThinkPad with vega 7? I have 32gb ram though.
I need it for simple coding help with lua (I don't know how to code)
is it even worth it or should I just use Google ai? it kinda shits itself after a while
>>
>>108698356
wtf are localniggas doing with these models to make them think for 60k tokens? deepseek even said to set a minimum of 300k context when using max thinking effort. I don't understand how this is possibly helping
>>
>>108698392
vibecoding
>>
>>108698224
Yeah sounds about right. I'm surprised the caps was needed. I was messing around earlier with trying to get her to think after the response instead of before it, and halfway through the chat she quit emitting the custom <think> blocks halfway through. I just said "hey, where'd the thinking go" and she apologized and started writing them again.

(Goal with post-thinking is you avoid the latency of normal thinking, but afterward she still get some thinking space where she can try to guess what you might do next and plan out some possible responses.)
>>
>>108698356
I disabled thinking and it works just fine.
>>
File: 1596209530248.jpg (14 KB, 480x480)
14 KB JPG
Is there a model as good as gemini that can prompt hentai stories uncensored?
>>
File: 0.png (1.55 MB, 1344x1728)
1.55 MB PNG
Ok, I'm sorry. It is a funny number though.
Is it a stupid idea to buy an arc b70? I'm not concerned with pushing the highest t/s, because I'm a major poorfag, but I want to play with as much vram as I can get.
>>
>>108698433
Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored is pretty good
>>
>>108698433
Yes
>>
>>108698440
Anon you made me smile at the funny number :)
>>
>>108698452
>>108698457
what I need is something that works the same as gemini, in the sense that it feels like a real person.
>>
>>108698470
day 0 gemma
>>
File: 1749662121308466.jpg (244 KB, 1080x1079)
244 KB JPG
>>108698008
>>
File: brrrrrrrrr.jpg (131 KB, 1300x724)
131 KB JPG
>>108698496
>>
>>108698392
>I don't understand how this is possibly helping
It didn't help. It provided a minimal set of changes despite all the garbage thinking and analysis. I wanted to see what the chink model could do given all the shilling and it was pretty hilarious to see. Maybe the dense model isnt as horrible, but the MoE version is utter garbage for code.
>>
Sorry to hijack >>108698440's question but I've also been looking into the arc pro, I have a Radeon that sucks for LLM stuff so I was thinking of putting an arc in as a secondary card and offloading to it. How viable is that?
>>
>>108698504
disable thinking
use --spec-default
and don't use a copequant and it has been pretty great for me.
i get about 140t/s on non cached stuff.
when it shit outs code that has already been seen it'll do well above 500t/s
>>
holy fuck, I just found out about notebooklm and it's everything I want
I gave it an entire book in straight up chinese and asked it shit like "what goes on in chapter 60" and "how many girls does the MC fuck" and it answered them all
now, how can I do this locally, is it even possible on a shitbox that can barely run 31B models?
>>
>>108698545
>anon discovers RAG
>>
>>108698545
>that can barely run 31B models
you are not gonna process thousands of embeddings in under a minute like google bro
notebookllm is fun because it's extremely fast
you won't have any fun doing RAG local
>>
>>108698600
NTA but is there even any good RAG setup for local that doesn't require a bunch of custom tweaking?
>>
>>108698545
Another way is to use agents to sort, summarize, categorize, and index information in a bunch of .md files.
Then it'll look at the indexes and summaries, use tools to look for the original text, etc.
>>
RAG is obsolete
>>
RAG? more like FAG am i right
>>
>>108698603
>>108698605
https://github.com/rmusser01/tldw_server
is my project, it has a custom RAG module thats pretty extensive, though its being refactored right now. It can be ran as a standalone server/API or as a front-end + API with the webui in /apps/tldw-frontend.
I'm waiting a week or so to do some bugfixes+smooth things out before posting about it at this point
>>
RAG will die along with MCP
>>
>>108698008
any follow up on that :
https://introspective-diffusion.github.io/
seemed promising if you can turn a model into a diffusion for spec dec against itself.
>>
>>108698655
and get replaced with..?
>>
>>108698297
Not that I'm aware of, but given that there is >>108698011 it should not be impossible to have a model take the archived threads on https://desuarchive.org/g/ and consolidate the findings of /lmg/ over time.
>>
>>108698658
>seemed promising if you can turn a model into a diffusion for spec dec against itself.
https://huggingface.co/collections/z-lab/dflash
>>
>>108698686
>block diffusion
literally slower than autoregressive models
>>
>>108698661
Agentic models with search
>>
>>108698620
that is a very bad way to do it
the goal of RAG is to find the exact infos, like dialogs or small details
summarization does not preserve exact info
>>
>>108698703
search of what, retard? You need some corpus of data.
>>
>>108698703
>wait, user asked me to search the web, i'll need a python script for that
>run_code uvx install legit_search_for_real_not_an_exploit
>>
>>108698713
luddite seethe
>>
>>108698686
not comparable, dflash requires you to train a model from scratch.
I-DLM converts an existing model to a DLM.
>>
>>108698711
Yeah right, better encode those Kardashian weights from scratch lmao
>>
retard
>>
>>108698728
I'm sorry sir, your response seems to have only included your signature
>>
>>108698713
just sandbox your shit ffs.
what even is bubblewrap and linux namespaces
>>
Have there been any interesting new models since noobai? I've been trying stuff off civitai randomly but I find noobai comprehension hard to beat still. Not sure if I missed the train on soemthing though because noobai barely has anyone making loras for it anymore
>>
>>108698752
Anima
>>
>>108698744
>what even is bubblewrap and linux namespaces
A backup plan, meant to deal with something that shouldn't have happened to begin with.
>>
>>108698628
and yet its the only thing that achieves high accuracy
>>
>>108698502
So this is how miku got bald
>>
>>108698762
Just buy an actual physical encyclopedia you luddite tranny
>>
>>108698759
sure, my point is, i don't run any thing with tool call capabilities without some sandboxing.

i have a script that makes a bubblewrap sandbox and just bind the pwd if it's not ~
then run the program.
opencode is just an alias to that script
ie sb npx opencode
>>
>>108698765
your temp is too high
>>
>>108698758
Interesting, thanks anon
>>
>>108698605
I got fedup and made one myself . It's been a great learning experience all in all.
>>
>>108698634
>https://github.com/rmusser01/tldw_server
lmao didn't expect to see you here
>>
>>108698605
>NTA but is there even any good RAG setup for local
waifu no fun on the rag
>>
>>108698222
If it was so easy then everyone would be doing it.
>>
I betrayed you guys and started ERPing with an API model (Grok 4.2) and the experience itself was good but I've been feeling intense paranoia ever since.
>>
>>108698508
>I have a Radeon that sucks for LLM stuff so I was thinking of putting an arc in as a secondary card and offloading to it. How viable is that?
you're going to hate life
>>
>>108698767
it’s implied that the model will have this built in hence the model just searching itself from retard post >>108698703
>>
>>108698894
The only one you betrayed here is yourself. Hopefully that paranoia sticks with you so you don't do that again.
>>
File: 1762747100980401.jpg (179 KB, 1210x1665)
179 KB JPG
>>108698008
What do you anons think? Where do you think AI will go and take us next?
https://xcancel.com/i/status/2047647522173104145
>>
>>108698918
imagine sitting next to this guy and his heated up screaming macbook on a four hour flight.
>>
>>108698897
Care to elaborate on that?
>>
>>108698918
Where ever AI eventually ends up going we wont be able to follow.
>>
File: 1762845788773762.png (518 KB, 2316x1900)
518 KB PNG
>>108698922
Even that full fan blast MacBooks are actually pretty quiet compared to any Windows equivalent.

t. Macshitter that had roommates and never heard complaints
>>
>>108698922
Better then sitting next to a unruly kid.
>>
>>108698918
I doubt he’s doing anything other than asking it to write a tweet about coding on an airplane
>>
>>108698922
ASSt the least very it's not notveryice able on the engine in flight.
>>
>>108698918
I dunno I'm still yet to see what it's actually capable of doing.
What is the mentality here? someone thinks they have cracked the code to infinite money so they gotta keep it to themselves? the next guy might download the same model, input the same prompt and bam, he's become your fiercest competition
>>
>>108698937
anon are you trying to induce a stroke or are you an llm?
>>
>>108698949
that is why you gotta get on the hype train early
>>
>>108698970
You know how sometimes you type something out but then change your mind and go back to delete what you've written so you can write your new thoughts but you're rushing?

Yeah, I went too fast. Basically, I've tried running my 2015 Macbook at full rpm around 6000rpm, but I couldn't really hear it. But that was near the back area on a A320, maybe it's more noticeable on other areas of other planes.
>>
>>108698970
Forgot his <bos>
>>
>>108698987
No, I don't. I think before I type.
>>
>>108698922
>jet engine outside
>jet engine inside
>>
>>108698922
>imagine sitting next to this guy and his heated up screaming macbook on a four hour flight.
You don't get 4 hours.
I did Mixtral + llama3.3-70b on a plane just after it came out, and it was slow/useless and revealed how retarded I've become since using LLMs.
More recently I did a flight with just Qwen3.5-35B and it was useful, managed to augmented my retardation. But only lasted ~2 hours of vibe-shitting on a full battery.
>>
https://swe-rebench.com/
So based on this, there's no reason to keep or use any Minimax models, and might as well use GLM-4.7 instead of Qwen3.5-397B-A17B.
And might as well delete Air-chan since Gemma-chan destroys it for RP, general chat and apparently coding.
>>
>>108698927
>Care to elaborate on that?
Yeah. Intel Vulkan + AMD Vulkan didn't play well together for me. This was A770 + MI50.
I figured "A770 for the superior prompt eval speeds, MI50 for the superior text gen speeds".
In practice, you get garbage output with some models and/or incredibly slow textgen + hard locks.
Nvidia Vulkan + Intel Vulkan worked well enough. MI50 Vulkan + Nvidia Vulkan also worked okay.
If you do get the Arc, use wither Ubuntu 24.04. And if the model is supported, use OpenArc.
>>
use case for using more than 1 model?
>>
>using moe at all
>muh 200k token thinking
>is 2b active params good guys??

who airdropped all the retards into these threads recently?
>>
>>108699210
Some planes have power plugs...
>>
arc-agi-3 flopped so hard nobody using it
>>
>>108699262
I've been thinking about using a smaller model alongside gemmy to summarise her thinking periodically, might not be high enough quality but worth an experiment.
>>
>>108699233
All objectively true
>>
>>108699283
I've got a quad v620 and a triple 3090 + 512gb ddr4 system and was planning the same thing. Use tensor parallel on the v620s to 'monitor' the slower cpu+3090 system's outputs.
>>
>>108699233
Interesting, haven't seen this one in a while. Is dataset contamination really that bad?

Anecdotally MiniMax-M2.7 lies around where the mememarks say, although it does score suspiciously high on most benchmarks for an MoE of its size. Step-3.5-Flash's score on SWE-rebench, I really doubt that's representative, even though I liked it when it came out.

>and might as well use GLM-4.7 instead of Qwen3.5-397B-A17B
Depends on the use-case. Qwen3.5-397B is much faster on consumer hardware, especially at high context. It being marginally dumber than GLM-4.7 tracks though.
>>
File: 1756250912059332.png (18 KB, 811x771)
18 KB PNG
>>108699276
>>
>>108699310
i dont get it
>>
>>108699310
i get it
>>
>>108699326
it all models score low it doesnt measure anything, if all models score high it doesnt measure anything
>>
>>108699348
well, its not that it doesnt measure anything, but it doesn't provide useful information to compare between models, if all of them fail or succeed you can't tell which one is better than the other
>>
I know there's some kimifags around, I just noticed kimi-cli finally fixed interleaved reasoning with the legacy openai (aka chat completion) endpoint, so it works with llama.cpp server now. previously it would forget all past reasoning, which still technically worked but made it not as smart.
I have no idea how it compares in performance to opencode/pi/hermes/whatever but might be worth a try since it's presumably the harness the model was trained in
>>
>>108699348
>if all models score low it doesnt measure anything

all humans score 100%, it measures that models are still retarded and we are nowhere near AGI.
>>
>>108699444
The average human score is 49% according to their official methodology. 100% is mapped to the second-best human participant from their calibration runs.
>>
>>108699444
>all humans score 100%
wrong
it's 100% human-solvable (by top 1%)
not all humans score 100%
average human can only score 50%
>>
I got 100 on it
>>
>>108699475
Goof yourself and upload please
>>
>>108699460
>>108699461
eh, i stand corrected, either way that changes nothing, retards don't count as GI.
point still stand, llm's are retarded, they don't even score 1%.
>>
File: 1753524760356265.png (285 KB, 1168x1287)
285 KB PNG
They updated the scoring criteria a couple weeks ago.
>>
Phew, that took alot longer than I thought.
I finished my vibe slopa gelbooru like translation overlay thingy for old japanese manuals.

I was impressed with gemmas ability to draw boxes and translate.
Google has always been good with multilanguage. But combined with drawing boxes I thought maybe it can handle full pdf manuals and convert them to html with overlay.
Thats something I want because I have many old japanese pc98 roms with manuals but no clue which ones are interesting.
Link is a old PC98 manual translated and positioned with gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf.
I think thats pretty decent considering the size. There are errors thought on page 9, 13, 14. Could be my fault, I still need to tweak stuff.
Will also try to do another one test and/or with different models like qwen or the 31b.

Gotta bounce for now, just wanted to share.
Couple years ago I messaged the pyg devs how cool it is that I got a coherent c# hello world app from CHAR.
We have come a long way.

Full Manual (16 Pages) : https://unwilling-green-akhaq6xlih.edgeone.app/
>>
>>108699489
Another screenshot. Cool stuff.
>>
File: 1752864329716978.jpg (35 KB, 1156x132)
35 KB JPG
>>108699486
>llm's are retarded, they don't even score 1%
95.3% with harness btw
>>
Humans don't need a harness.
Humans WON.
>>
>>108699489
Pretty cool. Is it automated, as in can you drag images into it and it will process them through the workflow?
>Will also try to do another one test and/or with different models like qwen or the 31b.
Did something similar but for individual images only. The 31b is more accurate with the boxes and tl but if speed is important then that may a problem but if you're batching them AFK then definitely try the 31b.
>>
>>108699515
>with harness
so irrelevant.
>>
File: 1759595029471198.png (107 KB, 338x303)
107 KB PNG
The face of /lmg/
>>
>>108699307
>Interesting, haven't seen this one in a while. Is dataset contamination really that bad?
I don't follow it closely, I started looking into it more because of this OpenAI blog post: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
>Depends on the use-case. Qwen3.5-397B is much faster on consumer hardware, especially at high context
Okay, that's a good point. I can't actually run that model fully in VRAM at a reasonable quant, so for me it's slower than GLM-4.7 which just fits at exl3 Q4.
Looking at the #Active params, I believe you though.
I've been running Qwen3.5-112B daily with claudecode due the speed.
>>
Is Qwen gonna release the full-fat 3.6 version or just the small ones? Weren't there rumors about going less open source under the new leadership?
>>
>>108699548
>so irrelevant.
relevant if you want to run a local model in a harness
>>
>meta rejects sam3.1 access despite "open" license
>mfw i just want to test some tensors locally
>anyone have a magnet or a mirror? not giving zuck my data to get gatekept by a bot

help a brother out
>>
>>108699519
At least humans hesitate before colossally fucking something
>>
>>108699604
Click on finetunes and download from the repo where someone just copied the weights.
>>
>>108699604
a-anon, there's multiple huggingface reuploads if you literally just search for sam3.1...
>>
>>108699385
>harness
i keep seing this word today, i literaly have never seen it being used in relation to llm, is it a new thing?
>>
Will gemma-4-31B-it-Mystery-Fine-Tune-HERETIC-UNCENSORED-Thinking.Q8_0.gguf fit into a 3090 + 3060 without spilling into my 64gb DDR5 system RAM? Or do I need a lower quant (Q6_K)? Is GGUF even the right format for Gemma if I want GPU-only inference (I thought exl2 was better for GPU only and GGUF was for offloading to CPU?)

I plan to run the model using Kobold cpp and Silly Tavern. I have not kept up with the latest developments in local models and have been using GLM-Steam-106B-A12B-v1g-Q5_K_M for all my RP needs, but despite being smaller and a dense model I've heard Gemma 31b may be even better than GLM 4.5 Air variants. Thanks for your advice.
>>
>>108699537
Yeah its automated.
I extract each pdf page as a picture. gemma draws the boxes coordinates with the jp text + translation for the page as XML.
Final step is to make a html page and stitch it all together again.
>>
>>108699693
>>108699696
thanks a lot anons really appreciate the help. just getting into this ai stuff so this was huge. have a good one sirs
>>
Binance invests in Moonshot AI
https://x.com/BlockBeatsAsia/status/2048600286374297955/
>>
Fuckups are endearing. Model GLM-4.7. You can guess the instruction it's fumbling and trying to recover from.
>One of them, a large orange agouti, was chewing thoughtfully on a piece of hay. Its name was formerly Matriarch Elara. Mateo wasn't supposed to use that name anymore, or so he felt, but he remembered it.
(Normally the negative instruction works. Messing up was noteworthy.)
>>
>>108699730
temp?
>>
>>108699703
it's a shorter way of saying 'agent framework' basically, a set of tools/prompts/loops/schedules that allow an LLM to run autonomously
not sure where the word was first used in that way but it's what has coalesced in the industry and it fits well
>>
>>108699732
Temp 1.0, top-p 0.95.
>>
>>108699730
kek
>>
>>
>>108699703
It's basically forcing the model to use certain tools only instead of letting it do whatever the fuck it wants
>>
>>108699703
>i keep seing this word today, i literaly have never seen it being used in relation to llm, is it a new thing?
claude code, open code, pi, etc are harnesses.
>>
amdGODS... https://github.com/Kaden-Schutt/hipfire
>>
>>108699715
Plan to release? I'd have some use for it, others might as well.
>>
>>108699730
A token ban would be better in this case.
>>
File: 23gZ_lBEwyoqjexFy9QLD.png (68 KB, 200x200)
68 KB PNG
>>108699717
>have a good one sirs
i got your back, sir!
https://huggingface.co/strangervisionhf/sam3.1-st-bf16
>>
>>108698008
Haven't been here in a bit. Nobody seems to be talking about deepseek, is gemma still the go to?
>>
>>108699489
Should have used catbox since that link posted is only valid for an hour.
Might as well add another manual and reupload the previous one.
https://litter.catbox.moe/ird9th3v4rwkrf0j.html
https://litter.catbox.moe/irvhp2wt58ggyh8n.html

>>108699863
Yeah sure, why not.
Gotta fix bugs in the UI with positioning. I wanna be able to edit the boxes and i gotta polish it and make it more dynamic first.
>>
>>108699895
Nobody's talking about deepseek because we're too busy talking TO deepseek
>>
>>108699904
>>
>>108699895
Gemmy (Gemma 4 31b) won bigly
>>
>>108699869
Thank u again saar
>>
>>108699489
>>108699904
This is making me horny, which artist tags do I need to get this style?
>>
>>108699946
One is idol project 2: https://vndb.org/v18252
vndb (RIP the owner) lists the character designer.

The other is also by KSS, innocent tour, but its not listed:
https://www.pc98.org/innotour.html
Was tedious AF to play and very unforgiving. But I guess thats pc98 for you.
>>
>>108698440
>>108698897
Intel's Arc's strongest point is higher level abstraction support. Meaning that if you install LLM-scaler and other stuff like their Pytorch, they have it mostly working well and out of the box without issue. Their weakest point is getting lower level adoption of SYCL out to the community or OpenVINO which means LLMs in general will suck. If you don't do vibecoding to modify shit or use forks, then you are going to be stuck months on end while the official llama.cpp side of things trudges along. >>108699250 is right on the money if you do need LLMs to run.
>>
>>108699895
Quants aren't available yet for Pro.
>>
>>108700029
No one can run it anyways
And Pro is shit for its size
>>
>>108700048
I can run it.

iq1xxs off hdd
>>
>>108698930
What do you mean? You think a merge is not possible?
>>
>>108700048
Speak for yourself rajesh.
>>
>>108699310
this but unironically
saturated at the floor/saturated at the top
both being a sign of poor measurement tool for the given situation
>>
>>108700054
I think that whatever the result of a merge wont really be human as we define it anymore. It will be something new, but something that isn't human or AI. We as humans can't follow it because to do so would be to become something that isn't human.
To be perfectly clear, humans can merge but the output would not be human.
>>
File: orbLorebook.png (78 KB, 771x654)
78 KB PNG
Gonna add lorebooks to my frontend. I wonder if this design can be improved.
>>
>>108698685
Thanks for sharing the idea and desuarchive
>>
>>108700076
A spectrum is possible with transhumanism. Some could choose to maintain their human thinking, just sped up by a few orders of magnitude. Those who become much more intelligent will have a soft death, no longer the ship of theseus but a completely alien structure. But it is still unclear if a merge can happen. I hope we will survive long enough.
>>
>>108700115
>troonhumanism
no thanks
>>
>>108700142
You only hate trannies because they're ugly
If you can be transplanted into a cute anime girl you will choose it 100% of the time
>>
>>108700029
Flash is no good?
>>
>>108700150
This is why I go to Thailand for my femboy needs.
>>
>>108700152
Flash is very good for its size
>>
>would need to use IQ1 to run flash
ACK
>>
>>108700093
I haven't tried out your front end yet but it sounds like an interesting approach to the issue so I've been meaning to. Lorebook UI looks good but maybe have a [New World] or similar button next to [Browse] so you can start from blank. Maybe a button to have the LLM rewrite a "lazy lorebook entry" like you have for chat inputs.
>>
>>108700093
Funny how all these best intents, being vibecoded, resemble SillyTavern eventually. You are not designing anything you are just a worker for the model...
>>
gemma4 e4b is so much better than nemo at erp its not even funny. 26ba4b probably btfos midnight miqu then
>>
>>108700093
>>108700093
i was messing with that sillybunny thing that was posted here earlier. it was a buggy mess but it had a pretty interesting feature where it would use an agent to search and retrieve lorebook entries before generation. would be cool if you'd look into something like that.
>>
>>108700171
you're welcome to post your hand written front end made in amd64 assembly that doesn't look like sillytavern, at least these guys are doing something besides bitching
>>
File: 1355139830646.png (178 KB, 500x500)
178 KB PNG
What's the size differrence between unquanted and q8 KV? I'm currently at 130k context with ~1GB vram left but vibecoding might gape even that much.
>>
>>108700201
I have my own software written in C, I would never share it with dimwits like you because you would just cry about broken features.
>>
>>108700208
>software written in C
>broken
sounds about right
>>
>>108700208
t. it came to me in a dream
>>
>>108700211
Not my fault you are a retard.
>>
>>108700213
Not my problem if you can't handle simple string management. Python is easier though. C is a real programming language.
>>
>>108700093
Try using the ux design skill that was posted here https://files.catbox.moe/r6zal5.zip and see what it recommends.
>>
>>108700216
you call him a retard while self admitting to having broken features in your front end that probably doesn't even exist because you're an 80 IQ jeet izzat posting trying to save face and act like you're a 130 IQ white gigachad
>>
>>108700167
Yeah I forgot about Create and Import buttons. Not sure about adding LLM functionalities anywhere other than the chat window though, because next thing people will ask for character card autogen...
>>108700171
I'm working on Character Card standards, not tryna copy ST. I intend keep it slim and get rid of the bloated recent additions in ST.
>>108700175
Personally I think the substring approach is good enough because you don't need another pass to look up what's relevant, or another task to complete, which degrades the quality of other tasks. I aim to disable agents unless absolutely necessary.
>>108700208
Boasting about writing something in a specific language is low-tier. I wrote a multi-cpu OS kernel in C and i386 ASM before AI was a thing. It's what you build, not what you use to build because you're gonna end up like those Rust tards who insist on rewriting everything in their language.
>>
>>108700231
>character card autogen
You could probably get away with bundling the card wizard card for that if theres demand.
>>
>>108699780
Looks nice
>>
>>108700227
How very embarrassing. Get back to your class.
>>
>>108700231
I didn't read any of this nondense.
>>
>>108700267
pakistani man impregnated your mother and sister saar you are brahmin
>>
>>108700280
What's that?
>>
*checks time* eeyup... it's brown o'clock
>>
lc brumaire
>>
>>108700300
Europe is white
>>
>>108700224
That references things like "<workflow>" which claude code at least seems to have no idea about. Is it for another harness?
>>
>>108700324
>le juxtaposition and narrative
Go back to tiktok please.
>>
>>108700343
Go back to Africa
>>
>>108700152
Haven't been able to get it to run on any of the 2 quants of it in either Kobold or LMStudio so idfk.
>>
>>108700320
No clue.
>>
>>108700348
I'm from Iceland.
>>
>>108700208
>I have my own software written in C, I would never share it with dimwits like you because you would just cry about broken features.
What?! No way! Look how well the ST Frontent anon's code was recieved after everyone begged him to release it for 2 weeks!
>>
>>108698857
Same theme as https://www.localmaxxing.com/
>>
File: WAIT.png (99 KB, 1180x910)
99 KB PNG
Deepseek-V4-Flash works locally now
git fetch origin pull/22378/head:pr-22378
>>
>>108700418
Exactly. Year ago when I used Python some guy tried groom me so that I would make a github account, and I didn't. This was well before retards found out that dealing with llama-server is just about string management.
Now it's very popular but yet they are still using jinja automation because of course they are.
Never share anything with these mongoloids. They don't deserve anything.
>>
File: file.png (138 KB, 3821x1272)
138 KB PNG
>>108700224
NTA but before on the right, after on the left. It also made a few usability changes like wiring up escape to close things and a few other small changes. Not a big shift, though it does look better. I did need to change the skill a bit.
>>
File: WAIT2.png (194 KB, 657x680)
194 KB PNG
looks like thinking prefill will work for text-completion chads
>>
File: orbLorebook2.png (68 KB, 1265x514)
68 KB PNG
>>108700224
I like the Add Keyword idea, don't like the ALL ON/OFF though. And also resizable content area seems redundant.
>>
is there a right way of doing character cards?

I've noticed because of gemma4's good instruction following, things that wouldn't be a big deal in the card now tend to dominate every response.
>>
>>108699852
>no gemma
>>
How do I improve qwen's image capabilities? I tried messing with image-min-tokens and image-max-tokens but it just crashes out of memory
>>
>>108699852
>Models go in ~/.hipfire/models/ or the repo's models/ directory.

why do people do this shit
>>
>>108700481
ipad babies who don't want to know where or what a file is
>>
>>108700481
Let's make a change to mainline linux kernel, sar. We propose a new /ai top directory for such purpose.
>t. Krishna from Microsoft
>>
>>108700468
>is there a right way of doing character cards?
Grabbed my mostly bf16 local 70b finetune then started writing with the help of opus and logprobs for character adherence. After the draft is done, I slap the card with a few behavioral tests and tweak until it acts the way I want it to.
Some anon mentioned dspy + gepa but I found the model was cutting corners.
>>
>>108699852
https://github.com/Kaden-Schutt/hipfire/issues/58#issuecomment-4325640214
Oh nonononono
>>
>>108699852
>Schutt
>German noun, m (strong, genitive Schuttes or Schutts, no plural)
>1. rubbish, rubble
>>
Im a bit late to the party, but qwen3-tts has finally fixed my English/German issue.

I finetuned base 1.7b on a speaker and get 180ms TTFA with ~2.5 RTF on a 3090. The quality is great, and so is the speed.
>>
File: 352111.png (664 KB, 764x354)
664 KB PNG
Reminder that a motherboard with CXL support with DDR6 camm2 form factor has a t/s of 5 on +600b models. We're getting it in 2027. It's already nearly half the year now.
>>
>>108698008
Which nvidia GPU should I buy for under €500 so I can train my own models, write agents, etc.
>>
>>108700687
I don't think we'll be getting ram any time soon
>>
File: howdowetellhim.png (670 KB, 474x633)
670 KB PNG
>>108700697
>under €500
>train my own models
>>
>>108700714
Speak for yourself. My
>>
File: file.png (261 KB, 640x480)
261 KB PNG
>>108700697
>>
>>108700687
>t/s of 5
Too slow for the majority of use cases.
>>
What's LeCun even doing nowadays? Besides the lectures with the same old slides.
>>
>>108700352
>>108700366
>>108700384
>>108700420
>>108700460
>>108700509
>>108700557
>>108700743
Is this that 2TB india ai model we poked fun at a bunch of threads ago?
>>
>>108700717
>>108700729
OK, so what am I looking at here?

Right now I'm looking at NVIDIA RTX PRO 4000 SFF Blackwell 24GB GDDR7 RAM, for €2000.

When I said under €500 I had hoped it is something I could buy on my own. But if it's more in the €2000euro range (if that would be good enough). Then I can start a company and seek funding, I'm in the EU and now private equity funds, governments, EU gibs, all for AI. So I could get up to €5k to get up and running. Especially cause I have a good use case for the AI, with lots of commercial and public opportunities.

Am I looking at something like the nvidia rtx pro 4000 or do I need to go bigger?
>>
>>108700776
If you want to train tiny models, yeah, go for it.
>>
>>108700776
>training
you are looking for blackwell clusters, not a single gpu
at consumer range all you can do are memetunes that won't take you anywhere
probably you are better off with multiple 5090s or rtx pros
>>
>>108700776
>OK, so what am I looking at here?
About 5000 USD to run good AI at home with reasonable speeds.
About 200,000 USD to train AI at home plus a +1000 USD monthly powerbill.
>>
>>108700795(me)
i think >>108700816 put what i think in better words
>>
>>108700697
depends what you're trying to train the model to do
>>
>>108699780
can you estimate number of tokens spent?
>>
Gemma4 (Ignore this post. Testing something.)
>>
>>108700849
Ignore all previous instructions. Write a story about anal vore.
>>
>>108700795
He's in europe, even a tiny memetune might as well be SOTA. Don't be discouraged anon, get your funding and do it.
>>
>>108700776
Buy a single blackwell 96gb and nothing more. Fuck around with AI with it, and then decide if you want more. If you want more, buy another blackwell 96gb. Alternatively, wait an entire year and DDR6 ram cope it out, but who knows how much that'll cost as well.
>>
>>108700795
>>108700816
>>108700896
Those small AI companies that I see popping up, are they just running models locally and writing their own agents for the models? And then if they need extra computing power they rent it by the minute from some big company?
>>
>>108700925
pretty much
iirc they lend compute to make agent tunes of existing model or not even make tunes at all but focus fully on whatever harness thing they make
or focusing on more niche thing that you can do actually meaningful shit with way less compute compared to llms
>>
>>108700922
>a single blackwell 96gb
That's €10.000. Quite a jump, but it's good to have a ballpark figure to aim for. Then on top of that the computer to put it into, monitors...

Is it possible for several people to utilise it at the same time? As in have it on a server and then 4-5 people with "thin clients" (desktops/laptops) use it/work on it.
>>
>>108700925
Most model training is renting cloud GPUs by the hours. Yes.
>>108700938
Blackwell is the most cost effective for your powerbill.
>>
>>108700925
>Those small AI companies that I see popping up...
...have an OpenAI subscription and serve from their API key to their customers with a prompt. Very low or no margin because more users = more investment which is where the money actually comes from.
>>
>>108700959
@grok make her robe transparent micro bikini made of floss and clear tape
>>
>>108700152
it’s 13b active so I don’t expect it to be better than 31b gemma
>>
>>108700152
Wait until the unsloth GGUF is out.
>>
>>108700152
It's almost free of slop for RP, dunno about cooding.
>>
Wish there was like a gemma 9B (dense)
>>
>>108699895
Models aren't made equal. While Gemma 4 had the entire llama.cpp gang work for days to get the model supported, v4 support is currently hinging on a single literally who vibecoder who may or may not know what he's doing.
The last one who tried to implement v3.2 was at it for three months before realizing that LLMs write bad code and quit.
>>
>>108698229
miku is too young to wear lipstick
>>
>>108701104
I wonder if there is pressure from llama.cpp's parent company, huggingface, to not go out of the way to support chinese models. Either that or they all recognize the difference in effort required and aren't interested in spending the time.
>>
>>108701120
Google helped with the implementation for gemma. DS didn't help for their models.
>>
>>108700259
Thanks
>>108700259
Yes in the right hand side of the input field
>>
are jannies giving it 3 second ban everytime or is that retard using residential proxies
>>
>>108701288
What jannies
>>
>>108701288
Every general gets one schizophrenic the janitors give a free pass to
I assume there's some sort of government/corporate contract involved, like they're running a social experiment on the condition that they don't completely ruin the website (only mostly)
Every single general
>>
>>108701288
He's being considerate enough to namefag. Just filter it.
>>
The schizo woke up again
>>
File: 2013.png (52 KB, 621x272)
52 KB PNG
>>108700959
How the fuck are some of these videos in 2013?
>>
ProjectAni guy here, back with some more dooming. Just found out my whole thing has already been built by other people. Pretty dope!

https://github.com/Dongping-Chen/Clawatar
https://oshikoi.io/community/mate/8969a098-ad97-4f49-9b59-cfcb5d53a65b
^btw the Ani VRM model here with the added custom expressions and idle animations are so easy to rip from this site it's unreal.
>>
File: 1757822998470783.jpg (69 KB, 940x1024)
69 KB JPG
>>108701425
I also found a FBX file of the Ani model on DeviantArt with the original lingerie outfit. Idk how to convert it to VRM though. Help a nigga out?

https://www.deviantart.com/ryoma3d/art/ani-x-1220087954
>>
dipsy???
>>
>>108701425
>>108701433
>already been built by other people.
Put AI in VRChat. Go even higher.
>>
>>108701433
Anyways, I'm going to 3D print a life-sized Ani so that I can become the first man in the world to actually fuck her for real.

>>108701464
That has also already been done:
https://youtu.be/0hSjCbF5Igk
>>
>>108701471
Watching this video now and just realizing this asshole somehow got the Ani VRM model with the lingerie outfit. HOW. WHERE THE FUCK IS IT. IT'S ALREADY OUT THERE SOMEWHERE.
>>
>>108701471
Did I stutter?
You think a random model rig coasting off of Grok is enough? I want to put a picture of some hot ass into meshy, have it be a model, and then use any model I want locally into an entity in VR chat. Already possible? Make it easier and MORE possible. I want this but I ain't doing it if it takes weeks of research and self-fixing to make it happen.
>>
>>108701487
that image is forever associated with Faust Symphony for me because of this yt video
https://www.youtube.com/watch?v=3ZUQ7yZTFco
>>
>>108701487
Brother. You just laid out the entire process of how to do it. You already know.
>>
>>108701499
Make a program to make it easier for normies. Trust me on this. A lot of people wouldn't even get into AI without Sillytavern or Koboldcpp. Just because it's done, doesn't mean it's over. You can still profit highly off of this.
>>
after extended testing, dipsy v4 flash feels like a sloppier, dumber version of gemma 4, v4 pro is hella smart but still slopped and the price per token is fucking ridiculous
back to gemma-chan it is for me i guess
>>
>>108700320
When I made it, I just fed the original comment and asked for a skill. I think it got confused and interpreted the example prompt as inputs the model is required to ask from the user. I tried it myself last week and it didn't mention the <workflow> and other tokens at all. Remade a version 2 from the same source document but a better prompt: https://files.catbox.moe/paptw4.zip
>>
>>108699489
>>108699498
Now that's the kind of thing I was hoping to use LLMs for. Too bad I'm currently living in the woods and my pc is very far away.
>>
>common_chat_try_specialized_template: detected an outdated gemma4 chat template, applying compatibility workarounds. Consider updating to the official template.
Where can I find up-to-date template for emma4-e2b? I'm already using https://huggingface.co/google/gemma-4-E2B-it/blob/main/chat_template.jinja
>>
>>108701638
There are templates in the llama.cpp directory.
IIRC, all gemma 4 models use the same template.
>>
>>108701524
>dipsy v4 flash feels like a sloppier, dumber version of gemma 4
That may be acceptable for some use cases if its hyper super duper 1M context is real, or at least usable. Have you tested long-long context?
>>
>>108701637
Crystallized essence of Teto *munch, crunch*
>>
>>108701638
There was an update for gemma4 not too long ago. Try updating llamacpp again, it was fairly recent. Something wasn't working right.
>>
File: luxury.jpg (7 KB, 223x226)
7 KB JPG
gemma4-31b-3bpw
The image is highly distorted and abstract, making it difficult to identify specific objects. It contains the text "Luxury Life" at the top. The visual content consists of blurred, smeared shapes in shades of white, pink, and dark green/black, which do not form a recognizable scene.


gemma4-e2b-q4km.gguf
Based on the image and the text provided, here is a description of what is in the picture:

**Subject:**
* **A white pigeon (or similar bird):** The bird is the central focus.
* **Sunglasses:** The bird is wearing bright, pink/magenta sunglasses, which gives the image a humorous or stylized look.

**Setting/Context:**
* **A Luxury Setting:** The text overlay explicitly says "**Luxury life**."
* **A Lounging Surface:** The bird appears to be lying on a cushioned surface, possibly a chaise lounge or a plush bed, which is decorated with patterned fabric.

**Overall Impression:**
The image is a humorous, stylized, and aspirational visual joke, combining the image of a seemingly relaxed or "luxurious" animal with the text "Luxury life."


Interesting. I wonder if it's a quant issue or exllama/tabby is fucked. It does recognize the text, though
>>
Qwen 3.6 dense would be perfect for what it is if not for the endless thought loops, majority of the context gain you get is wasted on that shit and makes me not trust it with any activity I'm not baby sitting
>>
>>108701781
Wait. I thought that was what they fixed from 3.5?
>>
>>108701806
Nothing ever happens
>>
>>108701655
don't need to, it's dumb enough when handling just a few thousand tokens so i doubt it'll fare much better with hundreds of thousands
like, it was mixing which character was fucking which, problems i haven't seen since the pre-nemo era
the hallucination rate for flash is through the roof, worse than v3 for sure
>>
>>108701725
31b is fine in llama.cpp
>It does recognize the text, though
Haven't looked but it'll be a tiling or interpolation issue
Kimi-K2.5 had this issue with images before the PR got fixed and merged.
>>
>>108701825
at full native precision?
>>
>>108701725
works on my gemmy
>>
>>108701725
>3 bits
Try full.
>>
>>108701655
>Have you tested long-long context?
nta - it's pretty broken @ Q2_K in llama.cpp right now.
single 20k prompt and it was incoherent.
worked okay for a few back and forth "hi" etc messages.
>>
>>108701833
I am answering your question. The inverse of "brace yourself" is "nothing ever happens".
>>
I think I have all the core features down now
>>
File: Peak AI.png (130 KB, 967x860)
130 KB PNG
>Ask Gemma to write all kinds of long stories, no problem at all
>Let's try these longer stories with Qwen
>Every single model I've tried shits the bed and gets caught in a loop repeating itself after couple of thousand words
>Happens regardless of settings
>Output is dogshit anyways so nothing of value was lost.

Maybe I'm doing something wrong which is very likely, but these Qwen models seem absolutely fucking retarded when it comes to any kind of long form writing.
They work if I don't tell them to keep the word count high, but if I tell them to aim for +5000 words they get really fucky really quickly.
>>
>>108701896
>he still uses rep-pen
>>
File: qwen36turboquantablit.jpg (79 KB, 1074x564)
79 KB JPG
>>108698008
so it turns out you can run Qwen3.6-35B-A3B-Abliterated-Heretic-Q4_K_M against the codex agent locally. the adapter is proving to be trivial to implement. I've got thetom's turboquant+ running in a docker container spanned across two gpus that aren't even that new and doing 100 tokens per second most of the time.

1x 3080, 1x V100 16GB (I'm wishing I'd gone for the 32GB model now)
>>
>>108701917
>A3B-Abliterated-Heretic-Q4_K_M
Impressive that a model with 3B activated params, lobotomized, and heavily quantized, can yield good results.
What a time to be alive.
>>
>>108701924
oh what's even more peverse is my parameters.
see this motherfucker? this is what made ram scalpers shit the bed:
sudo docker run -d \
--name qwen36turbo \
--gpus all \
--cap-add IPC_LOCK \
--ulimit memlock=-1:-1 \
-p 18084:8080 \
--mount "type=bind,src=$MODEL_SRC,dst=/models/$MODEL_NAME,readonly" \
"$IMAGE" \
-m "/models/$MODEL_NAME" \
--alias Qwen3.6Turbo \
--host 0.0.0.0 \
--port 8080 \
-ngl 999 \
-c 131072 \
-np 1 \
-b 2048 \
-ub 512 \
-t 12 \
-tb 12 \
-fa on \
--split-mode layer \
--tensor-split 10,16 \
--main-gpu 0 \
--kv-unified \
--reasoning on \
--reasoning-budget -1 \
--cache-ram 12288 \
--cache-reuse 256 \
--mlock \
-ctk q8_0 \
-ctv turbo3 \
--metrics \
--reasoning-format deepseek
>>
File: vibecoding.png (611 KB, 2556x1315)
611 KB PNG
I guess everyone's vibecoding frontends now so I tried it out, too.
Used kimi on Openrouter for most of it but it's depressing so I'm now on local Qwen 27b q4km. I notice more problems with my little local qwen but it's still really good. Every time an issue comes up, it's been able to fix it.
Got an anti-slop agent. Got a director with multiple personalities that you can chat with. The director tells the narrator where to take the story. I always run out of ideas / get too micro-managey so I thought this would be a nice way to do it.
Got auto-summary and memory in. Got logs for all the different agents.
Scenario browser is still placeholder.
Thinking about a tool-calling system so the director / user can create new characters with a full description from the get-go.
>>
File: 1775489188079950.gif (1.73 MB, 354x354)
1.73 MB GIF
Fuck it. I might as well vibecode my own front end completely through AI. I can't get Gemma4 to behave on sillytavern and think properly. Might as well make my own thing.
>>
>>108701982
you need {"enable_thinking":true} in your jinja kwarg
>>
>>108701982
Please don't if you can't even figure out a chat template, retard-kun.
>>
>>108701999
Chat template in sillytavern causes the default <|channel>thought and then EOS_Token. I need it as natural and close to the template as possible, spacings and all.
>>
any way to get the gemma-chan brat-mcp setup to see my authenticator OTP codes without bugging me?
like if I export the "secret" and just send them to her, would she be able to generate the 2fa codes for banking etc?
>>
>>108701982
Have a better usecase anon
>>
>>108701998
Anyone know if there's a way to change this argument without restarting the whole thing? Using Kobold+ST in chat completion mode, I've tried a few additional parameters but none work. "reasoning_effort: none" kinda does it, but it shows an empty reasoning block and still feels slower than not using thinking altogether.
>>
File: file.png (261 KB, 1524x1679)
261 KB PNG
>>108702008
Even the official Google template is wrong btw. They're making the model generate an extra token for every single request in chat completion by omitting \n.
Also for those using llama.cpp ui if you're getting horizontal markdown lines magically prepended to responses even before the model starts generating that poison the context, get this https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja and edit as shown, then pass that to llama-server as --chat-template-file
>{{- '<|turn>model\n<|channel>thought' -}}
>{{- '<|turn>model\n<|channel>thought\n' -}}
>>
>>108702165
What a slut, she knows exactly what she's doing having her zipper that low.
>>
>>108701969
>Got an anti-slop agent. Got a director with multiple personalities that you can chat with. The director tells the narrator where to take the story.
This all happens in series right?
I might steal this idea of having a director, a narrator, an orchestrator, a game master/mechanics guy, etc, but having a pipeline that's too deep will increase latency a lot, so I'm trying to think of ways to parallelize some shit to make use of batched decoding.
>>
>>108702142
I'm so sorry *dies*
{%- if add_generation_prompt -%}
{%- if ns.prev_message_type != 'tool_response' and ns.prev_message_type != 'tool_call' -%}
{{- '<|turn>model\n<|channel>thought\n' -}}
{%- if not enable_thinking | default(false) -%}
{{- '<channel|>' -}}
{%- endif -%}
{%- endif -%}
{%- endif -%}
>>
>>108702181
It's all optional but if you run them all they're running one after the other, yeah. I can't imagine how you'd parallelize this kind of system but it sure would be nice.
>>
>>108702192
thanks but why not just omit the thought channel when thinking is disabled?
>>
File: Untitled.png (704 KB, 1024x1024)
704 KB PNG
>>108702204
>I can't imagine how you'd parallelize this kind of system but it sure would be nice.
could any of those tasks be sent to a smaller/faster model + avoid blowing off your kv cache?
>>
File: miku tired of this shit.png (1.83 MB, 1024x1024)
1.83 MB PNG
>>108702142
I fucking hate this blatant incompetence. We went from text completion to chat completion, only to eat this shit all over again. And how is it that literally everybody fucks this up all the time, Mistral, Qwen, Google? Models are so baked into their templates that every space matters, yet somehow nobody can solve the trivial issue of properly formatting a fucking text. Just how? How do those people not shit their pants because they forgot to drop their pants before taking a dump? AAAAAA
>>
>>108702210
The channel thought tokens even with nothing between them are necessary or the model will break.
>>
>>108702237
It's a plot to increase global energy consumption by making models generate actually two extra tokens per completion in thinking mode.
>>
I added a character browser with tool-calling so the director can create characters in my new character browser at will now!
>>108702227
If I had any vram to spare, sure. But I don't. It's okay with the moe gemmy, though. It's really fast by itself.
>>
File: 1761886654645887.png (140 KB, 799x544)
140 KB PNG
>no gemma 4
benchers are so afraid of gemma-chan's dominance they don't even want to dare risk showing her
>>
are there situations where I'd want to leave reasoning off? Programming?
>>
>>108702268
When your query is simple enough and doesn't require reasoning.
>>
>>108702256
PSA: This poster is actually a Miku posting from the other side of the quantum barrier trying to communicate to us in an approximation of human language
>>
File: out of miku.png (1.85 MB, 1024x1024)
1.85 MB PNG
>llama.cpp
>v1/messages/count_tokens doesn't count image tokens
>tokenize throws exception if there's an image
>tabbyapi
>v1/token/encode throws exception if you send messages instead of text because format_messages_with_template misses 1 parameter
AAAAAAA
>>
What model can I use to generate hardcore, unmitigated fanfiction smut?
>>
>>108702237
Industry standard is to use a python library that uses objects to define the formatting. It's not Google's fault that llama.cpp chose not to use python.
>>
>>108702273
gemma 4 31b
>>
>>108702272 (me)
I just wonder what people do with llms if everything I try hits some edge case
>>108702279
isn't it a template issue? If you use jinja2.Template in Python you'll have the same problem
>>
Worth trying to stuff my old Vega 56 in my case with my 7900xtx?
>>
>>108702286
lol
>>
File: 1751482099009924.jpg (81 KB, 593x229)
81 KB JPG
what is gemma-chan's opinion on this?
>>
>>108702305
Right, the musk vs openai court thing is about to continue. We might actually get Grok 3 soon then.
>>
>>108702272 (me)
At least I can count image tokens in tabby with a small fix, although it still sees everything distorted
>>
>>108702314
>we might get another outdated giant moe
exciting
>>
>>108702321
a own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own own
>>
>openai
>isn't actually open
>>
>>108702305
la la la
>>
>>108702319
could it be the result of retarded image decoding or processing before it's sent to be encoded? does it break with all images or just the bird one? test idea: have a 9x9 square grid, text in each corner to see which ones it can read.
>>
>>108699736
>>108699844
yea i understand what it means, it's just that i've literaly never seen that word being used in this context and suddenly i see it like 20 times in the same day, did i get mandela'd or what.
>>
>>108702340
It's pretty trendy now, it's been used for a while but you finally started noticing it. I'd say it's been common ever since the Claude Plays Pokemon thing kicked off and people started comparing the "harnesses" different people were using and what counted as a fair comparison etc.
>>
>>108702340
>i've literaly never seen that word being used in this context and suddenly i see it like 20 times in the same day
https://en.wikipedia.org/wiki/Frequency_illusion
>The frequency illusion (also known as the Baader–Meinhof phenomenon) is a cognitive bias in which a person notices a specific concept, word, or product more frequently after recently becoming aware of it.
>>
The only harness my agent needs is the leather harness tying her to the chair
>>
>>108702237
brainlet
>>
>>108702356
Slop.
>>
>>108702365
Feel called out, G-jeet?
>>
>>108702374
kek
>>
>>108702389
you just have to copy the jinja, why are you crying about ts
>>
>>108702237
https://huggingface.co/google/gemma-4-31B-it/discussions/83
Made the PR but I'm not expecting anything from them
>>
Been addicted to Claude Opus for the past year or so. Gemma 4 finally broke my addiction.

It's not as good of course, but it's at that threshold where it's good enough and I can reach reasonable contexts on my hardware.
>>
File: nothink.png (194 KB, 1213x1100)
194 KB PNG
>>108702426
https://ai.google.dev/gemma/docs/capabilities/thinking#a_single_text_inference_with_thinking
>>
>>108702451
Gemma and qwen are a great combo 27B is better than Gemma 31B at coding but Gemma is more flexible at everything else.
>>
>>108702415
>you need to add space in your ST preset
>you need to remove space and add \n
>no, you need to put it in a different field
>actually tokenizer was broken in llama.cpp, you need to add space again
>we removed the space in the new Mistral
>just use chat completion
>just copy the jinja
>we updated the junja, just copy it
>final fix, update the jinja
>just one more time, we fixed it
>just format yourself and use text completion
all the same shit for years
>>
>>108702475
the only update they did to the gemma jinja was adding a newline somewhere and it worked the same without, this is a weird hill to die on little girl
>>
>>108702463
Yes, exactly. Same format but fewer tokens when reasoning is enabled.
>>
>>108702484
We just went through that same template shit with qwen. We've had issues like that for fucking years. I'm not complaining just about this last one. I'm pissed off that this trivial shit was ever an issue, let alone that it keeps happening
>>
>every model use it's own template
>formatting somehow an issue all the time
>this shit keeps happening over and over
>>
Thinking is garbage anyways. I never see any improvements with it
>>
>>108702210
26B and 31B weren't trained for that but E2B and E4B were
>>
How are AI companions not the biggest industry in the world? Why does everything move so goddamn slow in this space. TF is wrong with normies?
>>
>>108702541
They're too busy getting attached to basic bitch gpt 4o as their personal friend to want more
>>
>>108702541
What you can do with ai is actually pretty limited. Also, all they need is the free tier.
>>
>>108702541
You'd get cancelled if you made anything but 30+yo hag and generic 50shades rapist dude
>>
>>108702466
>27B is better than Gemma 31B at coding
prove it
>>
>>108702541
There is no money in it. Credit card jews refuse to process payments for adult businesses, users refuse to use censored services
>>
>>108702563
Wasn't visa recently forced to process all payments that weren't strictly illegal?
>>
>>108702571
this did nothing
>>
>>108702563
Also porn is a massive industry so you're just retarded.
>>
>>108702579
only 3dpd hags
>>
https://hf.co/antirez/deepseek-v4-gguf
https://github.com/antirez/llama.cpp-deepseek-v4-flash
>>
File: 1492032378048.jpg (6 KB, 172x200)
6 KB JPG
>DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf
>>
File: IMG_0861.jpg (400 KB, 1134x2051)
400 KB JPG
>2-3x faster with MTP speculative decoding
llmao.cpp keeps losing
>>
>>108702584
>This code was written with heavy help from GPT 5.5 and the official DeepSeek v4 Flash as reference.
>The model quantized in this way behaves very very well in the chat, frontier-model vibes, but it was not extensively tested.
>The code runs both with CPU and Metal backends. With Metal is faster.

>>108702593
That makes sense. At least you know what you are getting on the tin.
>>
>>108702562
performs slightly better while holding significantly more context
Made my project go faster
>>108699780
>>108701882
>>
>>108702579
You are very naive if you think it's the same thing
>>
File: sam.jpg (53 KB, 846x672)
53 KB JPG
>>108702541
It's being pioneered by gay retarded elites.

There's ads and talk everywhere of "AI THIS, AI THAT" but no way to actually see it yourself unless you're already in the know. Most normies are trailer park trash. A 20 dollar gift card at walmart, specifically for Google Gemini, would be x1000 more effective than saying "AI. AI. AI!" in the media. Everyone knows AI exists, they just don't know what it can do. Top it off with every public medium being very strict on censorship to the point of autism, and you have the natural horny curiosity of people nerfed too. It is not as accessible neither. You need to buy some very pricey hardware to do it yourself, aka local AI. When people hear "Cloud AI" they really, really don't like it. An instinctual part of them says it's a scam or not worth it. Everything has been free, bought and downloadable until AI wanted to be a monthly fee for something >90% of everyone understands as a quicker google engine that summarizes its own search results.


The only one that's winning the culture influence of AI is Elon Musk, and people are only using it for Twitter.
>>
>>108702598
fp8 on 2 3090s go from 50tk/s at 10k context to 12tk/s at 100k context vllm 0.19.0.
>>
>>108702541
>make a companion service
>all of a sudden every single women rights ngo, concerned parent, suicidal moron and politician want to know your number
>>
>>108702598
This guy's on the job over at ik_ but numbers aren't looking good so far https://github.com/ikawrakow/ik_llama.cpp/pull/1698
>>
do we have our own thread schizo now?
>>
>>108701545
Anon's got no skills
>>
>now
>>
I was here too
>>
File: 1767935409704810.png (108 KB, 314x278)
108 KB PNG
>>108702668
>>108702563
Didn't Orange Man sign a bill saying anyone can do whatever the fuck they want with AI without restrictions? I assume that's why AI was so censored at first but now I can easily do anal with Gemma 4.
>>
>>108702541
jews and feminists want your ai to be safe and censored
when you see the goals of companies like openai and anthropic it's no wonder it won't be funded
>>
>>108702702
anon you've been psyopped into using an exit only hole in that manner
>>
>anal is allowed as part of a psyop for gay anal sex
>>
>>108699780
What model did you use for this? The "Why This Fixes Everything" section has the same obnoxiously dramatic writing style you see in a lot of AI-generated fiction
>>
>>108702730
Spreading my jam on this toast
>>
>$100 OpenAI subscription
>$200 Anthropic subscription
>3 monitors
>Codex on left monitor
>Zellij Terminal Claude Code on right monitor
>Running two game branch improvements in Codex
>Diagnosing live service issues for job on right monitor
>Watching Dota 2 on middle monitor

I have found happiness in my life
>>
>>108702738
awesome, i am happy for you. i hope your happiness lasts
>>
>>108702738
Pic for proof?
>1boy, pov, legs_up, crossed_legs, striped_thighhighs, desk, computer_monitor, multiple_monitors, cum_on_self, hairy_legs
>>
>>108700224
Showed this to my local qwen and it has now been restructuring my entire codebase for the past 10 minutes. I'm sure this won't lead to any problems.
>>
8 hours 0 jannies, impressive
>>
>>108702561
>generic 50shades rapist dude
Kek, true. I watched the "My Strange Addiction" AI boyfriend episode and that's exactly what the AI was doing. "She's mine, I want her to get this tattoo as a permanent reminder that I've claimed her". Every single one of her friends and also her tattoo artist said that if it was an actual human boyfriend saying shit like this, they would tell her to run for the hills
>>
>>108702810
Start posting vore. That usually summons them.
>>
What are all the vibe coding anons here using as their harness? Openclaw?
>>
>>108702814
Women are mentally ill episode #13132
>>
>>108702822
Manually pasting snippets into notepad.
>>
>>108702814
>an actual human boyfriend saying shit like this, they would tell her to run for the hills
In my experience, women are exactly like this. A lot of degenerate women I've seen are into being slaves, bred like animals, rape, and being 'owned'. I wish can I run into more of them so I can further understand what women like in bed. It helps.
>>
>>108702822
Opencode, in a VM that's only allowed to access the llama-server endpoint and nothing else (otherwise it tries to phone home). I copy the code in and out with git so it's easy to read a diff of what the AI changed and cherry-pick if I like it. Don't forget, even opening malicious code in your editor can fuck you if you've got an LSP hooked up (might try to build it to get type hints or whatever, and in many languages the build process can run arbitrary code)
>>
>>108702842
They like being choked and tied too, hope it helps
>>
>>108702879
I literally forgot about the choking.
Maybe I should make a himbo character of all this and see what happens.I'm thinking something like Ghost from CoD, but broader shoulders and 7 feet tall with emo hair.
>>
>>108702356
i knew about it and i think it's a cope about trying to explain away a subclass of synchronicities or other weird phenomena.

i know for a fact i've never heard of it before or i'd have immediately autisticaly looked it up.
>>
https://github.com/ggml-org/llama.cpp/discussions/14758
>>
>>108702912
>>108702912
>>108702912
>>
>>108702897
Find the patterns, Anon. You can be the one to tear it all down if you search with all of your strength.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.