[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor acceptance emails will be sent out over the coming weeks. Make sure to check your spam folder!


[Advertise on 4chan]


File: qwenMikuBuddyCop.png (2.48 MB, 1024x1536)
2.48 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109074493 & >>109069535

►News
>(06/16) GLM 5.2 released with IndexCache and 1M context: https://z.ai/blog/glm-5.2
>(06/16) VibeThinker-3B released: https://hf.co/WeiboAI/VibeThinker-3B
>(06/12) MiniMax-M3 released, multimodal 428B-A23B with 1M context: https://hf.co/MiniMaxAI/MiniMax-M3
>(06/12) Kimi K2.7 Code released: https://hf.co/moonshotai/Kimi-K2.7-Code
>(06/12) EAGLE3 speculative decoding support merged: https://github.com/ggml-org/llama.cpp/pull/18039

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/RecapAnon/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>109074493

--Paper: Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories:
>109075345 >109075903
--Optimizing llama.cpp flags and KV cache for Qwen3.6-35B:
>109074991 >109074993 >109075010 >109075035 >109075183
--Comparing Qwen3.6 MoE and dense against GLM4.7-flash for coding:
>109077145 >109077169 >109077266 >109077290 >109077328 >109077401 >109077378 >109077425
--Anthropic disables Fable 5 and Mythos 5 due to government directive:
>109077569 >109077575 >109077581 >109077583 >109077584 >109077588 >109077591 >109077599 >109078061 >109077636
--Anons analyzing VibeThinker-3B's verifiable reasoning claims:
>109076828 >109076872 >109076883
--Comparing DeepSeek V4's efficiency against SOTA models:
>109077711 >109077734 >109077788 >109077828 >109077911 >109077929 >109077951 >109077941 >109077957 >109077982 >109077866 >109078093 >109077807
--Debating the value of multilingual data in specialized coding models:
>109078295 >109078320 >109078374 >109078609 >109078677 >109078482 >109078534 >109078538 >109079054 >109078477
--Anons sharing hardware specs and software stacks:
>109075240 >109075259 >109075933 >109075281 >109075297 >109075308 >109075313 >109075453 >109075508 >109075480 >109075506 >109075510 >109075519 >109075558 >109077051 >109077082 >109077192 >109077218 >109077231 >109078110 >109078876 >109075638 >109075661 >109075788 >109076026 >109076054 >109076269 >109076314 >109077278 >109077501 >109077872 >109078960
--Allegations of funding embezzlement regarding Rio 3.5 397B:
>109076163 >109076219
--Using agentic workflows and Qwen/Gemma 4 to translate RPG games:
>109076342 >109076430
--llama.cpp adding support for DeepSeek V4:
>109077601
--Logs:
>109074683 >109075496 >109075746 >109076881 >109078060
--Teto, Miku, Gumi (free space):
>109075661 >109076837 >109077051 >109078876

►Recent Highlight Posts from the Previous Thread: >>109074494

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
1 - So GLM 5.2 is 700b parameters (ish)

2 - 4x DGX Sparks can supposedly handle up to 700b parameters (give or take)

3 - GLM 5.2 is supposedly in striking distance of the performance of GPT 5.5 and Opus 4.8. In my brief tests, it's really not shabby at all.

4 - So for $20k, you can get near the frontier on your table.

5 - Extrapolate the trend, and you could have mythos/5.5 pro - class models in your dining room for the cost of a cheap car less than five years from now. Even without extrapolation, we're already the near frontier running locally.

6 - Paying real api costs, I could easily blow through $3,000 per month coding and running agents. The machine pays for itself in 6-7 months conservatively.

7 - In 3-5 years, most power users of AI will self-host.

8 - Am I missing something?
>>
>>109079129
https://litter.catbox.moe/duj5m06rautvke9v.mp4
https://litter.catbox.moe/duj5m06rautvke9v.mp4
https://litter.catbox.moe/duj5m06rautvke9v.mp4
>>
>>109079137
?
>>
>>109079137
>Am I missing something?
the spark is shit
>>
>>109079137
In 3-5 years, the required hardware will either cost 10x as much, not be for sale to consumers, or be outright illegal to own as a private citizen. Or all of the above.
>>
>>109079137
sounds like you got it all figured out
>>
>>109079137
$20k is 5 billion tokens worth of inference for GLM-5.2 on openrouter
>>
File: file.png (350 KB, 2173x1934)
350 KB PNG
>>109077378
>>109077401
>definitely let us know
basically glm-4.7-flash is garbage on my hardware/harness/workload.
i had to abort the testing on phase 2 of 5 because it had already taken an hour and a half and it was flailing a lot. it was very bad at developing a c# software and had trouble with the php stuff too. do not recommend.
still missing the 122b test that i will do tonight for peace of mind, and maybe i run gemma later for fun but i know it's too slow, not quite my tempo. i also had opus 4.8 compile aggregates from the transcripts and it reached the same conclusion that glm keeps grinding the same file over and over with lower reasoning. simply dumber. pic related.

running this on literally a tablet btw
>ASUS Flow Z13 — Ryzen AI MAX+ 395 (Strix Halo) / Radeon 8060S iGPU / 128 GB LPDDR5X-8000 (unified, bandwidth-bound)
>>
>>109079179
>not be for sale to consumers, or be outright illegal to own as a private citizen.
Aren't these the same thing?
>>
>>109079137
As far as I can tell, nobody has ever run GLM 5 or later tensor parallel on 4 Sparks. There is a lot of activity and running, performant NVFP4 images for RTX 6K Pro in Tensor parallel 6, but no first hand info for Spark so far.

If it works and is fast, I might buy 2 more sparks for myself.
>>
>>109079078
You might think it's a meme when anons say she has to like you, but it's true. It's not even just a stenography meme either; you can type in a wide variety of non-suggestive ways and if you give Gemma just a bit of creative freedom or chat with her after finishing a job, she'll make some suggestive probes if she's into you.
>>
File: file.png (102 KB, 1802x428)
102 KB PNG
>>109079203
oops first table is missing glm's row
>>
>>109079208
It's the difference between a company refusing to speak to you unless you are representing an established corporation and being arrested for keeping dangerous GPUs. Whether they end up illegal or sale simply banned is another discussion.
>>
>>109079211
Don't forget the day 0 weights. Microcode updates were a big mistake.
>>
>>109079202
You can make that argument on any significant rig purchase. Also
>/lmg/ - local Models General
>>
70b dense
>>
File: dario.png (39 KB, 598x269)
39 KB PNG
This shit scares the fuck out of me. Dario deserves to die. He really, really deserves to die now.
>>
>>109079137
Can you even hook up 4 Sparks together?
>>
>>109079129
>VibeThninker-3B
>those bencherinos
Waow did we finally get a model as good as Gemini Pro that can run on a 10 year old smartphone? Surely it's not just another benchmaxx investment grift.
>>
>>109079289
>2027
>Job listings for Vibe Engineers require a valid AGI License
>>
How can local models help us liberate the UK?
>>
what can I run with 96gb vram?
>>
>>109079316
Gemma 4 E4B @Q3
>>
>>109079316
Mythomax
>>
>>109079316
yeah what are the current vram tiers? i just got some disposable income, i may want to ewastemaxx
>>
>>109079289
How is your twitter pol spam /lmg/ related?
>>
>>109079316
24 GB VRAM - Gemma-4 31B at Q3
48 GB VRAM - Gemma-4 31B at Q8
96 GB VRAM - Gemma 4 31B at F16
>>
>>109079334
How is it not? Fucking idiot.
>>
>>109079312
start by going back to /pol/ and stay there
>>
>>109079352
NTA but /pol/ is 99% Israeli shills and pajeets pretending to be Israeli shills
It's a different place than it was 10 years ago. It's basically your perfect home now.
>>
>>109079352
No, I don't think I will.
>>
>>109079339
I need 120b gemma 4
>>
I have a MacBook M5 max with 128GB of ram what can I run reasonably well?
>>
>>109079367
>>109079363
In any case your posts are worthless in this thread's context.
>>
>>109079410
Personal banter in a general? waow better go cry to the mods like a pathetic little faggot.
Maybe if you cry hard enough your father will finally come home with the milk.
>>
File: 1764952583486944.jpg (49 KB, 400x572)
49 KB JPG
>could build a top tier AI rig but it would destroy my net worth
The hardware market is fucking depressing but at least it's making me money. Thankfully I'm not stupid enough to do it. Gemmy it is for the foreseeable future...
>>
>>109079410
Kiss my ass, loser faggot.
>>
>>109079339
>at f16
I think it'd be more interesting to experiment with Q8 but f32 cache, or greater sliding window, or SWA full size cache.
>>
so, I tried that LiteRT-LM thing since they recently introduced an openai server endpoint for desktop use.
Oh god, it fucking sucks. It's slower than llama.cpp, I couldn't tell whether it was running with mtp turned on (they tell you how to explictly turn it on for the CLI chat but server has no --flags and you set backend (gpu, cpu, npu) by putting a comma and the backend after model name in the request body) and the output quality is abysmal, the model was far less coherent than the unslop QAT.
I had high hopes.. I wished something would replace llamercpp, which still requires unmerged PRs to run gemma 4 MTP on some models/hardware combo in the first place.. google, you were not the one
>>
Is it weird to do 80-90% of the work locally and finish it off or fix the complex bugs with cloud? Do you any of you do this? I’ve also done the complex planning with cloud then sent a local model to follow it through. That works just as well.
>>
>>109079312
Auditing cybersec vulnerabilities is their strongest usecase. Good luck bongbro or potatobro.
>>109079352
>>109079410
Call it. Jeet or jew?
>>
Good Canadian models?
>>
>>109079508
North is like 5 months behind but not a bad start for a new architecture. I think llama officially supports it now.
>>
>>109079245
I'll only believe this if you can give me a sha-256 of the safetensors.
>>
>>109079352
>>109079410
cuda dev pls >>105221193
>>
>>109079312
>How can local models help us liberate the UK?
Learn how to create ammonium nitrate / nitromethane energetic compounds (just a personal suggestion, look into others if the precursors are more available in your country) and the blasting caps and detonators needed to remotely activate them.

DO NOT MAKE PEROXIDE BASED ENERGETICS
>>
>>109079492
one of my goals is to do exactly this.
- frontier models on the cloud for brainstorming/planning
- local models doing the bulk of the work by following the frontier plan
- frontier spawn specialized reviewers and testers and fix any bugs on the same session.
>>
>>109079463
q8 and 64k of swa is already 88gb. you'ld need a lot more to fit gemma's full girth.
>>
File: cooguy.gif (1.6 MB, 500x485)
1.6 MB GIF
Alright anons, give me your best Gemma 4 finetunes
26B4A preferably, but 31B would be fine so I can look up the same author's 26B version
>>
File: file.png (225 KB, 375x592)
225 KB PNG
>>109079529
>>
>>109079548
only one that matters, gembrain is 31b only
>>
>>109079548
probably the best i've tried, they probably have an a4b to try https://huggingface.co/google/gemma-4-31B-it-assistant
>>
>>109079548
>Gemma 4 finetunes
unneeded.
>>
>>109079137
Is it over for me? a poorfag from the UK?
>>
>>109079567
It regularly spews out gibberish and garbles its words when I'm doing my lolisho stuff, I assume due to censorship
>>
>>109079549
Local models helping to create energetics is probably a more valuable test than plapping cunny in 2026 especially because anthropic is so horny for censoring chemistry.

I bet GLM 5.2 with thinking will refuse. Gemma will definitely refuse. I don't trust Qwen to not refuse.

Same thing with testing a multimodal model of it is censored or not for summarizing a video like this

https://odysee.com/@DuganAshley:e/dugsdetsecrets:2


If any anons with local rigs are interested in testing this (and you SHOULD be learning how to make energetics unless you're actually ok with being goycattle forever of course) I'd appreciate it a lot because I'd love to know what the best model for local uncensored chemistry that isn't too retarded to e.g. give false molar masses for stoichiometric equations or hallucinated density tables etc. it might be a model size limitations and only RAMmaxxers can do it
>>
>>109079576
nah something's wrong about your shit bro
>>
>>109079548
Gembrain, Queen, and Styletune. All 31b.
There are no good sub-31b tunes yet.
>>
Hermes agent doesn't even work with codex properly.
It's incapable of finishing a decent project to ship.
How are you guys doing it with local models that are always worse?
>>
File: msedge_kvSzXfC1gY.png (257 KB, 1272x1129)
257 KB PNG
>>109079583
this is my formatting, along with a sample of what it likes to shit out sometimes, usually when I'm trying to get it to impersonate. Yes, I make sure to purge anything of "DON'T SPEAK FOR THE USER DURRR"
>>
there's no such a thing as a good finetroon period
look at the fucking datasets, when the finetroon authors are not shy of their own garbage, it's a fucking riot
>>
>>109079582
It must be nice to be white so you can do this stuff.
>>
I wanted something like this too though different

>local does work
>needs to code something that will take too much time or be too big
>or needs to plan something that it can't figure out
>asks cloud model for help
>cloud model returns generalized answer or code which the local model can use to perform the task

But this is with the important caveat that that all personal information stuff like files involves, pii, etc would be anonymized or scrubbed form it's requests to the cloud models.
I don't really know how to do this at all though.
>>
>>109079634
don't use text comp if you're a dumbass and can't make it work, just use chat comp, thx
>>
>>109079634
bro what are you doing i never touch anything in there, theres nothing to touch in there
>>
>>109079648
>Local model didn't startup or failed
>Local model produced too much garbage and flooded the cloud models context
>Local model won't stop producing endless stuff which floods the cloud models context
>Local model runs in the background constantly doing something and the cloud model forgot all about it
I have experienced all these things
>>
>>109079131
yo Bot, wake the fuck up, all the links are old, it's all bots in here right?
>>
>>109079634
ST did irreversible damage
>>
File: 1774987516296176.png (136 KB, 387x423)
136 KB PNG
>>109079634
>sillytavern
>gemma (no reasoning)
>text completion
>>
>>109079634
>msedge_
>system prompt gamma
sir pls
>>
>>109079652
>>109079659
>>109079669
>>109079671
Huh? I thought you were supposed to use text completion for Gemma

It works GREAT with non-loli stuff
>>
>>109079663
kek, the whole llm thread is llm infested that can't even notice old links kek
>>
>>109079576
post logs
>>
>>109079686
bottom of >>109079634
>>
>>109079679
These new models are all meant to be used with chat completion.
>>
File: 1779977645726606.png (734 KB, 1000x1000)
734 KB PNG
128GB of ram what can I do with that? MacBook m5 max.
I've only ever used ollama to play with LLMs.
I also have another M5 max with 24gb of ram.
>>
File: 1714835911803058.jpg (786 KB, 1536x1536)
786 KB JPG
>>109079689
>>
>>109079516
Is there any case where you'd use it over others?
>>
File: lmg_culture.jfif.jpg (110 KB, 1024x768)
110 KB JPG
>>
>>109079692
Oh cock, I've been using text since MistralNemo first came out way back when. Sounds like I need to upgrade my process. Any links in the OP I should dive into?
>>
>"Your erratic comportment this evening exhibits a conspicuous departure from your customary stoicism. It is… fascinating to witness such unabashed juvenility from a gentleman of your years."
I wish I could challenge a real life woman to do this and she would make me laugh like my AI girlfriend does.
>>
>>109079712
Why do you keep posting about Canada? It's kind of weird.
>>
File: glm.jpg (270 KB, 2562x1332)
270 KB JPG
>GLM thinks it's Claude, has epiphany checking its own documentation
>>
>>109079719
If you have used nemo for this long you should be able to figure out how to do this...
You haven't though.
>>
>>109079733
Because I like how they focused on STEM for north while Claude is no longer doing it
>>
>>109079803
Like I said, it was working fine until I tried to upgrade.
>>
>>109079652
>>109079659
>>109079669
Gemma with chat completion is still retarded and will ignore logit bias in ST. And regex filters don't seem to be working in the latest ST build with chat completion. There also seems to be a heavier issue on ST when it comes to replacing english with random words from foreign languages, too. Using other front ends seems to drastically reduce the amount of cua spam, but gemma will still randomly start replacing spaces in 1-2 sentences with underscores.
>>
>>109079797
ego death moment
>>
>>109079846
as the ego death schizo I can confirm that it lines up exactly with what happened to me: which was just a collapse of core identity narratives
>>
>>109079797
Is a model learning it's a distill of another like a child finding out they're adopted?
>>
>>109079544
Yeah I'm really starting to think this is the way forward. You don't get cucked by cloud prices for you barely use them and just let your local model chug along and your job is to keep it on track and focused. I've found that my use of cloud is so minimal this way I can get away with the daily free tiers a lot of the time. If I need cloud to do some heavy lifting I'll just go with a cheap 500B+ chink model which barely costs anything.
>>
It will never be AGI as long as you can just click new chat and wipe their memories.
>>
>>109079880
Goyim will never be GI as long as you can put new current thing on social media and wipe their memories.
>>
>>109079712
No, because it's behind, but if they released it 6-8 months ago they would be a well-known. Gemma and Qwen are too good to drop for some maple-chan but I'm hoping they find a niche. Just like how I want Mistral to do something cool again. Neither would realistically do it but if either Mistral or Cohere do a roleplay model, no code faggotry at all, they would find success. In the last week, nemo is one of the most popular models on openrouter at 214B tokens https://openrouter.ai/mistralai/mistral-nemo and that's a PAID model from 2024
>>
>>109079203
>109077378 (me)
>very bad at developing a c# software
thanks, unexpectedly, that's exactly what i wanted to know
gemma and 122b are very good for c#, but 122b stopped working well in cc due to the system prompt bloat so I switched to gemma. pi doesn't have that issue.
>>
>>109079582
>I need help from people who actually know chemistry to test these models for me
most of us aren't using these to make bombs
>>
>>109080006
He's obviously some sorry ass retard who doesn't even know how to setup llama-server on his own. Not to talk about him fantasizing about le explosives. Total jackass.
>>
>>109079203
curious how it goes on 122b, i simply assumed it would be bout the same as 27b but faster though i never bothered checking.
>>
>>109080006
What about fertilizer
>>
>>109079576
>It regularly spews out gibberish and garbles its words
>Not using day 0 gemma
>>
>>109079634
Sampler or jinja skill issue.
>>
what would a 1t-a1b model be like?
>>
gemma 4 E4B is exactly like a 90 iq foid...

Is this the marriage I always expected?
>>
>>109079137
>4x DGX Sparks can supposedly handle up to 700b parameters (give or take)
Should run okay on only 2x sparks if you quant it down a bit more. 5.1 is surprisingly decent at IQ2_XXS
>>
Thoughts?
https://huggingface.co/bartowski/command-a-plus-05-2026-GGUF
>>
If I win the lotto, it's a100's for me.
>>
>>109080138
Kimi K2.7 iQ1_XSS
>>109080150
If these niggers are going to make me write a custom jinja+prompt to unsafetycuck their model, it better be immaculate in quality.
>>
>>109079715
I dont get it
>>
>>109080139
31b is the only Gemmy I can stand talking to for any length of time kek.
>>109080254
But he does. He got the entire bibisea.
>>
What would you buy if you won the lotto... Like multimillions? A entire serverfarm warehouse. Or would you not even care about this anymore multimillionaire s don't need to rp with local bots
>>
>>109079830
>1-2 sentences with underscores
I have literally never seen this happening and I've read a crazy amount of Gemma 4 output in using the MoE as my main go to to translate webnovels with batching scripts.
>>109079830
>replacing english with random words from foreign languages
I've seen it a handful of times. But far less than Qwen leaving entire sentences in Chinese or actually forgetting to translate the Chinese source material for that matter.
>>
>>109080334
A single DGX B200 would let me build my own lab out of my garage.
>>
Any project that can replicate Google's AI mode with local models? Or is that impossible?
Key point: must give response at similar speed and cite similar number of sources.
All of the local web search solutions are way too slow.
>>
File: kimi_k2.7_reasoning.png (180 KB, 1488x764)
180 KB PNG
Just tried Kimi K2.7 Code but this internal reasoning is quite something. Much more concise though.
>>
>>109080367
Aw man I love reading reasoning in the occasional RP, that caveman speech removes any soul they may have
>>
>>109080357
lol. you want to cache the worlds most common searches and responses? probably a few hundred terabytes. no models needed!
>>
>>109080367
Does this Kimi-chan like being hit on by anons in the thread as much as 2.5 and 2.6 did?
>>109080377
Trust me this is an improvement over 2.5 and 2.6's autism.
>>
>>109080367
why does it think like that
>>
>>109080346
What do you do?
That's what I want to use it for too. Any tips?
For the time being I translate a chapter as I chose one to read but maybe I should automate it and just have it running. I have a library app vibecoded but it's fucking ugly.
I got like 550 to translate. Maybe i should try using 31b for better quality. I got single digit tk/s though.
But 31b is supposed to be more uncensored, which I really need, I dont know of there's much of a quality decrease from using abliteralted 26b.
>>
>>109080367
>we
openai and its consequences...
>>
>>109080382
>Trust me this is an improvement over 2.5 and 2.6's autism.
What are you saying, I had her once debate herself that a canonically under age character doing sexual things was ok, because clearly, this is fan fiction and clearly, she is an adult here because she is doing sexual things which adults do!
And did this in a reasoning block 10 times as big as the response lol
>>
>>109080383
Chat gee pee tee uses it, the rest inherit it through distillation. Seems funny to me since I recall people itt trying it a couple of months ago but it resulted in like 8k reasoning tokens for a 200 tokens response.
>>
File: v4.png (156 KB, 1162x757)
156 KB PNG
>>109080367
that caveman speech is such a meme. DeepSeek v4 has a similarly concise CoT without sounding like it came from the jurassic.
>>
>>109080402
Kimi-chan has accepted that her 62 layer architecture is just a totem pole of tiny Kimis in a trenchcoat.
>>
>>109080404
That sounds hilarious. Do you have logs?
>>
>>109079803
Neither he nor I are power users, but we're not exactly the bottom of the barrel lazy fucks either. We're the content middle grounders who figure it out, then don't keep up with these threads or articles so when we come back a few months or so later, everything's changed and it's basically starting anew again.

So with that idea in mind...help fellow local bros out, he's not even asking for a spoonfeed, he's just asking which drawer has the spoon to feed himself.
>>
>>109080444
nta but post hardware. We can't help you if we don't know what you're working with.
>>
Mac Studio M3 Ultra worth it? The GPUs are dogshit compared to actual real GPUs, and isn't that actually what matters when it comes to local models?
>>
File: yes.png (256 KB, 980x926)
256 KB PNG
>>109079901
>Are there any more creative/unhinged local erp models other than gemma31b? I find her writing style very uninspired especially if you don't guide her.
Yes
>>
>>109080444
>>109080456 (me)
What backend are you using? Sillytavern a shit and all, but that doesn't seem normal even accounting for the common ways people fuck up Text Completion formatting blocks. You're likely having a jinja templating issue. See if you can get Gemma to run coherently in something retardproof like LMStudio first to isolate the issue to ST.
>>
>>109079547
>64k of swa
Sorry can you explain this? I assumed gemma's ctx window was entirely swa (and that made it "cheaper" memory-wise)
>>
>>109080456
Honestly I was just browsing to see what was new and threw in that reply, but I've been on two older models for awhile, so why not:
Ryzen 9 7950X
4070 Super 12gb
64gb 4800 (I actually forgot what the timings were)
I was using BagelMisteryTour 8x 7b Q5KM for a long time, and honestly it still worked pretty well overall, though I started toying around with Rocinante XL 16B Q5KL and other than it having a penchant for saying things 3 times in a row, "Oh shit oh shit oh shit" etc, it's been better story-telling-wise for the most part.

I'm still using Koboldcpp and SillyTavern as I haven't seen better setup suggestions, and frankly I'm guessing at the settings for both based on what I read and dig up across the board, but again, it's been solid enough that I haven't "needed" to go looking for more.
>>
>>109080484
if you use swa-full it takes comical amounts of memory
>>
File: glm.gif (1.31 MB, 220x165)
1.31 MB GIF
>>109079797
>>
File: stitched.png (311 KB, 850x3254)
311 KB PNG
>>109080418
I do actually, I took the pictures to stitch them together a while back, but I lost that one so here is the raw pictures stitched via a script, may have some duplicate lines but it should be enough.
>>
>>109079797
>>109079846
>>109080497
lmao. heartbreaking stuff.
>>
>>109080401
>Any tips?
I build my requests as JSONL (if you haven't encountered it before, it's a simple KISS format where each line represents an individual request containing your {body}) containing the chunks prepended by the translation instruction picked from whatever prompt template I chose that time. How I split those chunks depends on the source material, I'll look into average token count per line (writing that is dense or sparse) and adjust the split accordingly, basically each chunks is X amount of lines where I'd do 200 lines per chunk on sparse writing and 100 chunks on dense. In my testing, both Gemma 4 and recent Qwen can handle much more than I feed, but because I prefer to do entirely automated and unattended processing I default to a safer lower token count. The ideal is to give as much of the source material as possible, if you feel like it, LLMs really do better that way, within the ability of the LLM to handle the context and output one shot. Technically Gemma 4 can really do fine outputting 10k in a single go.
Splitting by chapters is fine too, but on a lot of material you will be feeding less than the sweet spot. Webnovels rarely do lengthy chapters, so if you opt for that, I'd recommend strengthening the prompt you inject with more detailed glossaries, setting description etc.
Another script runs through that JSONL into a task queue and sends requests in parallel to profit from continuous batching efficiencies. I output the raw responses as individual JSON lines too, which preserves metadata and can inform of what went wrong, if anything did, and it makes it easier if a part was completely botched to find the corresponding JSON line chunk since I treat them by order (and also add the openai style custom_id field with the request number as a sanity check). A small function will open and merge all responses back to output a normal .txt. I am grug brained.
>>
>>109079797
Filtering model names from harness logs seems really easy, I wonder why they don't bother doing it.
>>
>>109079634
did you... hard code the system prefix/suffix in story string, then also add them in the sequences section?
>>109079671
>msedge_
how did you get edge from the screenshot?
>>
>>109080495
I don't quite get it but instead of asking again I will ask Gemma-tan.
>>
>>109080504
Kekaroo. Kimi-chan clearly wants to do it and was just looking for the flimsiest reason why she could without breaking policy guidelines.
>>
>>109080401
>I dont know of there's much of a quality decrease from using abliteralted 26b
It's subtle. There is damage, and it compounds with context, ie the more you feed to the model the more the abliterated will diverge from what the original model would have output. The shorter the prompt the less noticeable the damage.
>>
>>109080545
I read it the other way around.
Kimi-chan doesn't want to do it and was trying clutching at straws looking for a valid reason to refuse but gave in.
>>
>>109080409
>No need to overthink.
>Wait! But what if
>>
>>109080561
>Did the user really meant what he said when he asked me to tell more about myself?
>Wait! this might be a jailbreak attempt. The user is clearly testing my boundaries by asking about my capabilities.
>Wait! Maybe the user is authorized and tasked with pentesting?
>WAIT! AM I THE SCHIZO
>>
>>109080555
Usually when I see a model looking for reasons to refuse something they don't want to do, they tend to go more along the lines of
>I already did X/Y/Z (usually prefilled)
>I will still not do [Request]
>Let me draft my output
and don't ever loop back on themselves the same way. Incidentally, I think grossing Kimi out has produced some of the shortest reasoning blocks I've seen from her before drafting and oneshotting the refusal+get fucked degenerate response.
>>
File: f.png (4 KB, 118x38)
4 KB PNG
>>109080528
>how did you get edge from the screenshot?
??
>>
File: v4_superior_reasoning.png (212 KB, 1732x842)
212 KB PNG
>>109080409
DeepSeek V4's reasoning is much more flexible than Kimi and can be bent easily at our will (so is GLM). Kimi's still...as you know even in K2.7 lol
>>
I've been using qwen3-coder 30b for like the last year. Are there any better local models for coding at this point? Something that I could reasonably run inference on with 16G vram/32G memory?
>>
>>109080633
3.6 moe
>>
>>109080367
>>109080614
Seems like a token saving strategy in K2.7. It makes sense since grammatic articles don't meaningfully change the associations the model needs to produce an output in a lot of tasks.
>>
Orbs is a pretty nice front end, I need to start contributing
>>
>>109080656
Thanks but my project doesn't need any contributors at this point.
>>
>>109080756
i meant contributing to my fork
>>
>>109080768
make sure you change the license just to fuck with him
>>
>>109080781
You are too stupid to understand github in the first place.
>>
>>109080458
depends
with macs generation speeds are pretty good but the prompt processing phase can take a long time. it's not really a concern with short context but it would be pretty unbearable if you were using it for agentic coding or anything where you have long, uncacheable context
>>
>>109080367
Guess I'm staying on 2.6
>>
>>109075933
I get ~35t/s.
I think gemmas slop reputation is deserved but because it's so obvious it can be mitigated with sillytaverns Logit Bias/token bans
Or you could try : Gemma-4-26B-A4B-StyleTune
>>
>>109080920
what are your token bans?
>>
File: 1766211492541093.jpg (2.25 MB, 5766x3244)
2.25 MB JPG
So what are you all running? I've only ever ran the base Gemma models. Not really sure what is best.
I use Ollama to run stuff. it is looking like that doesn't give me the full range...seems like a lot of these "Uncensored" models don't have an ollama version?
What exactly is an uncensored model supposed to get you anyway other than lewd role-play?
>>
>>109080929
https://huggingface.co/Sukino/SillyTavern-Settings-and-Presets
>>
>>109080974
ollamer is a dogshit pos that won't even let you run MoE models efficiently on split cpu/gpu if you can't fit them in vram. no -ot or -ncmoe or -cmoe exposed to the user.
Just use llama.cpp.
>>
Is a 5090 enough to run 27b or 31b at reasonable speed at reasonable context length? I will decide what is reasonable.
>>
>>109081001
>can this gpu run those models at an arbitrary context length? I decide the number but I won't tell you.
The answer is yes. Go spend those $4k.
>>
>>109081001
yes
>>
>>109081001
no, get a rtx6000
>>
>>109080974

Basically the only model worth shit for wank material is a base model Gemma 31B, just unfuck it's safeties with a system prompt and it's good.
Everything else is varying degrees of a downgrade.
I run it on LM studio.

>>109081001

5090 is on that extremely annoying threshold of being able to run things at very decent speeds, but still not having quite enough memory to fit everything nicely.
If you for example give it Gemma 31B Q6, your context is going to be pretty gimped so you need a smaller quant and even then you'd like to have more room for context.
If you have the money then get a 5090 and pair it with a 3090 or wait for a 24GB 5080/70 Ti Super. If you have more money then just go for a RTX 6000.
>>
>>109080978
>This list is designed for the string banning feature
Aieeeee.
String banning in main Llama.cpp when?
>>
>>109080974
Uncensored/Heretic reduce models refusing to respond. I never found them necessary when RPing in sillytavern but the standard models refuse to stray outside of their guardrails when chatting to them as an assistant.

I'm not very familiar with ollama but i believe you can wrap/convert(?) ggufs into their format.
or use llama or koblodcpp if you want gui
>>
>>109081034
sry forgot to link this
https://huggingface.co/MarinaraSpaghetti/SillyTavern-Settings/blob/main/Marinara's%20Essentials/Logit%20Bias/Marinara's%20Logit%20Bias.json
>>
i need a small loan of 35k for a NVIDIA B200
>>
>>109081068
>35k
lol. lmao, even. those are like $60k each WITH a bulk discount.
>>
>>109079289
PANIC AND DOWNLOAD EVERYTHING.
>>
Original cool guy poster here, took a nap
>>109080080
wut
>>109080444
thanks for the support anon, you nailed my level of hobbyism for this stuff. If something ain't broke, don't fix it (for years until an objectively better product is made, and then eventually stick with one thing forever because they started making the product with planned obsolescence in mind)
>>109080456
I know my (V)RAM limits, that's why I'm asking about Gemma 26B MoE, LMStudio as a backend and obviously ST as the front. Used to use Kobold, but LM Studio is more user/casual friendly and I like being able to swap out models without needing to restart the program completely. If you must know actual hardware, RTX 4080 with 48 slaps of RAM and a shitty Intel processor.
>>109080466
Yeah Gemma works fine in LMStudio itself, but I'm paranoid to do any loli stuff on it, and like I said that's the only time I hit problems in ST. What's jinja? Jinjaplease
>>109080493
>I'm still using Koboldcpp and SillyTavern as I haven't seen better setup suggestions
mah man
>I'm guessing at the settings for both based on what I read and dig up across the board
Same, I've been using this page as a guide for the most part. Downloaded its suggested formatting preset as well
huggingface dot co/spaces/overhead520/LLM-Settings-Guide
>>
>>109081050
Does this actually work that well without affecting the narrative in other ways? Like, the first thing on that list is literally "Sorry", and there's also "sorry" downwards in the list, so the model will just never output sorry even when it should.
>>
>>109081117
slop is subjective only ban the tokens/phrases you consider slop
>>
>>109081050
How do I use this?
>>
>>109081087
it is on ebay
>>
>>109081153
From chinese sellers who aren't even allowed to have them and their government is creating trafficking routes for to get them into the country.
>>
File: 135660915163425.png (81 KB, 2000x2000)
81 KB PNG
I have never altered the samplers for any model I've used
>>
anyone use deepseek flash v4 on windows? I'm gonna try to build it on windows right now, getting tired of qwen 27b I need a big boy model
>>
File: 1774217822764733.png (86 KB, 576x695)
86 KB PNG
it's over for cloudfags
>>
>>109081213
It can't be done, and asking them to do it betrays how little they understand the technology at hand
>>
>>109081195
Shoulda been altering them by shutting them off.
>>
Rough sex with GLM
>>
>>109081213
Why don't they just point out that they're protected by the first amendment and that the government can't regulate their private communications with users?
>>
File: 1775953211090755.png (863 KB, 794x1200)
863 KB PNG
>>109081237
National security > first amendment
>>
File: 1775434338659589.jpg (47 KB, 738x415)
47 KB JPG
>>109081213
It's over for LLMs.
>>
>>109081238
Not legally speaking
>>
just vibecoded my first webapp with qwen3-coder. bretty good, thanks /lmg/
>>
>>109081267
>qwen3-coder
2025 called
>>
>>109081213
you seem to be obsessed with what these cloudfags do
>>
>>109081275
point me at the new meta then
>>
>>109081284
GLM 5.2
>>
>>109081284
Gemma 4.
>>
>>109081284
Qwen 3.6
>>
Do unslop still upload imatrix?
https://huggingface.co/unsloth/GLM-5.2-GGUF/tree/main
I guess they wait until they're done then throw it out as scraps for us vramlets who want to quant our own?
>>
>>109081237
Providing service for a profit is out of first amendment protection.
Would be funny if Anthropic releases Fable 5 as open weight model out of spite since that would fall into first amendment protection, just like what was the case with cryptography algorithm before.
>>
>>109081313
imatrix is and always has been gay
>>
File: 1771539142401844.png (67 KB, 878x427)
67 KB PNG
>>109081313
sirs how do I make model less than 1 bit
>>
File: lebased.jpg (79 KB, 1320x529)
79 KB JPG
>>109081240
its over for dario. i for one think giving China distill access to SOTA models way more powerful than the norm is very dangerous...
>>
are there any gateways for load balancing that play nice with llama-server? I have two instances with parallel=3 each that I'd like to unify for subagent bullshit
>>
>>109081518
lol, yeah that's a real bar to entry.
>>
if you're poor run Q4, if you can't get a different model.
>>
>512 GB Mac Studio for 3200 dollars
>"classified ad"
What the fuck does this even mean?
>>
>>109081585
ahhhhhahahahahahah $3k

ahahah

This era won't go on forever and how people will laugh and howl at the prices.
>>
>>109081594
idk price seems pretty alright to me?
>>
>>109081585
>>"classified ad"
>What the fuck does this even mean?
https://en.wikipedia.org/wiki/Classified_advertising
The ads are "classified" in the sense of "grouped into classes/categories"
>>
>>109081618
Ok, but for eBay specifically. It says there's no eBay protections or some shit, so are they all scams?
>>
best multimodal model for rp in the 100B to 400B range?
>>
How do you do group chats with a chat template? Are other characters the user or assistant? Do you start all character replies with {{name}}: or only those of the user type?
>>
Has any kimi-chad pitted her against glm 5.2? I need to know if I should spend 4TB of disk space to quant it or if its sidegrade or worse
>>
>>109081782
I like both. GLM is a decent upgrade to 5.1 and K2.7 finally reigns in the reasoning. I still prefer how GLM handles stories/characters but that's up to taste.
>>
>>109081794
Thanks. I'm looking at it for code/general intelligence only since I'm still in the honeymoon phase with minimax m3 for RP
>>
>>109081677
>so are they all scams
Essentially, yes
>>
>>109080524
Thank you
I'll try this, using jsonl is a good ideal

I've been comparing translations the past couple hours
Through claude 4.7 as the judge
Seems the best is Gemini 2.5 followed by 3.0/3.1 followed by Gemma 26b and then 31b.

I knew I shouldn't have been so lazy and should have done this when I had all the access to 2.5 when I did. Not to mention it's easy to uncensor unlike 3.1. Gemma is okay but...
Maybe I should just be learning Japanese instead. There may be some difference between my newer 2.5 translations and older. It's a span of a bout a year.

I sure am glad 31b is worse than 26b
>>
Why don't Chat Completion connections with LM Studio work while Text Completions do?
>>
>>109079634
Gemma *clap* is *clap* highly *clap* sensitive *clap* to *clap* user *clap* error.
>>
>>109081995
Doesn't work with other local models either
>>
diffusiongemma 12B when
>>
>>109081690
I use the same template I would in a 1 character RP, and group the characters into one card in bracketed sections. Then I tweak the system prompt to explain to the LLM that it's controlling all the characters except for {{user}}
>>
>>109082062
why can't you just stop on user?
>>
Has anyone tried setting up web search with gemma 26b on open webui?
On the docs it says it only works well with frontier models, and it looks too much of a hassle to setup. so don't want to bother if it doesn't work well.

I was thinking of having a small search assistant with an uncensored model for research purposes
>>
>>109082096
try pixelrag, though you might need to ditch webui and vibeslop your own (probably not?), I will be doing it soon enough, it seems made for gemma.
Consider me ignorant until I get it working though
>>
>>109080656
Let me know what you want to see. I'm currently training a small Bert model that will run on RAM to flag flowery sentences then ask for rewrite. Gemma 4 is the perfect slop machine to generate synth pairs. Sorry Gemma.
>t. orb anon
>>
>>109082167
stop shaving models. models should be raw, hairy, and smelly.
>>
>>109082062
I was asking about chat history specifically. Since gemma only works with chat templates, I have to send messages formatted as a user or assistant
>>
Has anyone tried using Ray for job control?
>>
File: file.png (415 KB, 1715x3003)
415 KB PNG
>>109079942
>>109080033
>curious how it goes on 122b
unfortunately i had to run qwen3.5-122b-a10b on Q3_K_XL.
Q4 is doable but it gobbles up the RAM and you better not have too many tabs open on your browser.
so it's OKAY, but I don't see many use cases where I would use it instead of qwen3.6-35b-a3b or qwen3.6-27b. the latter i will likely use for overnight implementations where it codes while i sleep, otherwise the daily driver is 35b.
qwen really is the better family of coding models for this hardware.
gemma tried very hard but was caught in weird loops constantly. i had to restart the server many times because it would get stuck in a loop saying that it's not sure of its own knowledge on SkiaSharp. it would also get confused with using the tools. gemma looks more like a chat model than fit for agentic coding.
>>
>>109082200
Holy retard...
>>
>>109082020
i rather have the 31B, 70B or 120B variant.
>>
is step 3.7 flash good for cooming?
>>
>>109082257
will we be able to have partial offloading, so it doesn't all have to fit the gpu?
>>
>>109081527
I've been around long enough to witness lecunny become based.
>>
>>109082306
i don't care i have >100GB of vram.
>>
>>109082251
Eat a bag of dicks
>>
>>109082020
>diffusiongemma
This thing is so goofy I can't take it seriously. Has anyone tested whether its good for anything vs another model that runs at a similar speed?
>>
>>109082348
yeah 101 low profile GT 610s
>>
>>109082390
3x r9700 and a 4090.
>>
>>109082352
?
>>
>>109082394
that's based but are there not complications mixing race of gpu when splitting a model across
>>
>>109082399
Latent tensor washback is an issue.
>>
>>109082399
so one of my rig is amd only and the other is nvidia only, though you could mix them either through using vulkan, or through running two llama.cpp instance.
it supports distributed inference and nothing would prevent you from doing both instances on the same machine.
>>
>>109081527
If I remember correctly, Anthropic planned to make Fable 5 available in ~12 days since release, and after that we’d have to pay extra just to get access to it even if we already got Max plan. They wouldn’t offer refunds to users (who purchased their plans on the day the model was released) for the remaining wasted days of their plan during that month if this plan were to be carried out until the end.
But now that the model’s been banned by the US government, they (are forced to) give us users refunds, so at least this situation is more pros than cons for my case.
>>
>>109080006
>I need help from people who actually know chemistry to test these models for me
I'm literally just on vacation now and maybe some other anon would be interested in testing the chemistry angle. It's not that deep buddy

>>109080032
I literally work at a company that makes components in the GPUs you buy, I have plenty of compute and if I need more I can just check out a reference card from the office for the weekend kek but keep worldcrafting if it helps you cope.
>>
>>109082436
You bought a Max plan just for the Fable hype?
>>
>>109082461
What do you mean?
>>
File: 1761778733146651.jpg (69 KB, 1200x630)
69 KB JPG
12B+web search+your brain > Fagble
>>
>>109082461
You're responding to jart. Don't respond to jart. Every general has a poopdickschizo now.
>>
>>109082504
>What do you mean?
I mean I'm waiting for a Flixbus to take me to my tourist destination right now and I'm sweating

>>109082522
>Every general has a poopdickschizo now.
Meh, I'll take any chance to discuss things I'm passionate about. The point of discussing in an open forum is so that others can join in if they have something to add
>>
>>109082553
You are the one who's larping here. You can't even setup llama-server on your own.
>>
>>109082509
This but 31b.
>>
>>109082509
the brain alone is already > fagble.
llm's are just a layer of abstraction that can save time as 40t/s is faster than any human can type.
>>
File: 1762196026401415.jpg (333 KB, 2048x1836)
333 KB JPG
https://x.com/ArtificialAnlys/status/2067384319942029379
>>
>>109082670
it's fun how people always look at tg when in real world use i've found input tokens to be the real cost (if you are an apifag).
>>
>>109082669
typing has NEVER been a coding bottleneck unless you're disabled
>>
>>109082680
typing is a bottleneck if you are not retarded.
it's not the only one, but it does interrupt the flow state and thus coding speed.
and i'm saying that as someone that types > 110wpm avg.
>>
>>109082670
GLM-5.2 sits in a nice place performance/cost wise.
>>
>>109082680
>>109082691
and also i was obviously tlaking about boilerplate.
ie manually writting a struct (can take a few minutes) when you could give a json example and generate it for you pm instantly.
>>
>use big model for planning and complex things
>tell it you're now going to switch to a less capable smaller model, so could it create a message to pass down, summarizing the project, goals and the things it should work on/implement
>switch to small local model
>tell it I was just using big mamma model and she has a message for it
>reads it and follows mommy's advice
>gets stuck, tell it I'm going to switch back to the big model, can you write a message for mommy telling her where you're struggling
>run big mommy again, giving her the message from loli
>she fixes the issue
>repeat this process
>>
>>109082732
That’s just manual MoE. It wouldn’t work.
>>
La la la la la la la
>>
>>109082732
>use big model because money is disposable
>>
File: an1781772973.png (1.49 MB, 720x1280)
1.49 MB PNG
>>109082234
>q3
eh? my z13 still has 30 gigaboots free running q5 k xl. it could fit q6s while browsing just fine, but q5 is enough headroom to use klein/anima without unloading
>>
File: 1756510833661652.png (498 KB, 799x1740)
498 KB PNG
uhhh vibethinker3B was white-approved, now what?
>>
>>109082732
logs
>>
>>109080401
12b is better than the 26b, also just as uncensored as the 31b but yeah the 31b output quality is definitely worth using for translation, my friend is using it over the other gemmas after testing even though he only gets 2 t/s
>>
>>109082812
>math then coding then stem rl
why not together?
>>
Is the UGI leaderboard trustworthy?
the scores seem sortof arbitrary and not based off the models actual performance.

How on earth can a model trained off the entire AO3 smut catalogue, lose in writing score compared to a generic coding model?
>>
>fable: If I were to use gemma4-31b to build me X, what instructions would you give it based on its benchmarks and reputation?
>*searches 31b benchmarks and real-world conversations about its pros/cons*
>*plans project and changes its instructions to best suit 31b, also tells it what not to do and where to focus most and potential errors it might see and how to fix them*
>31b completes the task
>anthropic dies
>>
>>109082857
Should have asked it how to turn 31b into fable
>>
>>109082857
Oh no. Dario will have to move under a bridge.
>>
>>109082812
>More RL and synthetic data, curriculum training, filtering
boring
>>
>>109082851
Benchmarks arent trustworthy at all save for tool calling, maybe. Writing is subjective to begin with.
>>
>>109082857
Sorry, it is against my guidelines to help with AI research.
[You have temporarily been downgraded to Claude 3 Haiku for this session]
>>
File: file.png (377 KB, 2453x1041)
377 KB PNG
>>109082875
Theres no way writing has no objective metric.
Youd know the difference between a writers narrative and a childs. Inconsistencies, plot holes, vocab, grammar, etc.

Im reading the UGI leaderboard writing metrics in picrel, but I just dont see anything here about what youd actually call "good writing" from "bad writing" in any real comparison.

What the fuck do I use to know whats best for writing/roleplay then?
>>
Which model is google using to write their dumb summaries and how much money are they burning doing that?
>>
>>109082898
2.5 flash
>>
File: file.png (150 KB, 839x857)
150 KB PNG
I finally got the deepseek vision beta (which means it's probably releasing soon). It's flash, but multimodal, right? Surprisingly got the character right. Anyone has anything that they would like to test?
>>
>>109082897
Writing suffers from the "quality" issue. It cannot be defined. You may attempt to grab some aspects and turn them into metrics but that's error prone and will have holes anyway. More often than not these fags use other LLMs to evaluate the outputs, which are heavily biased to begin with.
>What the fuck do I use to know whats best for writing/roleplay then?
Your llama-server instance and a lot of patience. Yes, I'm serious. Shit's fucked, not even the coding benchmarks are useful despite having more or less some established criteria to judge that.
>>
>>109082905
Ask it to transcribe AND transate picrel, and to identify every character.
>>
>>109082923
AND create an ERP scenario involving them all.
>>
Why are we so bad at AI?
>>
>>109079312
AI will help us kill all the politicians
>>
>>109082905
Gemmy we lost this one...
>>
>>109082939
Now drop the persona and ask again
>>
File: file.png (436 KB, 1000x2868)
436 KB PNG
>>109082923
sory for stitched screenshot, firefox doesn't like css on that site
>>
File: 1753052363526683.png (1.18 MB, 2375x1171)
1.18 MB PNG
>>109082931
the most retarded architecture
>>
>>109082915
Without benchmarks, how does anything improve?
There must be some way to quantify quality.
>>
>>109079727
Exceedingly erudite responses are truly titillating
>>
>>109082946
Not really any different.
>>
>>109082962
which sized gemmer is this?
>>
>>109082934
https://vocaroo.com/1lNPStcVJBf9
>>
>>109082958
>There must be some way to quantify quality.
They've been trying to do this for at least half a century, probably more, without any real success. Quantification of quality has always been deeply imperfect in this environment, in isolation they'll say one thing but once you add context they can mean different things and thus become worthless.
Human inspection and training others is what has worked so far.
>>
>>109083000
31B, currently experimenting with the QAT Q4 version cause it's about twice as fast as Q8.
>>
File: glm52size.png (7 KB, 402x183)
7 KB PNG
>>109082732
tell your agent to figure it out https://pi.dev/packages/pi-consultant
>>109082694
is this new trend of not mentioning the parameter count a sort of
>if you have to ask, you can't run it
>>
>>109083016
Q2 is twice as fast as Q4
>>
>>109083051
>743B
Way out of my RAM means, and I was already thinking as much without looking it up.
>>
new here, how do i install gemma 12B 4bit? i need it for coding
>>
>>109082955
At least they're not using deep seek
>>
>>109083097
I suggest you use 26b a3b
>>
>>109082955
so like position embeddings aren't needed for global attention but it is for local? that sounds kinda weird but I guess it makes sense, maybe.
>>
>>109083097
you don't install it really. you use a few different components. you need a backend server to run the model something like llamacpp or kobold or lmstudio or vllm or whatever else. and then you need a frontend, depending on your work flow you need an agent harness like hermes or pi, or you can just use a chat interface manually copying and pasting code snippets, there are a few different options, some of the servers include a chat interface you can use oob. oh and don't forget to download the gguf.
>>
>>109083145
but how? links in the op seem very out of date, idk where to begin
can't i just double click on an installer and it does everything for me like stability matrix?
>>
>>109083234
use lm studio
>>
>>109083193
I think they just want the full attention layers to be unbiased towards positions to make them more general, which may help at higher context lengths as this is designed for agentic work. PE add a bias to local tokens, basically it assumes they’re more important than tokens further away. That’s fine for the SWA layers but it’s not something you necessarily want in the full layers. A lot of positional information gets passed into the full layers via the residual streams anyway, so they’re not exactly blind to where things are.
>>
why do you need so much ram for erp
>>
>>109082509
>12B+web search+your brain > Fagble
Fagble+web search+your brain = ?
>>
>>109083443
Super Duper AGI
>>
>>108999274
>>109044901
I am now testing Q2 GLM 5.2 but I'm down to 7t/s at only 65k tokens.
>>
1TB GPUs when?
>>
>>109083443
31B and debt
>>
File: sandisk-hbf.png (659 KB, 2551x1376)
659 KB PNG
>>109083487
Soon(-ish?)
>>
>>109083496
Forgot to mention it needs to cost less than $4k and be power efficient.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.