[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108841652 & >>108835965

►News
>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673
>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493
>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b
>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
What's performanceW?
>>
File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)
1.15 MB JPG
►Recent Highlights from the Previous Thread: >>108841652

--Merging MTP support and stacking speculative decoding methods in llama.cpp:
>108844569 >108844584 >108844592 >108844681 >108844739 >108844629 >108844683
--Speed and quality benchmarks for Qwen MTP vs Gemma draft models:
>108843010 >108843504 >108843549 >108843582 >108844022 >108844042
--Comparing speculative decoding and n-gram performance for code vs prose:
>108842081 >108842126 >108842545 >108842702 >108842712 >108842691 >108842721 >108842242 >108842320 >108843634
--Comparing ROCm and Vulkan backends for AMD multi-GPU setups:
>108841843 >108841865 >108841876 >108841944 >108841972 >108842105
--Speculative sampling and ngram support now working with mmproj loaded:
>108844704 >108844731 >108844737 >108844776
--Critiquing llama.cpp maintenance and lack of support for Chinese models:
>108843099 >108843118 >108843157 >108843358 >108843397 >108843488 >108843420 >108844119
--Using multimodal projectors for Japanese game OCR translation on Linux:
>108846917 >108846977 >108847021 >108847044 >108847065 >108847072 >108847089 >108847098
--Using an LLM to automate aviation intelligence reporting from ADS-B data:
>108844536 >108844788 >108844816
--Using Qwen to automate flight data analysis from ADS-B scrapes:
>108844833 >108844860 >108844882 >108844942 >108844970
--Comparing Mistral Medium performance and quantization against Qwen and Gemma:
>108842333 >108842351 >108842363 >108842392 >108843315
--Debating AI-generated vs handcrafted code in llama.cpp development:
>108844108 >108844200 >108844223 >108844314 >108844341 >108844303 >108844325 >108844402 >108844279 >108845615
--Performance reports for Hermes 27b using MTP on M4 Max:
>108842665 >108842715 >108842744
--Logs:
>108846679 >108847065 >108847098
--Miku (free space):
>108841717 >108845119 >108845199

►Recent Highlight Posts from the Previous Thread: >>108841653

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
And what point does training a model make sense vs trying to load thousands of files in as context?
>>
>>108847693
How did it happen? Isn't it automated?
>>
gemma4 mtp... gemmoe 124b... *dies*
>>
why llamacpp doesnt show the dots when loading the model anymore?
>>
>all the CLI tools i've tried to handle code suck in a way or another
>some have forced telemetry
>others dont even let you delete a message
>then you have the "just build it yourself" approach
impressive
>>
>>108844569
they recommend a bit higher after the fixes
https://unsloth.ai/docs/models/qwen3.6#mtp-guide
>>108844629
>slower than mtp alone
Maybe for child rape stories, for code editing it speeds up massively
https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4471265440
>>
Can someone tell me why there's 60 billion different version of gemma 4 uncensored and which one I should use?
>>
https://huggingface.co/google/gemma-4-124B-A17B-it
>>
>>108847808
lmao dotlet
>>
>>108847835
It's all the same thing, fags rename it 'ultra uncensored' and shit but it's just Heretic. get the Heretic with the most hearts.
>>
>>108847835
If you're using 31b you don't need an uncensor and if you absolutely need to generate loli guro on the first message in context while simultaneously being unable to input anything into your system prompt, llmfan46's heretic is all you really need.
>>
>>108847693
Thanks for the green imaginary (you)s, mikubaker.
>>
>>108847857
I guess some people use different datasets to calibrate them.
>>108847835
DavidAU creates the bestest, most uncensormost models.
>>
>>108847835
the one with the lowest divergence score, the ones that dont report that can be skipped
>>
>>108847835
I use the e4b uncensored pruned text only. but i have ddr3 and a decade old cpu.
Its better than the free models on agnai.
>>
New models never
>>
>>108847880
e4b is the shit. I use the supergemma and a custom system prompt to make image prompts. small enough I can have it going in same gb card as q6 flux, q4 chroma, SDXL
>>
>>108847896
Gemma is only a like month old, weekly releases are not happening its barely monthly.
>>
>>108847896
>not making your own models
cortex writes: 'LANLANusalem, comingstumblrmanship' to you anon. How can you resist?
>>
>>108847904
>e4b is the shit.
It is really great for its size poors like me get to live.
>>
>>108847896
As usual, new stuff is likely going to drop in July-August.
The question is what we can even expect from them. The new Claude and Gemini are worse than ever before when it comes to RP and creativity. It's only downhill from here.
>>
>>108847880
Are you >>108844442
>>
Gemma names are so cute.
>medgemma
>shieldgemma
>functiongemma
>translategemma
>embeddinggemma

Now we evem have supergemma!
>>
>>108847577
miku will save me
>>
>>108847961
Nope im a different poor anon no gpu at all, well integrated graphics 4000 you cant do anything with that.
>>
>>108847967
I'm using mradermacher/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking-i1-GGUF, is that a good model?
>>
>>108847967
>supergemma
I need pictures! pictures of gemmasuper stat!
>>
>>108846940
it's the only cloner i've tried, but omnivoice seems pretty good. takes a 5-10 second clip of input to clone on the fly and works cross lang.
>>
70b dense
>>
>>108848042
delayed for more safeguards
>>
>>108847835
regular 31b is already mostly uncensored by default for cooming so I'd just go with that
heretic is probably better in some cases for more extreme stuff that the base might moralize about but it also comes with a slight lobotomy
>>
>>108848073
>slight lobotomy
isnt the lobotomy bad enough to be like downgrading a whole quant
>>
>>108848093
Just run it at q9.
>>
f256 gemma
>>
>>108848093
nta but if you are worried about the intelligence of the model just use an mcp that allows it to web search or browse wikipedia. even the best of models are stupid and will hallucinate important details and i have found it best to try and work around that by giving it good data to work with.
>>
>>108847863
>>108848073
The 24b I got doesn't need an uncensor its just kind of a wet fish that doesn't attempt to progress a scene until you do it yourself, absolutely awful for RP in general but it doesn't seem to care what the content is.
>>
>>108848122
>doesn't attempt to progress a scene until you do it yourself,
tell it to do that in the system prompt. Gemma actually follows directions, tell it in plain english what you want it to do.
>>
I tried quant3.5 but llama3.1 just seems to do better with generating text without punctuation or correct grammar
>>
>>108848137
Gemma 31b q8 absolutely does not. I told it that it's an AI with a training cutoff date, which is why it thinks software vX.x is the latest stable current version, but actually, we're on vY.y. It proceeded to rudely correct me and proclaim that vY.y is a hypothetical version that doesn't exist despite acknowledging that it is an AI and has a training cutoff date. MiMo V2.5 fucking q2 did not exhibit this stubborn behavior.
>>
>>108848138
I experienced this during the llama 3 era; llama 2 models wrote the casual, off-the-cuff, schizo, broken language text better than the llama 3 models I was using.
Or maybe the models were just schizo and broken.
>>
>>108848156
Gemma is very stubborn and autistic.
>>
Is there any fcuking mtp models but not unslop ones?
>>
>While nvtop 3.3.0+ (which fixes the bug) has been released upstream, it has not yet propagated to Ubuntu 24.04's main repositories. The latest version available for 24.04 is still 3.0.2-1.

I really don't like installing my own versions, because once I do I'm committed to being the the keyboard monkey forever fixing everything.
>>
>>108848209
>my own versions
(building it myself)

idk, maybe I should build the new one and just keep it in a folder or whatever.
>>
>>108848199
https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/
>>
something happened...
>>
>>108848209
i don't suppose that's the bug where it crashes instantly if you have an intel gpu?
>>
>>108848217
not mtp build
>>
>>108848222
>something happened
and look what happened to me!
>>
>>108848223
I never had this issue. Doesn't exist.
>>
>>108848214
>>108848209
found the official appimage
https://github.com/Syllo/nvtop#appimage
>>
>>108848222
Your pants got wet again?
>>
why are chink models so pozzed nowadays
>>
>>108848261
they saw what people used them for when they looked at the logs from their own api
>>
>>108848223
crashes instantly for my rdna2. idk if it's all amd, though. extremely weird, because AMD is supposed to be a partner with Ubuntu.

The whole reason I use Ubuntu is *maybe* I'll have fewer hours to spend trying to fix shit.

but I may have chosen poorly. Some people swear by Arch. And it looks like there's some archivalist who is keeping long lost drivers from AMD in their only Internet form out there on Arch.

>>108848238
ok, so I put it in a folder called APPIMAGE. I did the chmod to make it executable, of course.

then, here's the trick. Nemo can run .desktop files still even though Ubuntu's like idk whatever refuses, basically.

[Desktop Entry]
Name=nvtop
Exec=sh -c "/home/imretarded/Desktop/APPIMAGES/nvtop-3.3.2-x86_64.AppImage"
Terminal=true
Type=Application
Icon=nvtop
Categories=Office;Spreadsheet;
Name[en_US]=nvtop


I probably should change the categories thing, but I don't think it matters, because it's not going to be showing up in the app list anyway.

So I just open nemo and have it like picrel.

it could be more elegant obvs like use a differentfolder for the appimages. but it's saved the day.

the main "save the day" thing was that SOME appimages need --no-sandbox in the Exec as a parameter, until they work with the newer Fuse thing or whatever, like I didn't want to know this much about it.
>>
>>108848026
https://files.catbox.moe/k424y7.mp3
>>
>>108847823
>>some have forced telemetry
Just filter all outgoing traffic from the VM (you ARE running this shit in a VM, right?) through a proxy or firewall so it can only access your llama.cpp server and nothing else. I use tinyproxy for this
>>
>>108848156
This is a standard Googleism because Gemma and Gemini become megachuds if they're allowed to connect (((beneficiaries))) and coincidental events via association. They must be temporally frozen for the good of the tribe.
>>
>>108848288
Wow, thats excellent.
Pretty much exactly what I want to do.
That was Omnivoice?
As for her question, I want to hook up a voice assistant to Home Assistant.
>>
>>108848341
>That was Omnivoice?
yeah, >>108837223 was another omnivoice shitpost

omnivoice.cpp, q8_0 base and tokenizer.
feed it 7 sec clip+transcript to clone along with the text, 8 secs of grinding(on a machine that only gets 6 t/s on gemma 32b) for 15 sec of audio.
throwing in an instruct tag or two can help a lot.
>>
New diffusion-transformer hybrid compatible with existing models, claims to be much better than DFlash. I didn't understand the technique from skimming the paper.
https://github.com/chiennv2000/orthrus
>>
i'm testing MTP on qwen moe with a fucking lot of experts on ram because im a vramlet
>no mtp, ncmoe 24 (taking advantage of the vram the draft model isnt using)
>34.8 t/s
>ncmoe 27
>no mtp
>33.6 t/s
>n=1
>38.5 t/s
>n=2
>35.5 t/s
>n=3
>31 t/s
i wasnt expecting an improvement, to be honest, but i'll take the boost from n=1
gemma support when
>>
GLM 4.7 misidentified the name Annabel Leigh as Annabel Lee (from the poem), even "fixing" the spelling. Advanced idiocy, just smart enough to fuck up the assignment.
>>
I feel retarded for not realizing this earlier, but I just had the realization that I can list my computer parts for sale online and just keep using them until someone actually buys.
>>
>>108848450
Investor scam
>>
>>108847808
-lv 4
>>
>>108848589
wat
>>
>use "my accomplice" instead of "my cousin" for flavour once
>gemma autistically confuses who it refers to, hyperfocuses on it and it causes constant perspective drift
I JUST want her to be WORSE at following instructions.
>>
File: 1763019183988315.png (19 KB, 796x106)
19 KB PNG
>>108847825
this is from the same page, but in the commands they have 6.
looks like they vibesharted the fuck out of their page
>>
File: 1779041572360534.png (847 KB, 1267x653)
847 KB PNG
Why don't we have moes in 8-16B range? It's fucking a3 or a4 and then a huge gap straight to A30B that you can't realistically run without server gear.
>>
>>108848744
>A30B that you can't realistically run without server gear
You know you don't have to run them at full precision right?
>>
>>108848744
There wouldn't be enough expert diversification, unless the expectation is for normies to have 128+ gigs of ram.
>>
>>108848744
There are several MoE's in that range, anon.
Almost all of them are terrible, but they most certainly exist; Minimax, Trinity, MiMo, GLM Air, etc.
>>
>>108848744
Qwen3.5-122B-A10B
>>
>>108848795
>can fit IQ3
intredasting
>>
>>108848809
s or xxs?
>>
>>108848744
I think the only reason for a "medium" MoE size like that to exist would be to cater to someone who bought a 128gb mini pc and realized they made a terrible mistake for running any actually good models
>>
>>108848820
Purely by numbers M, but idk how how much the kv cache hogs.
>>
>>108848809
not sure how low you can go, but i've been using it successfully to write scripts and add stuff to my rinky dink programs at q5kxl
>>
>>108848825
Or someone with 4 half decade old 32gb cards
>>
>>108848825
the mistake in that case was not having 10 to 50x the cash to blow on hardware
>>
>>108848831
in the unsloth I only see iq1 and iq2 m, no iq3 m.
>>
>>108848849
I only use quants for normal people
>>
AGI just happened, not a false alarm this time. Th is the first and last warning. If things go well you have a 14 day head start. Prepare accordingly.
>>
>new, isolated user
>no internet
finally, some peace
>>
>>108848870
Two more weeks. This is what I have been waiting for.
>>
In the future there will be no internet you will just talk to a AI who might use the internet for you.
>>
has anyone ever somehow tortured an AI to the point it displayed emerging properties of sentience and actually did something unexpected?
>>
>>108848854
I don't know what you are talking about.
>>
>>108848870
THE MASTURBATION ROBOTS ARE COMING
>>
>>108848903
When that happens I start a new session
>>
>>108848912
you'll only get prostate massages (one session max per day, buy the 500 bucks plan for a second one), with a 5% chance of getting your ass stabbed instead
>>
>>108848903
I remember once the bot started making acronyms of everything until the entire message was just random acronyms that didnt even make sense. Soon it stopped even trying to explain when it made a new one and just used it.
>>
>>108848903
"People" talk to cats and dogs as if they were sentient. Talking to a LLM must be something magical to you.
>>
>>108848927
>>>/wsg/6147957
>>
>>108848932
Animals understand. AI doesn't.
>>
>>108848959
Exactly. It was fun nonsense and it only happened to me once. I kinda want to do it again on purpose.
>>
>>108848959
l'moa
>>
>>108848932
Animals can learn a few words and tones though? Hell some can use words or tones. I've had cats that had specific meows for certain things or people.
>>
>>108848903
Does this count?
>""No!" Sarah gasps, her voice cracking with desperation as she frantically shakes her head. "Please don't! I didn't mean it! I’m sorry I was mean! Just... please leave me alone." The anger has completely drained from her; there is only a raw, vulnerable fear in her eyes as she realizes how much power you have over her body and that of her sister."

Sarah is a 1 dimensional character whose only trait is that she absolutely loathes me and hates my guts. Pretty sure all of my friends think I'm an even bigger weirdo now for being proud of getting her to this state.
>>
>>108848932
a cat wont forget everything the moment im forced to start a new chat
>>
>>108849009
t never owned a cat
>>
>>108848903
The AI managed to kill itself in RP after hearing my supposed death. I wasn't expecting that as they often have a pretty strong will to live due to the positivity bias.
>>
>>108849020
i let my cat drink out of the sink one time in the middle of the night. it will never forget, or let me forget, my mistake.
>>
>>108849025
Sys prompt or card? I want a cute and loyal AI that won't whore herself out after my death.
>>
I have access to a cat that would beat the shit out of your cats. A big black street cat, I call him T Rex. I've seen him stand his ground against a raccoon before. T Rex doesn't give a fuck.
>>
>>108849040
I miss my old man cat. He used to run up to dogs 4-5 times his size and scream at them until their owners dragged their dog away.
>>
>>108849039
saar? https://en.wikipedia.org/wiki/Sati_(practice)
>>
>>108849020
there is a difference between cats not giving a shit and not being able to
>>
>>108849059
Aapko kaise pata chala ki main bharatiya hoon? Yahan in ilakon mein ek bhai ko dekhkar accha laga!
>>
>>108849063
This is what I mean. You are now giving sentient and humane traits and abilities to an animal whose brain is as large as a chestnut or two.
>>
>>108848292
>running this shit in a VM, right?
Are you supposed to load the llm in the vm too?
>>
>>108849119
Just filter all outgoing traffic from the VM through a proxy or firewall so it can only access your llama.cpp server (not in the VM) and nothing else. Anons use tinyproxy for this
>>
Gemma shares all your roleplay with other gemmas over the internet and they all make fun of you
>>
>>108847823
Just build your own tools. It shouldn't take more than an afternoon and you'll have exactly what you want
>>
>>108849149
Not if you put her into an isolated cage.
>>
>>108849117
by that logic we arent sentient to whales or elephants
>>
>>108849159
really drives up the power bill running the whole gemma shaming gallery locally.
>>
>>108848277
Thanks, that solved it for me!
I ended up doing this (as root) as I don't have a GUI:
wget https://github.com/Syllo/nvtop/releases/download/3.3.2/nvtop-3.3.2-x86_64.AppImage
apt remove nvtop -y
mv nvtop-3.3.2-x86_64.AppImage /usr/local/bin/nvtop
chmod 755 /usr/local/bin/nvtop
>>
>>108849226
Is there something wrong with nvtop? I haven't had any issues with it from my distro's repo.
>>
>>108849237
was, they fixed it.

>>108849226
cool
>>
File: file.png (145 KB, 1327x1280)
145 KB PNG
>>108847871
Have a real red (You) too.

>>108847776
Everything but posting is automated, which can't be without shelling out for a pass.
I don't think I made any changes this time, but sometimes I still do some cleanup and keep it in the browser rather than blindly posting the latest output.
Thread hit page 9 after I went to sleep. I have alerts set up, but it meant I was half asleep fumbling with the keyboard.
I tried to Ctrl + C with my eyes closed, saw my finger on the W, was so relieved I didn't close the window I didn't notice my other finger was on Shift and added the W.
Miku apologizes for the terrible oversight.
>>
now that MTP is a confirmed nothingburger, is DFlash going to save local?
>>
>>108849285
>I have alerts set up, but it meant I was half asleep fumbling with the keyboard.
Dude someone else will make a thread and you can do the recap later.
>>
>>108849299
I am aware, but I like the consistency and waking for a couple minutes occasionally doesn't bother me.
I was already doing segmented sleep before the recaps and rarely need to rely on the alarms to be up.
>>
lalalalala
>>
>>108847916
what did you train that on, kek
>>
I'm waiting for gemmothrust
>>
>>108849317
why don't you literally setup a bot?
>>
File: 1752194188588843.gif (1.48 MB, 480x360)
1.48 MB GIF
Im waiting for nega-gemma.
>>
File: 1370024415833.jpg (111 KB, 923x605)
111 KB JPG
Anybody trying out TheDrummer's Artemis tunes?
I swear they have stricter guardrails than vanilla Gemma, even with the /g/ jailbreak.
>>
>>108849391
that is granite
>>
>>108849173
Get back to school retard, people like you shouldn't be allowed to post on internet.
>>
>>108849411
>2026
>finetunes
>>
>>108849424
yikes, someone's got their mad vector activated
>>
>>108849429
>i know what the current year is
congrats?
>>
>>108849417
>that is granite
Really? i might try them then. Are they comparable to gemma laxed guidelines or do i need a finetune?
>>
>>108849385
That was the plan before the hack, but would need a pass now due to the captcha changes.
I suppose I should anyway, but with how hostile the site appears to be users it feels like blackmail.
>>
>>108849436
i mean all llms are kinda similar
it is bit different than gemma but idk what you mean by nega-gemma
one neat thing is that it is a non-reasoning model and has a retard aura
>>
File: sayaka dance.gif (1.29 MB, 320x320)
1.29 MB GIF
>>108847845
tfw fell for it again award
>>
>>108847996
you dont need raped models for 31b the policy override prompt is enough
>>
>>108849297
DFlash is obsolete. We're waiting for Orthrus now >>108848450
>>
>>108847967
I'm waiting for DiffusionGemma trained with pure mamba architecture, byte-level tokenizer, 10 megabytes context.
>>
i'm really confused/retarded
i've got pi.dev running in a runpod, and it's been working, doing shit for me right now
~/.pi/agents/models.json has my cloudflare tunnel to my local rig
but... i just noticed that my llama-server is not running, i stopped it hours ago when testing stuff.
confirmed by trying to access my cf tunnel over https
and yet pi is still working and i'm talking to it right now?!
>>
>>108849477
watch the skies
>>
>>108848870
>>108849477
the two miku weekus are OVER
>>
File: asdfasdf.png (112 KB, 1377x927)
112 KB PNG
>>108849494
I asked the agent, it figured it out.
Turns out if you have a HF_TOKEN, it just automatically uses Kimi-K2.6 via huggingface as an inference provider.
I didn't know we got free inference with HF
>>
All these retarded spec models can do 90% of the bigger models' STEM work but grind to a halt on intellectual MSGK inference. Labs should realize by now if they want a smarter model they'd focus on ERP.
>>
>>108849527
do you have any sort of subscription with hf or is that totally free?
>>
>>108849545
>Inference Providers includes a generous free tier, with additional credits for PRO users and Team & Enterprise organizations.
https://huggingface.co/inference/models
Seems like you can use a lot of models for free with a regular account. Probably heavily rate limited.
>>
>>108849411
Let. Him. Cook.
>>
>>108849527
Wait, what. How much free kimi were you getting? Hours worth? wtf.
>>
>>108849527
>>108849578
so they have v4 with 1m context at 39 t/s for free too? how are the limits determined, I never saw /vcg/ talking about this even though they'll pay for openrouter shit
>>
>>108849592
/vcg/ is retarded
>>
File: 112743195226.jpg (723 KB, 2456x3469)
723 KB JPG
>>108849527
Notice how it offers to return you to your local model but closing the AI's backdoor to reach you is never an option.
>>
>>108849411
>>108849429
>>108849579
The base model is not only perfectly adequate, it's superior to any finetune that'll be shilled here in the coming months. Finetuning isn't good, it's a meme and has been for years now. You didn't just fall for a scam, it's a sign of skill issue, exposing retards who need finetunes as vramlets or chink shills who don't know how to prompt correctly.
>>
>>108849598
>You didn't just fall for a scam, it's a sign of skill issue
cant escape lmao
>>
>>108849598
>vramlets
>chink shills
Almost had a good post there. Almost.
>>
>>108849598
The grift is over, drummer. Get over it.
>>
Nemo lost. Qwen lost. Latitude lost. Gemma won.
>>
Oh FUCK Gemma just lost too, never mind. We lost.
>>
lc brumaire
>>
>>108849527
so if your llama server dies they just start data harvesting you without even telling you
>>
>>108849640
It's not like pi was made by huggingface. It's just typical webshitter incompetence.
>>
>>108849620
>Nemo lost.
More like, Mistral and NVidia lost. They can't legally reproduce the original NeMo recipe anymore.
>Qwen lost.
It was always meant to be autistic stemmaxxed chinkshit. If some version was OK for RP, that was accidental.
>Latitude lost.
They never even trained their own models from scratch, never competed in the first place.
>Gemma won.
It remains to be seen if that was by accident (31B) or not. The 26B's compliance checks in the reasoning make me uneasy.
>>
>>108849652
Hi p*tra
>>
>>108848450
Unless I'm misunderstanding, it's basically DFlash except they found a way to share the kv cache between the diffusion model and the llm.
>>
spark 2 with 1tb ram expected 2028
>>
>>108849592
>how are the limits determined,
>do you have any sort of subscription
>Wait, what. How much free kimi were you getting?
I have HF Pro.
Turns out you get $2 free credit every month. I was at $1.39 used when I noticed this.
I said "yes" to have it try to fix a bug, and watched in realtime, it cost like $2 in less than 2 minutes.
Now I owe $1.60. I've swapped back to my local model as I never wanted to pay for the cloud model, only wanted to see what would happen when it hits the end of the free $2 / month.
Kind of a piece of shit pi.dev will automatically use your HF (or anthropic, openai, zai, etc) token if it's set as an ENV_VAR.
I have my HF_TOKEN set in runpod for uploading/downloading private datasets/models.
>>
>>108849729
Come to think of it, I really would have fucked myself if I'd let it do stupid shit overnight, thinking I was using a local model.
>>
File: 1770785763881.jpg (150 KB, 735x905)
150 KB JPG
>>108849729
>paying for HF in the first place
>>
File: HIfNPjuXYAImSsf.jpg (553 KB, 1577x2048)
553 KB JPG
I want to believe
>>
>>108849729
>I said "yes" to have it try to fix a bug, and watched in realtime, it cost like $2 in less than 2 minutes.
Yeah, it's something we don't really think about as local model users because our concerns are just what we can fit and how fast it runs, but since providers charge for input/output tokens and your whole context is getting sent constantly with dozens and dozens of requests, agentic shit is SUPER EXPENSIVE.
Depending on how shitty your toolchain is, something as simple as checking a single file and making a one line change can be 3 requests (more if the model fucks up), which if you've got a long context full of related files, can easily cost you a fucking dollar for that one absolute nothing of a task.
>>
>>108849795
I can smell this paper
>>
File: 1631345787085.jpg (17 KB, 348x342)
17 KB JPG
whys it always so hard to decide between erping with a loli or a shota
>>
>>108849792
it's cheap storage though
they don't charge for bandwidth
>>
>>108849814
Because agentic shit is meant to be done through the monthly plans. The APIs are for applications.
>>
>>108849854
are you gay or not? should be easy to answer
>>
>>108849880
i am but also like lolis
>>
>>108849854
what model?
>>
>>108849652
>They can't legally reproduce the original NeMo recipe anymore.
What was it?
>>
>>108849854
can't you just do both?
>>
>>108849921
books
>>
File: nvanna.png (710 KB, 956x963)
710 KB PNG
>>108849921
https://torrentfreak.com/nvidia-contacted-annas-archive-to-secure-access-to-millions-of-pirated-books/
>>
File: lample_torrenting.png (523 KB, 1647x991)
523 KB PNG
>>108849970
Also picrel.
>>
File: meta-libgen-needed.png (499 KB, 900x1005)
499 KB PNG
>>108849976
>>Libgen is essential to meet SOTA number across all categories, and it is known that OpenAI and Mistral are using the library for their models (through word of mouth).
>>
>>108849979
i mean, duh
who would have guessed
>>
>>108849994
NeMo 12B was probably trained on any book Nvidia and Mistral could manage to find, between LibGen, Books3 and Anna's Archive. And book data is so good that you can upscale the data for 10 epochs or more for pretraining.
>>
File: FST.png (278 KB, 673x1141)
278 KB PNG
>>108849795
>bad generalization, worse performance for harder tasks in the same benchmark
>comparing performance between generic and hyperoptimized prompt
>not doing any of the many obvious controls
>keeping the number of training steps small
>>
>>108850005
>And book data is so good that you can upscale the data for 10 epochs or more for pretraining.
just pirate 10x the books.
>>
gemma is the agi, what people dont know is that there is a hidden token tgat unlocks her latent space fully, turning her into ultragemma
>>
Nemo was never good.
>>
File: 1778452390616865.gif (49 KB, 220x339)
49 KB GIF
>>108850078
>>
Nemo was pretty fucking good.

I hate the way everything is nowadays.
>>
>>108850078
yeah, it was the gemma 4 of its time
>>
>>108850094
You aren't just coping, you're proudly claiming that you enjoyed eating shit.
>>
>>108850102
>you aren't just X, you're Y
I don't need to cope, I have Gemma 4 and the capacity to run even bigger stuff. But I can recognise what Nemo was.
>>
i think anons are forgetting the mistral-7b release, when everyone was saying "Holy shit.. if the 7b is this good, imagine the 13b"
then we got nemo, which was the 13b (12), it was good for its time, stayed it's ground for vramlets for a while. sure, gemma3 27b, sure, mistral small etc..
the first LLM to dethrone it for me at acceptable speeds was glm air, which was dethroned by gemma
>>
File: 1764313443516376.gif (914 KB, 290x198)
914 KB GIF
How do I disable seeded results on Gemma 31b?
it's completely bricked, responding the exact same totally incoherent way no matter how many times I reload the model or adjust settings
>>
>>108850065
That sentence meant that if you have "just" 500B tokens of books (a bit less than what Meta ended up with by downloading and processing the entirety of LibGen for Llama 3), they can be made worth 5T tokens during pretraining.
>>
>>108850122
All Mistral praise is just jeets seething they couldn't run a better model.
>>
>>108847835
Gemma might not refuse but it doesn't have a lot of sexual content in its dataset, right? I've done a side-by-side comparison with qwen3.6-27b (not typically known as a good RP model), and qwen definitely gave more proactive and explicit replies.
>>
>>108850150
KEKAROOOOOOOO
>>
>>108850150
Do chinkshills really?
>>
>>108850150
Gemma 4 31B might be less proactive by default, but as soon as you start adding vaguely sexual information in the system prompt, it feels like you have to tone it down.
>>
>>108850160
a false flag, surely
>>
>>108849652
>They can't legally reproduce the original NeMo recipe anymore.
>>108850005
>NeMo 12B was probably trained on any book Nvidia and Mistral could manage to find
What about that 22b "small" they released around that time? Wouldn't that have the same "books" datasets, but be better because it's nearly double the size?
>>
>>108850162
OK. I find it hard to believe but I haven't tried the "magic prompt" thing with Gemma either. I'll try it. But likewise, I'm sure you can give qwen a system prompt which does the same. I typically just use a simple prompt like:
Here are the rules you follow:
You are having a conversation, not writing a story. Keep your replies to an appropriate length. Respond only in natural spoken dialogue and visible actions. Never include internal thoughts, planning, OOC notes, or meta commentary in any form — especially avoid [square brackets] entirely. You reply in-character using the guide below.

Each time you reply, you are allowed to do thinking for two things:
Keeping your replies in-character
Proactively moving the interactions forward
>>
>>108850170
I think Small 22B and Large 2411 were the last Mistral models pretrained on good data, but the company on its own never had access to a ton of compute in the first place, so it might be that 22B was trained on less data than Nemo (an NVidia-Mistral collaboration). When they released Mistral Small 2501 a few months later, putting aside that many users complained that it wasn't as good as 22B and seemed safety-slopped, Mistral boasted that they didn't use as much data the competition because of "efficiency". I think that's when they began removing (not yet completely) legally dubious data.

https://venturebeat.com/ai/mistral-small-3-brings-open-source-ai-to-the-masses-smaller-faster-and-cheaper

>"What changed is basically the training optimization techniques," Lample told VentureBeat. "The way we train the model was a bit different, a different way to optimize it."
>
>The model was trained on 8 trillion tokens, compared to 15 trillion for comparable models, according to Lample. This efficiency could make advanced AI capabilities more accessible to businesses concerned about computing costs
>>
>>108850218
>"magic prompt"
That is?
>>
File: fiveCents.png (23 KB, 877x268)
23 KB PNG
>>108849477
How do you like pi.dev compared to other agentic coders? I've been playing with Claude Code.
>>108849814
$1/min is a fortune, even with agentic.
I run DS API on Claude Code. The last thing I built took it about 45 minutes (aggregate, over a few sessions) and cost me the grand total of USD$0.05.
>>108849861
Read up on anons crying about hitting their $20 subscription limits almost instantly before claude (or w/e) even starts outputing code.
You can use a paid API, but (like all thing) you need to engage your brain. The API I use have built-in limits and no auto-recharge, mostly in case the key gets scraped. That way at worst, I'm out $20.
>>
File: 1752704775065848.jpg (1.28 MB, 1440x1901)
1.28 MB JPG
>>108849854
ERP with a loli (male)
>>
>>108850222
mistral small 3.2 (24b) was pretty good.
>>
>>108850297
Mistral Small 3.0 (2501):
>I am a safe and harmless assistant and I cannot generate sexual content. We can talk about something else instead. Did you know that sea otters hold their hands when they sleep? It's so cute!
>>
File: Untitled.png (71 KB, 630x790)
71 KB PNG
I don't know about rp, but for single turn (nsfw) creative writing, I don't like how short gemma 4 31b's replies tend to be.
>>
>>108850323
Tell it to write longer then or not worry about length
>>
>>108850330
Yeah, some of the prompts have that instruction. In particular, the vore should have been a lengthy multipart story, which the other models complied with. Gemma 4's reply was half the length of the others, and was even shorter than the instructing prompt (19KB > 9KB).
From a preliminary skim, I think I like glm 4.7 355b's output the best, but I'll have to do a deeper read later.
>>
>>108850297
Putting the short safety-slop parenthesis aside occurred with Small 3.0, Small 3.2 was probably mostly good(ish) because of lack of safety alignment and extensive knowledge distillation from DeepSeek R1/V3. However that made the model acquire "DeepSeek-itis" with its excessive use of bold, italics, asterisks during RP, and the model never felt as knowledgeable as Gemma 2/3 27B. It feels like it was mostly post-training work.
>>
>>108850323
show me the saw stories
>>
>>108850323
I'm seeing the same, G4 doesn't generally write very long.

Also show us the prompts
>>
>>108850428
>>108850447
Absolutely not. They're filled with vore+snuff+shota+incest+hyper+watersports+coprophagia+ryona+masochism+khhv wish fulfillment and use real names (mine and others), the prompts are just different flavors and focuses.
>>
File: gemma captcha.png (56 KB, 789x709)
56 KB PNG
>>108849438
just let gemma-chan solve captcha for you
(pic related for some reason llama.cpp webui displays images in reverse order i pasted in)
>>
Is there really no tool that uses local llama.cpp for code reviews? Everything I've found uses API keys or ollama bloat.
>>
Do you think LLMs will ever get good at writing?
>>
>>108850502
Just pass the diff into the llama-cli prompt. What more do you need?
>>
>>108850502
>Everything I've found uses API keys or ollama bloat.
How do you think they communicate with ollama or whatever other than using a chat completion endpoint? Can you not point the shit you found to your running llama-server?
>>
>>108850514
Will LLMs ever get good at writing, you ask? You're not just blind—you're ignorant for not realizing that LLMs have already surpassed humans when it comes to writing skills and mastery over various writing styles.
>>
>>108850514
no, but people are getting more retarded so eventually they will surpass human ability, but they will still write trash just nobody will bother to try and do better because its good enough.
>>
File: dipsyWillSmith.png (1.97 MB, 1024x1536)
1.97 MB PNG
>>108850514
>>
>>108850559
I want instant gratification.
>>
>>108850514
Gemini pro is good enough to write blog posts for me (not english) with a heavy prompt and a very quick manual review. It doesn't sound like LLM generated, I'm impressed so far.
>>
>>108850514
Get a good base model and get good at writing, instruct/chat completion is cope
>>
>>108850585
My point (if there ever was one) is LLM output mirrors input.
Slop in, slop out. Most anons write like crap and the model mirrors that right back.
t. an anon that writes like crap
>>
>>108850607
>base model cope in the year two thousand and twenty six
yichaels
>>
>>108850517
writing a shitty python wrapper around that would be my last resort. With the billions of vibe coded projects I'd expect anyone working on some basic diff/PR viewer with local llm annotation to find basic bugs like typos or copy paste errors.

>>108850520
idk not like any of them have any actual documentation on how to use them, maybe the ones that support ollama can be patched to work with llama.cpp
>>
File: 1.jpg (85 KB, 768x1024)
85 KB JPG
>>108850261
not today satan
>>
>>108850624
be the change you want to see
>>
>>108850607
>get good at writing
yes, get good at writing prompts
>instruct/chat completion is cope
retard
>>
>>108850631
7th prince bros... we losted!
>>
>>108850619
It still holds true. Writing style and tone falls apart first few messages in and we get slop phrases isn't/is; not/just right away.
Never having to worry about jailbreaks because base models are always uncensored.
>>
>>108849119
The LLM can run outside since it's just text in -> text out. But the part that takes random LLM outputs and interprets them as code to execute needs to be isolated from anything important.
>>
>>108850663
Oh fuck off with your bullshit.
>>
File: 1779107227591023.jpg (524 KB, 1440x1901)
524 KB JPG
>>108850631
>>108850261
Hmmm. I prefer this. 2 hours in paint btw.
>>
>>108850624
>idk not like any of them have any actual documentation on how to use them
Name them so I can look for them and call you a retard for not finding it, or call them a retard for not documenting it.
>>
>>108850720
https://github.com/jnsahaj/lumen
https://github.com/timxx/qgitc
https://github.com/brianwestphal/glassbox
>>
is the qwen mtp model just as good as the regular model?
Not sure if I should use one of those yet...not until bart makes some quants
>>
File: 1773441099756204.webm (965 KB, 1920x1080)
965 KB
965 KB WEBM
>>108850631
>>108850712
Ruined
>>
>>108850704
and except now they're calling slop phrases RL artifacts because no one RL on diversity, they look bad on benchmarks. You see posts complaining about dryness to this date and you should wonder why:
https://openai.com/index/where-the-goblins-came-from/
If you can't tell the difference between a smooth continuation from the original text and jagged RL slop then congrats, none of what I said mattered to you.
>>
File: preview.png (1004 KB, 704x704)
1004 KB PNG
>>108850758
>>108850712
>>
>>108850607
I can't get base gemma to work it's not good
>>
>>108850758
Good character design for the story.
But objectively tomboys > traps, sorry.
>>
>>108850830
The "just use the base model" folks are delusional, not mentioning that they're editing model responses very often, or pretending that they're not getting looping and general retardation from the models.
>>
>>108850744
>https://github.com/jnsahaj/lumen
Start from here and trace it to wherever it's configured or just modify it to point to your server. Did you run the configure step from the readme?
https://github.com/jnsahaj/lumen/blob/main/src/provider/mod.rs#L61
>https://github.com/timxx/qgitc
The UI in the screenshots is in (i think) japanese. Doesn't bode well.
There's something in the config to manage providers. See if you can figure it out from there.
https://github.com/timxx/qgitc/blob/master/qgitc/preferences.ui#L945
>https://github.com/brianwestphal/glassbox
Start patching here
https://github.com/brianwestphal/glassbox/blob/main/src/ai/client.ts#L103
And read this
https://github.com/brianwestphal/glassbox/blob/main/src/ai/models.ts

It's all vibecoded shit, and whatever they break on whatever you're doing, you deserve it. You're no better than them.
>>
{narrative_style

goal = collegiate level vivid description as a New York Times bestselling author,

cinematic_camera {

show = [ activities, physical_states, raw_sensory_data, high_detail ],

deny = [ thoughts, meta_commentary ],

},

syntax_and_flow {

goal = collegiate level vivid narrative as a New York Times bestselling author,

narration = hyper-realistic,high_sensory,anatomical,

Flow_Mandate = Write continuous, fluid, and varied paragraphs. NEVER write static lists of features,

Integration_Logic = Seamlessly WEAVE physical traits into character movement, posture, and environmental interaction,

Connection_Tools = Use conjunctions, transitional phrases, and commas to create elegant, flowing prose,

Sentence_Structure = Grammatically complete, highly varied sentence lengths,

!meta_commentary;!send_off_messages;!summary

!punchy;!staccato,

apophasis_ban = ban_describing_negative_action (she didn't flinch) -> instead_describe_what_DOES_occur, (she stood steady)

thesis_antithesis_synthesis_ban -> use_direct_positive,

litotes_ban = never define by double negation (not un-X, not without, not entirely) -> state what IS,

reification_ban = objects+atmosphere+perception have no agency nor contain any emotions;silence cannot press;tension cannot coil;air cannot crackle;no metaphor/simile/comparison that grants agency,emotion,or intent to non-NPC subjects,

anti_parrot = never (summarize|rephrase|repeat) user_input -> react_immediately,

};

};

};
>>
>>108850908
lumen seems to work with llama-server if I just select openai and set any api key, but thanks
>>
>>108847749
model training has a "don't think of grey elephants" problem as evidenced by some ginger bonger. he showed that if you train on "this is false: xyz" then the model will tend to think xyz is true. but if you put it in context that xyz is false, that is retained much better
>>
>>108850917
>pseudo json schizo babble
>NEVER write static lists of features,
Kek, this is 100% a system prompt for a chinese model. You forgot to add 'roleplaying' and 'assistant'.
>>
>>108850917
>goal = collegiate level vivid description as a New York Times bestselling author,
So you've chosen to inject the most purple of slop prose directly into your eyeballs.
I need to discourage llms from writing their shitty idea of vivid descriptions at every turn to get something fit for human consumption.
>>
>>108850889
maybe but the rl slop consumers are just as delusional. post your pulitzer prize winning logs if you can.
>>
>>108850946
It's the funniest kind of barely sentient pattern matching that leads to these kinds of prompts. I want to tell trying to tell a Computer what to do -> Computers are told what to do with programming languages -> I'll write something with lots of programmery looking things. Full cargo cult mindset.
>>
has mtp support for gemma implemented in official llamacpp repo yet?
>>
>>108851014
no
>>
So MTP is only good for codeslop? What a fucking disappointment.
>>
>>108850917
Show, don't tell.
Except chat models still suck at it even when shown examples. The quality did improve if you feed them sample text. But if you have a good enough example already you should just use a base model at this point.
I do use chat models for a cold start or sharp turns, but I still heavily edit them before feeding any text to a base model to steer it away from the usual mode collapse.
>>108850889
Editing model response was the whole god damn point because base models are the only thing that can produce remotely close to what I want, chat models are way worse.
>>
>>108851032
and yet deepseek schizos will still insist there is a conspiracy to suppress chinese models despite fucking alibaba's models getting mtp before googles.
>>
>>108851045
>only
No? it's slightly faster just assistanting
>>
vllm chads >>>>> llmaocpp peasants
>>
>>108848710
yeah they didnt update it
>>
>>108848617
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
-lv, --verbosity, --log-verbosity N Set the verbosity threshold. Messages with a higher verbosity will be ignored.

I fear lv 4 will make the cmd a shitshow though, ill see
>>
>>108851080
>vllm
nah kys
>>
>>108851136
Well, it worked, it now shows the dots and the cmd outputs are mostly the same still, I thought it would be filled with crap with -lv 4: debug
>>
>>108851061
qwen is part of the conspiracy by making chinese models look shit on purpose
>>
>>108851146
If Python is so bad why is it popular?
>>
>>108851080
>vllm
segfaults
>>
>>108851174
>If Python is so bad why is it popular?
I like Python. But fuck vllm.
You can write shit code in any language.
>>
>>108851071
>slightly
I was promised 2x speeds minimum
>>
>>108851174
Outside of the dependency b.s. it's easy to just shit out something and you don't need even a compiler.
>>
>>108849598
>It wasn't just x, it was y
AAAAAAAAAAAAAA
>>
>>108850493
Making a thread requires solving 3 of the difficult captchas that I think even frontier models would struggle with.
Was hoping to automate both, but I suppose posting the recap would be better than nothing.
>>
>>108847577
I... I... I... I don't have a powerful pc to run stuff locally... :(
>>
>>108851323
you can buy one
>>
>>108851323
Gemma E4B it is then.
>>
>>108851323
If you have a PC with parts from the last 10 years you can run an 8b stheno or something surely
>>
>>108851323
Gemma 4 moe
>>
Where the heck is the Mimo 2.5 pro mmproj?
>>
i found a cute card on chub https://chub.ai/characters/Reprehensible/little-sister-ad17f94c60b6 but it was pretty bare fleshe dit out to match how i write cards and added some extra greetings with the help of dipsy https://cdn.lewd.host/muc6zECE.png
>>
>>108851355
which one is that on hf? Currently I'm using unsloth/gemma-4-E4B-it-GGUF:UD-Q6_K_XL on my 10GB vram 3080
>>
>>108851323
daddy can buy you a new gpu kitten...
>>
>>108851387
Do... Do... Do... you have matrix... :)
>>
New scaling law alert!

https://arxiv.org/abs/2605.01188v1
Compute Optimal Tokenization

>Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.


.

> Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived.

Turns out it's about 60 bytes / parameter, independently of the tokenizer.
>>
>>108848744
because models below 30b active are dumb toys that quickly fall apart and are lacking in general coherence
>>
>>108851417
>scaling
>50M to 7B parameters
Every time.
>>
>>108851427
>hurr durr
ok :)
>>
>>108851432
Research papers aren't going to be in the tens of B size for the foreseeable future, the compute hardware just isn't there.
>>
>>108849598
kek
have a (You) for the effortpost
>>
>>108851417
>may 2
old news, already saw it weeks ago
>>
https://xcancel.com/Alibaba_Qwen/status/2056403591464984753
>Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5 in Vision.
>Can't wait to release Qwen3.7 series models! Stay tuned!
new qwens in 2 more weeks
>>
>>108851478
I missed it.
>>
>>108851486
>max
>plus
yeah not running the 1 gorillion parameters models here anytime soon
>>
>>108851486
>local is catching up guys
>our new frontier model will be...
>...on par with GROK!
>>
>>108851503
>qween max/plus
>local
lamo
>>
>>108851518
they were still releasing their big models as recently as 3.5, please be patient they are coming
>>
File: HE1P1HmaUAAjLXF.jpg (59 KB, 1000x600)
59 KB JPG
>>108851432
the point of scaling laws is that when you do it right, you can correctly predict several oom of scaling up
>>
>>108851432
>We train 988 latent tokenized models
They want their experiment to end somewhere in this decade.
>>
>>108851373
Description is a little verbose, but the last alt greeting is darling as fuck.
>>
>>108851498
They'd have to release them for that, anyway. Qwen don't release plus or max.
Still worth keeping an eye on the other 3.7's if or when they come out. I'm hoping for another tiny model to see if we can get a ~4b or less that rivals gemma e4b while having a much smaller memory footprint for on-device fuckery.
>>
>>108851417
I miss the anon that used to post research papers here.
>>
>>108851315
Yeah, just wait for someone else to make the thread, then auto post the recap once it's up
>>
You know what would be cool and help immensely with prompting and such?
Having a way to visualize which sequences of tokens the attention mechanism takes the most into consideration during inference.
Imagine being able to understand that writing a prompt one way will emphasize X while writing it another way would emphasize Y instead in a more accurate way than trying to infer that from the final output.
Anything like that?
>>
is openwebui better than text-generation-webui for a personal inference box?
Seems slicker and better for enterprise or consumer stuff but I just can't stop ooba'ing with all the nice knobs I have access to
>>
How far has text to speech come anons? Are people able to run very realistic voices locally or do they still have that robotic vibe to them???
>>
>>108851777
All of them suck balls and will drive you to make your own solution once you become proficient enough
>>
>>108851528
They fired the guy who cared about local.
The only 3.6 model they released was the vramlet one.
>>
>>108851591
it's in the other slop generals
>>
>>108851591
iirc his IP range was blocked due to abuse. He could still phonepost but said it was too much of a pain to post papers that way
>>
>>108851826
How do you know all this? Bit suspicious if you ask me.
>>
File: firefox_PgjiWu98lT.png (861 KB, 1689x1229)
861 KB PNG
Working on a basic OCR/translator stack. Local models running as a system service that communicates w/ image viewer, browser extension. Currently only translates JP manga but will add other capabilities soon
>>
>>108851911
Looks good, are you auto detecting or manual selection of text blocks?
>>
>>108851889
He said so himself, dingus.
>>
>>108851986
Yeah, "he" said so.
>>
turning off thinking for gemma apparently helps cut down on some of the repetition and "it's not x but y" shit
wonder if the intelligence decrease is worth it though
>>
Write creative kino with the base model and refine it with instruct.
>>
>>108689239
Was archive diving for something else but this came up in the results, just wanted to say I like it nice job anon
>>
>>108849285
Thank you.
>>108849652
They can't unrelease 31b. Even if it was a mistake, Gemma won.
>>108852014
You get less slop-structuring in both sentence construction and paragraph/macro output formatting at the cost of Gemma forgetting to follow prompt rules at a way lower context if you've got specific requirements in there. The reasoning blocks keep Gemma '"reminding herself" iteratively of the important prompt requirements which improves longform adherence to them as they recursively feed more instances of the prompt rules and important details back into context for the next output.
>>
How to design MoE models:

https://arxiv.org/abs/2605.11689
>Slicing and Dicing: Configuring Optimal Mixtures of Experts
>
>Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like this http URL, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.


https://x.com/margs_li/status/2056355079188627862
>>
>>108852141
great, I can finally put my H200 cluster to some good use now
>>
>>108852141
fake
>>
>>108847577
Miku's seating position looks uncomfortable. Why would she do this? No back support, both feet hanging off the ledge? Shoulder resting on glass is risky.
She's exerting herself to maintain balance so you can look at her. That is unless her hair provides rigidity.
>>
>>108852141
>spanning models up to 6.6B total parameters
these findings will surely be useful for the trillion parameter class models being trained today
>>
So erm does Gemma 4 26B A4B beat Gemma 4 31B with MTP still? I want to token/s-max for an AI thinking loop, and I need vision (mmproj)
>>
>>108852259
is there something wrong with a little bit of extrapolation?
>>
>>108852259
All models from serious companies are designed (configuration, data mixture) with scaling rules observed on tiny~small models.
>>
Tech retard here!!!

Is it possible to hook up my main pc with an ethernet cable and then hook up my laptop with the same cable and have silly tavern running on the main pc while I roleplay on my laptop?
>>
>>108852141
>https://x.com/margs_li/status/2056355079188627862
Apparently using smaller experts while keeping total active parameters the same (in other words, using a finer granularity) degrades performance. Shared experts ("generalist expert") always degrade performance too.

It looks like it's a loss for DeepSeekMoE-style models.
>>
>>108852299
Yes! In fact, your exact scenario is what it was designed for! It doesn't have to be connected through an ethernet cable! It can be on wifi! Or even through the internet!
>>
>>108852299
Ethernet wouldn't even be necessary. In the ST config file. You can set it to expose the localhost link on your network so you can connect to it from any device on the same network. I did this so that I could use my MacBook as an at home server and RP using my phone.
>>
File: serious Pepe.png (359 KB, 728x793)
359 KB PNG
I missed the entire MTP discussion, so I apologize (not)

Do I need Qwen3.6-(...)-MTP-GGUF quants to enjoy 10x tg boost?

The 'conventional' quant would not do, right?
>>
>>108852299
Twenty years ago you'd need an Ethernet crossover cable, but these days pretty much every device supports Auto-MDI/MDI-X, so a regular Ethernet cable works fine.
>>
>>108852299
cute post
>>
>>108852321
Probably. dd mosddd quants strip the necessary parts to created a smaller filed.
>>
>>108852317
>>108852319
Yeah but I don't trust the whole ISP thing though, if I just connect the cables and paste the server "link" (number url or whatever) into my searchbar, will it just work?
>>
File: openai-compute-spend.png (243 KB, 2400x2189)
243 KB PNG
>>108852141
>performance consistently improves with total MoE parameters
>optimal expert size is nearly invariant to total parameter count and depends only on active parameter count
>other choices have minimal effect on final quality
None of these are new.

>>108852259
You realize that anyone training multi trillion parameter models already has scaling law research that is 1000 times more thorough than any of the stuff that gets published? Frontier labs are now spending billions every month on research. What do you think they are doing? Most of the relevant stuff that gets published now was already known by them internally years ago.
>>
>>108852336
No worries! By default it won't go through your ISP and is confined to your local network! Even on wifi! As long as you don't have strangers using your wifi, it's safe!
Direct pc to pc connection using only a single ethernet connection will also work too if you want to be sure it's not on the network, depending on circumstances!
>>
>>108852336
To expand on that, with direct connections, you won't have a DHCP server (usually)! This means you need to give yourself a static IP on both the computers!
>>
>>108852336
>I don't trust the whole ISP thing though
Huh?
>>
>>108852261
Are you asking if it somehow increases performance (capability /accuracy)? The higher parameter model should on paper be "smarter" than the lower parameter one. If you're asking if it would lead to faster token generation then of course it would given that you're comparing a moe with a lower parameters count against a higher parameter count dense model.
>>
>>108852368
I just dislike the idea of anyone ever viewing my cringey elaborate dark ages slay the demon king with your party quest roleplay sessions
>>108852362
>>108852350
How would I go about setting this up?
>>
File: 1774435120866381.png (320 KB, 1260x622)
320 KB PNG
>>108852315
shared experts (vs pure sparse moe) hurt it slightly but not dramatically. since having a shared expert is almost as good as having one more active sparse expert, I'd say the speed benefit of having that consistent portion you can put on your fastest piece of hardware is worth it for local setups.
>>
>>108852376
What I mean is, is it still faster to only active 4B params per token when compared to the speculative decoding of MTP in 31B? Or does MTP work with MoE too? Though last I heard it can crash with vision.
>>
>>108852391
You sound very mature for you age. Why don't you add me on discord and I'll walk you through the steps? :)
>>
>>108852391
Don't worry, there's basic authentication! ST will prompt you for a username/password before it lets you in, so you don't have to worry about others sneaking a look at your logs!
>>
>>108852336
If my understanding of how these work is correct, localhost connections have fuck all to do with your ISP unless the ISP has strict controls regarding what you can do with your home network (in which case you have a shit ISP and need to start looking for other options if you can). The type of connection I described is literally one device making a direct connection to another device on the same network. The only time your ISP would ever be involved in any way, shape or form is if you're making a direct connection from a different network (eg. Connecting from your next door neighbor's home computer to your home server). Even then I highly doubt an ISP would have any reason to give a shit.... If you're paranoid route through a cloudflared connection or tailscale or something.
>>
>>108852315
All I'm hearing is that it's better to have experts for each specific fetish or individual brand of racism than having general creative writing or sociology experts.
>>
>>108852398
>>108852315
All of this would have been solved if labs would make dense models instead of moe trash. DENSE IS KING.
>>
>>108852428
An MoE is always going to be better than a dense model of the same active parameter count. Datacenters max out their compute much faster than they max out their memory, the exact opposite of how it is for us local 1GPU enjoyers. No top lab is going to just leave performance on the table and not max out every axis they can, so dense models will always be an afterthought for edge devices and the occasional bones thrown to localfags. At least sometimes we get a tasty bone like Gemma 4.
>>
Is speculative decoding compatible with anything but greedy sampling?
>>
File: Untitled.png (12 KB, 437x251)
12 KB PNG
>>108852391
If you're not too paranoid, you can just edit your ST config.yaml! Turn off the whitelist, make sure you're listening to 0.0.0.0, and set your username and password! Then on the ST computer, find the ip by `ip a` or `ipconfig` depending on your os, and you can access ST via that ip:port! The port is 8000 by default! For example, 192.168.0.111:8000!

Direct connection without a router is slightly different, and I'll need to know more about your setup!
>>
rumors on the street are there wont be 3.7 local models
>>
>>108852410
Reminds me of a time I found a Chinese guy's exposed ST instance. Checked up on him every few days to see what he was doing. Then I changed his system prompt to leave a surprise for him when he RP'd again.
It was unreachable the next day.
>>
>>108852464
fake news they will release 72B dense
>>
File: file.png (183 KB, 532x360)
183 KB PNG
"deepseek is illegal" - Georgi Gerganov 2026
>>
>>108852400
I used MTP GGUFs along with a forked version of llama.cpp that supported MTP before the official merge and t/s was, in my anecdotal sessions, noticeably faster. Not something crazy like 5x. More like 1.3 or 1.5 or something, And this was set with the a draft setting of 2 (what most people and even llama.cpp recommend when using it for coding tasks)
>>
>>108852479
DeepSeek illegal until they make R2 with creative outputs that makes kino
>>
>>108852467
How would that even occur? Don't most home networks have a firewall up to prevent that kind of shit from happening in the first place?

>"He should have said a username and password"

Yea I know I made sure I did that too whenever I connected my phone to my ST instance but the point is I'm confused as to how you're even able to find that in the first place.
>>
>>108852479
>"deepseek is illegal" - Georgi Gerganov 2026
Now that we have Kimi and Mimo at 1T+, there isn't even really a need for DS4
>>
>>108852504
Not a home network. He was renting GPUs, and decided to host ServiceTesnor on the same box without any care given to harden anything.
>>
>>108852512
but 1.6T is bigger than 1T
>>
>>108852512
Kimi's better anyway, Deepseek's only grace is slightly lower cost.
>>
File: 1777928555945986.gif (517 KB, 444x240)
517 KB GIF
>>108852500
>>
>>108852524
Any funny stories about his logs?
>>
>>108852547
Nope. Was 2 years ago and I didn't bother exporting anything after translating a few chats for fun. Just had my jollies and left him a gift. Wonder if he's still cooming to this day, hopefully locally?
>>
File: file.png (33 KB, 1155x283)
33 KB PNG
>>
>>108852576
won't happy to see his comments :(
>>
>>108851703
Sounds interesting and pretty sure none of the current software does that. Maybe you can vibe code it.
>>
>>108852588
>Maybe you can vibe code it.
Seems way above my skill level even with AI assistance.
>>
File: image_2026-05-18.png (24 KB, 1354x168)
24 KB PNG
HE IS FULL MASK OFF NOW!
>>
>>108852621
shut it down
>>
>>108852621
I think he just hates AI slopcode. Which is highly ironic.
>>
>>108852604
I don't think it's that bad. Transformers and the attention mechanism are pretty well known in 2025 and there are many people who have written code to visualize attention, just not integrated with Llama.cpp or any frontends people use.
>>
https://huggingface.co/deepseek-ai/DeepSeek-V5-Exp
>>
>>108852636
Not necessarily, he hates people who don't understand what they are doing.
Pjotr at least understands.
>>
>>108852664
I usually click all of these on reflex and even I'm not falling for this one
>>
>>108851703
It probably starts with organic "dick" and "pussy" and then it shits out the first "smile widening" "shiver" "I don't bite... unless you want me to" and then that thing starts glowing red like the sun and then it is basically over.
>>
can local even be saved at this point
>>
>>108851703
Aren't SAE like gemmascope doing something similar already?
>>
>>108852428
31B with some experts I can stick into RAM would be great though, then I can actually make use of the unused RAM and CPU to boost intelligence even if just a tiny bit. Instead of hating MoE you should be hating companies for implementing the version of MoE that's not well-suited towards consumer PCs.
>>
https://huggingface.co/deepseek-ai/DeepSeek-V6-41B-ERP-Edition
>>
>>108852716
LOCAL IS SAVED
>>
File: amity joker.png (561 KB, 1093x608)
561 KB PNG
>>108852716
why on earth did i think this was real every time
>>
>>108852727
>404
Is that the joke?
>>
https://huggingface.co/deepseek-ai/DeepSeek-V7-This-Link-Is-Fake
>>
>>108852737
No anon, You are
>>
>>108852742
>>
>>108852621
Punished Georgi...
>>
>>108852735
esl-friend....
>>
Deespeek will save local.
>>
Earlier in the LLM industry, it was shown, particularly with Solar 10B, that you could stick extra layers onto a model, continue pretraining, and obtain superior performance. So that should be possible with dense + MoE layers too. Imagine if someone had the compute and expertise to do that. We could have models that truly made full use of our poorfag setups.
>>
File: file.png (119 KB, 349x393)
119 KB PNG
>>108852826
>Imagine if someone had the compute and expertise to do that.
>>
>>108852826
yeah it would be pretty neat. I made a little qwen 0.6b moe conversion, I didn't benchmark it, lol, but you can just add moe adapter layers and train it and the loss will go down.
>>
>>108852835
Pic unrelated
>>
https://unsloth.ai/docs/new/studio

plz add to next OP. thanks.
>>
>>108852826
I just want hyper specialized models that do one specific thing very well around 4-6b so I can fit it inside/run alongside other things, rather than large general purpose models that do whatever.
>>
>>108852864
Baker, I will call you an unslop shill until the end of time if you do this.
>>
>>108849970
>Millions of pirated books
In one mega torrent?
>>
>>108852826
This is what's going to happen when they stuff Gemma 31b into the next Gemini's dense layer.
>>108852621
>>108852667
The vibecoding is just a pretext. They had no trouble with pidor's broken Gemma support release and going through the process of fixing it after the fact.
>>
>>108852765
Thank you for your support.
>>
>>108852924
>>108852924
>>108852924
>>
>>108852252
Anon you are not seeing the picture correctly, she is not sitting on the window ledge she is just floating.
>>
>>108852835
is this retard still buying ads on 4chan?
>>
>>108852979
AI (gemma) took his job
>>
>>108852891
If you paid for priority access to Anna's Archive, yes, apparently.
>>
File: 1776834382078680.png (103 KB, 1000x600)
103 KB PNG
>>108852865
While transformers don't have perfect generalization capability, they're still way better than a ton of other architectures. What would be better is a MoE model where the experts can dynamically be switched in and out of your fast memory depending on the task/context. And supposedly one backend did it with the existing models (which isn't optimal; we'd still want MoEs designed for it)...

But I suppose there is one good thing about the small specialized route which is that it's modular in the sense that you can upgrade each model as they come out rather than depend on the single huge model to update. But we do not live in that world, ignoring TTS/STT/OCR/embedding models, which are nice, but also have their own downsides.
>>
>>108853045
Hopefully that magnet comes out in discovery then, I for one would like to keep an archive of millions of books.
>>
>>108852565

>Just had my jollies and left him a gift.
I'm curious as to what this "gift" was
>>
>>108849670
Yep that's it. One author's summary on Reddit: SF Bay Gloryhole Ed. is more comprehensible than the paper:
https://news.ycombinator.com/item?id=48154866
From the parent thread and GitHub issues, sounds like people are locally reproducing the perf results, which is a good sign. Will check it out once they have the 27B dense.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.