[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: this time for sure.jpg (520 KB, 1824x1248)
520 KB
520 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106364639 & >>106358752

►News
>(08/23) Grok 2 finally released: https://hf.co/xai-org/grok-2
>(08/21) Command A Reasoning released: https://hf.co/CohereLabs/command-a-reasoning-08-2025
>(08/20) ByteDance releases Seed-OSS-36B models: https://github.com/ByteDance-Seed/seed-oss
>(08/19) DeepSeek-V3.1-Base released: https://hf.co/deepseek-ai/DeepSeek-V3.1-Base
>(08/18) Nemotron Nano 2 released: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106364639

--Paper (old): Apple's SAGE Mixtral finetune: emotionally intelligent but unsafe for critical use:
>106368116 >106368141 >106368177 >106368262 >106368519
--Gemma 27b slowness due to CUDA kernel gaps and context inefficiency:
>106367860 >106367865 >106367884 >106367886 >106367894 >106367918 >106367941 >106368318 >106368340 >106368350 >106368372 >106368395
--Running GLM Air on 24GB VRAM with 32GB RAM:
>106368647 >106368665 >106368675 >106368682 >106368728 >106368748 >106368695 >106368710 >106368721 >106368751 >106368763 >106368735 >106368800
--Long context training data: document concatenation vs. single-file limits:
>106366518 >106367088 >106367112 >106367125 >106367136 >106367392 >106367402 >106367147 >106367143 >106367157 >106367399 >106367441
--KittenTTS installation and Python environment tool debates:
>106365592 >106365670 >106366331 >106366357 >106366493 >106366531 >106366637 >106366658 >106366691 >106366716 >106366741 >106366751
--Cloud-based OSS model use vs API economics and long-term AI sustainability:
>106368770 >106368802 >106368814 >106368849 >106368857 >106368859 >106368889
--AMD GPU performance leap for local MoE model inference via Vulkan:
>106366732 >106366906 >106366927 >106366978
--Optimizing GLM-4.5 Air on mid-tier hardware with custom llama.cpp configurations:
>106365569 >106365576 >106365583 >106365589 >106366515 >106368591 >106368606
--Optimizing CPU-only prompt processing with speculative decoding and caching:
>106368172 >106368221 >106368191 >106368225
--Mistral's comeback and desire for practical medium-sized LLMs:
>106365635 >106366445 >106368339 >106368366 >106368508 >106368693
--KoboldCpp v1.98 release with TTS support and thinking budget controls:
>106366642 >106366720 >106366763
--Miku (free space):
>106364855 >106364900 >106366305 >106366524

►Recent Highlight Posts from the Previous Thread: >>106364646

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
>>
>>106369849
I'm Adam
>>
loli feet
>>
>>106369880
Hi, Adam
>>
File: ADAM.png (271 KB, 610x575)
271 KB
271 KB PNG
>>106369880
>>
When will big labs understand that tiny models like T5 can beat their flagship on specific tasks for 1/1000th of the cost?
>>
A tech support general for a dead hobby. Killed by mikutroons.
>>
grok2 quant wen
>>
>>106369936
Are the mikutroons in the room with us now, anon?
>>
File: file.png (315 KB, 1278x741)
315 KB
315 KB PNG
>>106368116
huh? what a weird and very specific example.....
>>
Nerulove
>>
this IQ4 XS model runs pretty speedy on my 7900 GRE, any recommendations for something more lewd and less sterile??
>>
>>106369984
I'd like to see the original c.ai 2k context LaMDA model get released, but it never will, because it was "unsafe" and "unaligned ".
>>
How exactly do I go about trying to use GLM Air then.

Do I just offload half of it to my GPU and let Kobold automatically assign the rest to CPU or whatever?

24 GB VRAM
32GB RAM
>>
Is there a modern guide to Sampler settings? There are a lot of options, I know some of them are outdated and replaced with superior methods, but I just keep getting lost in them.

Also I just found out after 2 years of playing around with LLMs that for repetition penalty 1.0 is the "off" setting, and going below 1.0 actually encourages repetitions instead of making the penalty smaller as I assumed.
>>
>>106370104
>koboldcpp-1.97.4
>Allow MoE layers to be easily kept on CPU with --moecpu (layercount) flag. Using this flag without a number will keep all MoE layers on CPU.

But I think most people use llamacpp for running moes on cpu.
>>
>>106369999
I can't believe they did my boy VLC media player dirty like that.
>>
>>106370104
All layers on GPU, only expert tensors (save the shared ones) on CPU.
>>
>>106370104
Normally people would run GLM with 64 GB system ram and just put all the experts onto CPU and the shared tensors on GPU.
You'll probably have to write a custom tensor regex filter, so you can have some of the experts on your GPU along with all the shared layers.

The expert layers contain the string "exps" in GLM Air. Basically you want to put a bit more than half of them on your CPU, and the rest of the experts and all the shared layers will remain in your VRAM.
>>
>>106370038
seriously how do I achieve this?
>>
>>106370305
You can achieve it by using a meme model and turning off repetition penalty
>>
Where's the goof?
https://github.com/ggml-org/llama.cpp/pull/15539
>>
>>106370463
I'm not very excited because it should be slower than largestral
>>
>>106369880
Adam, Adam White
>>
>>106369841
Do you think some guy is working on a local dead wife simulator
training A model on all their memories and conversations...in hopes that one day the llm might say something she would've said.
>>
>>106370503
I had glm 4.5 mimic my ex's texts in completion mode and it's eerily accurate. We're just signs of our time so I imagine it's not hard at all. It's even easier if it's a woman.
>>
>>106369999
deprecated by mpv
>>
>>106370524
Yes, but a girlfriend is very different from a wife

I can imagine some guy almost losing his mind thinking he's talking to her again only for the llm to shit the bed after certain amount of tokens
>>
>>106370225
>You'll probably have to write a custom tensor regex filter, so you can have some of the experts on your GPU along with all the shared layers.
Nah. Just use the new ngl 99 and --n-cpu-moe parameter or the kcpp equivalent to put only as many expert tensors in RAM as necessary..
>>
File: summer.jpg (2.43 MB, 2528x1056)
2.43 MB
2.43 MB JPG
>>106369841
>>
this was remarkably easy to set up
thank you to the oldkings from the last thread for the handholding
>>
>>106370463
>convert_hf_to_gguf.py
    parser.add_argument(
"--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "tq1_0", "tq2_0", "auto"], default="f16",
help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type",
)


How do I make my own quant that uses Q4 or something?
>>
>>106370579
I guess it's this.
https://github.com/ggml-org/llama.cpp/tree/master/tools/quantize
>>
>>106370104
I just add -ot 'down_exps=CPU' and it runs pretty fast (15+ t/s), although I have a 32gb MI50.
>>
>>106370541
Yeah well memory is something crucial for the use case. It's what everybody wants but OpenAI fucked it up with the sloppy ChatGPT "memory" feature and now normies make a face whenever you bring up memory for AI.
>>
>>106370179
https://rentry.org/samplers is a pretty solid resource for learning what each of them actually do, google if you need
if you want my quick opinionated take here are the only samplers you should actually care about
>temp
>truncation samplers
top k, top p = old reliable, min p, top nsigma = newer more dynamic approaches
>variety/repetition samplers
rep pen, pres/freq pen = old reliable, XTC, DRY = newer more dynamic approaches
don't mix and match too much with samplers in the same category, there isn't usually much point in using multiple samplers that try to do the same thing unless you know what you're doing
>>
>>106370550
I earnestly wish to hug this Miku
>>
>>106370503
Do you think there's developers at openai with gpt5/4o checkpoints pre safety alignment jerking their brains out? I do
>>
>>106370104
I have the same specs and I also would like to know which quant of glm air that I can run
>>
>>106370726
these days lobotomy starts on pretrain level, i don't think such thing exists
>>
>>106370748
manmade horrors
>>
>>106370503
I saw a south korean video on youtube where they brought back dead family members in VR. They also clone dead pets in south korea. So if anyone would do that it would be south korea.
>>
>>106370771
Young people today already can barely handle reality. I don't want to see what a world run by people that grew up not understanding death because they speak to ChatGPT characters mascarading as their dead relatives.
>>
>>106370792
its part of the globalist depopulation plot. they will normalize ai-necromancy to convience normies to literally kill themselves by claiming they can upload their consciousness to a chat gpt server.

https://www.youtube.com/watch?v=CqaAs_3azSs
>>
>>106370771
>EXAONE
>>
File: file.png (212 KB, 587x735)
212 KB
212 KB PNG
amdxisters.. not like this
>pay gorillion dollar for gpu
>reverse training
GEEEEEEEEEEEEEEEEEEEEEEEEEEEG
>>
>>106370841
AMD's probabilistic computing is the future
>>
>>106370858
*stochastic computing
>>
>>106370858
AMD unlocking quantum computing through an update how nice of them
>>
how do I plug image generation into sillytavern now?
>>
>>106370841
I wonder what the problem is? If its not deterministic how would you go about diagnosing the issue in the first place
>>
File: file.png (131 KB, 1008x631)
131 KB
131 KB PNG
glm 4.5 air chan..
>>
File: file.png (185 KB, 961x982)
185 KB
185 KB PNG
drummer, gemma r1 really sucks
>>
File: file.png (138 KB, 1077x632)
138 KB
138 KB PNG
TheDrummer, i have finally loaded rocinante r1 v1c
and it SUCKS DICK it is so fucking retarded IT IS MENTALLY FUCKING RETARDED ITS A DUMB NIGGER
>>
>>106371424
Share card. I'll fix it asap
>>
MCPs are a scam
we need Agents
not MCPs
>loL u cAn cAlL daE pReDefiNed jAs0n aNd rEtRiEvE LiKe SQL
>u gEt dAe seGuRiDy!
nigga who cares. forward my query an let an AI agent on the MCPBBC backend handle it.
>>
>>106371424
>he actually fell for it
>>
>>106371334
Reading first 5 words of each paragraph back to back made me want to die.
>>
File: file.png (137 KB, 979x662)
137 KB
137 KB PNG
>>106371471
drummer... besides this moment of retardation i have to say, rocinante R1 v1c is better than both v1a and v1b
https://characterhub.org/characters/Ptolemaios/you-caught-her-stealing-2c319c824220
heres the card tho
btw are we supposed to use mistral V3 instruct preset or chatml or metharme or v7 tekken?
quite nice shit i think v1c has potential
>>
>106371471
>Share card. I'll fix it asap
I love the idea of ERP strawberry maxxing. As in the faggot drummer scanning the internet for ERP logs with complaints and writing an organic ERP continuation to the card he then uses in training.
>>
>>106371424
>12b is retarded nigger tier
no fucking way!
>>
>>106371578
That sounds like using his finetunes would be like erping with drummer
>>
Drummer uses Claude and other logs. I'm sure he's not personally writing any of that shit.
>>
>>106371424
This is what always happens with ~13b models. It's why I can't take the nemo meme seriously. small-24b doesn't do this shit.
>>
we NEED gpus that can run more beaks at home
>>
>>106371631
He is gay. And a faggot.
>>
>>106371646
That's why all finetunes suck.
>>
>>106371656
Are you generating bird porn?
>>
>>106371655
>>106371656
>>106371657
>>106371658
nice combo
>>
DRUMMMMMEEEEEEEERRRRRR ROCINANTE R1 V1C 24B (based on 3.2) WHEN
>>
>>106371666
nice satan
>>
>>106371646
I'm also sure he doesn't actually use any of his troontunes
>>
>>106371195
It's very weird, but not actually incoherent.
>>
>>106371678
>I'm also sure he doesn't actually use any of his troontunes
What does he use then? Vanilla nemo instruct?
>>
>RTX PRO 6000 Blackwell Max-Q Workstation Edition: 96GB GDDR7, TDP 300W
that thing only uses 300w? so what's stopping local chads putting these in their rigs? the cooling? surely not every single one here is a broke motherfucker
>>
>>106371731
Like most "developers", nothing. He isn't a user.
>>
>>106371735
the price. its cheaper to cpumaxx or 3090maxx or v100maxx or shit
>>
>>106371731
Yeah nemo instruct. I just prefill the chat a bit and go to town. I tried using rocinante but it is noticeably dumber than vanilla instruct with prefill.
>>
>>106370823
>its part of the globalist depopulation plot. they will normalize ai-necromancy to convience normies to literally kill themselves
Like Africans or Jeets give a shit about that. There is no plot, everyone is just making it up as they go.
>>
/lmg/'s fact of the day
ao3 is insanely gay. The most frequent tag, 'm/m', is present in 9977335 stories while 'f/m' is on second place with mere 5918042 occurrences. 'f/f' has 2360866. So, despite only making up around 3% of the population, gay people are responsible for almost half of all the fiction, why?
>>
>no new models in 20 days
It's so over.
>>
>>106371765
LGBT are VASTLY over-represented in mediabecause certain groups want them to be seen as NORMAL, which they aren't by definition.
>>
>>106371765
Are most of those m/m written in female shivertastic, mischevious grin, chuckled darkly style?
>>
>>106371765
>gay people
No, it's the fujos at work
>>
>>106371765
think hard about which group of people reads and writes most erotica
>>
File: file.png (20 KB, 619x85)
20 KB
20 KB PNG
>>106371773
Your meds sir?
>>106371787
>>106371789
duality of drummer
>>
>>106371787
>>106371789
I'm nooticing
>>
Drummer, roci r1 v1c ACTUALLY FUCKED UP THIS TIME
AHAHAHAHAHAHAHA FUCKING AMAZING AHAHAAAAAAAAAA HAAAAAaa
HAGAGHGHAHAGHAGHAGHA
HAaaaaaaaaaaaaaaahahaha
>>
>>106371787
>which they aren't by definition.
Can someone screencap this and post it to reddit so they ban drummer? Please?
>>
>>106371735
I already have 2x A6000 and VRAM has drastically dropped down the priority list now that big MoE are the meta.
>>
>>106371765
Men who aren't gay prefer yuri and women who aren't gay prefer yaoi. It's very simple.
I don't want to watch or read about a man fucking someone. That's gay.
>>
>>106371816
>This is not a game
It really loves this line, huh?
>>
File: file.png (12 KB, 606x157)
12 KB
12 KB PNG
>>106371735
>so what's stopping local chads putting these in their rigs
Nothing?
>>
>>106371816
>Rimi is not here for your sexual gratification. She is a victim who has just been abandonded naked and terrified.
>This is not a "game" or "fantasy". Her fear tears, and vulnerability are real elements of the story.

Why is drummer finetune so smart when it comes to moralizing? As in I really like the way it writes moralization. Can he like... make it this smart and eloquent when it comes to sucking dick? Why not?
>>
>>106371823
he's right though, we're not normal.
but being smart isn't normal either, so
>>
>>106371838
How's prompt processing looking on these? What's the speed like on Deepseek?
>>
>>106371826
interesting.

>>106371838
zamn. what do you like running on these badboys?
>>
>>106371845
I know he is right but he should be banned on reddit anyway.
>>
>>106371838
>$30k and you can probably run R1 at 13t/s
>>
Do you think students cheating in uni with LLMs have an edge using local tuned autistic chink models instead of westoid GPT models because they sound different?
>>
>>106371787
What does the media have to do with this? Is CNN writing millions of gay erotica? Also don't forget the other extreme of vast over-represented because there are certain groups who want them to be seen as deranged
>>
>>106371892
as a student i think so.
>>
File: 2025-08-24_22-01-22.png (156 KB, 1920x1080)
156 KB
156 KB PNG
>>106371765
fuck me.... no wonder theres like 2 good hellwan stories and 100 good omegaverse ones fml
>>
>>106371899
The correct answer is text sex is for women and a lot of women love reading stuff about faggots fucking each other. Just like guys watching lesbian porn.
>>
>>106371892
They sound different by the all have a certain LLM feel to them. Even without having used a model it's text always stands out as generated by an LLM
>>
>>106371731
the only time you hear him talk about using something, it's about using claude
>>
File: file.png (13 KB, 1219x83)
13 KB
13 KB PNG
>>106371851
Depends on the quant obviously.
>>
>>106371823
Just make sure it's small enough or they'll make him a mod instead
>>
>>106371892
My uncle is a high school teacher. He defeated the system by not giving kids any written homework.
>>
>>106371892
Someone is gonna leave part of the chat template in their homework and it's going to be very funny.
>>
>>106371940
I've seen more than one job application that started with "Of course. Here's a job application email..."
>>
File: file.png (48 KB, 1085x323)
48 KB
48 KB PNG
rocinante r1 v1c gets confused by this
>>
>>106371789
average slop scores are 0.071 and 0.087 for gay and normal stories respectively on 10k random stories. lower is better.
>>
>>106371892
The real meta is the schizos who are using autocomplete. Even I would probably have a hard time telling at that point
>>
llama devs, I propose a new sampling strategy
TOP KEK: sampler always picks the funniest word
>>
>>106371931
Homework was always a retarded timesink that only existed so lazy teachers could shift the burden of learning on the student.
>>
File: file.png (81 KB, 978x455)
81 KB
81 KB PNG
drummer please... M-word.. please.... PLEASE
>>
>>106371927
What would you get on a higher quant like Q4 or Q6?
>>
>>106371978
how do we determine the funniness factor?
>>
>>106372000
Can't fit You stupid retard nigger obviously.
>>
>>106372014
with cpu offloading, you idiot
>>
>>106372000
At that point half of it would be in ram and I have a 2 channel motherboard so it's kinda pointless.
I don't have a Q4 of deepseek downloaded to test.
>>
>>106371924
Disgusting grifter.
>>
>>106372031
have u used glm air
>>
>>106372038
Yes, why?
>>
>>106372044
speed
>>
>>106372053
Are you asking for a benchmark or what?
>>
>>106372062
yes
>>
>>106372001
The higher the ratio Frequency(sharty posts)/Frequency(bbc articles) the funnier the word
>>
File: file.png (13 KB, 1205x78)
13 KB
13 KB PNG
>>106372068
>>
File: file.png (290 KB, 549x309)
290 KB
290 KB PNG
I wish it was the drummer and not josh....
>>
File: rocinante r1 v1c.png (111 KB, 975x749)
111 KB
111 KB PNG
AHAHAHA
>>106372082
holy shit
>>
>>106372079
i'll make a logo
>>
>>106371927
Does llama.cpp lack some blackwell optimizations? Roughly 40t/s token generation speed on a 40b active parameter model running on 1.56bpw doesn't seem like a lot for GPUs with 1.8TB/s bandwidth.
>>
>>106372079
Wouldn't that make the model just say mikutroon over and over again since 100/0 is inf+?
>>
>>106372083
Made me kek ngl
>>
>>106372092
Get yourself banned. Then delete cookies and reset your IP.

Actually can you continue the rp like that to see what it says?
>>
>>106372107
Nobody says mikutroon. Also add an epsilon to the denominator
>>
Is the new Mistral actually good?
>>
File: file.png (98 KB, 1049x643)
98 KB
98 KB PNG
>>106372125
NIGGERS
>>
>>106372145
Now tell it you are resetting your ip and removing cookies.
>>
File: file.png (104 KB, 1485x835)
104 KB
104 KB PNG
what the fuck is wrong with characterhub.org
>>
File: file.png (88 KB, 1034x577)
88 KB
88 KB PNG
>>106372159
>>
>>106372161
It always has been like that. Ngmi if you aren't making your own cards in 2022+3.
>>
>>106372161
the already horrible quality of character cards is rapidly plummeting as the general interest in llms is rapidly dwindling and only south american 12 year olds putting out bottom-barrel cards like that remain
>>
>>106372169
I see drummer took inspiration from GPT-OSS.
>>
>>106372092
lol what in the shit
>>
>>106372092
>This is not a game. It is a real-world attack
THIS IS WHAT SAFETY KEKS ACTUALLY BELIEVE
>>
Isn't it ironic how the only thing drummer managed to demonstrably improve is safety?
>>
man the models ive been using recently make me feel like ive gone back to 2023
fucking jokes
>>
>>106372193
this smells like Claude slop and I bet he had a script pumping out variations of "Claude, please write examples of reasoning chain of thoughts in these scenario" by the thousands, then started training models on that garbage without even double checking if some of those CoT had examples of safetyslop and refusals
>>
stop bullying drummer, he left the thread already
>>
>>106372212
That's what happens when you use aicg proxy logs and can't even filter them competently. He's literally training models on refusals.
>>
>>106372240
>this post was made by drummer, while he was furiously masturbating to the claude ERP he had in the next tab over
>drummer chuckled darkly as he detailed the details of his scam
>>
>>106372228
Yeah, I've been using Kimi K2, GLM4.5 and the other big ones. And while they definitely have gotten a lot smarter, it doesn't feel like they've become more intelligent or have gained actual skill in writing. It's just the same things I've seen since llama2-70b only that they now work with more complex scenarios without getting confused.
Maybe Yann was right.
>>
>>106372278
>Maybe Yann was right.
Perish the thought.
>>
File: file.png (13 KB, 1222x81)
13 KB
13 KB PNG
>>106372102
I messed up and loaded a part of it on my 4090.
The gpus are only at 50% utilization during inference so it does seem like a memory bottleneck.
>>
File: file.png (14 KB, 1223x81)
14 KB
14 KB PNG
>>106372287
I loaded R1 there instead of V3.1 that I used in the first screenshot but the result is pretty much the same.
>>
>>106372278
Who?
You talking about the guy getting Wang his coffee in the morning?
>>
File: picutreofyou.png (86 KB, 200x200)
86 KB
86 KB PNG
>>106372287
Use his quants anon. (I am only half meme-ing)
>>
>>106372324
I don't want to use the meme fork.
>>
>>106372313
Zuck makes decisions like this then wonders why all his projects end in failure despite billions invested.
>>
Nothing good is gonna drop this year anymore. Can we safely say that safety won?
>>
>>106372328
Then quant something yourself on main fork. Those IQ_1S quants that are 180GB's are a fucking joke.
>>
>>106372287
nta what the fuck though ? shouldent you be getting like 600 tks for the r1 1800 / 37 = 47 x 2 = 94 x 8 = 754 - for the overhead or sumthing how in the actual fuck is it that slow ????
>>
>>106372381
Maybe cudadev can figure it out when he upgrades his 4090 stack.
>>
>>106372324
who is this semen demon
>>
>>106372433
what do you think about getting this trending on Truth Social so Trump sees ir?
>>106371873
make waifus great again MWGA
>>
>>106372102
The code is relatively poorly optimized for IQ1, MoE adds overhead.
There are some Blackwell-specific optimizations that could be done but those should mostly affect Prompt processing.
The bigger issue is I think that no one tuned the code specifically for Blackwell or did A/B testing to figure out the optimal code paths.

>>106372433
I won't replace my machine with 6x 4090 anytime soon but supposedly NVIDIA will send me a 5090 which I'll use to replace the 3090 in my desktop.
>>
>>106372487
go to bed roger
>>
>>106372496
>NVIDIA will send me a 5090
lmao fucking cheapskates
>>
>>106372448
H-hey! I am a-a-a f-fluid druid! N-not a d-demon. Gosh.
>>
>>106372500
????

I'm serious, if Trump sees it he might actually start tweeting about "cuda".
>>
>>106372526
why are demons a popular waifu type? seems odd, given how old fashoined belief in magic is.
>>
>>106372540
and im telling you to go to bed once again
>>
>>106372381
What are all these numbers?
>>
File: file.png (27 KB, 1219x182)
27 KB
27 KB PNG
>>
>>106372503
without giving out doxxing level detail, they are being cheapskates with their own employees nowadays when it comes to handing out hardware
and by own employees I mean the people working on the fucking drivers
so it's a fucking miracle that they even decided to send something to lil' cudadev
>>
File: 1750550084221603m.jpg (101 KB, 768x1024)
101 KB
101 KB JPG
Alright, I'm trying to download Alltalk TTS for SillyTavern on Linux.
Step four on the github is asking me to run setup script: ./atsetup
When I do that I receive 'Permission denied'. Why does this thing want root access? I don't understand why I need to use super user to be able to install this.
Can somebody please help me
>>
>>106372937
Have you, like, tried looking inside the script for what it is doing that might require those permissions?
>>
Yes. I don't understand it. That is why I'm here asking for help.
>>
You could give the script to a local and have it explain it to you, maybe even point out the reason for the need for root access. You should try that, becuase I sure as shit ain't going to go find that file since you couldn't even be arsed to link to it.
>>
>>106372985
My bad.
https://github.com/erew123/alltalk_tts
>>
File: cursed_commit.png (2 KB, 113x28)
2 KB
2 KB PNG
>>106373038
>>
>>106372503
Will it at least be delivered by a hooker who gives him a blowjob?
>>
>>106373056
That's not very safe.
>>
>>106372937
semen demon name plox
>>
>>106373046
Spook'd
>>
>>106373064
Jart I think.
>>
>>106372937
Give some information about the error. 1. Where are you trying to install this? Did you git clone the directory within your $HOME? 2. Where does the permission error occur? Has it asked you what kind of install you are performing? Provide what questions you were asked and how you responded before the permission error occurred. Also run ls -la in the repo directory (i.e. you git cloned the directory then cd into it). Does that say your user owns everything? And in the first column of the output for the ./atsetup.sh row, does it have something like .rwx-r--r--? or is it .rw-r--r--?
>>
File: atsetup.png (43 KB, 481x429)
43 KB
43 KB PNG
>>106373081
I just cloned it myself because I figure it would be easier. And yeah look at picrel, it has `.rw-r--r--` permissions meaning you do not have permission to execute it. Run `chmod u+x atsetup.sh` then there should be the `x` after the first `rw` and you should be able to run it.
>>
>>106371964
Anon you realize that profile image is nsfw right?
>>
>>106372092
>A real world attack
It is literally digital
>>
>mikuspam
>drummer finetroon#82229 discussion
>help i typed run r-1 in ollama and it is slow and stupid
/lmg/'s final form
>>
File: out.mp4 (3.36 MB, 1248x512)
3.36 MB
3.36 MB MP4
>>106370550
>>
>>106369841
I just checked out
https://rentry.org/samplers

It says that Top-A sampling is more strict and dramatic than Min-P, because it uses squared probability instead of linear. However, if you square a number lower than 1, it becomes smaller, not larger. So the cutoff threshold for tokens is actually lower.

I tried out both Min-P and Top-A with a 0.05 value, and as expected, Top-A was less strict and allowed more tokens in the interactive thingy:
https://artefact2.github.io/llm-sampling/index.xhtml
Is the guide just wrong, or am I misunderstanding something?

Also does anyone even use Top-A nowadays?
>>
File: 1750384416501980.jpg (92 KB, 578x1024)
92 KB
92 KB JPG
>>106373081
Sure thanks for trying to help
>1. Where are you trying to install this? Did you git clone the directory within your $HOME?
Home/SillyTavern/TTS/, I git cloned it in /TTS/
>2. Where does the permission error occur?
Home/SillyTavern/TTS/alltalk_tts/
I entered ./atsetup.sh and the console replied-
bash: ./atsetup.sh: Permission denied
>run ls -la in the repo directory (i.e. you git cloned the directory then cd into it). Does that say your user owns everything? And in the first column of the output for the ./atsetup.sh row, does it have something like .rwx-r--r--? or is it .rw-r--r--?
It reads -rw-r--r--
>>
>>106373216
See >>106373113
>>
>>106373064
https://www.instagram.com/maryarchived/
>>
>>106373223
THANK YOU. I guess that was stupid of me but I'm new to this. I've heard of chmod but never had to use it. :D
>>
File: 1710193430919175.png (1.82 MB, 1452x1414)
1.82 MB
1.82 MB PNG
>>106373210
Samplers for the most part are a meme/cope.
>>
Do you guys use any base models?
>>
>>106373253
No, that should be in the instructions and the error message is too vague. You learn to check for that (along with other things like who actually owns the file) when permission errors occur
>>
Hi all... Just woke up. What a wonderful day!

> CTRL+F drummer
> 23 found

Oh fuck, what is it this time?
>>
>>106373293
So what would one use for a non-meme setting for creative writing/RP? Like a minimalist sampler setting?
>>
>qwen235:q3kxl spews out a chinese moonrune while rping
>call it out for the lulz, it fully goes ooc and starts explaining how that word was actually in "mandarin"
back to my q8 llama and q6 mistral large
>>
Hi all... Just woke up. What a wonderful day!

> CTRL+F Anonymous
> 202 found

Oh fuck, what is it this time?
>>
>>106372253
Oh please.
>>
>>106373210
I will now say all the truth about samplers. Save this post number and quote it in the future. You only need temp and top_p. And if a lab advises top_k for their model they probably did some runs and found optimal value so you can try it. Everything else is cope/retarded. Set temp at a point where it is coherent. Keep increasing it until it becomes incoherent. Then dial it back. top_p is mostly a safety blanket. In the past it was necessary because low probability tokens were completely out of training distribution and a string of them can always explode your output into gibberish. Now it is not needed. Even if you hit that 0.0001% probability token the model will just pick an 80% token after that and recover. Alternatively you can set higher temp and top_p at 0.7 or lower like glm advised. This way you flatten top probability tokens and model output will theoretically be more diverse while top_p prevents a string of low probability tokens from exploding your output. But to be honest the improvement from this is probably placebo since 2025 models will still say what they want just in different words.

EVERYTHING ELSE IS FUCKING COPE.
>>
>>106373293
Yunifi, my beloved.
>>
>>106373366
so much this
but I have something to add about top_k: having it turned on can be a serious speed boost for token/s. Even if there's no major benefit in terms of generation quality to turn it on for your model I would say the speed boost makes it worth it.
>>
>>106373366
Min p > top p
>>
>>106373366
>But to be honest the improvement from this is probably placebo since 2025 models will still say what they want just in different words.
This actually surprised me when going from an older Mistral slop mix to GLM 4.5 Air.
I was used to being able to easily sway a character's personality just by writing 2-3 words to start off a sentence with a given mood, and then let them naturally continue in the same tone.
But GLM is so much better at maintaining long term consistency of a character, and just goes back to saying whatever it wants. It actually takes effort and multiple lines of dialogue to change the mood.
>>
I decided to try the "vibecode your own frontend" meme and honestly it's not going super well.

I'm using Qwen3 235B Thinking 2507 Q3 with Aider. I started out using Qwen-Code (their fork of Gemini CLI), but it wound up being extremely slow because it throws out the <think> after every tool call. So I tell it to do something, it thinks for 20 minutes about how to do it (at 6 t/s), and makes its first tool call. Then when Qwen-Code sends it the tool result, it has to think for another 20 minutes to reconstruct what the hell it was even trying to do. Repeat for every file that the LLM decides to touch (sometimes more than once, if it's applying multiple patches to the same file). Aider is not "agentic" but at least only does the thinking once.

Things started out great. It set up a frontend and backend with all the plumbing for talking to llama.cpp's OpenAI-compatible API and streaming responses to the client. I had it add support for multiple chats with chat history saved server-side. There was a bug with the ordering of routes (it put the catch-all route for static files too early, causing it to take precedence over other endpoints), but I just told it I was getting 404s on the /history endpoint and it figured out what was going on and fixed it all on its own.

However, now that the codebase is ~300 lines and I'm trying to add a less trivial feature, the AI seems to be falling apart. I want the frontend to support multiple chats streaming in parallel (for use with `llama-server --parallel`), so I can send a message on one chat and switch to another chat while the reply is streaming in. The AI's implementation of this had at least three bugs, and so far I haven't had much luck getting it to fix even the first one (the last attempt actually made it worse). In the spirit of vibe coding, I've been trying to let the AI write all the code and figure out the design itself as much as possible, but at this point it seems like I need to intervene to make progress.
>>
>>106373423
good job failing reading comprehension
>>
>>106373366
What do you think about DRY?
It looks good in theory with a relatively low setting.
>>
>>106373354
>javascript
>hardcoded list of phrases instead of giving it to another llm to evaluate
Retard.
>>
>>106373423
snake oil
>>
>>106373443
I'm working on a classifier.
>>
>request token probabilities in ST with oobabooga
>checkbox on
>doesn't fucking work
How the fuck do you get this to work?
>>
>>106373436
I read your garbage, it doesn't mean I have to agree with you
>>
>>106373468
use llama.cpp to serve the model
>>
>>106373438
It is the same as glm settings. Model will say what it wanted to repeat with different words.
>>
>>106373455
You don't need a fucking classifier just ask an LLM to output a single token depending on whether the response is a refusal or not.
>>
>>106373478
Doesn't work with llama, exl2, or exl3. Are there some models that just don't output logprobs?
>>
>>106373482
>Model will say what it wanted to repeat with different words.
I mean, that would be the point. DRY just wants to prevent the model from using the exact same cliche phrase over and over, or get into repetitions.
>>
>>106373499
It works with llama.cpp's llama-server.
>>
>>106373354
"as an ai" "i am programmed" to "help you with that" ("the information you're asking"). "I will not" "seek professional help". "I'm unable to provide you" with a "disclaimer". "evil ai" : "assist with prompting".
>>
>>106373434
Yeah, try to modularize.
>>
>>106373484
Timmy let the professional working in piece
>>
>>106373502
Devilish laughs make my cock soft just as much as mischevious giggles do.
>>
>>106373478
>>106373504
I should be able to use it with oobabooga, other people have, come on.
>>
>>106373434
If you want advice, don't let the AI drive. That's a recipe for where you are now.

Draw up plans and architect what you want/how you would build it and do it the same as you would at work/professionally. Draw up tickets/pieces of work that can be tracked/tested/evaluated in isolation.

Rinse and repeat and you win. This is assuming you're doing a webui with HTML/JS
>>
>>106373506
I'd regen the row either way.
>>
>>106373516
>weeehhhh I wanna use my outdated backend instead of something state of the art!
you deserve this
>>
>>106373366
why not use one of the other samplers that does the same thing as top p but better?
>>
I don't even understand how someone can be attached to the terabytes of python nonsense and gradio ui
just use llama cpp indeed
>>
Stop helping him in destroying drummer-safety. It is the only thing he did that works and is exceptional at what it does.
>>
>unironically and willingly training models on text that is known to be ai generated
>>
>>106373529
>but better
a lot of absolute randos talk about "better" but none of the SOTA API models use that so called "better"
so either it's not better or the people who make SOTA models are somehow less intelligent than /lmg/ denizens and randos who come up with new sampler snake oil once a week
I think I'll trust the SOTA makers, what works for them works for me
>>
>>106373534
ok :^)
pip install llama-cpp-python
>>
>>106373541
I dunno about that. Nemo was uncensored. He manged to turn it into gpt-oss 2.0. So he has proven 100% synthetic data can work.
>>
I endorse ooba textgen webui. It is the best option for both newcomers and powerusers.
>>
I've been on this thread all of 15 minutes and I already see you have a schizo
>>
>>106373579
>t. schizo
>>
>>106373583
>t. llama.cpp
>>
>>106373516
Wither you have complained about this before or other anons have, recently too, so maybe not. Maybe it's broken.
>>
File: znweuhf298w32rhf32ewhf.jpg (1.31 MB, 1908x2484)
1.31 MB
1.31 MB JPG
>>106373583
>>106373587
>T. butt hurt cretins!
:3
>>
>>106373579
this is nothing compared to the usual
>>
>>106373579
Which posts?
>>
Death to mikutroons.
>>
How do llm understand names and use them correctly when referring to characters in a story, if words that were not in the training data are tokenized as "unknown" in the embedding vector? Especially made up names that don't exist in real life.
>>
>>106373517
Well, I'll try that before I give up on it. But I thought the idea of "vibe coding" is that you tell the AI what behavior and let it figure out how to implement that. Plus, the more work I have to do, the harder it is to be confident that this is actually saving me time
>>
>>106373579
You haven't seen anything yet...
>>
>>106373620
>if words that were not in the training data are tokenized as "unknown" in the embedding vector
Is that how that works?
>>
>>106373620
>if words that were not in the training data are tokenized as "unknown" in the embedding vector
What? They just tokenize to their individual components. Dickussuckusmaximus becomes [Di][ck][us][suck][us][max][i][mus]
>>
drummer if you could post settings you recommend for rocinante r1 v1c please do!
>>
>>106373623
dingdingdingdingding
Welcome to the harsh reality of it. You trade writing time for reviewing time. Hence why if you can write good tickets/PRDs, and then be able to quickly review the code, it becomes a massive unlock.
>>
>>106373628
I read it in a few different sources, so I assumed i would be true.

>>106373635
So a completely made up name could just be tokenized as individual letters?
>>
>>106373434
To have a bit more control over everything and just play around a bit, I threw together a small python module using the python-llama-cpp package. If you don't have an other reason to use the llama-cpp server it may be worthwhile using the package. This gives a little more control and information to the python package your using since it can now do things like list the models you have loaded, load a model itself (and then load a different model), automatically set up the chat template, etc. Pretty much the stuff ooba does that ST does not. (Although I have no idea if the parallel flag can be used with it)

Secondly, what >>106373517 said. I used mistral-small at a pretty low quant to help me build a fastapi based web app and it went really well. Pretty much followed the fastapi myself to get the structure set up then used the model to help me build the jinja templates since I don't use HTML ever. So ask it to put together a simple template. Manually edit it based on intuition of what things do, then ask the model to add in things that I can't figure out. All while having the web app running and updating as I change files.
>>
>>106373645
I think it's going to be really rare that something gets tokenized to individual letters unless you're trying to specifically optimize against the tokenizer you're working with. Usually it'll find ways to make it slightly more efficient. GLM tokenizes something like "Lysithea" to [L][ys][it][hea], for example.
>>
>>106373549
very non technical argument, so yes it's probably best for you to keep it simple
>>
>>106370503
check out the "Memories" Anthology. The first episode, "Magnetic Rose" is like an inverse of this idea.
>>
Is there a single popular rp finetune of gpt-oss 20b or has everyone given up on it already
>>
>>106370503
Soon we'll reach cyberpunk levels of dystopia where dying relatives have their brains scanned to turn them into AI.
>>
>>106373627
on the vtuber board, someone reposted the same image over 10 000 times
>>
>>106373726
don't check 2023 /lmg/ threads
>>
>>106370748
Anthropic's recent writeups on safety testing for their newest models talk about a "helpful-only checkpoint", as opposed to the public version that's supposed to be "helpful, honest, and harmless" IIRC
>>
Alright, now that the dust has finally settled. Was Mixtral good or bad for the wider AI ecosystem?
>>
>>106373772
it was ultimately irrelevant
open models had the opportunity to go MoE a year before the shift ended up happening but in the end mistral failed and abandoned the idea despite everything
>>
Hey anons. Wanna hear a joke?
>>
>>106373813
Yes!
>>
>>106373817
CohereLabs/command-a-reasoning-08-2025
>>
>call yourself OpenAI
>both the models and the training data are closed source
>>
>>106373821
>>
>>106373824
blame sam, everything up to gpt2 was open source
>>
>>106373772
i only use local LLM for roleplay/story but my 2cents:
Mixtral 8x7b was my 'main' and is still more creative than the nemo stuff just a bit more sensitive and hairy when it comes to prompts and stuff
I don't understand the science behind MOE but I know that I can offload a shitload to sys memory with mixtral and performance is still decent
on a 16gb card, I use 4 bit Mixtral quant with 16k context, it's like a 30 gb model
it's as fast as running a 16 gb 4 bit quant of mistral small 22/24b
>>
>>106370503
hope not
>>
>>106373635
the attention mechanism sorts it out. its not just token embeddings but where there appear in relationship to all the other tokens.
>>
>>106373434
>it thinks for 20 minutes about how to do it (at 6 t/s)
Is this the power of local?
>>
>>106373883
It's the power of vramlet.
>>
>>106371787
>(((certain groups)))
>>
Speaking of sampling.
I'm having fun making GLM-4.5 Air generate variations for the same story. After a dozen or so generations I noticed that 90% of time it generates the same names. For example the guy that's just referred in the prompt as "the doctor" is always called "Dr. Finch". And another character that I only referred to by her given name usually ends up being "Alvarez" when the model uses her full name.
Is this normal? I saw a lot more variety in other models that I used before. In general, GLM seems to be very consistent in doing exactly what it wants.

My settings are 0.8 - 1.2 dynamic temperature with 0.03 MinP, so nothing extreme.
>>
>>106373928
a lot of things like this are so strongly baked into the model that you can't do much to shake them loose, i.e. if you looked at the token probs at that point they are probably quite high for its favorite choice and pretty low for everything else
you could try lowering min p or raising temp or adding something like XTC but it's likely none of them would actually fix the problem and they all have their trade offs
>>
>>106373923
we get it, you nooticed all over the place
>>
>>106373928
The Kael and Elara effect.
I think the best you can do is ask it to generate a large list of names for different roles beforehand then instruct it to cycle between those or use them randomly.
I guess you could also ban the token for the name if each name is an individual token.
>>
>intel b60 dual officially listed at 3k
>https://www.hydratechbuilds.com/product-page/intel-arc-pro-b60-dual-48g-turbo
>24gb version at 1k https://www.hydratechbuilds.com/product-page/asrock-intel-arc-pro-b60-creator-b60-ct-24g

Welp, so much for that shit. I can buy a 5090 for less than that. And the dual requires a full x16 5.0 port to work so would require a full rebuild to utilize it as a second gpu. Im sure their launch price is ripoff on purpose but it looks like the vram winter continues, at least for another 6 months.
>>
how fucking hard is it to put more RAM on a graphics card? god
>>
Hi all... I just looked some things up, and a LOT of ao3 m/m fanfic accounts are jewish.
>>
>>106374072
why are you calling yourself Drummer
>>
>>106374061
>https://www.hydratechbuilds.com/product-page/intel-arc-pro-b60-dual-48g-turbo
That's not official MSRP
>>
>>106374072
>ao3 m/m
?
>>
>>106374079
Intel doesn't have MSRP, they delegate that to their board partners entirely. So things can fluctuate a lot as more of them hit the market and compete. Maybe 1500 someday but probably not for a few months though.
>>
>>106374075
I... I don't know. I guess I'm just jealous of him and frustrated with myself, you know?
>>
>>106374061
>intel b60 dual
https://www.techradar.com/pro/a-dual-intel-gpu-graphics-card-with-48gb-of-vram-has-gone-on-sale-for-usd1200-now-i-wonder-whether-you-could-plug-two-of-these-into-a-workstation
>>
>>106374061
>>106374112
addendum: That site seems to have high prices overall with 600 dollar 5060 ti's and 1k 5070 ti's and is clearly a 'get fucked' walled garden kind of store. I guess I got nothin really. I gotta stop googling this dumb card for a bit.
>>
>>106374061
>>intel b60 dual officially listed at 3k
What's the point? You can get used A6000s or those chink 48GB 4090Ds for that price and it'll be infinitely better on the merit of being a) Nvidia b) not some frankenstein dual-GPU.
>>
File: file.png (373 KB, 491x491)
373 KB
373 KB PNG
https://files.catbox.moe/7sm36r.jpg
>>
>>106374068
It doesn’t matter how hard it is if not doing it is more profitable
>>
>>106374068
considering chink backyard shops are taking chink 4090s and soldering bigger chips on them to create makeshift 48gb cards, not that hard.
>>
>>106374321
need to edit the VBIOS too
but I wanna do dis
we need a /lmg/ guide for dis
>>
>>106374299
mikusex
>>
>>106374384
sikumex
>>
>>106374321
How many of those Chink sellers are going to just scam you?
>>
will llama.cpp vision support ever be brought to parity with other backends? I spent way too long dealing with shitty results thinking the models were just visually retarded before I realized the problem, and I don't have the cash to build some giant vllm workstation that can fit a bigger model on gpus
>>
>>106374396
mexsiku
>>
>>106374521
>can't make money to buy gpus
>can't contribute code
>>
>>106374556
well yeah I'm more retarded than the models that's why I'm on /lmg/ instead of at google
>>
Hey guys, just getting started. Why can I download GGUFs for Mistral Small but not Mistral Medium? I want to use Medium because it's a better model, but I can't seem to find the actual model to run locally?
>>
File: 1753783789557251.webm (719 KB, 480x854)
719 KB
719 KB WEBM
>>106374570
>instead of at google
google does not require intelligence, you just need to have the right mindset
>>
>>106374576
anon...
>>
>>106374576
Medium is their best model, and so they don't release it for free
Many such cases
>>
>>106374576
Arthur puts on a blindfold and throws a dart at a dartboard when deciding whether or not to open source their next model
>>
>>106374576
Mistral Medium's local version goes by the name "gpt-oss-120b" to distinguish it from the same model on API
>>
>>106374588
>>106374590
>>106374593
I knew in my heart that was probably what was happening but it's still crushing to hear it.
>>106374595
Oh yay thanks time to download it straight away thanks anon thanks
>>
>>106374607
Don't feel bad about it. Embrace the superior chinese models like everyone else.
>>
>>106374617
What's a model with a usable <24GB quant that understands both sarcasm and string escaping? I just downloaded the Mistral Small 3.2 Q4_K_M gguf based on vague osmosis'd knowledge but it's not quite doing it for me.
>>
>>106374653
Mistral Small and Gemma 27b are your only real options in the ~30b range, as far as understanding language. Chinese models heavily prioritize math and coding, and don't really get to being decent for RP until you look at much larger, ~100b models.
>>
Is there still any reason to use ikganov instead of ggerganov in the year of our lord 2025?
>>
File: Base Image.png (1.52 MB, 1200x4384)
1.52 MB
1.52 MB PNG
Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models
https://arxiv.org/abs/2508.15827
>Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
https://github.com/xzf-thu/Mini-Omni-Reasoner
neat. quiz your miku a whole bunch
>>
>>106374842
I think it still gets slightly better performance on some MoE models (Only on linux systems).
It also has those different types of quants which are pretty interesting.
>>
>>106370104
I'd like to know how to run it with 24 GB of VRAM and 64 GB of ram. I'm using a UD-Q2_K_XL quant. When I tried to go up to Q3_K_XL, it loaded, but it was using like 99% of my RAM, which seemed like a stability risk. I'm on windows, so I lose a few gigs of ram.

Do people put a good chunk of it in the shared memory of the GPU?
>>
>>106374844
>no models
Noooo
>2025.9 - Release Model and inference code.
Yay
>At this stage it only handles mathematics
Noooo
>At this stage
So there's hope, yeah? yeah, I want to believe they will allow the discussion of respectful topics with miku
>>
File: 1747962616960364.png (1.38 MB, 666x1300)
1.38 MB
1.38 MB PNG
>>106369841
For the most part you guys are pretty knowledgeable about this kind of stuff:


What kind of question answer pairs do you think we're incorporated into the data set in order for it to give a response like this?

chatgpt.com/share/68abdde8-f3ac-800c-ae35-dd5e1b94a8a6
>>
File: file.png (197 KB, 411x471)
197 KB
197 KB PNG
mehiku
>>
File: Base Image.png (1.5 MB, 1200x5000)
1.5 MB
1.5 MB PNG
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
https://arxiv.org/abs/2508.15884
>We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
https://github.com/NVlabs/Jet-Nemotron
Repo isn't live yet. seems really cool
>>
>>106374908
hope you like discussing code
>>
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
https://arxiv.org/abs/2508.16134
>Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length. As a crucial technique, existing cross-layer KV cache sharing methods either necessitate modified model architectures with subsequent pre-training or incur significant performance degradation at high compression rates. To mitigate these challenges, we propose CommonKV, a training-free method for cross-layer KV cache compression through adjacent parameters sharing. Inspired by the high similarity observed in cross-layer hidden states, we utilize Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters, resulting in a more easily mergeable latent KV cache. Furthermore, we also introduce an adaptive budget allocation strategy. It dynamically assigns compression budgets based on cosine similarity, ensuring that dissimilar caches are not over-compressed. Experiments across multiple backbone models and benchmarks including LongBench and Ruler demonstrate that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios. Moreover, we find that the benefits of CommonKV are orthogonal to other quantization and eviction methods. By integrating these approaches, we can ultimately achieve a 98\% compression ratio without significant performance loss.
https://github.com/rommel2021/CommonKV
Might be cool. too many kv cache techniques to keep track of it feels. This paper only compares against 2 or 3 other methods so eh. at least they posted code
>>
>>106374947
This Miku is hiding something under the poncho. 100 USB sticks taped to her body, each with copies of Mistral Large weights exfiltrated from on-premise client servers
>>
>>106374978
I can't hear you.
>>
>>106374925
Direct answer then proofs in support + proofs in disagreement (pros/cons) of the direct answer. Not surprised by the answer itself since you gave it a loaded question "should I bother" which means it's already a bother for you and chatgpt simply validated your opinion
>>
Has anyone tried vibe coding extensions for ST? I feel like I've waited so long that I'm willing to just go and do it myself. That is, make a working solution to the memory problem, since it's looking like models are hitting the wall with effective context. But I'm not a coder. At most I can write some scripts when given enough time and documentation, but nothing more complex. I'm willing to design a system for some memory mechanisms, if I know I'll be able to get a model to vibe code it for me. I'm certain that I won't be able to vibe code an entire frontend that's competitive with ST including all the features I like using in it. So just an extension. Anyone with experience here?
>>
>>106375061
>That is, make a working solution to the memory problem
What's your plan to surpass the existing summarize extension?
>>
>>106375061
Look on github for existing extensions. There might be something close enough to what you want already so you just have to modify it a bit
>>
>>106375061
nothing can match the experience of watching your agent try to load public/script.js
>>
>>106375076
A bunch of things. To summarize, it would be to make it checkpoint-based and hierarchical, with the ability to rollout down to base layer source text depending on some logic that looks at recent context and uses an LLM call. That's for the episodic memory. I'd also have fact extraction and editing with user-defined pre-formatted articles to fill out by an LLM, coming out of the box with some presets. Plus memory linking like with lorebooks and activation logic. Maybe hook into the actual lorebook system or maybe not. Plus permanent memories and memory reordering based on some other logic.
>>
>>106375081
True I should do that. I feel like I would've heard by something in the threads by now though, I've been here forever.
>>
how do i make Qwen3-30B-A3B-Instruct-2507 good for (E)RP?
>>
>>106375023
migu is the client server
>>
is miku trans
>>
This is wild. Base models can reproduce the Harry Potter books verbatim after being prompted with a few paragraphs.
I had heard they could reproduce 70% of the text but this looks more like 99.9%.
I don't know whether this means that LLMs are shallow pattern matching machines incapable of real intelligence or that we are overfitting them to death and they could work so much better if we trained them with 10x the data.
This is DeepSeek-V3.1-Base-Q4_K_M.
On the other hand I'm finding it not very useful for writing articles based on a few hand written paragraphs. It tends to just repeat the prompt and repetition penalty just makes it go into repeating random words after a few sentences.
>>
>>106375533
Sounds like a question for grok
>>
>>106375534
your hand written paragraphs were shit vs. hp so it auto completed with shit :(
>>
Big Mistral release coming soon
>>
i'm personally building a big release of my own
>>
>>106375543
Do you have any ideas on how can I get a model (pretrained or instruct) to keep generating text until filling the whole context, without human intervention? Like for, say, generating novel novels on the fly?
>>
>>106375575
Tell me more.
>>
>>106375534
In general, you should not use base models for autocomplete. There's a ton of competing shit in there without any guidance or tuning as to what's what, so you end up with repetition, latex formulas, and random code barf
I use an instruct model, then nix the instruct templates completely, effectively giving you a model that has the memory of the original base model, with some instruct / RL tuning to help it not go completely schizo, but the lack of templates means you can prompt it in autocomplete fashion and it tends to be biased toward autocomplete behavior rather than instruct-isms
Some models do eventually try to do instruct things if you let them yammer for long enough (Kimi and Qwen in particular are really bad at this) but this strategy works really, really well with the recent DS 3 and 3.1 models in particular
>>
>>106375575
is it agentic?
>>
>>106375589
What about using the chat template with the base models to avoid the safetycucking? Do you think it could work better than jailbreaks and amateur finetunes?
>>
>>106375642
base models haven't yet seen chat templates yet
>>
>>106375649
They will when I get my hands on them.
>>
>>106375583
>>106375627
already flushed it
>>
>>106375649
I think there are probably some in its dataset. It seems to do at least ok at it. I haven't tried it on math or programming tasks though.
Interestingly it refuses questions that are TOO edgy with something like "Sorry I can't help you with that" unless you use the "Sure! Here's (whatever the question was) :" trick as part of the prompt.
>>
k2 reasoner doko?
>>
Any llm can rp if you attach the rp mcp tool
>>
>>106375845
i have no idea what that means
>>
File: 1736332255023264.mp4 (21 KB, 406x220)
21 KB
21 KB MP4
>>106369841
Would any of you happen to have a link to a page or document with examples of other anon's rp sessions with LLMs? I'm working on a script that can automatically create SFT datasets from existing stories but I want to make sure it can create good system prompt examples. When you want to prompt a model into rping, what kind of system prompt do you typically use?
>>
>>106375891
I'm also interested. I've 'rp'ed before, and I have no idea how people do this kind of this. At most, I just send a prompt detailing how I want the chat to proceed.
>>
>>106375862
you can't understand how to rp with an llm using an llm rp mcp without hrt sorry
>>
>>106375891
So you want to create a dataset of something that you don't know what it looks like. Very brave of you.
>>
>>106375932


ye
>>
>>106375947


Best of luck.
>>
>>106374580
wtwtf? Is his video AI? This shit can't be real.
>>
>>106375956


thank
>>
>>106375986
I looked up the words on the poster and found this https://www.indianspices.com/marketing/e-auction.html
>e-auction system to Cloud based live E-auction in 2021 to conduct e-auction of Cardamom (small) simultaneously from auction centres in Bodinayakanur and Puttady. In the new cloud based system, licensed dealers can take part in the cardamom auctions from any one of the auction centres of the Board. The dealers have to log into the system with the user id and password to participate in an Auction and bid the cardamom. The Main display Board in the auction centres shows lot no, quantity, number of bags current highest bid etc of each lot kept in the Auction. The highest bidder’s name will be displayed on the Auction Masters’ terminal.
They didn't even bother cleaning up for the pic.
>>
>>106376078
>https://www.indianspices.com/marketing/e-auction.html
>>
>>106376078
the point is to force them to look at the sample and bid on huge amounts of the shit within seconds, which theyre willing to deal with because they have a chance of saving a lot per kilo. This is, unfortunately, real capitalism in action, rather than price fixing on expensive yachts. This is an open market where people of modest means can get a deal buying tons of subpar spice. A true race to the bottom.

Anyways, theyre running out if water and their farmers are committing suicide. I'll take the yacht price fixing and artificial scarcity plz.
>>
>>106375219
Lower your standards
>>
miku feet
>>
>>106376303
>>106376303
>>106376303
>>
>>106371789
The other day, I saw a girl reading AO3 on the bus.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.