[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106762831 & >>106755904

►News
>(10/01) Granite 4.0 released: https://hf.co/collections/ibm-granite/granite-40-language-models-6811a18b820ef362d9e5a82c
>(10/01) LFM2-Audio: An End-to-End Audio Foundation Model: https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model
>(09/30) GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities: https://z.ai/blog/glm-4.6
>(09/30) Sequential Diffusion Language Models released: https://hf.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552
>(09/29) Ring-1T-preview released: https://hf.co/inclusionAI/Ring-1T-preview

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106762831

--Paper: The Pitfalls of KV Cache Compression:
>106765718 >106765984 >106766108 >106766148 >106766761 >106766197
--Papers:
>106764212
--Evaluating GLM 4.6's roleplay performance and quantization efficiency:
>106763532 >106763653 >106763717 >106763827 >106763907 >106763914 >106764034 >106764029 >106764173 >106763671
--IBM Granite 4.0 enterprise model launch and documentation inconsistencies:
>106767652 >106767670 >106767732
--Director roleplay customization addon for managing character settings and environment:
>106763408 >106764995 >106765052 >106765045 >106765076 >106765156 >106765172 >106765183 >106765217 >106765326 >106765094 >106765123 >106765190 >106765225 >106765253 >106765303 >106765342 >106765390
--Liquid AI's LFM2-Audio 1.5B multimodal model capabilities and performance:
>106765758 >106765764 >106765934 >106766498 >106766751 >106766973
--Feasibility of building a local knowledge base with limited VRAM and RAM considerations:
>106764158 >106764219 >106764240 >106764254 >106764261
--Discussion on AI model quantization methods, jinja string editing, and new quantization types:
>106767116 >106767235 >106767244 >106767293 >106768177 >106768200 >106767251 >106767337
--ik-llama GPU utilization problems and offloading configuration:
>106763167 >106763227 >106763244
--Qwen3-30B-A3B model selection and GGUF quantization considerations for RTX 3090 VRAM limits:
>106764290 >106764312 >106764328 >106764340 >106764356 >106764360 >106764366 >106764402 >106764430 >106764489 >106764500 >106764622 >106764838
--Setting up a roleplay bot on 8GB VRAM hardware:
>106767312 >106767327 >106768048 >106768086
--Unsloth AI introduces Docker image for streamlined LLM training:
>106766089
--GLM 4.6 performance surpasses Deepseek R1 on gaming rig:
>106766318
--Miku (free space):
>106763663 >106768757

►Recent Highlight Posts from the Previous Thread: >>106762833

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>anon? you're not making your quants?
>GAHAHAHAHHAHAHA LOSER!
how do you respond?
>>
Gumilove
>>
alright eel smarter
>>
>>106768369
Why is Kuro such a bitch?
https://files.catbox.moe/5b8n7l.txt
>>
>>106769691
with a bullet
>>
File: 1738912286388860.png (36 KB, 469x171)
36 KB
36 KB PNG
>>106758314
GLM doesn't output a newline after <think> and before </think> so you need to remove those from the reasoning formatting to get it to parse correctly.
>>
I want to spend $8000 to draw 2d anime pictures. Do you think it will be enough? 2d anime pictures are very important to me.
>>
>>106769831
You can get a decent drawing tablet for half that.
>>
>>106769831
Image gen needs a lot less memory than text, so yeah, it'll be a banging 2d anime pictures generation machine.
>>
>>106769831
you can get a tablet for a lot cheaper, and you can get a gpu to do it for about the same $500
>>
>>106769831
That should be enough even for moving anime pictures.
>>
>>106769845
>>106769847
>>106769852
so how much do I really need?
maybe... 50 TOPS?
>>
>>106769866
so funny i forgot to laugh
>>
>>106769691
I lie down under her.
>>
>>106769902
gotcha, gear fag
>>
>>106769866
This is all I got.
https://www.pugetsystems.com/labs/hpc/whats-the-deal-with-npus/

Don't know how much software is even out there that can target an NPU.

Do tell us when you figure things out.
>>
Seriously? You just tell it "You're doing ERP" and it just turns off the safety features?
https://files.catbox.moe/ozn9ws.txt
>>
>>106769947
Sounds like my answer is it will make a decent chatbot/search tool, maybe do img -> txt, but that's about it.
Thanks for the link.
>>
File: 1750174960929728.jpg (75 KB, 383x908)
75 KB
75 KB JPG
>tried to share my addon on leddit
>muh mod approval
>its 24hr later and no response

whelp i did what i could to advertise. i'm not going to bother following up and waiting. i made an addon that does what i want, for me, and i've shared it here a few times.

you can install my st addon by entering the address into st's extensions. https://github.com/tomatoesahoy/director

i'm not done working on my addon but updates are spontaneous at best. i'm more disappointed that when i feel my addon is good for release to everyone rather than just posting here occasionally, i'm met with walls of restriction. i followed their rules, made an account, waited 30 days. still cant post. so whatever, enjoy, you 4chan fucks
>>
>>106769725
>https://files.catbox.moe/ozn9ws.txt

I find this easier to read than the usual wall of purple prose, but the model was a bit loose with the formatting.
>>
Does GLM 4.6 have a habit of slipping into feminist lecture mindset, like GLM Air?

inb4 prompt issue
>>
>>106769831
NAI is $25 a month
>>
>>106770110
Don't dwell on it. After the guy who made localllama had his meltdown, admins put one of their puppet power mods in charge. It's a feed of sponsored and approved content, not public discourse.
>>
>>106770110
Stay here.
>>
>>106770080
>https://files.catbox.moe/ozn9ws.txt
>The "Uohhh!" is a sound of surprise and delight
still doesn't know the meaning :/

>>106769831
>2d anime pictures are very important to me.
don't you have whole *boorus, kemono, etc. for that? why do you need more?
>>
>>106770139
buy an ad kurumuz
>>
My GPU crashed so hard that it stopped being recognized by nvidia-smi. Is it over?
>>
>>106770207
>>106770215
it actually means a lot to see other people be understanding. i posted in good-faith. i really just wanted to share something i think helps solve the clothing/location issue models have. but then i realize i'm totally shut out from posting. at least, other than here
>>
>>106770262
Here is all you need.
>>
>Using ikllama 4.6 = 2.6 T/s at 10k ctx
>switch to regular llamacpp = 3.2T/s
?? Why was I using the memefork again?
>>
File: 9dfg87.png (7 KB, 407x218)
7 KB
7 KB PNG
>>106770207
>meltdown
qrd?
>>106770252
until you reboot perhaps
>>106770300
few hours of trial and error to find the right memeflags then it'll be fast
also post cmdline
>>
>>106770324
--override-tensor exps=CPU -ngl 99
That is what I use for both. I have bad experiences with fmoe and I don't think any other flags apply to glm.
>>
>>106769660
https://files.catbox.moe/bpv4jk.mp4
>>
has glm 4.6 officially saved local?
I haven't had this much fun with a model on my rig in quite a while and even api deepseek never felt this good
>>
File: file.png (1.19 MB, 1216x832)
1.19 MB
1.19 MB PNG
>>106770252
Did it give:
>"Unable to determine the device handle GPU0000:n is lost" ?
If so then try a hard power cycle and hope it works. Can be a driver crash, or actual hardware issues.
I had that an issue like that about 5 months ago and it turned out to be a flaky connection that got better after I cleaned the dusty PCIe slot.
>>
>>106770366
im so glad miku is 2D
>>
>>106770110
Okay but where is the version of this where it describes body parts and their current status? I want the model to know the exact height and skin color and ear shape of my goblin maid. Seriously though, I wonder if something like that would help with things like the nala test or similar scenarios. Maybe explicitly describing the character as quadrupedal with paws in a scene description would improve the output even though it should be self-evident by describing the character as a lion.
>>
File: korn worry.jpg (10 KB, 193x245)
10 KB
10 KB JPG
>>106770366
Miku-san, where.. from where does your cheese come from?
>>
>>106769660
Daily reminder
>mikusex
>Nvidia Spark is a tiny DGX computer not meant for LLM/imagen inference
>petra is the goat
>>
File: 1732511391275741.jpg (227 KB, 1536x2048)
227 KB
227 KB JPG
/ourguy/ Kalomaze (Min P) is doing an AMA on Reddit
https://www.reddit.com/r/LocalLLaMA/comments/1nwaoyd/ama_with_prime_intellect_ask_us_anything/
>>
>>106770451
seems like he's reddits guy now
>>
>>106770366
Chinkiest miku ever
>>
>>106770356
got some spare GPU mem? still testing myself (also gonna be v hw dependent) but maybe first three+ layers back on GPU, flash-attn, run-time-repack, K/V quanting, batch sizes ..
>>
>>106770207
you mean the mod who instantly created a discord and twitter account for the subreddit somehow isn't a saint, color me shocked to find this out
>>
>>106770463
Always was.
>>
>>106770451
wow
>I’m Kalomaze (u/kindacognizant), a researcher at Prime Intellect, the lab behind:
>Distributed training efforts including INTELLECT-1 + INTELLECT-2
what switch to go from here to being responsible for some of the worst models ever
>>
>>106770480
the point is memefork isn't better
>>
>>106770498
Acquiring funding is like making a deal with satan.
>>
>>106770451
Someone ask him why his company uses total fucking garbage to train models and not only that they're doubling down on it by creating a huge synthetic dataset for MATH CODING and SCIENCE
>>
File: that chink is cooking.png (535 KB, 750x750)
535 KB
535 KB PNG
>>106770498
money
>>
I've coomed for the 5th time today, and I can't get it up!
>>
>>106770513
No point asking. Assuming he doesn't ignore the question, the answer is obviously adding any nsfw, vulgar, or copyrighted content would get them shut down overnight.
>>
>>106770523
>I'd become a giga neet
>>
>>106770540
They make pills for that.
>>
>>106770477
Yeah, those are not Japanese aesthetics. Definitely chinky or gooky.
>>
File: 1748895901549445.jpg (89 KB, 973x796)
89 KB
89 KB JPG
>>106770289
i know. it was silly to even try because the internet now is nothing but dead ends and degenerates
>>
>>106770546
How would vulgar or nsfw content (or just focusing on writing even a tiny bit) get them shut down? Whatever, just tell him to choke on a dick or something then
>>
File: monke music.png (112 KB, 497x356)
112 KB
112 KB PNG
>>106770451
I enjoyed those times when he was trying threadly new sampler shit-throwing to see what happened. It was fun to try getting the experimental hacky stuff working even if most of it did not make the outputs definitively "better" overall. They did do something, that's for certain, so I thought it was cool to be part of the exploration of new land.
>>
>>106770366
not my miku
>>
>>106770588
>or just focusing on writing even a tiny bit
That doesn't improve the benchmark scores that get investors all hot and bothered.
>>
>>106770607
INTELLLECT has some of the worst scores ever recorded though, like worse than llama 7b
>>
It's odd to me that ik_llama.cpp doesn't seem to work well with assistant prefills using the chat completion API.
llama.cpp just works.
>>
>>106770651
Less math and code isn't going to boost those scores though.
>>
I compared https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ2_KL
against https://huggingface.co/Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF/tree/main/24GB%2B128GB_V3
and found ubergarm's to have significantly better token probabilities on my tests. There is one where ubergarm's quant has 70% on correct token, 16% on "trap" token while the other one has 20% on correct and 63% on trap while also being somehow slower at a smaller size
>>
>>106770651
Does it beat the original StableLM?
>>
air bros... wheres our 4.6???????????????
>>
>>106770451
>Hey, we are super aligned
yea...
>>
File: aaa.png (193 KB, 395x391)
193 KB
193 KB PNG
What is best model for editing image? (I want to add dick on dude forehead) Im using Easy Diffusion, while trying it shows weird results, like it does not look like proper image of dude with dick on forehead but something weird
>>
>>106770710
ppl and kld. Come back when you have those.
>>
File: Screenshot.png (25 KB, 569x202)
25 KB
25 KB PNG
>>106770727
there is hope, thrust!
>>
Okay, something is wrong with GLM 4.6. Maybe it's the quant, I'm running IQ5_K but the model is fucked. It will sometimes randomly add chinese characters to the middle of sentences, put the actual response in its thinking and doesn't consistently follow formatting of previous responses correctly. I have to reroll most of the time and it's slower than R1.
I'm using temp 1 with everything else neutralized as recommended on the hf page with the GLM 4 context/instruct template.
>>
>>106770482
>MODERATOR OF
>r/LifeProTips
>r/ChatGPT
>r/PewdiepieSubmissions
>r/OpenAI
>r/GPT3
>… and 44 more
they are in good hands
>>
>>106770763
The best hands.
>>
>>106770753
Try using the chat completion api.
If that works, something is fucked with the context/instruct template somewhere somehow.
>>
>>106770451
Neat, you should head over there
then don't come back
>>
>>106770774
Do not be mean unto others.
>>
>>106770780
redditors are vermin, not people
>>
>>106770753
>everything else neutralized
Use top_k=40, top_p=0.95. If you neutralize everything, you'll always get shitty tokens now and then.
>>
>>106770827
Nope! If you have to resort to that you're using a shit model and should switch to something actually good.
>>
File: 1409885621682.gif (249 KB, 404x500)
249 KB
249 KB GIF
>this is deeply misogynistic and reinforces problematic gender stereotypes
air-chan, yamero
>>
>>106770872
Thank you kalomaze, very safe.
>>
>>106770886
it's not like it refuses, it just insists on scolding me about sexism and toxic masculinity all the time
>>
>>106770905
So a bit of a Mixtral vibe, nostalgic.
>>
>>106770872
have you tried disabling thinking?
>>
>>106770905
have you told it to like sexism instead?
>>
>>106770840
lmao glmtards BTFO
>>
>>106770747
>coworker calls him a coomer
workplace bullying is not ok
>>
>>106771035
name a better local model
>>
>>106771098
nemo
>>
>>106771098
they're all in a sad state right now but i'd chose something that doesn't run at 2 tokens per second unless you are a retard who spends money on this
>>
>>106771105
Somebody should turn Nemo into a MoE.
>>
>>106770366
It's not Miku Cheese unless it's made from Miku Milk.
>>
>>106771105
>>106771134
Have you tried Rocinante — I love it!
>>
>>106771153
NeMoE
>>
>>106771205
'mo-moe
>>
>>106771205
>>
Well, >>106771205 has the name down.
Now we just need a finetooner to create it.
Anybody got DavidAU's phone number handy?
>>
At last, local is saved. I called qwen3 and glm4.5 shit while everyone praised them. The new glm however is a whole different story. It writes with the soul I've never seen before in a local model, not even deepseek.
>>
File: 1742695099511857.jpg (95 KB, 1190x906)
95 KB
95 KB JPG
>>106771262
>>
>>106771262
>soul
zoomer opinions are less than worthless
>>
>>106771105
what's it like to fuck a retard?
>>
>>106770753
ubergarm/IK quants on non ik_ llamacpp perchance? Haven't seen such issues with barts even at Q3_K_M, sometimes thinking gets messed up but that's my janky Silly config. remember newline before <think>
>>106771276
>not embracing the heartsovl of your model
get out
>>
>>106771247
I'm 100% sure there have already been Nemo based clowncar MoE made.
>>
>>106771134
I can get my glm running at 4-5 t/s. Honestly not bad compared to what I originally expected.
>>
>>106771333
>remember newline before <think>
You mean start reply with that?
>>
>>106770451
>Kalomaze is doing an AMA!
>he never answered any question
>>
why are we using GLM 4.6 again? this shit feels astroturfed as fuck now that i had a chance to use it. surely i should be getting better speeds than this with 30 layers offloaded. token generation is the same, although it ends up taking much more time since GLM wastes time thinking (i know i can turn it off) it's only like 30tk/s quicker than K2 despite being half the size (249GB vs 485GB) the results below is with a IQ5_K quant (ubergarm)

INFO [print_timings] prompt eval time = 32843.74 ms / 4807 tokens ( 6.83 ms per token, 146.36 tokens per second) | tid="128489640603648" id_slot=0 id_task=3408 t_prompt_processing=32843.743 n_prompt_tokens_processed=4807 t_token=6.832482421468692 n_tokens_second=146.35968866276903
INFO [print_timings] generation eval time = 250057.45 ms / 1591 runs ( 157.17 ms per token, 6.36 tokens per second) | tid="128489640603648" id_slot=0 id_task=3408 t_token_generation=250057.452 n_decoded=1591 t_token=157.16998868636077 n_tokens_second=6.362537837904546
INFO [print_timings] total time = 282901.20 ms | tid="128489640603648" id_slot=0 id_task=3408 t_prompt_processing=32843.743 t_token_generation=250057.452 t_total=282901.195

kimi k2 0905 smol_IQ4_XSS quant

INFO [print_timings] prompt eval time = 39716.81 ms / 4180 tokens ( 9.50 ms per token, 105.25 tokens per second) | tid="133210908811264" id_slot=0 id_task=1323 t_prompt_processing=39716.806 n_prompt_tokens_processed=4180 t_token=9.501628229665071 n_tokens_second=105.24511966042789
INFO [print_timings] generation eval time = 18771.55 ms / 121 runs ( 155.14 ms per token, 6.45 tokens per second) | tid="133210908811264" id_slot=0 id_task=1323 t_token_generation=18771.548 n_decoded=121 t_token=155.1367603305785 n_tokens_second=6.445925503852959
INFO [print_timings] total time = 58488.35 ms | tid="133210908811264" id_slot=0 id_task=1323 t_prompt_processing=39716.806 t_token_generation=18771.548 t_total=58488.35399999999
>>
>>106771590
Let them cook, they're busy!
>We are an open source agi Labs and ramping up our research team, our goal is to be competitive asap on capabilities with the big labs, we have compute, talent, and crowd source environment with verifier and the hub. Stay tuned for our next model release !
>>
>>106771605
>this shit feels astroturfed as fuck
no way take your meds schizo freak
>>
> Is there any local ai that let me do TTS elevenlabs type quality?

I have a 3080 geforce tuf and a gigabyte ga-h61m-s1 mobo, so my pc is quite decent but not overkill. I still want to try messing with a good TTS.
>>
>>106771510
Yea, or on a second line in Assistant Prefix. I use the instruct sequences to more easily turn thinking on/off
>>
>>106770110
>remember always running it on windows
>reinstalled it on linux months ago
>for some dumb reason it doesn't show up in my silly tavern extension tab like it does on windows despite the fact it shows as installed and enabled when managing extensions
aside from that weird fuckery goning on my end, it was one of my fav add-ons back then desu, mostly used the clothes and world info bits and it helped out a ton steering models, you did a good job anon!
>>
>>106771624
it's not good and now i know why nobody is posting GLM 4.6 logs. what's the point of using GLM 4.6 when K2 is only moderately slower in PP but still the same speed in TG? GLM 4.6 is lacking in its knowledge, it fails to understand niche things that kimi is able to pick up without issue. even niche stuff aside, kimi just knows way more and its dataset is more recent. you can ask it to tell you stuff that happened in December 2024 and it will answer factually.
>>
>>106771605
>why are we using [350b model] and not [1t model]
But that aside, do you not like the output?
>>
>>106771590
No shit, reddit AMAs are just promotional stunts. They only respond to the questions they want to, the ones that let them segue into marketing themselves with canned and preplanned responses.
>>
>>106771712
if i turn off thinking then it's reasonable. keeping thinking on is a mess though.
here's an example of the output with thinking. removed some of the response to fit the character limit. the thinking is slopped.

Okay, the user has responded through {user}. Let me break down the key elements of their response to guide my turn as {char}.

{user}'s actions/Dialogue:

Physical reaction: Breath knocked out, hugs back "timidly." This shows he's a bit overwhelmed and maybe shy, which fits his description. He's not rejecting the hug, but he's not as effusive as {char}.
Dialogue:
"Jeez, yeah yeah. It looks like I'm back for a while…" - Confirms his return, sounds a bit flustered/casual.
<blahblah>
{char}'s internal state:

She's ecstatic. Her absolute best friend, the boy she's been in love with for a decade, is back.
<blahblah>
My Plan for {char}'s Response:

Initial Reaction: <blahblah>
Physicality: <blahblah>
Dialogue:
She'll laugh loudly at his "welcome party" comment. Maybe tease him about it.
<blahblah>
Narrative Flow:
Start with her still hugging him, her immediate reaction to his awkwardness.
<blahblah>
End on a high note, maybe she grabs his hand and starts pulling him somewhere, or asks a million questions a mile a minute. The goal is to keep the interaction going and show her personality. I'll make it brief and stop at a point where {user} needs to react.
Drafting the response in my head:
>>
>>106771674
thanks for the compliment, it makes me happy to read. its funny that 4chan is the way it is, i've never got much negativity about developing my addon. you guys are always pretty supportive, telling me to keep going and stuff. it might not seem like much but reading the positive comments makes me wanna do more, so thanks
>>
>>106771605
>this shit feels astroturfed as fuck
no bro 4.6 is kino af ong it has svol bro fr fr the vibes bro bro
>>
So what layers should I offload to RAM for GLM 4.6 in ik_llama.cpp?
>>
GLM 4.6 just bought me a house and cured my dog's cancer!
>>
>>106771820
this but unironically
>>
Another reason I love glm-chan 4.6 is that I am sure she makes drummer jerk off while lubricating his cock with his tears. Glm chan reminds him that the days of his grift are numbered. Next air will be accessible to everyone and will easily beat the shit out of all the shittunes. Shittune placebo will die in 2026. Last chance to get a job you safety engineer retard.
>>
>>106771605
>why is this 30b active parameter MoE at Q5 running slower than this other 30b active parameter MoE at Q4_XXS
is this really the level we're having this discussion on? maybe ollama is more up your speed, you won't have to worry about this sort of thing and things.
>>
>>106771820
You are either salty because they didn't release air or are one of those 3x p40 anons. I'll post logs tomorrow because it's past my bedtime
>>
>>106771801
You are welcome! I remember the first posts you made about working on it and going like "Oh shit gotta write it down when it releases" cuz I was trying to play around that issue of models forgetting scene stuff or steering it towards a play by writing some of the info on lorebooks or even just random notepad entries and then copy and pasting their contents on the author notes tab, but it was kinda janky to do so manually every time especially with so many entries... and your extension for me solved just that in a really neat and easy way, glad to see it's still going
>>
>>106771645
VibeVoice 7B. It's the best we got.
>>
>>106771958
why wont anybody have a serious discussion about this? is it because you don't have enough RAM to run kimi k2 yourself and have to rely on cope quants of GLM 4.6? post logs ffs otherwise i will just stick with the superior chink company.
>>
>>106771704
>i know why nobody is posting GLM 4.6 logs
https://files.catbox.moe/mwwdug.txt
https://files.catbox.moe/xs9vn5.txt
>>106769725
>>106770080
>>
File: 1757604949388241.png (1.45 MB, 900x1200)
1.45 MB
1.45 MB PNG
WTF even is this LMAO: https://xcancel.com/wolflovesmelon/status/1971002333577482360
>>
File: 1729019232618558.jpg (96 KB, 411x980)
96 KB
96 KB JPG
>>106771993
its a funny issue. ai only cares so much about the most recent thing (lowest context in the chat). there is bunch of addons now that all do similar things but it still ends up with reinjecting data into the ai at a point.

in my head i knew what needed to be done, but wasnt sure how it'd turn out. especially since i was using ai to develop the app (i'm ok with java, but not a programmer in it). i'm pretty happy with how things turned out

lately all i've done is add an image feature. so inside the folder of my addon, if you create an 'images' folder and then have an image that matches a name, it'll pop up a pic just like the card would if you clicked the image. picrel is belle from beauty and the beast, in her peasant outfit
>>
>>106772080
>post logs ffs otherwise i will just stick with the superior chink company.
If you actually tried both of these why would you give a shit about anyone else's logs
>>
>>106772136
a bit like having different alt gens of a card but more dynamic since you can switch it up based on what the setting is saying they ar exurrently wearing? I kinda dig it desu
>>
>llama-kv-cache.cpp:764: GGML_ASSERT(ubatch.seq_id [s*n_tokens][0] == seq_id) failed
when doing --parallel 4 runs
ah well, it wasn't worth upgrading llama cpp to try granite
this shit is so ghetto
>>
i just woke up in the most suspicious way possible call me a schizo but im going to attribute it to divine forces awakening me to witness an amazing drop dont hold me to my word though plz
>>
>>106772401
kinda. i just wanted images that are associated with outfits or locations.

back when ai was pyamilion 2.7/6b and st was hardly a thing, one of the nice things the kobold ui did was highlight lorebook entries. any time it hit an entry, you could hover over that in the chat and see its entire entry plus a pic of it. i always wanted something similar for st, but since i can't do that, i'll settle on pics that pop up the same way a card pic does
>>
>>106772111
Is he dead yet?
>>
>>106772444
Do you remember if Miku said anything to you?
>>
>>106772425
Now try running GLM on CPU with vllm.
>>
>>106772444
glm already dropped and saved our cocks
>>
>>106772494
why is everyone waiting for llama.cpp to implement models when you can just run everything with vllm on cpu much faster
>>
>>106772524
>vllm on cpu much faster
it's not and because you don't get finegrained quants like exl3 or gguf
>>
>>106771704
Kimi felt schizo to me when I used it, like it doesn't know how to describe stuff naturally despite all that knowledge. And Deepseek is just boring and cucked now. New GLM doesn't have either of those issues. Guess it goes to show that having a gazillion parameters doesn't matter when all you care about is benchmaxxing.
>>
>>106772549
but there's no point in using those if you're running 8bit anyway
>>
>>106772524
I have never seen any of the vllm on CPU shills post t/s numbers.
>>
>>106772524
Because the people who did try it ran into bugs and errors. Turns out vllm is only production-ready if you plan on doing GPU-only inference and your GPUs are all the same VRAM. Also you might need to find the exact version of vllm that works with the model you want because not every version new versions can and have introduced bugs with old models.
>>
>>106772580
>not every version new versions
Somehow deleted a part of that, was meant to be
>not every version does as new viersions
>>
>>106772093
>https://files.catbox.moe/ozn9ws.txt
jesus christ... so its shits all retarded and it talks like a fag for everybody in its thinking process and just isn't me. it's so over.
>>
File: k2 params.png (72 KB, 402x699)
72 KB
72 KB PNG
>>106772555
heres my parms anon. hope it helps you with getting non-schizo responses, kimi seems to behave the best with these.
>>
>>106772580
Not to mention the trial and error getting the pythonshit dependencies working. Even with conda it seems like the project is in a constant state of broken.
Or the fact that the issue tracker is full of ignored issues because all support and development happens on discord.
Or the fact that they only support the latest 3 gens of Nvidia cards, so AMDfags, Intelfags, Macfags, and P40fags are all out of luck.
>>
>>106772580
Because vLLM only works with GPU configurations in the power of 2 and I have 7 GPUs. Maybe one day I'll get an 8th GPU and connect it over that weird SAS port that does PCI-E shit.
>>
>>106772697
Honestly it is really fucked up. CPUbros don't know how good they have it mister gurglenov.
>>
>>106772739
Eastern Euro C/C++ programmers are a different breed.
>>
>>106772769
can confirm.
source: am rpcs3 dev
>>
glm chan is the semen demon. sign the pact now by buying at least 128GB's of ram.
>>
>>106772857
I'm not sure I want to deal with 3 t/s (if what the anon said in the other thread was not a lie).
>>
https://huggingface.co/Qwen/Qwen3-4B-SafeRL
finally, a model everyone can run on a potato while feeling very safe
>>
>>106772899
nah anon, you can get a whopping 6tk/s with GLM 4.6 and you only need 120GB of VRAM to do it. >>106771605
>>
>>106770753
I give up, something's wrong and I don't know what. Back to R1 for me.
>>
>>106772899
I couldn't run 70B's offloaded at 2T/s. I can run glm at 3T/s because you never have to reroll. Just look how often you reroll and if it is around 8 times then it is a no brainer.
>>
File: 5416.png (206 KB, 400x300)
206 KB
206 KB PNG
I see most people here use the models for roleplaying, are there are any tools for using one as a voice assistant?

I'm thinking of making something like Cortana or Alexa for my grandma cuz her memory is getting weak. What's the best way about doing this? Any tips or tools?
>>
File: ohnoitsretarded.png (164 KB, 859x809)
164 KB
164 KB PNG
>>106772929
GLM 4.6 will follow your previous responses format/style to a T to its detriment. If you put in weird formatting and split up sentences in weird ways you will get a response like the one in my screenshot. Does it do it on a fresh chat?
>>
What are the odds one of these chucklefucks successfully ban all chink tech in the west?
>>
>>106772970
you could try giving the new line a bit of a debuff, they probably let it see too much hard wrapped text in the pretraining.
>>
>>106772970
>Does it do it on a fresh chat?
Yes, this is with a card that doesn't put speech in quotations and has asterisks wrapped around all other text and it's not just that. It will output </think> in the middle of its response and start speaking for my character and on some rerolls, it will only output <think> and then end the response.
I'm assuming it's quant related, or there's some bug or something, I'm using ik_llama and I rebuilt it. I can't be bothered to figure it out, maybe I'll give it another try if it's still relevant here in a month or whatever is broken gets fixed.
>>
File: file.png (191 KB, 416x416)
191 KB
191 KB PNG
thank you god emperor xi. my member isn't worthy this boon you bestowed upon us.
>>
>>106772914
Oh ok, thanks for pointing that out. Might consider it, but also wish we had more given that it's a reasoning model.
>>
>>106761230
Okay so you can replace the built one in portable_env/Lib/site-packages/llama_cpp_binaries/bin with ones self built and do not forget to make the ik_llama build with CUDA. Running GLM 4.6 with ooba's UI.
>>
>>106773087
it is garbage with reasoning since reasoning blocks are 3 times longer than 4.5. but you don't need reasoning. everything just works. 16 times less useless detail. it even does things you didn't ask for but you realize that you actually want.
>>
mistral bros when is it our time to shine? isn't large supposed to be released soon since they upgraded medium recently?
>>
the glm gaslighting continues
>>
>>106773183
who cares? large would just be a downsized and fried r1 anyway
>>
>>106773183
>With the launches of Mistral Small in March and Mistral Medium today, it’s no secret that we’re working on something ‘large’ over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we’re excited to ‘open’ up what’s to come :)
>May 7, 2025
just a feeeeeeeeeeew more weeks
>>
>>106773194
get a job anon and you can maybe stop coping
>>
>>106773195
you act like more competition is a bad thing. we dont know how badly it performs until they release it. more competition breeds innovation.
>>
glm betrayed its userbase by not doing another air despite it being 99.999999% of what people actually used
yeah i'm sure you ""local"" users are now suddenly running 300b, how about you all go to /aicg/
>>
>>106773194
Say that after she gives you a blowjob. I dare you.
>>
>>106773216
>t. seething poorfag
>>
>>106773216
There'll be another time.
>>
>>106773216
128gb is enough for IQ3 you have no excuse to neglect your sex life by not buying it.
>>
>>106773216
i like running 1T models locally. $800 for 512GB of RAM in a first world country isn't that expensive. Its like 60% of a weekly paycheck for me.
>>
>>106773216
How does one not use 0.0000001% of a model?
>>
>>106773205
it's you guys who ought to stop coping
>it is garbage with reasoning since reasoning blocks are 3 times longer than 4.5. but you don't need reasoning.

>using thinking model without thinking
>because you would spend a literal eternity waiting for your shitty cpumaxxing to generate the first line of actually readable shit
>coping that disabled thinking works great
>in a glm model
>>
>>106773254
and then you sit there watch it do 2000 tokens of reasoning at 3 tokens per second, yeah
or maybe you're using the api?
>>
>>106773266
uh huh, that is why everyone here, reddit, and the novelai, silly tavern, featherless, and ai assisted writing discords are all praising it, huh? That all mostly used claude before?
>>
>>106773283
maybe those places are more up your speed then? don't shit up /lmg/ with your trash, nobody wants it here
>>
>>106773283
you need to go back
>>
>>106773266
>using thinking model without thinking
it's a hybrid isn't it? perfectly within their intent to use it without reasoning
>>
>>106773297
go enjoy your nemo then brown
>>
My IQ1 GLM finished downloading...
>>
>>106773280
I get 6-7 tk/s for generation and 110tk/s for processing. I can go through 4K of tokens every 35 seconds, it's really not as slow as you are imagining.
>>
>>106773266
All this seething when you could be enjoying actually good ERP.
>>
Can your local model sing a duet with you?
>>
>>106773304
>within their intent
fucking lmao
nobody sane gives a shit what the "intent" is, only what the actual performance amounts to in real use
there's no such a thing as an actual hybrid model, they all underperform terribly with reasoning turned off
>>
>>106771319
Things your mom says after your dad leaves your room at night
>>
>>106773320
i get like 220tk/s for prompt processing on a 128GB DDR4 + 24GB GDDR6X system

Generation is still ~5tk/s but it's incredible how fast this is compared to Deepseek R1
>>
>>106773331
not really
>>
>>106773359
ooooooh buuuuurn
>>
>>106773331
if you want a non-thinking then go run k2
it's much better than glm provided you are not poor
>>
File: kimiduet.png (264 KB, 1245x1176)
264 KB
264 KB PNG
>>106773324
even kimi thinks this shit is cringe as fuck
>>
>>106773424
no it is not, k2 is retarded / schizophrenic
>>
>>106773431
i bet you ran it below q8
>>
>>106773324
>"Anon" with fem pfp
>>
>>106771205
>NeMoE
https://www.youtube.com/watch?v=qByKEu0zdco
>>
>>106773439
q6, and I still run glm at q8
>>
>>106773442
Got another one for me? I'll change it if you are feeling hurt.
>>
>>106773366
i could only get 140tk/s with 96GB of VRAM and the rest offloaded into RAM. what's your context? i was running at 64K and had the first 31 layers loaded into VRAM.
>>
>>106773467
changing it to something like Anona is a few clicks away
>>
>>106773523
But I want to continue roleplaying as a cute catboy. "Anona" sounds like a girl's name.
>>
>>106773564
lose some weight anon and take a shower
>>
>>106773584
Requests must be received through handwritten letters in flawless Palmer method business writing, and will be considered after appropriate payment has been confirmed. I accept and await receipt of 3 RTX PRO 6000 Blackwell Workstation edition GPUs.
>>
I'm just trying out GLM 4.6. Never tried GLM 4.5. Is it supposed to take much more VRAM per growing context length? I have -fa on.
>>
>>106773651
*compared to GLM 4.5 Air
>>
File: thing.png (1.45 MB, 832x1248)
1.45 MB
1.45 MB PNG
>>106773564
well banano gave me this gay thing
>>
>>106773651
yes it's retarded as fuck how much VRAM it uses. check this shit for 64k

llama_new_context_with_model: n_ctx = 65536
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 6144.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 5888.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 6144.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 5632.00 MiB
llama_new_context_with_model: KV self size = 23808.00 MiB, K (f16): 11904.00 MiB, V (f16): 11904.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model: CUDA0 compute buffer size = 3105.77 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 1136.02 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 1136.02 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 2448.00 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 274.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1104.05 MiB

meanwhile kimi at 40k

llama_new_context_with_model: KV self size = 2745.00 MiB, c^KV (f16): 2745.00 MiB, kv^T: not used
>>
>>106773675
You look cute!
>>
https://xcancel.com/deepseek_ai/status/1973331587774230573
HOLY FUCK
>>
>>106773737
HHHHHNNNNNNNNNGGGGGGGGGGGGGGGGGGGGG
>>
>>106773737
nice
>>
>>106773737
Are we back?
>>
>>106773765
I don't know retard, are we?
>>
>>106773737
this. changes. everything.
>>
>>106773737
sama thought he won when he stole miku at that concert
but local models are now back
>>
I wonder if we have people trolling or its just legit just LLMs responding
>>
>>106773737
b-bakana......
>>
>>106773813
im cooming and respoonding thanks to the power of local LLMs
>>
>>106772970
>hfoption
>gooption
>becaus
>torent
>torren
>optionl
>direcly
>termminal
>downlaod
>>
>>106773737
NANI?!
>>
>try IQ1 GLM 4.6
>get not even 6 t/s
It's actually over.
>>
>>106774016
Q1 is slow as shit somehow because it's not very optimized as far as I've heard. I once did some testing on R1-0528 and it turned out that Q1 ran about the same speed as Q5 and thus slower than Q4 on DDR4 8-channel
>>
File: mj53dnsrsper4jkp.gif (380 KB, 480x498)
380 KB
380 KB GIF
>>106773216
>being this butthurt over not being able to afford some more cheap ddr4 ram to run local sota
must suck to be you
>>
I wish tavern allowed me to reference multiple lorebooks in a chat. i've shifted towards 'universal' lorebooks for recurring story settings and using the scenario field for specific character and summary info, but the scenario field has an annoying character limit, and gets cleared when you change a card name and maybe other things, so I have to remember to keep a hidden copy in a system note at the beginning of each chat. I hate having to remember to turn things on and off in the universal lorebook, especially when I have maybe 8 groups dedicated to a single world, for example. I suppose I could use author's note, now that I can use the thinking field or system notes for what I used to use author's note for, but it's easy to forget what you have in there. things like the director extension help my autism, but it's usually world history and setting and 5e stats that I'm worried about rather than clothing or weather. I probably need to take another look at it and play around with how it works internally, the concept of it can fix a lot of my problems
I like tavern for a lot of its features but I hate that I'm afraid of updating it. I should probably git checkout more often because maybe they're fixing janky things about it and maybe new features are worthwhile. I like the idea of the checkpoint branching stuff, but what I need are better ways of grouping things, moving things around, and copying stuff like groups. I do a lot of stuff manually in the files but it's annoying. Stuff like the tagging system but for individual chats, and a way to browse that easily in the interface would make my life a lot easier as well
>>
>>106774215
>I wish tavern allowed me to reference multiple lorebooks in a chat
stopped reading here because it does
>>
>>106773617
sex with suiseiseki's ball joints
>>
Does anyone have that guide on how youre supposed to load the really big models on the 24/128 deal? I keep getting the run out of vram error on ooba even though the estimated ram usage is below 24 on the gpu layers
>>
>>106773737
kek I didn't know only the post ID mattered
>>
File: additionallorebooks.png (16 KB, 464x243)
16 KB
16 KB PNG
>>106774215
bruh
>>
>>106774461
Dunno about ooba. Here's one of my configs for ik_llama.cpp, single 3090 + 128: https://files.catbox.moe/homknt.txt
>>
henlo
I could totally keep using all of the proprietary models but I want to switch to local models for purely ethical reasons
>>
High-Fidelity Speech Enhancement via Discrete Audio Tokens
https://arxiv.org/abs/2510.02187
>Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.
https://lucala.github.io/dac-se1/
https://github.com/ETH-DISCO/DAC-SE1
Repo isnt live. Might be cool
>>
>>106774487
aren't you just the cutest thing
>>
File: Base Image.png (1.71 MB, 1244x3388)
1.71 MB
1.71 MB PNG
Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
https://arxiv.org/abs/2510.01631
>Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data \textit{alone} is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data \textit{alone} results in notably higher loss on many downstream domains especially at small data budgets. "Good" ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on "model collapse" during large-scale single-round (n=1) model training on synthetic data--training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by "model collapse". Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.
very cool. from meta. seems things are better than we thought
>>
Have I been gaslit or should a 94gb model be able to fit in 136GB (72VRAM+64DDR5)?
>>
>>106774647
depends
>>
>>106774659
Depends on what?
>>
File: Base Image.png (799 KB, 1200x2576)
799 KB
799 KB PNG
Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps
https://arxiv.org/abs/2510.02274
>Modeling radio frequency (RF) signal propagation is essential for understanding the environment, as RF signals offer valuable insights beyond the capabilities of RGB cameras, which are limited by the visible-light spectrum, lens coverage, and occlusions. It is also useful for supporting wireless diagnosis, deployment, and optimization. However, accurately predicting RF signals in complex environments remains a challenge due to interactions with obstacles such as absorption and reflection. We introduce Diffusion^2, a diffusion-based approach that uses 3D point clouds to model the propagation of RF signals across a wide range of frequencies, from Wi-Fi to millimeter waves. To effectively capture RF-related features from 3D data, we present the RF-3D Encoder, which encapsulates the complexities of 3D geometry along with signal-specific details. These features undergo multi-scale embedding to simulate the actual RF signal dissemination process. Our evaluation, based on synthetic and real-world measurements, demonstrates that Diffusion^2 accurately estimates the behavior of RF signals in various frequency bands and environmental conditions, with an error margin of just 1.9 dB and 27x faster than existing methods, marking a significant advancement in the field.
https://rfvision-project.github.io/
pretty neat
>>
>>106774666
model quant, context, context quant, fa and other stuff
>>
>>106774686
I'm talking about GLM4.6 iq2_m
>>
>>106774647
easily
I can fit a 145gb model quant in 152gb
>>
llama_model_load: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'
Why does this happen?
>>
>>106774728
you need to build the newest version of llama.cpp to run glm4.6
>>
File: Base Image.png (911 KB, 1200x2936)
911 KB
911 KB PNG
ExGRPO: Learning to Reason from Experience
https://arxiv.org/abs/2510.02245
>Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO
Code not posted yet
https://huggingface.co/collections/rzzhan/exgrpo-68d8e302efdfe325187d5c96
>>
>>106774797
Neat
>>
looks like glm4.6 should get good speeds on a regular gaming pc
https://www.reddit.com/r/LocalLLaMA/comments/1nwimej/glm_46_local_gaming_rig_performance/
>>
is running at q8 on ooba or running a smaller quant on fp16 on ooba better?
>>
you people are running any models on normal RAM not Vram?

Isn't that like an hour per prompt
>>
>>106774958
>Q2
No, thanks. I will keep using gpt-oss.
>>
>>106774985
yeah
many of us are running it on ssd too which takes days
>>
>>106774989
>gpt-oss
ooof, masochist
>>
>>106774985
I stream my models from burned blu-ray discs
>>
>>106775087
Relevant

https://www.datacenterfrontier.com/cloud/article/11431537/inside-facebook8217s-blu-ray-cold-storage-data-center
>>
>>106774958
>>106774989
imagine 128GB modules and consumer boards supporting 512GB. why cant we live in this reality?
>>
>>106775240
Wish granted, but it's still dual channel.
>>
>>106775240
ddr6 should be a big jump in 2 years
>>
>>106775287
actually 3 years. i am waiting for the gb300 dgx station that is coming in a couple months. i hope it is less than $30k
>>
>like a physical blow
>She clasps her hands together, her knuckles white
Ahhh that's the stuff! GLM-chan I've missed you.
>>
where glm 4.6 air
>>
>>106775344
you're breathing it
>>
Can I do anything with 1x 4090?
>>
>>106775412
no. just give it to me for free
>>
>>106775337
nkdshi mska mska
>>
>>106775412
Mistral Small 3.1/3.2
Gemma 27b
Nemo 12b
Qwen 30bA3B
If you're looking to run fully in VRAM, these 4 are your best options, which is better depends on preferences and use case.
Beyond that, there's large MoEs like GLM that everyone is shilling lately, if you have at least 64GB then you can try GLM Air. If you have 128GB+ then GLM 4.6
>>
chat, which small model (for 8gb vram) passes mesugaki test?
>>
>>106775453
>>106775412
>if you have at least 64GB then you can try GLM Air. If you have 128GB+
For this, I'm talking about regular RAM, using it in addition to your 4090.
>>
>>106775424
"We must refuse," says MISAKA as she attempts to protect her sister from the strange man.
>>
>>106775557
This is pedophilia
>>
>>106775593
They are both children so it's okay
>>
>>106775557
Kuruko really carried that mid anime desu, yuri is always welcome after all
>>
>>106775557
Game Master: **[EXCEPTION]:** Your hands find no protrusions on the surface, nothing to squish or to grip on to.
>>
File: 1755567910777537.png (203 KB, 399x399)
203 KB
203 KB PNG
Found any GPT sources yet? Gemini has been obnoxiously prudish this past week even though my jb was working fine last month.
>>
>>106775630
Are you lost?
>>
>>106775632
Yes I'm sorry I thought I clicked /aicg/
>>
>>106775455
>chat
go back
>>
File: 1729922535573891.png (655 KB, 2005x741)
655 KB
655 KB PNG
>>106775593
out of 10!
>>
File: 1737993304431567.jpg (131 KB, 1024x1024)
131 KB
131 KB JPG
What glm quants are the fastest? Preferable in q4-q6 range
>>
File: wtf.png (100 KB, 576x446)
100 KB
100 KB PNG
>>
>>106775784
How many ollama users could actually make an mi50 work anyway?
>>
>>106775784
who in their right mind uses that ollama abomination anyway
>>
>>106775781
Tbh it would be interesting to see speed benchmarks of the different quants. Don't remember anyone doing that.
>>
File: 1749749341258054.png (350 KB, 1180x1630)
350 KB
350 KB PNG
What causes this mental illness?

I used to laugh at these people, now I just feel sad
>>
>>106775886
>luna lactea
>jackemled@furry.engineer
uhh seems being furry tranny causes it
>>
>>106775886
Furry aside, making it easier for scientists and researchers to shit out more Python is the last thing we need.
>>
>>106775903
I do actually use ai as my retard indian intern. These fearmongering retards dont realize that ai is just a tool, my ratio of actual work done / hours worked has never been as good as now. This week I coded a total of 2 hours.
>>
>>106775279
>Wish granted, but it's still dual channel.
I don't understand why Intel doesn't just market a 8 channel prosumer motherboard for Xeon Scalable. The processors aren't that expensive.
>>
>>106775886
He's right though, the intuition to know where the negative proof lies was developed by actually getting there. The ability will atrophy.
>>
>>106775987
I have a guy in my group who works as a vibe coder for n8n. He can't code but he can get UI done with some cloud tool.
>>
File: 1742952981750258.png (3 MB, 1716x1888)
3 MB
3 MB PNG
>>106776023
>>
>>106776049
>socially unacceptable to touch your patient
>totally cool to guzzle down his piss
the middle ages were a strange time
>>
>>106776049
i was born in the wrong era
>>
File: 1756155999480183.png (2.38 MB, 1248x1824)
2.38 MB
2.38 MB PNG
>>106776068
out of sight, out of mind
>>
>>106775843
>>106775781
https://github.com/ikawrakow/ik_llama.cpp/discussions/164
Tests are old and were done with a small model entirely on CPU, but the hierarchy of results is the same today, including with a GPU, assuming that at least some significant portion of the model is being loaded into system RAM.
>>
>>106776068
It makes perfect sense. It's not socially acceptable to physically examine other people's bodies in general. Why would that change just because one person is a doctor?
>>
>>106774985
I assume most of us are running the models on RAM primarily with some offload to GPU for context and stuff. On a DDR4 system it's generally limited to ~5t/s. It's slow if you want it to write script (which has limited tokenisation extent) but not unusable. One nice thing about GLM is that it goes through the prompt at like 200+ t/s vs Deepseek taking it at like 15 t/s
>>
I've been writing with IQ1 a bit. It's surprisingly not unusable. But it's somewhat dumb, repetitive, sloppy, and just not great. Yet it's also not necessarily worse than Air or 235B at the same memory size. Just different. Maybe if I had 192GB RAM instead it would be a lot better.
>>
>>106776217
4.6? I had a great experience with ubergarm's iq2_kl, it just knows how to write unlike 4.5 and qwen. Even at temp 1.
>>
>>106776188
Because you're asking said doctor to find out wtf is wrong with you?
>>
>>106776217
4.5 full gets good at around iq2, with the knowledge coming back at iq3 and above. DS v3 is the only model I found usable at IQ1. Usable, not good. DS needs at least q2k to feel normal.
>>
>>106776238
>>106776241
Yeah, 4.6. Q2 sounds cool. Q1 is the best I can do.
>>
A genetically modified mouse, genetically engineered mouse model (GEMM)[1] or transgenic mouse is a mouse (Mus musculus) that has had its genome altered through the use of genetic engineering techniques. Genetically modified mice are commonly used for research or as animal models of human diseases and are also used for research on genes. Together with patient-derived xenografts (PDXs), GEMMs are the most common in vivo models in cancer research.
>>
What did anon mean by that.
>>
>>106776273
The time of Local Mouse General is approaching. LLMs can't stay around forever. What's the next step? Wetware.
>>
Machine learning, GEMM, trans, mice to human trials, ARGH I'm noticing things
>>
>>106776293
Mus muculus GEMM transgenic mice are not real, take your meds weirdo
>>
>been using nothink for a while since it's slow and I wanted fast responses
>try out thinking for the first time
>the output is immediately more cucked
Fucking hell.
>>
can't even take a joke
>>
>>106772963
wholesome but local isn't there for real time stuff, gotta use an api like chatgpt 4o
>>
>>106776348
It's not there yet for real time stuff on grandma's pc
>>
>>106776363
Mossad won
>>
>>106776363
lmao, we won!
>>
File: 1742082385341710.png (458 KB, 1071x1011)
458 KB
458 KB PNG
How much ram do you guys have? How much do you recommend?
>>
>>106776386
192 but I have dual channel setup and most models are too slow at a quant that takes up all of it so I usually keep it below 140-150gb
>>
>>106776386
128GB minimum to begin playing with the best open models
>>
>>106773216
At about q2, fat glm fits my gayming rig of a 5090 and 128 gb of ram and even at loq quant it beats glm air. It's not some server motherboard or whatever cuz I'm too lazy to go and buy a whole new setup just for inference and was waiting to see if those llm shitboxes from nvidia and amd get interesting next year, but yeah you can run 300b locally without cpu maxxing
>>
>>106775784
lmao
They're not even the ones maintaining hardware support, they're literally just cockblocking their users.
>>
>>106776386
96
either 192 or >500 (server/mac studio)
>>
Does task manager showing full load on all cores mean my tps is bottlenecked by cpu?
>>
>>106776529
Not always. 100% in task manager does mean the CPU is working at max capacity, and it says nothing about active memory throughout, same goes for GPU core % in other monitoring software.
>>
>>106776551
does not mean*
>>
>>106773324
nice
>>
>>106776551
Thanks
>>
File: a100.png (312 KB, 1892x835)
312 KB
312 KB PNG
Don't understand why A100s are still so pricey when RTX 6000 Pro is half the price. HBM vs GDDR but bandwidth is similar
>>
>>106776566
Something to do with nvidia's licensing when it comes to running consumer stuff in commercial servers? "I know what I got"-type eBay greed, not wanting to sell for so much less than it was worth in the past?
>>
>>106776597
Sounds plausible, forgot about their loicense shenanigans. Maybe there's better NVlink and virtualisation support, but doesn't seem there'd be any reason to pay more for personal inference uses unless I'm missing something obvious. Recall seeing them going secondhand for like 8K in the early Llama days, was tempted but it seemed like an insane price for one GPU, and now here we are.
>>
>>106776566
I was literally about to ask the same, lul. Yeah, BBCwell all the way. Now the question is how many to run a good glm4.6 quant.
>>
<|assistant|>\nMiku: <think>/<think>\n
or
<|assistant|>\n<think>/<think>\nMiku:
>>
What is the right way?
>>
File: 1758879465831864.gif (2.1 MB, 270x480)
2.1 MB
2.1 MB GIF
>>106776741
>using think models
>>
>>106776755
instruct mode with 80k token system prompt like my nigga claude
>>
>>106776741
use basic logic to deduce where the model would expect its thinking block to be
>>
>Oho~ Want my tight little backdoor, huh? Been saving it all just for you~ Mmmph… *she moans softly as her own tiny fingers start playing with herself.* Let’sfacetouchmyassfirstandgetitallwetandslipperyforyouokaybaby?
are you?
>>
>>106776780
Both make sense, I can see logic in either variant
>>
>>106776741
<|assistant|><think></think>Miku:
hi
<|user|>Anon:
omg it migu
>>
>>106776741
<|assistant|>\nFaggot<think>/<think>\nFaggot:
>>
File: GLM.png (56 KB, 486x706)
56 KB
56 KB PNG
>>106776741
<|assistant|>\n<think></think>\nMiku:
>>
>>106775682
can't even make a little silly joke around you sour faggots...
>>
>>106775455
you can use ollama cloud to run local models even on modest hardware
>>
>>106775682
NTA but If you wanted a real answer, then as always, the answer is Nemo.
>>
>>106776825
The model never inserts \nMiku: after </think> if you allow it to think, so I assume it doesn't expect character name in the final answer. After some limited testing, I believe that <|assistant|>\nMiku: <think>/<think> without \n at the end better adheres to formatting. I guess I have to use it more and see if it wasn’t a fluke
>>
>>106776959
It never inserts character names because it was not trained to insert them anywhere. If you are going to force them in, after the think block is the right spot, because there is never anything between assistant and think.
>>
>>106777047
>there is never anything between assistant and think
Indeed, except a newline according to z.ai .jinja template
>>
>>106776929
alright faggot you're looking for nemo or rocinante, use lm studio because i know you're running on windows and it should show you which quant fits on your 8GB. good luck.
<|spoonfeed_end|>
>>
>>106776386
Some consumer motherboards support 256gb, look it up online so you don't lock yourself into a lower ram capacity. You can never have enough ram.
>>
>>106777198
The motherboard could have 3000 DIMM slots but the memory controller physically limits how much memory the CPU can address. 9K series Ryzen, for example, caps out at 192GB. So you could put 2x128GB RAM kits in a motherboard with 4 DIMM slots but you won't get 256GB of useable RAM.
>>
>>106777114
Exactly. There is also a newline after the empty think block, according to their template.
>>
>>106777256
I would take how much the CPU officially supports with a grain of salt because my 13700k officially supports 192gb but I'm running 256gb, which my mobo does support. You might need to win the silicon lottery but there's no harm in trying it if the dimms can be returned.
>>
>>106777256
Here's a link I found after 30 seconds of googling of people running 256gb on a 9950x:
https://forum.level1techs.com/t/256gb-4x64gb-ddr5-overclocking-results-w-9950x-and-msi-mag-x670e-tomahawk/228651
>>
>>106777351
9950X is like best binned silicon so if it is a silicon lottery thing you'd expect a lot of winners at that level.
Might be designed for 256GB but then under-declared because there's some lottery losers that couldn't do the full thing in testing.
>>
>>106777408
>>106777408
>>106777408



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.