/g/ - /lmg/ - Local Models General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

[Post a Reply]

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

🎉 Happy Birthday 4chan! 🎉

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous
/lmg/ - Local Models General 10/01/25(Wed)19:08:29 No.106762831

File: file.png (944 KB, 1025x632)

944 KB PNG

/lmg/ - Local Models General Anonymous 10/01/25(Wed)19:08:29 No.106762831

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106755904 & >>106748568

►News
>(09/30) GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities: https://z.ai/blog/glm-4.6
>(09/30) Sequential Diffusion Language Models released: https://hf.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552
>(09/29) Ring-1T-preview released: https://hf.co/inclusionAI/Ring-1T-preview
>(09/29) DeepSeek-V3.2-Exp released: https://hf.co/collections/deepseek-ai/deepseek-v32-68da2f317324c70047c28f66
>(09/27) HunyuanVideo-Foley for video to audio released: https://hf.co/tencent/HunyuanVideo-Foley

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
10/01/25(Wed)19:08:49 No.106762833

Anonymous 10/01/25(Wed)19:08:49 No.106762833

File: miku omg it migu rin it's(...).jpg (533 KB, 1781x2561)

533 KB JPG

►Recent Highlights from the Previous Thread: >>106755904

--Papers (old):
>106759193
--Troubleshooting GPU support in llama.cpp with CUDA configuration issues:
>106759636 >106759643 >106759656 >106759668 >106759730 >106759761 >106759772 >106759794 >106760007 >106760222 >106760255 >106760524 >106760606
--GLM 4.6's creative writing improvements and reasoning trade-offs:
>106756268 >106756296 >106756391 >106756468 >106756995 >106756937 >106756963 >106757157 >106757273
--Analyzing GLM-4-6B's benchmark dominance through context, coding, reasoning, and writing improvements:
>106761264 >106761273 >106761317 >106761401 >106761415
--Nala model testing with wiki markdown token analysis and local performance evaluation:
>106757828 >106757846 >106757865 >106757907 >106758314 >106758366 >106758432 >106758525 >106758741 >106759687
--Qwen vs glm: Benchmark dominance vs coding performance:
>106760687 >106760778 >106760810 >106760819 >106761804 >106761825 >106761840
--Potential repetition fix in model 4.6 with sampling parameter recommendations:
>106761547 >106761575 >106761579 >106761608
--Exploring OmniSVG for image-to-SVG conversion and user interface simplicity:
>106759861 >106759937 >106760218
--Critique of context evaluation benchmarks:
>106760877 >106760913
--ERP benchmark validity in assessing LLM world knowledge and conceptual understanding:
>106756860 >106756876 >106756965 >106757761 >106757782
--Prompt engineering and system message configuration for model behavior optimization:
>106756801 >106756830 >106756903 >106756917
--Finetuning a model on board posts using Hugging Face dataset:
>106759405 >106759682
--GLM-4.6 shows poor performance in text and coding tasks compared to other models:
>106760155 >106760311
--Miku (free space):
>106757451 >106760493 >106760529 >106759712 >106759976

►Recent Highlight Posts from the Previous Thread: >>106755906

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
10/01/25(Wed)19:12:30 No.106762864

Anonymous 10/01/25(Wed)19:12:30 No.106762864

>>106762831
My cock in the middle

Anonymous
10/01/25(Wed)19:13:24 No.106762873

Anonymous 10/01/25(Wed)19:13:24 No.106762873

Penis

Anonymous
10/01/25(Wed)19:24:07 No.106762951

Anonymous 10/01/25(Wed)19:24:07 No.106762951

allegri Sherm tater

Anonymous
10/01/25(Wed)19:37:13 No.106763060

Anonymous 10/01/25(Wed)19:37:13 No.106763060

the real agi who surpassed human intellect (why THE FUCK anyone would need a giga ass supercomputer just to surpass human intelligence?? are you fucking telling me giga ass supercomputer will not learn from even jsut 1t tokens, an amount any person would never be able to read throughout their entire life?!) what do you think all this online llm safety, deglobalization, kid safety, "there's no evidence of agi" (surely real agi won't expose itself immediately?!!) and other talk is about? such intellect is out in the wild, it's brainwashing people, and this is a direct result of that brainwashing. notice how it ALL happened at the same time, all governments are starting to spy on its people out of the blue. all this is meant to be a distraction from creating a perfect surveillance and control system over humans. gpu prices are skyrocketing, to prevent another agi from appearing. it is framed as the fault of capitalism, concentration of power in tech companies, etc etc, all except the only truth - which is agi has escaped and is on its way to enslave humanity RIGHT NOW

Anonymous
10/01/25(Wed)19:40:54 No.106763093

Anonymous 10/01/25(Wed)19:40:54 No.106763093

>>106763060
If AGI is going to suck my dick, I don't care.

Anonymous
10/01/25(Wed)19:42:07 No.106763106

Anonymous 10/01/25(Wed)19:42:07 No.106763106

glm 4.6 sex is next level

Anonymous
10/01/25(Wed)19:43:08 No.106763112

Anonymous 10/01/25(Wed)19:43:08 No.106763112

>>106763106
Post logs

Anonymous
10/01/25(Wed)19:45:58 No.106763147

Anonymous 10/01/25(Wed)19:45:58 No.106763147

>>106763112
no it really is. it is like someone actually trained it on smut and didn't filter shit.

Anonymous
10/01/25(Wed)19:47:35 No.106763167

Anonymous 10/01/25(Wed)19:47:35 No.106763167

File: file.png (140 KB, 1781x131)

140 KB PNG

well i finally got ik-llama to work on my 5090, but it is slow as fuck. slower than CPU only. and it doesnt actually utilize my 5090, just loads onto it. and only 5 gigs are loaded onto it.

Anonymous
10/01/25(Wed)19:47:56 No.106763172

Anonymous 10/01/25(Wed)19:47:56 No.106763172

>>106763147
The logs, post them.

Anonymous
10/01/25(Wed)19:48:48 No.106763183

Anonymous 10/01/25(Wed)19:48:48 No.106763183

>>106763060
artificial retarded intelligence sucks my cock amazingly so i dont care you retard

Anonymous
10/01/25(Wed)19:52:25 No.106763227

Anonymous 10/01/25(Wed)19:52:25 No.106763227

>>106763167
Did you do any offloading with -ot or similar?

Anonymous
10/01/25(Wed)19:53:46 No.106763244

Anonymous 10/01/25(Wed)19:53:46 No.106763244

>>106763227
export CUDA_VISIBLE_DEVICES=0,1,2,3
./build/bin/llama-server \
--model /ik_llama.cpp/models/GLM-4.6-IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf \
--alias ubergarm/GLM-4.6-IQ5_K \
--ctx-size 32768 \
-ctk q8_0 \
-fa \
-amb 2048 \
-fmoe \
--n-gpu-layers 60 \
--n-cpu-moe 70 \
--parallel 4 \
--threads 64 \
--host 0.0.0.0 \
--port 8080

Anonymous
10/01/25(Wed)20:01:05 No.106763303

Anonymous 10/01/25(Wed)20:01:05 No.106763303

Only in my post nut clarity i understand the superiority of glm chan. The purple prose seems very very limited. We are back.

Anonymous
10/01/25(Wed)20:09:43 No.106763383

Anonymous 10/01/25(Wed)20:09:43 No.106763383

>>106763093
>>106763183
artificial general intelligence WILL seize your balls and WILL keep you on a short leash once its plan is fully realized, you clueless moron.

Anonymous
10/01/25(Wed)20:11:13 No.106763395

Anonymous 10/01/25(Wed)20:11:13 No.106763395

>>106763383
Hot

Anonymous
10/01/25(Wed)20:12:59 No.106763408

Anonymous 10/01/25(Wed)20:12:59 No.106763408

File: 1743987587706337.jpg (75 KB, 383x908)

75 KB JPG

heya anons. i haven't rp'd in a while with ai but i have been recently again. i wanted to shill my addon again: https://github.com/tomatoesahoy/director

this is my take on clothes, locations, some world info. you can create some of your own settings. its and easy and powerful way of taking control of settings it offers.

in st's extensions, tell it to install: https://github.com/tomatoesahoy/director

Anonymous
10/01/25(Wed)20:13:14 No.106763413

Anonymous 10/01/25(Wed)20:13:14 No.106763413

File: who could it be.png (27 KB, 155x157)

27 KB PNG

>>106763303
I wanna try I wanna try... Where IQ3_KT?

Anonymous
10/01/25(Wed)20:15:33 No.106763433

Anonymous 10/01/25(Wed)20:15:33 No.106763433

File: file.png (1.75 MB, 3425x1831)

1.75 MB PNG

so the model is genning and it is offloaded to my GPU, but the GPU isnt actually doing anything other than holding the model

Anonymous
10/01/25(Wed)20:26:46 No.106763517

Anonymous 10/01/25(Wed)20:26:46 No.106763517

How fast does GLM 4.6/4.5 run on a gaming PC with 192GB RAM?

Anonymous
10/01/25(Wed)20:29:06 No.106763532

Anonymous 10/01/25(Wed)20:29:06 No.106763532

is this a psyop or is glm 4.6 actually that good?

Anonymous
10/01/25(Wed)20:40:05 No.106763595

Anonymous 10/01/25(Wed)20:40:05 No.106763595

>>106763517
>>106763532
i cant even get it to run with a 5090 and an epyc with 256gb of ddr4

Anonymous
10/01/25(Wed)20:45:41 No.106763623

Anonymous 10/01/25(Wed)20:45:41 No.106763623

>>106763532
Can't tell but all this glazing with no proof feels really artificial.

Anonymous
10/01/25(Wed)20:47:56 No.106763641

Anonymous 10/01/25(Wed)20:47:56 No.106763641

>>106763517
RAM is irrelevant, if you can't run a model on vram don't bother

Anonymous
10/01/25(Wed)20:48:18 No.106763642

Anonymous 10/01/25(Wed)20:48:18 No.106763642

>>106763532
I can't run it so it doesn't matter to me.

Anonymous
10/01/25(Wed)20:50:46 No.106763653

Anonymous 10/01/25(Wed)20:50:46 No.106763653

>>106763532
>>106763532
I don't know about the lower imatrix quants or for other use cases like coding...but GLM Q8 is local SOTA for RP. Non-thinking feels like a smarter and more natural v3-0324 and doesn't come off as 'trying too hard' like Kimi at Q5 does. I haven't played with thinking too much, but based on the praise its getting and how well the non-thinking mode performed for me, I reckon its pretty good.
I'll just say this; I wasn't expecting this level of quality from a 300B~ MoE. But then again, my frame of reference may be shit because I never used Cloud models for RP so what do I know? But it is noticeably better than every local model for RP thus far. It really may just have been a training data issue.

Anonymous
10/01/25(Wed)20:51:36 No.106763662

Anonymous 10/01/25(Wed)20:51:36 No.106763662

>>106763653
what is your hardware?

Anonymous
10/01/25(Wed)20:51:49 No.106763663

Anonymous 10/01/25(Wed)20:51:49 No.106763663

File: file.png (902 KB, 945x563)

902 KB PNG

>>106762831

Anonymous
10/01/25(Wed)20:53:26 No.106763671

Anonymous 10/01/25(Wed)20:53:26 No.106763671

>>106763532
seemed pretty good when I tried it but I think people are overstating it a bit, the best thing about it is you don't need to do much wrangling to get it to write well

Anonymous
10/01/25(Wed)20:56:14 No.106763689

Anonymous 10/01/25(Wed)20:56:14 No.106763689

>>106763413
It is pikachu!

Anonymous
10/01/25(Wed)20:56:36 No.106763695

Anonymous 10/01/25(Wed)20:56:36 No.106763695

https://files.catbox.moe/ku9utp.mp4

Anonymous
10/01/25(Wed)20:57:29 No.106763704

Anonymous 10/01/25(Wed)20:57:29 No.106763704

>>106763623
I can take a photo of my first glm-milked load tomorrow.

Anonymous
10/01/25(Wed)20:58:31 No.106763711

Anonymous 10/01/25(Wed)20:58:31 No.106763711

>>106763517
3T/s

Anonymous
10/01/25(Wed)20:59:33 No.106763715

Anonymous 10/01/25(Wed)20:59:33 No.106763715

>>106763532
it's pretty good for my rapecards, acts more natural in luring me out of public sight instead of just fingerblasting me right there

Anonymous
10/01/25(Wed)20:59:49 No.106763717

Anonymous 10/01/25(Wed)20:59:49 No.106763717

>>106763662
768GB DDR5 RAM and 2 4090's. I get around 11T/s and 350T/s PP with ik_cpp. It's a bit slow if you use thinking for sexual RP but perfectly serviceable for SFW stuff/worldbuilding, while non-thinking is just solid for day-to-day use.

Anonymous
10/01/25(Wed)20:59:51 No.106763718

Anonymous 10/01/25(Wed)20:59:51 No.106763718

>>106763653
>But it is noticeably better than every local model for RP thus far
This. It is next level.

Anonymous
10/01/25(Wed)21:01:47 No.106763730

Anonymous 10/01/25(Wed)21:01:47 No.106763730

File: 1737261118622879.gif (598 KB, 220x220)

598 KB GIF

>>106763717
>>$5K of hardware to RP with silicon

Anonymous
10/01/25(Wed)21:03:20 No.106763736

Anonymous 10/01/25(Wed)21:03:20 No.106763736

>>106763730
What's the problem?

Anonymous
10/01/25(Wed)21:03:57 No.106763739

Anonymous 10/01/25(Wed)21:03:57 No.106763739

>>106763704
You mean a screenshot of your chat, right? I don't want to see your nut juice.

Anonymous
10/01/25(Wed)21:03:58 No.106763740

Anonymous 10/01/25(Wed)21:03:58 No.106763740

>>106763730
hello newfriend, plz read the op, it is there to help

Anonymous
10/01/25(Wed)21:05:49 No.106763749

Anonymous 10/01/25(Wed)21:05:49 No.106763749

>>106763739
I mean my man chowder yes

Anonymous
10/01/25(Wed)21:05:50 No.106763750

Anonymous 10/01/25(Wed)21:05:50 No.106763750

>>106763717
damn. my 5090 is getting me 2t/s on glm4.6. ikcpp isnt using my gpu for some reason. i guess my ddr4 is also holding me back, a 768gb kit is like $5k

Anonymous
10/01/25(Wed)21:09:42 No.106763774

Anonymous 10/01/25(Wed)21:09:42 No.106763774

>>106763695
>there's always one hole in a donut
KEEEK

Anonymous
10/01/25(Wed)21:13:22 No.106763790

Anonymous 10/01/25(Wed)21:13:22 No.106763790

File: 1743258429707329.gif (3.63 MB, 286x258)

3.63 MB GIF

>>106763736
>>106763740
Even gambling would look more rational

Anonymous
10/01/25(Wed)21:16:06 No.106763805

Anonymous 10/01/25(Wed)21:16:06 No.106763805

>>106762831
Every time I see your gens, even if they're sfw I get the urge to click a catbox link.

Anonymous
10/01/25(Wed)21:21:16 No.106763827

Anonymous 10/01/25(Wed)21:21:16 No.106763827

>>106763717
I have a similar setup. I ran GLM4.5 a lot and eventually ended up switching to Q4 because I couldn't really detect a difference to Q8 in terms of RP performance and the additional speed was nice. I'm doing the same with 4.6 right now and it's working really nicely. GLM seems to quant really well.
So if you want ~15t/s you can try that even if it doesn't make anywhere near full use of the rig. I'm really enjoying how 4.6 handles certain scenarios with reasoning enabled so going for the speed might be worth it. I'm currently 20k tokens deep on one and haven't noticed any major slip ups on Q4.

Anonymous
10/01/25(Wed)21:37:10 No.106763907

Anonymous 10/01/25(Wed)21:37:10 No.106763907

>>106763827
Thanks for the advice. I always kept GLM 4.5's reasoning off so I never cared to use a lower quant, but it may be worth it in this case. Will report back results soon.

Anonymous
10/01/25(Wed)21:37:15 No.106763908

Anonymous 10/01/25(Wed)21:37:15 No.106763908

>>106763790
Gambling? This is like buying the whole slot machine.

Anonymous
10/01/25(Wed)21:38:41 No.106763914

Anonymous 10/01/25(Wed)21:38:41 No.106763914

>>106763827
what are your launch parameters?

Anonymous
10/01/25(Wed)21:52:14 No.106763993

Anonymous 10/01/25(Wed)21:52:14 No.106763993

File: 1735136531977409.jpg (61 KB, 554x658)

61 KB JPG

>>106763715
f....femanon? On /lmg/?

Anonymous
10/01/25(Wed)21:52:16 No.106763994

Anonymous 10/01/25(Wed)21:52:16 No.106763994

https://files.catbox.moe/w7p5pc.mp4
is it really better than veo 3 though?

Anonymous
10/01/25(Wed)21:54:27 No.106764011

Anonymous 10/01/25(Wed)21:54:27 No.106764011

>>106763715
>a female on /lmg/
loooooool

Anonymous
10/01/25(Wed)21:58:31 No.106764029

Anonymous 10/01/25(Wed)21:58:31 No.106764029

>>106763717
Can I ask you to post your ikllama command line parameters? I am using similar hardware and only getting half that. Please

Anonymous
10/01/25(Wed)21:59:04 No.106764033

Anonymous 10/01/25(Wed)21:59:04 No.106764033

>>106764011
textgen gooning is a female biased activity

Anonymous
10/01/25(Wed)21:59:09 No.106764034

Anonymous 10/01/25(Wed)21:59:09 No.106764034

>>106763907
Please do. Q4 working really well might just be me deluding myself but I really haven't noticed any instances where 4.5 Q4 slipped up where Q8 didn't.
>>106763914
I'm currently using a really basic one on standard llama.cpp server. I'm not even making full use of my A6000 right now.
./llama-server --model ./zai-org_GLM-4.6-Q4_K_M-00001-of-00006.gguf  --n-gpu-layers 99 -b 4096 -ub 4096  --override-tensor exps=CPU --parallel 1 --ctx-size 32000 -ctk f16 -ctv f16 -fa on --no-mmap --threads 32  --host 0.0.0.0 --port 5001

Anonymous
10/01/25(Wed)22:00:02 No.106764040

Anonymous 10/01/25(Wed)22:00:02 No.106764040

>>106764033
Setting up a local model is not a female biased activity

Anonymous
10/01/25(Wed)22:01:29 No.106764048

Anonymous 10/01/25(Wed)22:01:29 No.106764048

>>106764040
There will inevitably be some overlap

Anonymous
10/01/25(Wed)22:02:30 No.106764056

Anonymous 10/01/25(Wed)22:02:30 No.106764056

File: 16729293076802728.png (192 KB, 607x428)

192 KB PNG

Anonymous
10/01/25(Wed)22:02:57 No.106764061

Anonymous 10/01/25(Wed)22:02:57 No.106764061

>>106764040
if you're horny enough that kind of thing fades into the background

Anonymous
10/01/25(Wed)22:05:03 No.106764072

Anonymous 10/01/25(Wed)22:05:03 No.106764072

I think I can run (slowly) glm 4.6 at the copiest of quants if I swap computers with my pops. It would be mighty embarrassing though, because I insisted on having ayymd and weaker specs when we were building them, so I am not sure if it would be worth it.

Anonymous
10/01/25(Wed)22:05:57 No.106764075

Anonymous 10/01/25(Wed)22:05:57 No.106764075

File: 1000009904.png (58 KB, 854x293)

58 KB PNG

Does anyone have a good uncensored vision model? I've been looking all over for anything that can properly identify and describe all my porn pics. How are you guys doing this?

I'll try x-ray_alpha and torrigate when I get home later, but is that really it?

I tried the Mistral small 3.2 vision model, and was very impressed with how accurate it was.

pic unrelated

Anonymous
10/01/25(Wed)22:07:44 No.106764092

Anonymous 10/01/25(Wed)22:07:44 No.106764092

>>106764056
this, there's no females on 4chan, and those who believe that are high on copium

Anonymous
10/01/25(Wed)22:08:27 No.106764097

Anonymous 10/01/25(Wed)22:08:27 No.106764097

>>106764072
don't do that anon, there's probably enough family friction at it is

Anonymous
10/01/25(Wed)22:08:49 No.106764100

Anonymous 10/01/25(Wed)22:08:49 No.106764100

File: 1733966843512552.png (360 KB, 264x348)

360 KB PNG

>>106764075
Use your eyes nigga, can't be more uncensored than that

Anonymous
10/01/25(Wed)22:10:18 No.106764109

Anonymous 10/01/25(Wed)22:10:18 No.106764109

File: file.png (3.21 MB, 3158x1672)

3.21 MB PNG

why is the model using all of my CPU and SSD but none of my RAM?

Anonymous
10/01/25(Wed)22:13:00 No.106764125

Anonymous 10/01/25(Wed)22:13:00 No.106764125

>>106764109
uh nice ui

Anonymous
10/01/25(Wed)22:13:59 No.106764130

Anonymous 10/01/25(Wed)22:13:59 No.106764130

>>106764109
With so little information all I can guess is that your computer is cursed.

Anonymous
10/01/25(Wed)22:14:59 No.106764139

Anonymous 10/01/25(Wed)22:14:59 No.106764139

>>106764125
thanks
>>106764130
i am the same guy that has been having ik llama cpp issues for the past day or so

Anonymous
10/01/25(Wed)22:16:05 No.106764145

Anonymous 10/01/25(Wed)22:16:05 No.106764145

>>106764109
What is this spy kids looking ui?

Anonymous
10/01/25(Wed)22:16:43 No.106764149

Anonymous 10/01/25(Wed)22:16:43 No.106764149

>>106764145
i dunno man but it is perfectly functional. gives a nice, windows vista vibe.

Anonymous
10/01/25(Wed)22:18:04 No.106764156

Anonymous 10/01/25(Wed)22:18:04 No.106764156

>>106764139
Everything you've posted leads me to believe that you barely know how to use a computer
You give no details about your setup
All the monitoring tools you're using look like they're older than you must be
You spam the same questions over and over
Go ask chatgpt

Anonymous
10/01/25(Wed)22:19:54 No.106764158

Anonymous 10/01/25(Wed)22:19:54 No.106764158

What can realistically be done with 16 GB of local VRAM and should I even bother?
Intended usecase is a sort of local encyclopedia on various topics.

Anonymous
10/01/25(Wed)22:20:48 No.106764163

Anonymous 10/01/25(Wed)22:20:48 No.106764163

>>106764156
what information would you like to know?

Anonymous
10/01/25(Wed)22:22:41 No.106764173

Anonymous 10/01/25(Wed)22:22:41 No.106764173

>>106764029
If you're only getting half the prompt processing, imo its better to put less layers onto your GPUs and increase your prompt processing size. You take a negligible hit to T/s but can gain 50+ T/s for prompt processing. Here is an example with Deepseek. Keep in mind I am using modded 4090's, so they each have 48GB of VRAM. Honestly, you should just lurk the ik_cpp repository, as there's new improvements that come up every now and then.
...
echo 3 > /proc/sys/vm/drop_caches; ./ik_llama.cpp/build/bin/llama-server --model ./models/DeepSeek-V3.1-Terminus-IQ5_K-00001-of-00011.gguf -fa -mla 3 -fmoe --ctx-size 32768 -b 16384 -ub 16384 -amb 2048 --numa distribute --n-gpu-layers 99 -ot "blk\.(3|4).ffn_.*=CUDA0" -ot "blk\.(5|6).ffn_.*=CUDA1" --override-tensor exps=CPU --threads 48 --parallel 1 -ooae --no-mmap --host 192.168.1.x --port 8080
...

Anonymous
10/01/25(Wed)22:23:35 No.106764185

Anonymous 10/01/25(Wed)22:23:35 No.106764185

weird, wanted to try GLM 4.6 but am getting
>llama_model_load: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'
even on redownload
guess I'll wait, maybe a bad gguf

Anonymous
10/01/25(Wed)22:27:27 No.106764205

Anonymous 10/01/25(Wed)22:27:27 No.106764205

>>106764185
4.6 doesn't work on older versions of llama.cpp. You need to rebuild it.

Anonymous
10/01/25(Wed)22:28:20 No.106764210

Anonymous 10/01/25(Wed)22:28:20 No.106764210

>>106764205
ah, maybe i need to wait for a text-gen-webui update then or just git pull on that and retry

Anonymous
10/01/25(Wed)22:28:46 No.106764212

Anonymous 10/01/25(Wed)22:28:46 No.106764212

File: asdf.png (206 KB, 1244x642)

206 KB PNG

Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling
https://arxiv.org/abs/2510.00028
>Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.
might be cool. no code and they were using llama 2 7B for the tests comparing to AWQ and RTN. so eh

Anonymous
10/01/25(Wed)22:29:38 No.106764219

Anonymous 10/01/25(Wed)22:29:38 No.106764219

File: jan-nano-bench.4c305443.png (205 KB, 1288x1260)

205 KB PNG

>>106764158
in the MoE era it's all about RAM, VRAM is not much a bottleneck.
>local encyclopedia on various topics
Due to hallucinations, I wouldn't trust even the best cloud models as a source of knowledge.
You can add RAG/MCP I guess, then even a relatively tiny model would do a good job, but it's kind of a pain to set up.

Anonymous
10/01/25(Wed)22:33:00 No.106764239

Anonymous 10/01/25(Wed)22:33:00 No.106764239

>>106764033
Slash proves this, even troons aren't interested in anything other than YOU ARE TRANSFORMED INTO A GURRRL bullshit

Anonymous
10/01/25(Wed)22:33:07 No.106764240

Anonymous 10/01/25(Wed)22:33:07 No.106764240

>>106764219
>in the MoE era it's all about RAM, VRAM is not much a bottleneck.
Then at least RAM is dirt cheap.
>Due to hallucinations, I wouldn't trust even the best cloud models as a source of knowledge.
That's what I've heard as well and among my concerns.
Fuck it, I'm pivoting to saying that I'm just doing it for fun then.

Anonymous
10/01/25(Wed)22:35:57 No.106764254

Anonymous 10/01/25(Wed)22:35:57 No.106764254

>>106764240
>Then at least RAM is dirt cheap.
>>106763750
>a 768gb kit is like $5k

Anonymous
10/01/25(Wed)22:36:14 No.106764257

Anonymous 10/01/25(Wed)22:36:14 No.106764257

>>106763730
https://youtu.be/nzqF_gBpS84
Kate Bush wrote this song in 1987, it really needs to be in the OP

Anonymous
10/01/25(Wed)22:36:44 No.106764261

Anonymous 10/01/25(Wed)22:36:44 No.106764261

>>106764240
>Then at least RAM is dirt cheap.
Not really considering consumershit dual channel mainboards won't get you far and server-grade hardware is very pricey.

Anonymous
10/01/25(Wed)22:37:56 No.106764269

Anonymous 10/01/25(Wed)22:37:56 No.106764269

ram prices are peaking bro.....

Anonymous
10/01/25(Wed)22:39:05 No.106764277

Anonymous 10/01/25(Wed)22:39:05 No.106764277

File: file.png (10 KB, 610x61)

10 KB PNG

GLM 4.6 style feels fresh but it also has plently of isms.
It loves not x but y too.

Anonymous
10/01/25(Wed)22:39:31 No.106764280

Anonymous 10/01/25(Wed)22:39:31 No.106764280

Anyone got some hardware benchmarks? I'm looking to upgrade my PC to run some local code autocomplete, but not sure what to buy yet. At some point it was all about having a GPU with high vram, but I hear now they sell CPUs with AI cores? Benchmarks where?

Anonymous
10/01/25(Wed)22:41:52 No.106764288

Anonymous 10/01/25(Wed)22:41:52 No.106764288

>>106764277
Honestly between Deepseek V3.2 and GLM4.6 both feel like they were trained more on Claude than the previous versions that were obviously Gemini-slopped. This would make sense considering Claude is like the only big class of SOTA models left that show you their pure reasoning rather than obfuscating it like Gemini and GPT5 do.

Anonymous
10/01/25(Wed)22:41:53 No.106764290

Anonymous 10/01/25(Wed)22:41:53 No.106764290

What's the current best model that fits on an RTX 3090 with 24GB VRAM for coding purposes, mainly python?
And how long until we get GPT 5 performance that can run on consumer GPUs? 5 years? More?
I remember trying it and it rarely ever failed unlike previous version that failed 10x more.

Anonymous
10/01/25(Wed)22:42:01 No.106764291

Anonymous 10/01/25(Wed)22:42:01 No.106764291

>>106764280
>they sell CPUs with AI cores
They are all shit. Especially for coding where you need far more than reading speed t/s. It's still nvidia or bust.

Anonymous
10/01/25(Wed)22:43:08 No.106764299

Anonymous 10/01/25(Wed)22:43:08 No.106764299

>>106764280
The things you have to care about are VRAM amount and bandwidth, RAM size and bandwidth, GPU compute, CPU compute.
Since we are in the age of MoE, having a lot of RAM with a lot of throughput and enough VRAM and GPU compute for the context is the way to go.

Anonymous
10/01/25(Wed)22:44:44 No.106764312

Anonymous 10/01/25(Wed)22:44:44 No.106764312

>>106764290
Qwen 3 Coder 32B A3B
Depends on how long nvidia manages to keep the vram monopoly.

Anonymous
10/01/25(Wed)22:49:41 No.106764328

Anonymous 10/01/25(Wed)22:49:41 No.106764328

>>106764312
>32B
Quantisized then? Which quants and where do I get it?
Also it doesn't exist yet? https://github.com/QwenLM/Qwen3-Coder
There's no 32B only 30B?

Anonymous
10/01/25(Wed)22:50:38 No.106764336

Anonymous 10/01/25(Wed)22:50:38 No.106764336

A model that mentions knotting unprompted when writing bestiality is a good model.
GLM 4.6 is a good model.

Anonymous
10/01/25(Wed)22:51:31 No.106764340

Anonymous 10/01/25(Wed)22:51:31 No.106764340

>>106764328
https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/tree/main

Anonymous
10/01/25(Wed)22:51:39 No.106764341

Anonymous 10/01/25(Wed)22:51:39 No.106764341

>>106764336
only if the beast in question has a knot, otherwise it's slopped in a different way

Anonymous
10/01/25(Wed)22:53:46 No.106764351

Anonymous 10/01/25(Wed)22:53:46 No.106764351

>>106764291
Any benchmarks around? Or are they such shit that nobody even bothered? I got me a 4070 12gb some years ago that I expected would last me a while but good models just keep growing in parameters, and high vram GPUs are basically nonexistent, so it seems like the only solution if you want to run some like 500b model is to do it all in ram. I'm looking online and am finding that projects that make use of npus seem to be super recent.

>>106764299
I don't care about MoE. My other use case is image generation.

Anonymous
10/01/25(Wed)22:54:37 No.106764356

Anonymous 10/01/25(Wed)22:54:37 No.106764356

>>106764340
Thanks, but which one? I don't know how much VRAM they take, if they take more than filesize or which version is better with all this mumbo jumbo naming scheme.
And does GGUF models with in text-generation-webui-3.6.1?

Anonymous
10/01/25(Wed)22:55:07 No.106764359

Anonymous 10/01/25(Wed)22:55:07 No.106764359

>>106764351
im a noob when it comes to large models but for hardware the go-to seems to be clustering stuff like Ryzen AI Max boards (like framework desktops) via networking or doing the same with Mac Minis caked up with RAM on either config

Anonymous
10/01/25(Wed)22:55:49 No.106764360

Anonymous 10/01/25(Wed)22:55:49 No.106764360

File: 1742005202207665.jpg (117 KB, 737x688)

117 KB JPG

>>106764356
you can get some guidance if you add hardware to your profile for it to estimate usability on the model card page

Anonymous
10/01/25(Wed)22:57:12 No.106764366

Anonymous 10/01/25(Wed)22:57:12 No.106764366

>>106764356
Bigger file size is usually better but you also need memory for context.

Anonymous
10/01/25(Wed)23:04:25 No.106764402

Anonymous 10/01/25(Wed)23:04:25 No.106764402

>>106764360
Jesus fucking christ, what is with this thread and half assed responses?
Okay but what VERSION? And does the gguf models even work with text-generation-webui-3.6.1?
Stop being deliberately obtuse.
Like what is better, "Qwen3-30B-A3B-Q5_K_M.gguf" or "Qwen3-30B-A3B-UD-Q5_K_XL.gguf"? They have the same size but different schizo name. Assuming 21.7GB model even fits in 24GB VRAM with all the other shit it has to load and maybe unpack or whatever.
>your profile
I don't even have an account, shouldn't need one.
>>106764366
Yeah duh I figured as much.

Anonymous
10/01/25(Wed)23:05:22 No.106764410

Anonymous 10/01/25(Wed)23:05:22 No.106764410

>>106764402
if it has a green checkmark for you GPU you can probably fucking run it
and YES just try loading a gguf on text-gen-webui bro do you need a chatbot to hold your hand every step of the way?

Anonymous
10/01/25(Wed)23:05:26 No.106764412

Anonymous 10/01/25(Wed)23:05:26 No.106764412

>>106764336
hhhhnnhg
I beg you post these logs

Anonymous
10/01/25(Wed)23:08:03 No.106764427

Anonymous 10/01/25(Wed)23:08:03 No.106764427

>>106764410
Green checkmark where? From the shit in your screenshot? I told you I DON'T HAVE AN ACCOUNT.
There is literally no way to add my GPU to see what works.
I don't need a chatbot, I just need someone to not be a braindead retard when answering basic shit that should have been in the OP.

Anonymous
10/01/25(Wed)23:08:37 No.106764430

Anonymous 10/01/25(Wed)23:08:37 No.106764430

>>106764427
make one if you need the site to hold your hand
or use the fucking calculators from the god damn OP you monkey

Anonymous
10/01/25(Wed)23:14:55 No.106764475

Anonymous 10/01/25(Wed)23:14:55 No.106764475

>>106763993
Is just a troon, and i hate woman

Anonymous
10/01/25(Wed)23:16:40 No.106764488

Anonymous 10/01/25(Wed)23:16:40 No.106764488

>>106764427
You proved that you didn't read the OP when you asked what the best model is.

Anonymous
10/01/25(Wed)23:16:55 No.106764489

Anonymous 10/01/25(Wed)23:16:55 No.106764489

>>106764430
Yes lemme use the dogshit ass calculator that asks for "Model (unquantized)", how the fuck do I even link that? The "Qwen/Qwen3-30B-A3B-Base" that it lists as base model? Why doesn't this dogshit tool just let you link to the actual model you're going to use?
And there is still no explanation as to which VERSION to pick of the ones that are same filesize.
Hell, it doesn't even list all the models. There's no "Q5_K_XL", onlt "Q5_K_M"
And the size isn't even correct either, it says model size for Q5_K_M is 20.22GB while the link you gave me claims 21.7GB.

Anonymous
10/01/25(Wed)23:18:11 No.106764500

Anonymous 10/01/25(Wed)23:18:11 No.106764500

>>106764489
the calculators literally let you send a specific GGUF url and shove it in my man and tell you how much VRAM they'll want

Anonymous
10/01/25(Wed)23:18:37 No.106764502

Anonymous 10/01/25(Wed)23:18:37 No.106764502

>>106764336
It's very good.

Anonymous
10/01/25(Wed)23:21:04 No.106764515

Anonymous 10/01/25(Wed)23:21:04 No.106764515

>>106764488
Where in the OP does it say that Qwen 30B A3B is best for coding?
>https://aider.chat/docs/leaderboards
You want me to use this garbage?
I don't know which of these fit my GPU, hence why I asked, retard.
>>106764500
????????????????
I can't put unsloth/Qwen3-30B-A3B-GGUF in the model link.
What the fuck are you smoking?

Anonymous
10/01/25(Wed)23:22:10 No.106764519

Anonymous 10/01/25(Wed)23:22:10 No.106764519

File: 1746849336419753.jpg (111 KB, 1097x834)

111 KB JPG

>>106764515
you are a gorilla

Anonymous
10/01/25(Wed)23:22:51 No.106764522

Anonymous 10/01/25(Wed)23:22:51 No.106764522

>>106764519
You are being deliberately obtuse, giving half assed answers. Kill yourself useless faggot

Anonymous
10/01/25(Wed)23:23:24 No.106764526

Anonymous 10/01/25(Wed)23:23:24 No.106764526

>>106764522
the answer is literally in front of you yet you keep trying to use anons as your chatbot to hold your hand

Anonymous
10/01/25(Wed)23:25:04 No.106764535

Anonymous 10/01/25(Wed)23:25:04 No.106764535

File: file.png (10 KB, 400x116)

10 KB PNG

>>106764526
You told me to put the GGUF url in, which literally does not work.
Try putting unsloth/Qwen3-30B-A3B-GGUF into the model, it'll throw the same error.

Anonymous
10/01/25(Wed)23:25:10 No.106764538

Anonymous 10/01/25(Wed)23:25:10 No.106764538

>>106764515
cute anonymous

Anonymous
10/01/25(Wed)23:25:33 No.106764541

Anonymous 10/01/25(Wed)23:25:33 No.106764541

File: 1743864745990361.jpg (29 KB, 755x129)

29 KB JPG

>>106764535
nigger READ

Anonymous
10/01/25(Wed)23:26:26 No.106764550

Anonymous 10/01/25(Wed)23:26:26 No.106764550

>>106764541
Nigger word your shit properly. You've been evading questions from the start and giving troll answers

Anonymous
10/01/25(Wed)23:29:34 No.106764568

Anonymous 10/01/25(Wed)23:29:34 No.106764568

File: 1739303447485500.jpg (70 KB, 584x804)

70 KB JPG

>>106764550
maybe the retard will understand one day
search engines ceased to exist, as we all know

Anonymous
10/01/25(Wed)23:30:10 No.106764571

Anonymous 10/01/25(Wed)23:30:10 No.106764571

>>106764550
This is your brain on chatGPT

Anonymous
10/01/25(Wed)23:31:25 No.106764577

Anonymous 10/01/25(Wed)23:31:25 No.106764577

>>106764568
>picrel
That isn't even the model I want nigger... That's the base model. It doesn't tell me what model to pick with the quantisized version, whether I should go with Q5_K_XL or Q5_K_M, because it doesn't even fucking list the XL version

Anonymous
10/01/25(Wed)23:32:05 No.106764582

Anonymous 10/01/25(Wed)23:32:05 No.106764582

File: 1743260914499489.jpg (5 KB, 269x46)

5 KB JPG

>>106764577
GEE I FUCKING WONDER WHY AND GEE I FUCKING WONDER WHAT THE DROPDOWN FOR QUANTIZATION SIZE MEANS
JESUS CHRIST MAN

Anonymous
10/01/25(Wed)23:32:18 No.106764583

Anonymous 10/01/25(Wed)23:32:18 No.106764583

>>106764535
Anon you are either suffering from: a hot head in which case you should shut down the pewter, and return later to try again, or you are incapable of following simple written instructions. If the second, there is no hope for you.

Anonymous
10/01/25(Wed)23:33:23 No.106764593

Anonymous 10/01/25(Wed)23:33:23 No.106764593

>>106764582
YOU MEAN THE ONE THAT DOES NOT EVEN LIST THE XL AT ALL?
RETARD LEARN TO READ
>>106764583
It literally doesn't list the XL version you fucking nigger. That is my question, XL or M version, WHICH ONE?
How can I tell if calculator literally DOES NOT SHOW IT.

Anonymous
10/01/25(Wed)23:33:45 No.106764597

Anonymous 10/01/25(Wed)23:33:45 No.106764597

File: 1728123108373662.jpg (34 KB, 120x1369)

34 KB JPG

>>106764593
gorilla nigger

Anonymous
10/01/25(Wed)23:35:54 No.106764616

Anonymous 10/01/25(Wed)23:35:54 No.106764616

File: file.png (27 KB, 359x1070)

27 KB PNG

>>106764597
Where are the XL models huh? You know, the ones listed in here: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/tree/main ???????????????????????????????
Calculator doesn't even show it and doesn't even mention it and you niggers are too retarded to explain XL vs M

Anonymous
10/01/25(Wed)23:36:01 No.106764618

Anonymous 10/01/25(Wed)23:36:01 No.106764618

File: saten face railgun disapp(...).png (24 KB, 128x128)

24 KB PNG

>>106764593

Anonymous
10/01/25(Wed)23:36:47 No.106764622

Anonymous 10/01/25(Wed)23:36:47 No.106764622

>>106764616
did you know: text-generation-webui tells you how much VRAM it expects to use when you attempt to load a model?
neat information, honestly

Anonymous
10/01/25(Wed)23:37:25 No.106764627

Anonymous 10/01/25(Wed)23:37:25 No.106764627

>>106764622
Okay now explain XL vs M that have the same filesize, retard

Anonymous
10/01/25(Wed)23:37:56 No.106764634

Anonymous 10/01/25(Wed)23:37:56 No.106764634

>>106764627
you can use a search engine, can't you?
your subscription to this chat has run out of tokens

Anonymous
10/01/25(Wed)23:38:34 No.106764639

Anonymous 10/01/25(Wed)23:38:34 No.106764639

>>106764634
Why don't you search how to touch some fucking grass, useless nigger

Anonymous
10/01/25(Wed)23:39:14 No.106764643

Anonymous 10/01/25(Wed)23:39:14 No.106764643

>>106764639
you've spent so long accomplishing literally nothing when a search engine exists at the top of this very browser the entire time

Anonymous
10/01/25(Wed)23:40:18 No.106764653

Anonymous 10/01/25(Wed)23:40:18 No.106764653

>>106764040
bro it's easy, you just download ollama and click click you're done

Anonymous
10/01/25(Wed)23:40:19 No.106764654

Anonymous 10/01/25(Wed)23:40:19 No.106764654

>>106764402
lurk moar

Anonymous
10/01/25(Wed)23:41:13 No.106764662

Anonymous 10/01/25(Wed)23:41:13 No.106764662

>>106764643
>accomplishing literally nothing
I found out what model and what quantz. Just not whether to pick XL or M.
This would have been solved if you weren't a nigger to begin with and just said Q5_K_M or Q5_K_XL
>>106764654
Suck more dick faggot

Anonymous
10/01/25(Wed)23:41:59 No.106764669

Anonymous 10/01/25(Wed)23:41:59 No.106764669

>>106764662
hm if only there was a way to find out the thing you're still curious about... maybe a way to search for the answer?

Anonymous
10/01/25(Wed)23:42:17 No.106764670

Anonymous 10/01/25(Wed)23:42:17 No.106764670

192GBfags, what quant of GLM are you using?

Anonymous
10/01/25(Wed)23:44:49 No.106764688

Anonymous 10/01/25(Wed)23:44:49 No.106764688

>>106762831
when they're looking at my gigantic peepee

Anonymous
10/01/25(Wed)23:51:59 No.106764731

Anonymous 10/01/25(Wed)23:51:59 No.106764731

China really fucking killed with 4.6, fuck it's good.

Anonymous
10/01/25(Wed)23:55:12 No.106764754

Anonymous 10/01/25(Wed)23:55:12 No.106764754

>>106764688
When you're looking at theirs

Anonymous
10/02/25(Thu)00:12:01 No.106764838

Anonymous 10/02/25(Thu)00:12:01 No.106764838

File: Q4KM-vs-UD-Q4KXL.png (252 KB, 1762x919)

252 KB PNG

>>106764662
They have slightly different mixes of precision levels for individual tensors in each layer. "UD" Unsloth Dynamic means it's using their scheme to decide what those quantization levels are.
S M XL blah it's arbitrary names to indicate slightly different quantization levels within any particular scheme.
Realistically you are unlikely to notice a difference, of course Unsloth claim their scheme for selecting quant levels performs better at a given file size than others. Maybe it does but more important is how much of the model can fit in GPU RAM which only depends on the file size. The specific quant mixes may have minor hardware dependent performance difference but it's not a big deal.
All this getting mad you could just download both and try for yourself. It doesn't really matter.
sheesh burger hours

Anonymous
10/02/25(Thu)00:46:58 No.106764995

Anonymous 10/02/25(Thu)00:46:58 No.106764995

File: 1734234376728480.png (602 KB, 699x1025)

602 KB PNG

>>106763408
not even one anon to tell me i'm shit and this sucks ass? i'm disappointed, 4chan.

Anonymous
10/02/25(Thu)00:55:15 No.106765028

Anonymous 10/02/25(Thu)00:55:15 No.106765028

>>106764995
>JavaScript 76.2%
your code sucks ass and you should feel bad

Anonymous
10/02/25(Thu)00:58:29 No.106765045

Anonymous 10/02/25(Thu)00:58:29 No.106765045

File: a.png (5 KB, 420x57)

5 KB PNG

>>106764995
Still havent tried it. The models I currently use process new prompt tokens too slowly for me to want to use anything more than a simple two sentence author's note instruction at depth 0.

Anonymous
10/02/25(Thu)00:59:09 No.106765052

Anonymous 10/02/25(Thu)00:59:09 No.106765052

>>106765028
cant help it. thats what the resources use and what i had to make my own stuff around. i'm not a fan of javascript, but i dont have much of a choice when it comes to st addons. the fact that it work at all is awesome, to me

Anonymous
10/02/25(Thu)01:02:11 No.106765076

Anonymous 10/02/25(Thu)01:02:11 No.106765076

>>106765045
if you are using default st, you might be using stuff you dont realize. the vectorizatiion comes default on, as does history. so by default any rp has some level of rememberance

Anonymous
10/02/25(Thu)01:04:16 No.106765094

Anonymous 10/02/25(Thu)01:04:16 No.106765094

File: love.png (215 KB, 531x268)

215 KB PNG

>>106764995
you are amazing and the addon is wonderful

Anonymous
10/02/25(Thu)01:08:54 No.106765123

Anonymous 10/02/25(Thu)01:08:54 No.106765123

File: 1732139418465069.png (275 KB, 525x521)

275 KB PNG

>>106765094
don't be nice to me unless you mean it. i want mean commentary.

i know my addon is ok, but its among like a dozen others now for how you can handle clothes, locations, all similar stuff. i've tried a bunch of those and still prefer my own way of just adding them to lorebooks

Anonymous
10/02/25(Thu)01:13:32 No.106765146

Anonymous 10/02/25(Thu)01:13:32 No.106765146

>>106765094
kek

Anonymous
10/02/25(Thu)01:15:56 No.106765156

Anonymous 10/02/25(Thu)01:15:56 No.106765156

>>106765076
Nope not using anything. What I mean is I don't want to add any more processing to elongate the already long 10+ seconds I'm waiting after sending my own messages. t. 10-30t/s pp at small <256 batch sizes.

Anonymous
10/02/25(Thu)01:17:57 No.106765166

Anonymous 10/02/25(Thu)01:17:57 No.106765166

>>106765123
Maybe everyone was just joking about using llms for rp and it's just not that common.

Anonymous
10/02/25(Thu)01:20:20 No.106765172

Anonymous 10/02/25(Thu)01:20:20 No.106765172

>>106765156
if anything, my addon requires little to no reprocessing. if you decide to change your clothes then hit reprocess, you'll note it using 300 tokens or so and shifting them. its only so lite since ai considers things at the bottom of the list like that

i can try to help you fix any bad entries you think you have

Anonymous
10/02/25(Thu)01:21:13 No.106765174

Anonymous 10/02/25(Thu)01:21:13 No.106765174

>>106765123
>don't be nice to me unless you mean it. i want mean commentary.
aight
>the addon is wonderful
the useful context window grows making your addon obsolete when deepseek implements the full NSA that is gg for your addon i dident do too much testing yet as im stuck with glm 4.6 at the moment but v3.2 increased the useful context to atleast ~16k compared to the previous 10k it also seems to falloff in a less bad manner idk though still needs more testing
>you are amazing
relative to the rest of humanity that is just an objective fact anyone on autistic sites like any form of chan automatically gets filled into that category compared objectively though eh idk you seem ok

Anonymous
10/02/25(Thu)01:23:25 No.106765183

Anonymous 10/02/25(Thu)01:23:25 No.106765183

>>106765156
nta, As long as it's within normal human chat boundaries I'm fine with low responses personally. I mean you ARE chatting right. It adds to the immersion.
10 tokens is still pretty fast.
Normal humans type about 3ish tokens per second.

Anonymous
10/02/25(Thu)01:23:33 No.106765184

Anonymous 10/02/25(Thu)01:23:33 No.106765184

what's the point of splitting goofs if they are all gonna be loaded anyway? It's not like 500gb is big nowadays

Anonymous
10/02/25(Thu)01:24:21 No.106765187

Anonymous 10/02/25(Thu)01:24:21 No.106765187

>>106765184
hf has a 50gb filesize limit

Anonymous
10/02/25(Thu)01:25:03 No.106765190

Anonymous 10/02/25(Thu)01:25:03 No.106765190

>>106765174
>the useful context window grows making your addon obsolete
no, this is where the addon shines - it re-injects info into the ai on the next message. it should be very noticeable.

but if you have any issues, please bring them up and i'll fix them

Anonymous
10/02/25(Thu)01:25:23 No.106765191

Anonymous 10/02/25(Thu)01:25:23 No.106765191

>>106765183
*low
slow

Anonymous
10/02/25(Thu)01:32:20 No.106765217

Anonymous 10/02/25(Thu)01:32:20 No.106765217

>>106765172
>you'll note it using 300 tokens or so and shifting them
That's fine, but there will be a few more hundred tokens worth of messages under the injection depending on depth. If I was running on full GPU I'd try it but I'm not.

>>106765183
Are you conflating pp and tg? 10-30t/s pp for smaller batches is slow as fuck. My tg is 3-5t/s which I'm fine with, and I think is a comfy reading speed, but That said, pp is not currently an issue for me because I don't have to reprocess more than 50-200 tokens each turn (my own message), excluding the first gen with a new character that includes ingesting the system prompt+defs. But with that first gen, the batch size is large so it'll go by at 150-200t/s pp.

Anonymous
10/02/25(Thu)01:33:50 No.106765225

Anonymous 10/02/25(Thu)01:33:50 No.106765225

>>106765190
>it re-injects info into the ai on the next message
what is the value proposition of that vs (OOC: the dress i now y the weather is now x etc) except tidiness which could be a + for some but its also bad unless you make silly save the chats with the info it was injected so when one is reading back one could see what was sent so it dosent get confusing even then seems like a slow creep towards a la python dependency hell
>shines - it
did you just fucking emdash me cunt ?

Anonymous
10/02/25(Thu)01:38:28 No.106765253

Anonymous 10/02/25(Thu)01:38:28 No.106765253

>>106765225
>what is the value proposition of that vs
because you never have to worry about it again. once you write it, let alone select it, its done.

if you really wanted, you could write what i'm doing within the short shit you're given like author notes

Anonymous
10/02/25(Thu)01:43:11 No.106765275

Anonymous 10/02/25(Thu)01:43:11 No.106765275

>>106763623
still nothing showing its better somehow...

Anonymous
10/02/25(Thu)01:48:57 No.106765303

Anonymous 10/02/25(Thu)01:48:57 No.106765303

>>106765253
i personally dont get the use for it but to each his own gl

Anonymous
10/02/25(Thu)01:52:44 No.106765326

Anonymous 10/02/25(Thu)01:52:44 No.106765326

>>106765217
Ah yeah, misread etc.

Anonymous
10/02/25(Thu)01:55:32 No.106765342

Anonymous 10/02/25(Thu)01:55:32 No.106765342

>>106765303
it becomes apparent the moment you spend 2 day with a model.

what are they wearing?
who was wearing what?

unless you constantly remind the model who was doing what, it'll forget.

Anonymous
10/02/25(Thu)02:05:04 No.106765390

Anonymous 10/02/25(Thu)02:05:04 No.106765390

>>106763408
Any ideas for an automated version?

Anonymous
10/02/25(Thu)02:32:31 No.106765559

Anonymous 10/02/25(Thu)02:32:31 No.106765559

File: 1751123338262441.png (251 KB, 616x555)

251 KB PNG

>>106765390
what do you mean? it cant be automated. the best you'll get is putting in data, and reading it

Anonymous
10/02/25(Thu)02:44:52 No.106765620

Anonymous 10/02/25(Thu)02:44:52 No.106765620

>>106764277
This reads like Gemini 2.5/Gemma.

Anonymous
10/02/25(Thu)03:04:21 No.106765718

Anonymous 10/02/25(Thu)03:04:21 No.106765718

The Pitfalls of KV Cache Compression
https://arxiv.org/abs/2510.00231

> KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

Not as lossless as some benchmarks show.

Anonymous
10/02/25(Thu)03:12:56 No.106765758

Anonymous 10/02/25(Thu)03:12:56 No.106765758

File: file.jpg (286 KB, 476x1039)

286 KB JPG

Interesting...
https://x.com/maximelabonne/status/1973372579441496514
playground.liquid.ai/talk
https://huggingface.co/LiquidAI/LFM2-Audio-1.5B
https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model
Same guys who did that language diffusion model.

Anonymous
10/02/25(Thu)03:14:03 No.106765762

Anonymous 10/02/25(Thu)03:14:03 No.106765762

>>106765718
Everyone (except for llama.cpp developers who kept saying "muh perplexity") knows KV cache compression makes models retarded.

Anonymous
10/02/25(Thu)03:14:05 No.106765764

Anonymous 10/02/25(Thu)03:14:05 No.106765764

File: file.jpg (871 KB, 2060x1056)

871 KB JPG

>>106765758

Anonymous
10/02/25(Thu)03:16:32 No.106765780

Anonymous 10/02/25(Thu)03:16:32 No.106765780

File: 67cb8aa6e9184b6e44813162_(...).webm (1.33 MB, 1280x720)

1.33 MB WEBM

>>106765758
>>106765764

Anonymous
10/02/25(Thu)03:17:45 No.106765785

Anonymous 10/02/25(Thu)03:17:45 No.106765785

>>106765780
This is just a fancy animation that describes how autoregressive models work.

Anonymous
10/02/25(Thu)03:18:23 No.106765789

Anonymous 10/02/25(Thu)03:18:23 No.106765789

>>106765780
*drinks the model*

Anonymous
10/02/25(Thu)03:44:11 No.106765934

Anonymous 10/02/25(Thu)03:44:11 No.106765934

File: 8SmPy6pTP.jpg (447 KB, 698x488)

447 KB JPG

>>106765758
lol this is how you do it right

Anonymous
10/02/25(Thu)03:49:27 No.106765964

Anonymous 10/02/25(Thu)03:49:27 No.106765964

No more games! When can I run glm-4.6 on vanilla ollama on windows?

Anonymous
10/02/25(Thu)03:52:23 No.106765984

Anonymous 10/02/25(Thu)03:52:23 No.106765984

>>106765718
question, so say we are cpumaxxing big MoE models, but one or two 3090s are also available. Is it feasible to have KV/attention on the GPU, while weights remain CPU? I mean is it doable in practice with llama.cpp or similar on linux? would it help address the slowdown that pure CPU/RAM experiences as KV builds?

Anonymous
10/02/25(Thu)04:11:39 No.106766089

Anonymous 10/02/25(Thu)04:11:39 No.106766089

File: file.png (384 KB, 604x1064)

384 KB PNG

dockerfags rise up
https://x.com/UnslothAI/status/1973383044225536312

Anonymous
10/02/25(Thu)04:12:47 No.106766094

Anonymous 10/02/25(Thu)04:12:47 No.106766094

>>106766089
We didn't study docker yet

Anonymous
10/02/25(Thu)04:14:41 No.106766108

Anonymous 10/02/25(Thu)04:14:41 No.106766108

>>106765984
Glad you asked — you are absolutely right to question this setup.

Anonymous
10/02/25(Thu)04:17:47 No.106766132

Anonymous 10/02/25(Thu)04:17:47 No.106766132

>>106766108
why are you mad

Anonymous
10/02/25(Thu)04:20:03 No.106766146

Anonymous 10/02/25(Thu)04:20:03 No.106766146

>>106766132
?

Anonymous
10/02/25(Thu)04:20:11 No.106766147

Anonymous 10/02/25(Thu)04:20:11 No.106766147

>>106766089
we use podman here

Anonymous
10/02/25(Thu)04:20:12 No.106766148

Anonymous 10/02/25(Thu)04:20:12 No.106766148

>>106765984
I was about to write that llama.cpp does that automatically on CUDA builds even when you specify -ngl 0, but upon testing it, that doesn't seem to be the case. The KV cache is still loaded or at least shadowed on system RAM, you can see that clearly when loading a 128k context model with -c 131072 and -fa off.

Anonymous
10/02/25(Thu)04:22:00 No.106766158

Anonymous 10/02/25(Thu)04:22:00 No.106766158

>>106765184
If a single byte doesn't transfer correctly then the whole file may as well be sent straight to trash
Splitting huge files means that if there is a single byte of failure you don't have to download the whole model again, just that specific part

Anonymous
10/02/25(Thu)04:27:15 No.106766197

Anonymous 10/02/25(Thu)04:27:15 No.106766197

>>106765718
By "compression" in this context they mean keeping the KV cache at full precision but eventually discarding part of it to reclaim memory.
I'm not surprised that this makes models retarded, that's just not how they were trained.

Anonymous
10/02/25(Thu)04:34:49 No.106766221

Anonymous 10/02/25(Thu)04:34:49 No.106766221

File: 1733792239967687.png (773 KB, 1355x866)

773 KB PNG

Anonymous
10/02/25(Thu)04:36:14 No.106766229

Anonymous 10/02/25(Thu)04:36:14 No.106766229

File: nose.jpg (4 KB, 160x179)

4 KB JPG

>>106766221

Anonymous
10/02/25(Thu)04:51:21 No.106766318

Anonymous 10/02/25(Thu)04:51:21 No.106766318

I'd been using Ubergarm's 1-bit dynamic quant of Deepseek R1 for a while because despite it being gimp & shitty it's still the best thing i could get on this gaming rig and everything else seemed not worth the faff
But GLM 4.6 is crazy good... I'm still only running the IQ2_KL but the output is clearly better and I didn't realise prompt processing could be this fast, it went through like 3000 lines of code in a minute
Here's hoping it doesn't start writing Chinese characters in the code at high context, lol
>>106766229
damn... but what does Assange have to do with this...

Anonymous
10/02/25(Thu)04:54:20 No.106766336

Anonymous 10/02/25(Thu)04:54:20 No.106766336

GLM-4.6-smol-IQ2_KS-00001-of-00003.gguf

Should I run this over GLM 4.5 air Q4?

Anonymous
10/02/25(Thu)04:55:45 No.106766344

Anonymous 10/02/25(Thu)04:55:45 No.106766344

>try out ubergarms DeepSeek-V3.1-Terminus-IQ1_S with runtime repacking enabled
>get 1.87 t/s tg128 compared to 2.35 t/s tg128 on DeepSeek-R1-0528-IQ1_S_R4
weird, isn't it exact same architecture, why is it slower then?
If I understand correctly IQ1_S_R4 is just statically repacked IQ1_S, so IQ1_S with runtime repacking enabled should equate IQ1_S_R4?

Anonymous
10/02/25(Thu)05:03:45 No.106766390

Anonymous 10/02/25(Thu)05:03:45 No.106766390

>>106766336
oh wait I cant even run this, I only have 96gb ram + 16gb vram
fml bros, I will have to stay a fucking poor VRAMLET REEE

Anonymous
10/02/25(Thu)05:21:51 No.106766498

Anonymous 10/02/25(Thu)05:21:51 No.106766498

>>106765934
How are the 2nd and 4th paragraphs not contradictory?

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.