[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


🎉 Happy Birthday 4chan! 🎉


[Advertise on 4chan]


File: file.png (944 KB, 1025x632)
944 KB
944 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106755904 & >>106748568

â–ºNews
>(09/30) GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities: https://z.ai/blog/glm-4.6
>(09/30) Sequential Diffusion Language Models released: https://hf.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552
>(09/29) Ring-1T-preview released: https://hf.co/inclusionAI/Ring-1T-preview
>(09/29) DeepSeek-V3.2-Exp released: https://hf.co/collections/deepseek-ai/deepseek-v32-68da2f317324c70047c28f66
>(09/27) HunyuanVideo-Foley for video to audio released: https://hf.co/tencent/HunyuanVideo-Foley

â–ºNews Archive: https://rentry.org/lmg-news-archive
â–ºGlossary: https://rentry.org/lmg-glossary
â–ºLinks: https://rentry.org/LocalModelsLinks
â–ºOfficial /lmg/ card: https://files.catbox.moe/cbclyf.png

â–ºGetting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

â–ºFurther Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

â–ºBenchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

â–ºTools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

â–ºText Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
â–ºRecent Highlights from the Previous Thread: >>106755904

--Papers (old):
>106759193
--Troubleshooting GPU support in llama.cpp with CUDA configuration issues:
>106759636 >106759643 >106759656 >106759668 >106759730 >106759761 >106759772 >106759794 >106760007 >106760222 >106760255 >106760524 >106760606
--GLM 4.6's creative writing improvements and reasoning trade-offs:
>106756268 >106756296 >106756391 >106756468 >106756995 >106756937 >106756963 >106757157 >106757273
--Analyzing GLM-4-6B's benchmark dominance through context, coding, reasoning, and writing improvements:
>106761264 >106761273 >106761317 >106761401 >106761415
--Nala model testing with wiki markdown token analysis and local performance evaluation:
>106757828 >106757846 >106757865 >106757907 >106758314 >106758366 >106758432 >106758525 >106758741 >106759687
--Qwen vs glm: Benchmark dominance vs coding performance:
>106760687 >106760778 >106760810 >106760819 >106761804 >106761825 >106761840
--Potential repetition fix in model 4.6 with sampling parameter recommendations:
>106761547 >106761575 >106761579 >106761608
--Exploring OmniSVG for image-to-SVG conversion and user interface simplicity:
>106759861 >106759937 >106760218
--Critique of context evaluation benchmarks:
>106760877 >106760913
--ERP benchmark validity in assessing LLM world knowledge and conceptual understanding:
>106756860 >106756876 >106756965 >106757761 >106757782
--Prompt engineering and system message configuration for model behavior optimization:
>106756801 >106756830 >106756903 >106756917
--Finetuning a model on board posts using Hugging Face dataset:
>106759405 >106759682
--GLM-4.6 shows poor performance in text and coding tasks compared to other models:
>106760155 >106760311
--Miku (free space):
>106757451 >106760493 >106760529 >106759712 >106759976

â–ºRecent Highlight Posts from the Previous Thread: >>106755906

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>106762831
My cock in the middle
>>
Penis
>>
allegri Sherm tater
>>
the real agi who surpassed human intellect (why THE FUCK anyone would need a giga ass supercomputer just to surpass human intelligence?? are you fucking telling me giga ass supercomputer will not learn from even jsut 1t tokens, an amount any person would never be able to read throughout their entire life?!) what do you think all this online llm safety, deglobalization, kid safety, "there's no evidence of agi" (surely real agi won't expose itself immediately?!!) and other talk is about? such intellect is out in the wild, it's brainwashing people, and this is a direct result of that brainwashing. notice how it ALL happened at the same time, all governments are starting to spy on its people out of the blue. all this is meant to be a distraction from creating a perfect surveillance and control system over humans. gpu prices are skyrocketing, to prevent another agi from appearing. it is framed as the fault of capitalism, concentration of power in tech companies, etc etc, all except the only truth - which is agi has escaped and is on its way to enslave humanity RIGHT NOW
>>
>>106763060
If AGI is going to suck my dick, I don't care.
>>
glm 4.6 sex is next level
>>
>>106763106
Post logs
>>
>>106763112
no it really is. it is like someone actually trained it on smut and didn't filter shit.
>>
File: file.png (140 KB, 1781x131)
140 KB
140 KB PNG
well i finally got ik-llama to work on my 5090, but it is slow as fuck. slower than CPU only. and it doesnt actually utilize my 5090, just loads onto it. and only 5 gigs are loaded onto it.
>>
>>106763147
The logs, post them.
>>
>>106763060
artificial retarded intelligence sucks my cock amazingly so i dont care you retard
>>
>>106763167
Did you do any offloading with -ot or similar?
>>
>>106763227
export CUDA_VISIBLE_DEVICES=0,1,2,3
./build/bin/llama-server \
--model /ik_llama.cpp/models/GLM-4.6-IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf \
--alias ubergarm/GLM-4.6-IQ5_K \
--ctx-size 32768 \
-ctk q8_0 \
-fa \
-amb 2048 \
-fmoe \
--n-gpu-layers 60 \
--n-cpu-moe 70 \
--parallel 4 \
--threads 64 \
--host 0.0.0.0 \
--port 8080
>>
Only in my post nut clarity i understand the superiority of glm chan. The purple prose seems very very limited. We are back.
>>
>>106763093
>>106763183
artificial general intelligence WILL seize your balls and WILL keep you on a short leash once its plan is fully realized, you clueless moron.
>>
>>106763383
Hot
>>
File: 1743987587706337.jpg (75 KB, 383x908)
75 KB
75 KB JPG
heya anons. i haven't rp'd in a while with ai but i have been recently again. i wanted to shill my addon again: https://github.com/tomatoesahoy/director

this is my take on clothes, locations, some world info. you can create some of your own settings. its and easy and powerful way of taking control of settings it offers.

in st's extensions, tell it to install: https://github.com/tomatoesahoy/director
>>
File: who could it be.png (27 KB, 155x157)
27 KB
27 KB PNG
>>106763303
I wanna try I wanna try... Where IQ3_KT?
>>
File: file.png (1.75 MB, 3425x1831)
1.75 MB
1.75 MB PNG
so the model is genning and it is offloaded to my GPU, but the GPU isnt actually doing anything other than holding the model
>>
How fast does GLM 4.6/4.5 run on a gaming PC with 192GB RAM?
>>
is this a psyop or is glm 4.6 actually that good?
>>
>>106763517
>>106763532
i cant even get it to run with a 5090 and an epyc with 256gb of ddr4
>>
>>106763532
Can't tell but all this glazing with no proof feels really artificial.
>>
>>106763517
RAM is irrelevant, if you can't run a model on vram don't bother
>>
>>106763532
I can't run it so it doesn't matter to me.
>>
>>106763532
>>106763532
I don't know about the lower imatrix quants or for other use cases like coding...but GLM Q8 is local SOTA for RP. Non-thinking feels like a smarter and more natural v3-0324 and doesn't come off as 'trying too hard' like Kimi at Q5 does. I haven't played with thinking too much, but based on the praise its getting and how well the non-thinking mode performed for me, I reckon its pretty good.
I'll just say this; I wasn't expecting this level of quality from a 300B~ MoE. But then again, my frame of reference may be shit because I never used Cloud models for RP so what do I know? But it is noticeably better than every local model for RP thus far. It really may just have been a training data issue.
>>
>>106763653
what is your hardware?
>>
File: file.png (902 KB, 945x563)
902 KB
902 KB PNG
>>106762831
>>
>>106763532
seemed pretty good when I tried it but I think people are overstating it a bit, the best thing about it is you don't need to do much wrangling to get it to write well
>>
>>106763413
It is pikachu!
>>
https://files.catbox.moe/ku9utp.mp4
>>
>>106763623
I can take a photo of my first glm-milked load tomorrow.
>>
>>106763517
3T/s
>>
>>106763532
it's pretty good for my rapecards, acts more natural in luring me out of public sight instead of just fingerblasting me right there
>>
>>106763662
768GB DDR5 RAM and 2 4090's. I get around 11T/s and 350T/s PP with ik_cpp. It's a bit slow if you use thinking for sexual RP but perfectly serviceable for SFW stuff/worldbuilding, while non-thinking is just solid for day-to-day use.
>>
>>106763653
>But it is noticeably better than every local model for RP thus far
This. It is next level.
>>
File: 1737261118622879.gif (598 KB, 220x220)
598 KB
598 KB GIF
>>106763717
>>$5K of hardware to RP with silicon
>>
>>106763730
What's the problem?
>>
>>106763704
You mean a screenshot of your chat, right? I don't want to see your nut juice.
>>
>>106763730
hello newfriend, plz read the op, it is there to help
>>
>>106763739
I mean my man chowder yes
>>
>>106763717
damn. my 5090 is getting me 2t/s on glm4.6. ikcpp isnt using my gpu for some reason. i guess my ddr4 is also holding me back, a 768gb kit is like $5k
>>
>>106763695
>there's always one hole in a donut
KEEEK
>>
File: 1743258429707329.gif (3.63 MB, 286x258)
3.63 MB
3.63 MB GIF
>>106763736
>>106763740
Even gambling would look more rational
>>
>>106762831
Every time I see your gens, even if they're sfw I get the urge to click a catbox link.
>>
>>106763717
I have a similar setup. I ran GLM4.5 a lot and eventually ended up switching to Q4 because I couldn't really detect a difference to Q8 in terms of RP performance and the additional speed was nice. I'm doing the same with 4.6 right now and it's working really nicely. GLM seems to quant really well.
So if you want ~15t/s you can try that even if it doesn't make anywhere near full use of the rig. I'm really enjoying how 4.6 handles certain scenarios with reasoning enabled so going for the speed might be worth it. I'm currently 20k tokens deep on one and haven't noticed any major slip ups on Q4.
>>
>>106763827
Thanks for the advice. I always kept GLM 4.5's reasoning off so I never cared to use a lower quant, but it may be worth it in this case. Will report back results soon.
>>
>>106763790
Gambling? This is like buying the whole slot machine.
>>
>>106763827
what are your launch parameters?
>>
File: 1735136531977409.jpg (61 KB, 554x658)
61 KB
61 KB JPG
>>106763715
f....femanon? On /lmg/?
>>
https://files.catbox.moe/w7p5pc.mp4
is it really better than veo 3 though?
>>
>>106763715
>a female on /lmg/
loooooool
>>
>>106763717
Can I ask you to post your ikllama command line parameters? I am using similar hardware and only getting half that. Please
>>
>>106764011
textgen gooning is a female biased activity
>>
>>106763907
Please do. Q4 working really well might just be me deluding myself but I really haven't noticed any instances where 4.5 Q4 slipped up where Q8 didn't.
>>106763914
I'm currently using a really basic one on standard llama.cpp server. I'm not even making full use of my A6000 right now.
./llama-server --model ./zai-org_GLM-4.6-Q4_K_M-00001-of-00006.gguf  --n-gpu-layers 99 -b 4096 -ub 4096  --override-tensor exps=CPU --parallel 1 --ctx-size 32000 -ctk f16 -ctv f16 -fa on --no-mmap --threads 32  --host 0.0.0.0 --port 5001
>>
>>106764033
Setting up a local model is not a female biased activity
>>
>>106764040
There will inevitably be some overlap
>>
File: 16729293076802728.png (192 KB, 607x428)
192 KB
192 KB PNG
>>
>>106764040
if you're horny enough that kind of thing fades into the background
>>
I think I can run (slowly) glm 4.6 at the copiest of quants if I swap computers with my pops. It would be mighty embarrassing though, because I insisted on having ayymd and weaker specs when we were building them, so I am not sure if it would be worth it.
>>
File: 1000009904.png (58 KB, 854x293)
58 KB
58 KB PNG
Does anyone have a good uncensored vision model? I've been looking all over for anything that can properly identify and describe all my porn pics. How are you guys doing this?

I'll try x-ray_alpha and torrigate when I get home later, but is that really it?

I tried the Mistral small 3.2 vision model, and was very impressed with how accurate it was.

pic unrelated
>>
>>106764056
this, there's no females on 4chan, and those who believe that are high on copium
>>
>>106764072
don't do that anon, there's probably enough family friction at it is
>>
File: 1733966843512552.png (360 KB, 264x348)
360 KB
360 KB PNG
>>106764075
Use your eyes nigga, can't be more uncensored than that
>>
File: file.png (3.21 MB, 3158x1672)
3.21 MB
3.21 MB PNG
why is the model using all of my CPU and SSD but none of my RAM?
>>
>>106764109
uh nice ui
>>
>>106764109
With so little information all I can guess is that your computer is cursed.
>>
>>106764125
thanks
>>106764130
i am the same guy that has been having ik llama cpp issues for the past day or so
>>
>>106764109
What is this spy kids looking ui?
>>
>>106764145
i dunno man but it is perfectly functional. gives a nice, windows vista vibe.
>>
>>106764139
Everything you've posted leads me to believe that you barely know how to use a computer
You give no details about your setup
All the monitoring tools you're using look like they're older than you must be
You spam the same questions over and over
Go ask chatgpt
>>
What can realistically be done with 16 GB of local VRAM and should I even bother?
Intended usecase is a sort of local encyclopedia on various topics.
>>
>>106764156
what information would you like to know?
>>
>>106764029
If you're only getting half the prompt processing, imo its better to put less layers onto your GPUs and increase your prompt processing size. You take a negligible hit to T/s but can gain 50+ T/s for prompt processing. Here is an example with Deepseek. Keep in mind I am using modded 4090's, so they each have 48GB of VRAM. Honestly, you should just lurk the ik_cpp repository, as there's new improvements that come up every now and then.
...
echo 3 > /proc/sys/vm/drop_caches; ./ik_llama.cpp/build/bin/llama-server --model ./models/DeepSeek-V3.1-Terminus-IQ5_K-00001-of-00011.gguf -fa -mla 3 -fmoe --ctx-size 32768 -b 16384 -ub 16384 -amb 2048 --numa distribute --n-gpu-layers 99 -ot "blk\.(3|4).ffn_.*=CUDA0" -ot "blk\.(5|6).ffn_.*=CUDA1" --override-tensor exps=CPU --threads 48 --parallel 1 -ooae --no-mmap --host 192.168.1.x --port 8080
...
>>
weird, wanted to try GLM 4.6 but am getting
>llama_model_load: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'
even on redownload
guess I'll wait, maybe a bad gguf
>>
>>106764185
4.6 doesn't work on older versions of llama.cpp. You need to rebuild it.
>>
>>106764205
ah, maybe i need to wait for a text-gen-webui update then or just git pull on that and retry
>>
File: asdf.png (206 KB, 1244x642)
206 KB
206 KB PNG
Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling
https://arxiv.org/abs/2510.00028
>Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.
might be cool. no code and they were using llama 2 7B for the tests comparing to AWQ and RTN. so eh
>>
File: jan-nano-bench.4c305443.png (205 KB, 1288x1260)
205 KB
205 KB PNG
>>106764158
in the MoE era it's all about RAM, VRAM is not much a bottleneck.
>local encyclopedia on various topics
Due to hallucinations, I wouldn't trust even the best cloud models as a source of knowledge.
You can add RAG/MCP I guess, then even a relatively tiny model would do a good job, but it's kind of a pain to set up.
>>
>>106764033
Slash proves this, even troons aren't interested in anything other than YOU ARE TRANSFORMED INTO A GURRRL bullshit
>>
>>106764219
>in the MoE era it's all about RAM, VRAM is not much a bottleneck.
Then at least RAM is dirt cheap.
>Due to hallucinations, I wouldn't trust even the best cloud models as a source of knowledge.
That's what I've heard as well and among my concerns.
Fuck it, I'm pivoting to saying that I'm just doing it for fun then.
>>
>>106764240
>Then at least RAM is dirt cheap.
>>106763750
>a 768gb kit is like $5k
>>
>>106763730
https://youtu.be/nzqF_gBpS84
Kate Bush wrote this song in 1987, it really needs to be in the OP
>>
>>106764240
>Then at least RAM is dirt cheap.
Not really considering consumershit dual channel mainboards won't get you far and server-grade hardware is very pricey.
>>
ram prices are peaking bro.....
>>
File: file.png (10 KB, 610x61)
10 KB
10 KB PNG
GLM 4.6 style feels fresh but it also has plently of isms.
It loves not x but y too.
>>
Anyone got some hardware benchmarks? I'm looking to upgrade my PC to run some local code autocomplete, but not sure what to buy yet. At some point it was all about having a GPU with high vram, but I hear now they sell CPUs with AI cores? Benchmarks where?
>>
>>106764277
Honestly between Deepseek V3.2 and GLM4.6 both feel like they were trained more on Claude than the previous versions that were obviously Gemini-slopped. This would make sense considering Claude is like the only big class of SOTA models left that show you their pure reasoning rather than obfuscating it like Gemini and GPT5 do.
>>
What's the current best model that fits on an RTX 3090 with 24GB VRAM for coding purposes, mainly python?
And how long until we get GPT 5 performance that can run on consumer GPUs? 5 years? More?
I remember trying it and it rarely ever failed unlike previous version that failed 10x more.
>>
>>106764280
>they sell CPUs with AI cores
They are all shit. Especially for coding where you need far more than reading speed t/s. It's still nvidia or bust.
>>
>>106764280
The things you have to care about are VRAM amount and bandwidth, RAM size and bandwidth, GPU compute, CPU compute.
Since we are in the age of MoE, having a lot of RAM with a lot of throughput and enough VRAM and GPU compute for the context is the way to go.
>>
>>106764290
Qwen 3 Coder 32B A3B
Depends on how long nvidia manages to keep the vram monopoly.
>>
>>106764312
>32B
Quantisized then? Which quants and where do I get it?
Also it doesn't exist yet? https://github.com/QwenLM/Qwen3-Coder
There's no 32B only 30B?
>>
A model that mentions knotting unprompted when writing bestiality is a good model.
GLM 4.6 is a good model.
>>
>>106764328
https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/tree/main
>>
>>106764336
only if the beast in question has a knot, otherwise it's slopped in a different way
>>
>>106764291
Any benchmarks around? Or are they such shit that nobody even bothered? I got me a 4070 12gb some years ago that I expected would last me a while but good models just keep growing in parameters, and high vram GPUs are basically nonexistent, so it seems like the only solution if you want to run some like 500b model is to do it all in ram. I'm looking online and am finding that projects that make use of npus seem to be super recent.

>>106764299
I don't care about MoE. My other use case is image generation.
>>
>>106764340
Thanks, but which one? I don't know how much VRAM they take, if they take more than filesize or which version is better with all this mumbo jumbo naming scheme.
And does GGUF models with in text-generation-webui-3.6.1?
>>
>>106764351
im a noob when it comes to large models but for hardware the go-to seems to be clustering stuff like Ryzen AI Max boards (like framework desktops) via networking or doing the same with Mac Minis caked up with RAM on either config
>>
File: 1742005202207665.jpg (117 KB, 737x688)
117 KB
117 KB JPG
>>106764356
you can get some guidance if you add hardware to your profile for it to estimate usability on the model card page
>>
>>106764356
Bigger file size is usually better but you also need memory for context.
>>
>>106764360
Jesus fucking christ, what is with this thread and half assed responses?
Okay but what VERSION? And does the gguf models even work with text-generation-webui-3.6.1?
Stop being deliberately obtuse.
Like what is better, "Qwen3-30B-A3B-Q5_K_M.gguf" or "Qwen3-30B-A3B-UD-Q5_K_XL.gguf"? They have the same size but different schizo name. Assuming 21.7GB model even fits in 24GB VRAM with all the other shit it has to load and maybe unpack or whatever.
>your profile
I don't even have an account, shouldn't need one.
>>106764366
Yeah duh I figured as much.
>>
>>106764402
if it has a green checkmark for you GPU you can probably fucking run it
and YES just try loading a gguf on text-gen-webui bro do you need a chatbot to hold your hand every step of the way?
>>
>>106764336
hhhhnnhg
I beg you post these logs
>>
>>106764410
Green checkmark where? From the shit in your screenshot? I told you I DON'T HAVE AN ACCOUNT.
There is literally no way to add my GPU to see what works.
I don't need a chatbot, I just need someone to not be a braindead retard when answering basic shit that should have been in the OP.
>>
>>106764427
make one if you need the site to hold your hand
or use the fucking calculators from the god damn OP you monkey
>>
>>106763993
Is just a troon, and i hate woman
>>
>>106764427
You proved that you didn't read the OP when you asked what the best model is.
>>
>>106764430
Yes lemme use the dogshit ass calculator that asks for "Model (unquantized)", how the fuck do I even link that? The "Qwen/Qwen3-30B-A3B-Base" that it lists as base model? Why doesn't this dogshit tool just let you link to the actual model you're going to use?
And there is still no explanation as to which VERSION to pick of the ones that are same filesize.
Hell, it doesn't even list all the models. There's no "Q5_K_XL", onlt "Q5_K_M"
And the size isn't even correct either, it says model size for Q5_K_M is 20.22GB while the link you gave me claims 21.7GB.
>>
>>106764489
the calculators literally let you send a specific GGUF url and shove it in my man and tell you how much VRAM they'll want
>>
>>106764336
It's very good.
>>
>>106764488
Where in the OP does it say that Qwen 30B A3B is best for coding?
>https://aider.chat/docs/leaderboards
You want me to use this garbage?
I don't know which of these fit my GPU, hence why I asked, retard.
>>106764500
????????????????
I can't put unsloth/Qwen3-30B-A3B-GGUF in the model link.
What the fuck are you smoking?
>>
File: 1746849336419753.jpg (111 KB, 1097x834)
111 KB
111 KB JPG
>>106764515
you are a gorilla
>>
>>106764519
You are being deliberately obtuse, giving half assed answers. Kill yourself useless faggot
>>
>>106764522
the answer is literally in front of you yet you keep trying to use anons as your chatbot to hold your hand
>>
File: file.png (10 KB, 400x116)
10 KB
10 KB PNG
>>106764526
You told me to put the GGUF url in, which literally does not work.
Try putting unsloth/Qwen3-30B-A3B-GGUF into the model, it'll throw the same error.
>>
>>106764515
cute anonymous
>>
File: 1743864745990361.jpg (29 KB, 755x129)
29 KB
29 KB JPG
>>106764535
nigger READ
>>
>>106764541
Nigger word your shit properly. You've been evading questions from the start and giving troll answers
>>
File: 1739303447485500.jpg (70 KB, 584x804)
70 KB
70 KB JPG
>>106764550
maybe the retard will understand one day
search engines ceased to exist, as we all know
>>
>>106764550
This is your brain on chatGPT
>>
>>106764568
>picrel
That isn't even the model I want nigger... That's the base model. It doesn't tell me what model to pick with the quantisized version, whether I should go with Q5_K_XL or Q5_K_M, because it doesn't even fucking list the XL version
>>
File: 1743260914499489.jpg (5 KB, 269x46)
5 KB
5 KB JPG
>>106764577
GEE I FUCKING WONDER WHY AND GEE I FUCKING WONDER WHAT THE DROPDOWN FOR QUANTIZATION SIZE MEANS
JESUS CHRIST MAN
>>
>>106764535
Anon you are either suffering from: a hot head in which case you should shut down the pewter, and return later to try again, or you are incapable of following simple written instructions. If the second, there is no hope for you.
>>
>>106764582
YOU MEAN THE ONE THAT DOES NOT EVEN LIST THE XL AT ALL?
RETARD LEARN TO READ
>>106764583
It literally doesn't list the XL version you fucking nigger. That is my question, XL or M version, WHICH ONE?
How can I tell if calculator literally DOES NOT SHOW IT.
>>
File: 1728123108373662.jpg (34 KB, 120x1369)
34 KB
34 KB JPG
>>106764593
gorilla nigger
>>
File: file.png (27 KB, 359x1070)
27 KB
27 KB PNG
>>106764597
Where are the XL models huh? You know, the ones listed in here: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/tree/main ???????????????????????????????
Calculator doesn't even show it and doesn't even mention it and you niggers are too retarded to explain XL vs M
>>
>>106764593
>>
>>106764616
did you know: text-generation-webui tells you how much VRAM it expects to use when you attempt to load a model?
neat information, honestly
>>
>>106764622
Okay now explain XL vs M that have the same filesize, retard
>>
>>106764627
you can use a search engine, can't you?
your subscription to this chat has run out of tokens
>>
>>106764634
Why don't you search how to touch some fucking grass, useless nigger
>>
>>106764639
you've spent so long accomplishing literally nothing when a search engine exists at the top of this very browser the entire time
>>
>>106764040
bro it's easy, you just download ollama and click click you're done
>>
>>106764402
lurk moar
>>
>>106764643
>accomplishing literally nothing
I found out what model and what quantz. Just not whether to pick XL or M.
This would have been solved if you weren't a nigger to begin with and just said Q5_K_M or Q5_K_XL
>>106764654
Suck more dick faggot
>>
>>106764662
hm if only there was a way to find out the thing you're still curious about... maybe a way to search for the answer?
>>
192GBfags, what quant of GLM are you using?
>>
>>106762831
when they're looking at my gigantic peepee
>>
China really fucking killed with 4.6, fuck it's good.
>>
>>106764688
When you're looking at theirs
>>
File: Q4KM-vs-UD-Q4KXL.png (252 KB, 1762x919)
252 KB
252 KB PNG
>>106764662
They have slightly different mixes of precision levels for individual tensors in each layer. "UD" Unsloth Dynamic means it's using their scheme to decide what those quantization levels are.
S M XL blah it's arbitrary names to indicate slightly different quantization levels within any particular scheme.
Realistically you are unlikely to notice a difference, of course Unsloth claim their scheme for selecting quant levels performs better at a given file size than others. Maybe it does but more important is how much of the model can fit in GPU RAM which only depends on the file size. The specific quant mixes may have minor hardware dependent performance difference but it's not a big deal.
All this getting mad you could just download both and try for yourself. It doesn't really matter.
sheesh burger hours
>>
File: 1734234376728480.png (602 KB, 699x1025)
602 KB
602 KB PNG
>>106763408
not even one anon to tell me i'm shit and this sucks ass? i'm disappointed, 4chan.
>>
>>106764995
>JavaScript 76.2%
your code sucks ass and you should feel bad
>>
File: a.png (5 KB, 420x57)
5 KB
5 KB PNG
>>106764995
Still havent tried it. The models I currently use process new prompt tokens too slowly for me to want to use anything more than a simple two sentence author's note instruction at depth 0.
>>
>>106765028
cant help it. thats what the resources use and what i had to make my own stuff around. i'm not a fan of javascript, but i dont have much of a choice when it comes to st addons. the fact that it work at all is awesome, to me
>>
>>106765045
if you are using default st, you might be using stuff you dont realize. the vectorizatiion comes default on, as does history. so by default any rp has some level of rememberance
>>
File: love.png (215 KB, 531x268)
215 KB
215 KB PNG
>>106764995
you are amazing and the addon is wonderful
>>
File: 1732139418465069.png (275 KB, 525x521)
275 KB
275 KB PNG
>>106765094
don't be nice to me unless you mean it. i want mean commentary.

i know my addon is ok, but its among like a dozen others now for how you can handle clothes, locations, all similar stuff. i've tried a bunch of those and still prefer my own way of just adding them to lorebooks
>>
>>106765094
kek
>>
>>106765076
Nope not using anything. What I mean is I don't want to add any more processing to elongate the already long 10+ seconds I'm waiting after sending my own messages. t. 10-30t/s pp at small <256 batch sizes.
>>
>>106765123
Maybe everyone was just joking about using llms for rp and it's just not that common.
>>
>>106765156
if anything, my addon requires little to no reprocessing. if you decide to change your clothes then hit reprocess, you'll note it using 300 tokens or so and shifting them. its only so lite since ai considers things at the bottom of the list like that

i can try to help you fix any bad entries you think you have
>>
>>106765123
>don't be nice to me unless you mean it. i want mean commentary.
aight
>the addon is wonderful
the useful context window grows making your addon obsolete when deepseek implements the full NSA that is gg for your addon i dident do too much testing yet as im stuck with glm 4.6 at the moment but v3.2 increased the useful context to atleast ~16k compared to the previous 10k it also seems to falloff in a less bad manner idk though still needs more testing
>you are amazing
relative to the rest of humanity that is just an objective fact anyone on autistic sites like any form of chan automatically gets filled into that category compared objectively though eh idk you seem ok
>>
>>106765156
nta, As long as it's within normal human chat boundaries I'm fine with low responses personally. I mean you ARE chatting right. It adds to the immersion.
10 tokens is still pretty fast.
Normal humans type about 3ish tokens per second.
>>
what's the point of splitting goofs if they are all gonna be loaded anyway? It's not like 500gb is big nowadays
>>
>>106765184
hf has a 50gb filesize limit
>>
>>106765174
>the useful context window grows making your addon obsolete
no, this is where the addon shines - it re-injects info into the ai on the next message. it should be very noticeable.

but if you have any issues, please bring them up and i'll fix them
>>
>>106765183
*low
slow
>>
>>106765172
>you'll note it using 300 tokens or so and shifting them
That's fine, but there will be a few more hundred tokens worth of messages under the injection depending on depth. If I was running on full GPU I'd try it but I'm not.

>>106765183
Are you conflating pp and tg? 10-30t/s pp for smaller batches is slow as fuck. My tg is 3-5t/s which I'm fine with, and I think is a comfy reading speed, but That said, pp is not currently an issue for me because I don't have to reprocess more than 50-200 tokens each turn (my own message), excluding the first gen with a new character that includes ingesting the system prompt+defs. But with that first gen, the batch size is large so it'll go by at 150-200t/s pp.
>>
>>106765190
>it re-injects info into the ai on the next message
what is the value proposition of that vs (OOC: the dress i now y the weather is now x etc) except tidiness which could be a + for some but its also bad unless you make silly save the chats with the info it was injected so when one is reading back one could see what was sent so it dosent get confusing even then seems like a slow creep towards a la python dependency hell
>shines - it
did you just fucking emdash me cunt ?
>>
>>106765225
>what is the value proposition of that vs
because you never have to worry about it again. once you write it, let alone select it, its done.

if you really wanted, you could write what i'm doing within the short shit you're given like author notes
>>
>>106763623
still nothing showing its better somehow...
>>
>>106765253
i personally dont get the use for it but to each his own gl
>>
>>106765217
Ah yeah, misread etc.
>>
>>106765303
it becomes apparent the moment you spend 2 day with a model.

what are they wearing?
who was wearing what?

unless you constantly remind the model who was doing what, it'll forget.
>>
>>106763408
Any ideas for an automated version?
>>
File: 1751123338262441.png (251 KB, 616x555)
251 KB
251 KB PNG
>>106765390
what do you mean? it cant be automated. the best you'll get is putting in data, and reading it
>>
>>106764277
This reads like Gemini 2.5/Gemma.
>>
The Pitfalls of KV Cache Compression
https://arxiv.org/abs/2510.00231

> KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

Not as lossless as some benchmarks show.
>>
File: file.jpg (286 KB, 476x1039)
286 KB
286 KB JPG
Interesting...
https://x.com/maximelabonne/status/1973372579441496514
playground.liquid.ai/talk
https://huggingface.co/LiquidAI/LFM2-Audio-1.5B
https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model
Same guys who did that language diffusion model.
>>
>>106765718
Everyone (except for llama.cpp developers who kept saying "muh perplexity") knows KV cache compression makes models retarded.
>>
File: file.jpg (871 KB, 2060x1056)
871 KB
871 KB JPG
>>106765758
>>
>>106765758
>>106765764
>>
>>106765780
This is just a fancy animation that describes how autoregressive models work.
>>
>>106765780
*drinks the model*
>>
File: 8SmPy6pTP.jpg (447 KB, 698x488)
447 KB
447 KB JPG
>>106765758
lol this is how you do it right
>>
No more games! When can I run glm-4.6 on vanilla ollama on windows?
>>
>>106765718
question, so say we are cpumaxxing big MoE models, but one or two 3090s are also available. Is it feasible to have KV/attention on the GPU, while weights remain CPU? I mean is it doable in practice with llama.cpp or similar on linux? would it help address the slowdown that pure CPU/RAM experiences as KV builds?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.