[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1743052378919146.jpg (186 KB, 768x1024)
186 KB
186 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106351514 & >>106345562

►News
>(08/21) Command A Reasoning released: https://hf.co/CohereLabs/command-a-reasoning-08-2025
>(08/20) ByteDance releases Seed-OSS-36B models: https://github.com/ByteDance-Seed/seed-oss
>(08/19) DeepSeek-V3.1-Base released: https://hf.co/deepseek-ai/DeepSeek-V3.1-Base
>(08/18) Nemotron Nano 2 released: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2
>(08/15) Ovis2.5 MLLMs released: https://huggingface.co/collections/AIDC-AI/ovis25-689ec1474633b2aab8809335

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106351514

--Qwen VL blocks Mao commemorative tea image due to political content moderation:
>106352603 >106352638 >106352653 >106352678 >106352695 >106352729 >106352741 >106352766 >106352778 >106352788 >106352794 >106352824 >106354537 >106354560
--GPU frequency locking affects code path performance and can't be queried:
>106351737 >106351762 >106351867 >106351875 >106351889 >106351911
--Frontend differences affecting token generation speed on same backend:
>106353506 >106353548 >106353898 >106354113 >106353905 >106354026
--Reasoning pre-fill exploits model trust bias for stronger output control:
>106354146 >106354174 >106354426 >106354778 >106354793 >106354614
--Meta partners with Midjourney, sparking criticism and speculation:
>106352643 >106352648 >106352649 >106354887 >106355765
--Avoid FP16 CUDA flags to prevent numerical overflow in quantized models:
>106356396 >106356788
--Qwen models overusing "not x but y" phrasing:
>106353981 >106353997 >106354008 >106354031 >106354058 >106354159 >106354182 >106356075
--GPU memory fault due to excessive GPU offload layers and poor memory management:
>106352359 >106352374 >106352413 >106352428 >106352463 >106352578 >106352673
--Maximize VRAM usage during fine-tuning for optimal throughput:
>106355943 >106356138 >106356180 >106356282
--Anons deploy local LLMs for gaming, finance, automation, and adult content:
>106354780 >106354986 >106355189 >106355209 >106355240
--OpenAI's India expansion mirrors past tech offshoring trends:
>106353105 >106353224 >106353263
--Seed 36B model support merged:
>106354673 >106355049 >106357911
--Illegal GPU memory access likely caused by index calculation bugs, not VRAM capacity:
>106352021 >106352040
--Copyright lawsuit accuses Meta of using pirated adult films for AI training:
>106352956
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>106351520

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Local AI is as good as dead if we don't get a local equivalent to Genie 3 by the end of this year.
>>
>>106358752
>Command A Reasoning released
How is it?
>>
>>106358772
whats the plan then genius
>>
loli feet
>>
>>106358772
How have you pushed local models in order to realize this claim?
>>
>>106358780
Cohere has completely committed to slopping up and safety cucking their shit.
>>
How come ST doesn't have some simple tool calling yet that lets the model roll a die or something dynamically? Why are local models so far behind?
>>
>>106358780
It is absolutely safe.
>>
>>106358780
It competes with gpt-oss
>>
>>106358832
You are absolutely right! Bringing safety to all is a part of my core programming.
>>
>>106355818
Hoping someone could patch command-a-reasoning-08-2025 into ST. Model works over the API trial key.
"thinking": {
"type": "disabled", # enabled by default
"token_budget": 500 # no error on disabled, no max, unlimited when not specified
}

"message": {
"role": "assistant",
"content": [
{
"type": "thinking",
"thinking": "stuff here"
},
{
"type": "text",
"text": "final response here"
}
]
}
>>
>>106358831
SillyTavern is not a local model it's an user interface.
>>
File: 1754493464792375.png (1.4 MB, 1664x928)
1.4 MB
1.4 MB PNG
>>
>>106358831
The bloated broken mess that is ServiceTesnor is single-handedly holding back local.
>>
File: LLM-history-fancy.png (1.28 MB, 7279x3166)
1.28 MB
1.28 MB PNG
Little update
>>
>>106358831
>How come ST doesn't have some simple tool calling yet that lets the model roll a die or something dynamically?
It literally does. Ask gpt to look up the documentation
>>
>>106358892
Seems like you ran out of colours. Mentioning individual dev is also nasty and irrelevant.
>>
>>106358922
>Mentioning individual dev is also nasty and irrelevant.
Which dev?
>>
command-a-reasoning really was the final punch in the dick of densessisies
>>
>>106358959
are you incapable of reading you even quoted the name
>>
>>106358980
moetards are really trying to play up a cohere model failing as a win for themselves?
oh no no no
>>
>>106358980
What? Bro, how many B is your brain and at what quant is it running?
>>
>>106359017
You are absolutely right to question this
>>
>>106358892
DS V3 0324 so good it was mentioned twice
>>
Best way to make a disappointment build?
>>
>>106359090
Buy a prebuilt
>>
>>106358892
Retard
>>
crazy how we're still stuck with sillytavern in 2025 when it's essentially stuck as a cobbled together piece of shit from 2023 for all eternity
>>
>>106358892
Thanks
>>
>>106359090
Buy a premade HP and realise it has 9nly two ram sockets.
>>
>>106359104
this but llms in general
>>
>>106359090
Buy enough RAM to run deepseek and realize it's slower than the slowest cloud provider
>>
>>106359104
Try doing better
>>
Cloud will always be cheaper and faster than local because you aren't running your local model 24/7
>>
>>106359104
Vibe code your own interface. You're sending formatted strings to model and back. All you need to know is how to implement tags for each model you are using and how to keep every string in order. It's that simple.
>>
>>106359090
Forgo the build and give away your privacy, autonomy and personal information to use the cloud instead.
>>
People who still use shit cards from 2023 think they're in their rights to criticize 2025 models
>>
>>106359090
>RTX 5070 12 GB
>slow 16GB DDR4 RAM
>Intel i3-14100
>>
>>106358892
Jamba sisters...
>>
>>106359148
bro I haven't touched a card with less than 3k tokens in a year and a half
>>
>>106359090
spend about $15k on hardware and run the best local model you can find
>>
>>536373993
>>536373993
>>536373993
Apologize to rentry
>>
>>106359070
Will correct it in the next version.

>>106359184
Will add as a note in summer flood.
>>
>>106359234
>>>/vg/536373993
>>>/vg/536373993
>>>/vg/536373993
oops
>>
File: GooH6mwWIAEGDaG.jpg (81 KB, 1000x707)
81 KB
81 KB JPG
>Try Qwen +200b
>Purple prose schizo
>GLM, and Deepseek are MoEs
>Kimi too big for local
I await my Magnum v5.
>>
>>106359256
This brings memories. Funny how that image slipped past filters.
>>
>>106359256
kimi and qwen are also moes
>>
>>106359256
Qwen is a MoE too...
>>
>>106359263
>>106359275
Fuck me, I saw instruct and thought it wasn't. This explains everything.
>>
If dense is so good
Why aren't more people training them
>>
>>106359290
because expensive, and as always once something becomes anywhere big it's race to the bottom time
>>
>>106359299
Why don't people that want dense models train their own?
>>
>>106359311
refer to >>106359299
>because expensive
>>
>>106359290
Everyone wants to be the next deepseek now
>>
>>106358892
Is this a joke? DeepSeek was never good.
>>
>>106359351
Fuck off :D
>>
>>106359351
just let it go sam
>>
>>106359351
Sam, it's been almost nine months. Please settle down.
>>
>>106359315
wtf are you poor?
>>
I was drunk last night and downloaded GPT Ass. Jesus, I promptly deleted it today.
>>
>>106358892
hybrid reasoners work fine, look at GLM
>>
Is there a trick to prompting moes that I'm not aware of? GLM, Deepseek, and Qwen3 are all schizo when I use them.
>>
>>106359351
Openai was never good
>>
>>106358831
You need to write an extension to give it tools.
>>
File: Untitled.jpg (331 KB, 1637x817)
331 KB
331 KB JPG
WTF cheater
>>
>>106359450
o3 was good. Was expensive too. But good.
>>
>compute_imatrix: 1500.86 seconds per pass - ETA 973 hours 53.38 minutes
O-oh...
>>
playing games with reasoning models sure is time consuming
>>
>>106359595
Hybrid reasoners are perfect for that.
>>
>>106359595
Prefil the reasoning with relevant information.
Hell, inject lorebook entries in the reasoning block even.
>>
Is it just me or are inline latex single dollar signs not rendering on DS webapp
>>
I notice the same model takes twice as long responding to my prompt on Open WebUI as in the Ollama interface. Most of the time is loading time as I wait for the first word to appear. This happens even if I run the same prompt back-to-back in a different chat, so it's not Open WebUI loading the model for the first time. I know Open WebUI adds overhead, but this is suspicious. Anything I can check in my settings?
>>
>>106359709
stop using ollama
>>
>>106359723
Suggestions for alternatives? I want something with features like chat history, markdown, etc. and not just a command terminal.
>>
>>106359750
llamacpp has it's own embedded webui.
>>
>>106358892
>Next up - the AI ice age
>>
>>106359750
troonkupad
>>
Is Miku trans?
>>
>>106359808
Yes
>>
>>106359750
llama.cpp server and any frontend what works for (you). llama has its own webchat but that's very bare. SillyTavern or whatever else is out there works well.
>>
>>106359808
Absolutely
>>
Anyone using any good Mistral Small models? I've been pretty much exclusively using Magnum Diamond (Cydonia is meh, decent but I think there's other better Mistral small models)?

I really wanna try out the Qwen shit but I can never really get it to work well. Feels like it's really poor at RP (probably prompt issue or some shit). Got 24GB VRAM, 32GB RAM so i'm pretty limited on the shit I can run
>>
So now that the dust has settled
What went wrong with DeepSeek 3.1?
>>
>>106359808
She is
>>
>>106359844
Lack of sex modality
>>
Wtf is going on in /aicg/? I come back after a few hours and every post is deleted.
>>
File: file.png (12 KB, 638x56)
12 KB
12 KB PNG
muh blackx rights
>>
>>106359808
Stop replying to yourself
>>
>>106359844
People expecting to RP with it using schizo cards.
>>
>>106359837
Devatral
>>
>>106359837
Qwen is not that great at writing. Mistral 3.2 is ok. Cydonia is somewhat strange it's not bad though.
Try Gemma 3 or Gemma 3 Glitter specifically. I really like its output (relative) but it's annoying if you are pushing its censorship limits. That works too but you need to groom it first can't just blurt out something or otherwise it'll display suicide hotline disclaimer with phone numbers lel
>>
File: 1738785005961858.png (308 KB, 1683x1353)
308 KB
308 KB PNG
Newfag here, Im trying to build fast local models for erp conversations, what are some models that are on par with qwen-flash's speed? Because those 1-3s delays in most LLMs are a huge turndown for me. We are talking about around 3-500ms with like 100-300 input tokens.

Also in picrel the numbers of gigabytes in parentheses are the memory needed right? How tf are you supposed to have 200GB in your local machine?
>>
>>106359448
https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune
Recommended settings.
>>
Smell will be the next big modality
>>
>>106359914
>How tf are you supposed to have 200GB in your local machine?
A lot of money
Personally I'm eyeing a 96gb DDR5 kit
>>
>>106359914
>How tf are you supposed to have 200GB in your local machine?
people off load the model to system ram out of desperation. I suppose you could stack some of those workstation cards like thet Blackwell rtx 6000 but it would cost prohibitive pretty quickly.
>>
https://rentry.org/lmg-lazy-spoonfeed-guide
?
>>
>>106359914
>even Q1_S is extremely capable
does that mean 1 bit?
>>
>>106359949
what would you run with that?
>>
>>106359914
>How tf are you supposed to have 200GB in your local machine?
disk space + RAM + VRAM ≥ 250GB
Most of lmg ERPs at 1 t/s.
>>
>>106359914
>How tf are you supposed to have 200GB in your local machine?
https://www.amazon.com/Crucial-5600MHz-5200MHz-Compatible-CP2K64G56C46U5/dp/B0DSR5P84D
https://www.amazon.com/G-SKILL-4x64GB-CL36-44-44-96-Desktop-Computer/dp/B0FFKFCLLL
Like this.
>>
>>106359984
Cope quants of bigger models than glm45 air
>>
>>106359989
c'mon, you need at least 5t/s for it to be halfway enjoyable. I can run larger models at 1t/s but I never would, what do you do while it's spitting out the response?
>>
>>106360005
>what do you do while it's spitting out the response?
Masturbating and shit posting.
>>
>>106360005
I get 3t/s and I am sure something is fucked with my config + I am on windows.
>>
>>106359976
>Pub: 23 Aug 2025 18:04 UTC
>Views: 0
>wixmp.com
Did it have anything useful before?
>>
>>106360005
i just switch tabs and do something else
>>
>>106359976
recommending ooba as a first thing to start with is diabolical (no one is going to call this shit text gen ui, fuck off)
llama.cpp has release builds every hour or so anyway if you are running windows, and if you are running linux and can't compile a program then this might not be a hobby for you anyway
overall pretty dogshit guide, if it were really spoonfeeding then it would go from a to z through every part but it's not even half assed, more like quarter assed
a list of recommended programs would be better + a small glossary and that's it
>>
>>106360038
It's just the old spoonfeed guide (https://rentry.org/lmg-spoonfeed-guide) with model recommendations replaced with a link to the rentry in the OP.
>>
>>106360066
>a list of recommended programs would be better + a small glossary and that's it
That's already covered by the links in the OP template.

>recommending ooba as a first thing to start with is diabolical
Ooba is still used, and rewriting it to walkthrough llama.cpp instead is too much for a lazy guide.

>overall pretty dogshit guide, if it were really spoonfeeding then it would go from a to z through every part but it's not even half assed, more like quarter assed
It touches on and mentions most things someone starting out will need to know and gets them running. People have to put in some effort themselves too. If someone asks how to run local and you give them a 30 page document, they won't even bother.

It's better than the current getting started guide, no?
>>
>>106359978
Allowed quantization types:
2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
38 or MXFP4_MOE : MXFP4 MoE
8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
36 or TQ1_0 : 1.69 bpw ternarization
37 or TQ2_0 : 2.06 bpw ternarization
10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
>>
>>106360195
>we have bitnet at home
>>
File: file.png (16 KB, 177x937)
16 KB
16 KB PNG
i am running a chroot inside a chroot, and i am running things off of different partitions
HOW THE FUCK DO I HIDE THIS SHIT INSIDE MY FILE PICKER AND INSIDE MY FILE MANAGER
HOW TO FUCKING HIDE IT FUCK FUCK FUCK FUCK FUCK!!!!!!!! FUUUUUUUCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKKKKK
>>
>>106360195
Thanks, I found a 2 year old version of that chart.
Q1 sounds like there's no way it can be good desu
>>
>>106360219
Not quite. Bitnet needs the training to be quantization aware to be near-lossless as they claim. This is not it.
>>
>>106360173
>It's better than the current getting started guide, no?
not really
i could try my hand at writing a guide, but i was here since llama2 released, so i'm not sure what are the pain points for new people of various literacy levels
this shit isn't really rocket science though, i'm sure that a moderately non-retarded person could figure it out in an afternoon or two on their own
you can't really save the lowest common denominator from their own stupidity
>>
File: llama_quantize.png (16 KB, 1023x720)
16 KB
16 KB PNG
>>106360238
It's the output of llama-quantize without parameters. You should have it on your pc already.
>Q1 sounds like there's no way it can be good desu
Some people are desperate and will do it anyway.
>>
>>106360238
>>106360238
Q1 can be good for HUGE HUGE HUGE models, like deepseek R1, ymmv
>>
>>106360238
Any Q1 Deepseek is better than any dense model you have ever tried. The problem of Q1 is that it is essentially enforced greedy sampling. All rerolls are almost the same.
>>
>>106360298
Interesting. So it's hard baked. Hardtack.
>>
>>106359448
to some degree the model is going to act the way it wants to act no matter what, but imo those models need a more restrained prompt than the ones that people used to use for roleplay with dry models, e.g. you don't really want to be encouraging them to use a flashy personality-maxxed hentai writing style or telling them to be extremely creative and unpredictable etc. they do much better with a more neutral prompt
>>
>>106358752
Do any of you use TTS programs? I'm not looking for the best - I'm looking for fast and low vram, because I want as much of my vram as possible to be dedicated to the LLM, not to the TTS.
>>
>>106360462
I used piper for a bit. Had to make the glue between my editor and piper, but it worked. It's stupid fast. Kokorotts is fast too. Not as fast, but I think it sounds a little better. Haven't tried kittentts. It has the smallest models of all three, so it should be faster than piper. One of these days i'll integrate it in my stuff.
>>
>>106359976
I agree with the other anon that recommending ooba is dumb.
In my opinion LM Studio (not open source I know) is the easiest to get started with because it uses llama.cpp and it doesn't require any python stuff that filters so many people. Kobold is a distant second.
Anyone serious about this uses llama.cpp and it's not even mentioned.
>>
>>106360527
>Anyone serious about this uses llama.cpp and it's not even mentioned.
It's a spoonfeed guide. Whoever needs to read that, is not serious yet. llama.cpp is mentioned in the OP that retards cannot be bothered to read, so whatever.
>>
>>106358892
real nice
>>
>>106360524
Nice, thanks for the suggestions
>>
I think I moved on from ooba when i got a bug where you couldn't interrupt generation and you had to wait for it to finish. I also still remember how retard forcibly changed API to openAI API and removed the old one. And openAI implementation was obviously bugged so there was no way to use it.
>>
So, meta is not releasing Llama 4 Behemoth, right?
>>
>>106360573
Right after Grok 2
>>
>>106360573
they realized that drummer already has tunes that are named behemoth so they decided to scrap the model to not cause any confusion
>>
>>106360588
damn this drummer guy is pretty badass
>>
hi shitfuckers
just wanted to tell you all that me and sam-chan's plan to hollow meta from the inside out has been going smoothly
convincing meta to abandon open source? easy
making lecunny ledone? yawn
convincing zuck to use gpt-5 after spending millions on his "superintelligence" (lol) team? well let's just say zuck is even more of a bottom than sam is
i've also been putting the plans in motion to get that chink ban underway. enjoy your deepsneed while it lasts, because when i'm done the only chinese letters you'll see will be the digits after a dollar sign
lol. good luck, and for those of you who are interested in resources about openness, remember that by reading this you have acknowledged that the wang-sama foundation legally owns your car and your daughter's virginity
>>
>>106359908
Isn't Gemma meant to be super retarded when it comes to roleplaying though (misremembering basic details etc)

I remember trying it before, Drummers one and the Abilerated one or something? Both sucked.
>>
>>106359104
this but also cumfartui for diffusion
>>
>>106360524
Nta but thanks, I'll test kittentts and will integrate that to my client. Todo list grows but no work gets done lol
>>
File: file.png (195 KB, 896x900)
195 KB
195 KB PNG
drummer why is gemma r1 12b so shit? i swear to god i pulled out a steam deck and then it started doing this 2 messages later when i told it to suck my dick
glm4.5 air chan would never do this
>>
>>106360601
buy an ad
>>
>>106360623
No, in my experience even 12b gemma 3 excels and is comparable to larger 24b mistral. I mean I use it for d&d rp and it can cite how much gold my partner has etc. I don't have any complex rules and have tried to make every system prompt rule as concise as possible.
Retardation comes more from its censorship in sexual content but this can be avoided with jailbreak thing plus gemma 3 glitter is somewhat better in this sense.
try it out and if you don't like it into the trash it goes
>>
https://youtu.be/mjB6HDot1Uk?t=428
>>
>>106360692
youtube slop
>>
>>106360601
>sam-chan
>not sama-chama
anon...
>>
>>106360601
>lecunny ledone
qrd on this?
>>
I have 128 GB RAM and 24 GB VRAM. What model that fits is the best for making small scripts? I could run Qwen 235B at like Q2 to Q3, or GLM 4 Air at Q6. Does the quantization hurt Qwen too much for coding or is still the best even when lobotomized?
>>
>>106360964
Devatral FP16
>>
>>106361016
That would be really slow though since no more than half of the model could fit in VRAM.
>>
File: 1736744801054204.png (29 KB, 1022x271)
29 KB
29 KB PNG
>>106360889
>>
>>106361058
it know
>>
has lecun made a statement on genie3 yet? google just went ahead and did what he was dreaming of with that
>>
>>106360889
He reports to Wang now
>>
File: onmeth.png (21 KB, 701x27)
21 KB
21 KB PNG
>Deepseek R1 400b
>prompt it with: Write uniquely to the tone of {{char}}'s personality.
>Magical shit like this happens
>>
>>106361174
400b?
>>
>>106361174
>Deepseek R1 400b
is this some pruned shit?
>>
>>106361151
Do we even know what architecture it uses and what methods they used to train it?
>>
>>106361204
No, closed source has finally achieved its moat. All it needed was to fully abandon LLMs.
>>
File: file.png (15 KB, 566x247)
15 KB
15 KB PNG
DEEPSEEK WHY WHY!?!?!?!
>>106361207
HOLY SHIT
HE DELIVERED
THANK YOU MUSK SAMA
I APOLOGIZE
>>
>>106360583
where is behemoth?
https://huggingface.co/xai-org/grok-2
>>
>>106361215
>500GB
I sleep
>>
>>106361208
If the details are that light then I imagine Lecunny would be incentivized to not make a post about it since he either knows too much and would get into trouble, or he'd have to "speculate".
>>
File: 1732453400163950.png (242 KB, 655x1223)
242 KB
242 KB PNG
>>106361215
The fuck is this structure?
>>
>>106361215
Use the command below to launch an inference server. This checkpoint is TP=8, so you will need 8 GPUs (each with > 40GB of memory).
does this mean its 8bit?
500gb but 8bit? that means.. 1trillion parameters? BROS BROS?!??!?! BROS!@#?%!#%^)!@$*^()!#@$&*^!)($^ BROS FUCKING BROS HOLY SHIT !!!!
>>
Is grok 2 potentially fuckable? As in is it worth it over r1 and glm4.5 for sex?
>>
>>106361215
>>
>>106361253
Nah. Grok 3 was okay at erotic creative writing though.
>>
>>106361253
Ask the cloudcucks instead
>>
>>106361215
>it's real
Well, /lmg/? Your apology to Elon sir?
>>
>>106361256
I recognize that profile picture
>>
>>106361253
no, it was hardly even a good option at the time
>>
>>106361270
He was late by a few days, but it could be the unpaid intern's fault
>>
>>106361270
I still think Elon is a despicable piece of shit. This model is late, 2 generations behind and quite frankly worthless. I would only apologize if it made me cum but that is not gonna happen.
>>
Grok2 had native image editing, didn't it?
>>
>>106360964
235B less quantized than Q1 of R1, and even at Q2 or Q3, it's still better than dense 70B models and more than capable at making small scripts.
>>
>>106361293
I don't think so. Didn't it call flux on the backend?
>>
>>106361293
Grok2 called out to Flux iirc
>>
>>106361294
What about Air though?
>>
>>106361310
Never tried it. But you can download both and ask them to do some script and keep the one that does better.
>>
>>106361276
So there is zero reason to use it over Deepseek. No one sane is going to actually use it. And we will now get some loud obnoxious saars running around the internet saying that Elon is a friend of open source because of it.

I hope Elon gets cancer soon.
>>
>>106361340
go cry on blue cry, more open weights is always a good thing
>>
>>106361310
air is good
>>
>>106361215
>b. Restrictions:
>You may not use the Materials, derivatives, or outputs (including generated data) to train, create, or improve any foundational, large language, or general-purpose AI models, except for modifications or fine-tuning of Grok 2 permitted under and in accordance with the terms of this Agreement.
Ewww
>>
>>106361215
So it's basically mixtral but 500gb.
"hidden_size": 8192,
"intermediate_size": 32768,
"moe_intermediate_size": 16384,
"num_experts_per_tok": 2,
"num_local_experts": 8,
"num_hidden_layers": 64,
>>
>>106361355
No one should want to anyway kek.
>>
>>106361348
Post output
>>
ollama run grok 2
>>
ollama run you're mum
>>
>>106361348
Indeed, only because then losing to Grok in the benchmarks is more embarrassing.
For being fuckable? You'd be stupid not prefer the fucking wildly hallucinating Gemma mini models over Grok.
>>
>>106361361
But if Grok 3 and 4 ever get released, they'll likely have the same license.
>>
>>106361382
Is Grok 3 and 4 so good to warrant distilling them though?
>>
>>106361253
Grok 2 was the one that had the engineers on twitter complaining about how much positivity bias leaked into it from contaminated training data.

>>106361340
If he really wanted to show up Altman, he could have easily released both Grok 2 and 3, and even gpt-oss-sized distill just to rub it in. The probably could have knocked out the distills in a week.
>>
Grok 2 saved local
>>
>>106361460
*safed
>>
>>106361422
Sir they are working on Grok 5 AGI Companions, they are rightly focusing their attention where it's needed.
>>
>>106361251
Weird if true. HF says the tensors are at BF16.
>>
>>106361491
dont trust HF autodetect for anything, its always wrong
if its BF16 even better, only 250b model thats nice
>>
>>106361469
If they dumped both 2 and 3 at the same time, they wouldn't have people nagging them to do another release in 6 months because they would have already gotten it out of the way.
>>
>>106361527
Please understand, safety checking needs long time.
>>
>>106361396
Grok 4 is the a SOTA model. Grok 1 and Grok 2 were them just dipping their toes in the water. 3 is when they really started doing decent.
>>
>>106361527
Elon dumps something when his ego needs a stroke, so my guess is it'll probably come when / if OpenAI does another "open" release
>>
File: wahaha cry.jpg (64 KB, 1280x720)
64 KB
64 KB JPG
Petra why are you bullying the facehuggers, you know they're sensitive
>>
>>106358189
I like how these faggots are acting all uppity as if markdown rendering is some arcane secret only they control. Anyone could vibecode a clone over a weekend these days
>>
>>106361648
It's really funny
>>
>>106361215
>Usage: Serving with SGLang
https://github.com/sgl-project/sglang/pull/9532/files
Is 'xai_temperature' something like dynamic temperature?
>>
>>106361657
The issue isn't rendering an alternative but hosting that shit
>>
>>106361648
How does he not get banned anyway?
>>
File: petra.png (93 KB, 636x667)
93 KB
93 KB PNG
>>106361648
You harbour sin brother
>>
File: file.png (29 KB, 1520x389)
29 KB
29 KB PNG
>>106361681
hf jannies are based
picrel is from when gpt oss released and i dropped gamer word and whatever else
>>
File: file.png (12 KB, 682x131)
12 KB
12 KB PNG
HAPPENING! CONFIRMED TO BE 260B/A30B
QWEN3 235B BUT SEX AND STUPED
BASED BASED BASED
>>
>>106361648
if the companies who create new models and publish them on huggingface realize that the main userbase of open models uses them for porn and obscene purposes, they're more likely to try to pander to us in the future
>>
>>106361743
I'd say the opposite is far more likely, Which he would like since he's been trying to kill the thread for a while now.
>>
>>106361740
8 experts 2 active. Therefore slow as shit
>>
>>106361783
what about the common/shared ones?
>>
Hey /lmg/ scholars, what makes LLMs so sensitive to quantization degradation? I'm quantizing small transformers models (T5, ViT, Bert...) to UINT8 ONNX and get literally 0 degradation over the full FP32 safetensors (and sometimes a very small improvement due to regularization). Why is that so hard to achieve with LLMs?
>>
saw the shitposts in that thread and fucking knew it came from here lol

hf admins literally posting as well there so ill await my ban
>>
File: dither-3596767975.jpg (240 KB, 1200x755)
240 KB
240 KB JPG
>>106360238
>Q1 sounds like there's no way it can be good desu
's fine
>>
>>106361801
Look at the config. >>106361358
It don't think it has shared experts
>>
>>106361802
depends on usecase
8 bit int is barely quanted anons itt are nutting to Q1
masturbation requires novelty = nuanced token distribution
experiment with QAT
>>
>>106361864
You got a point. I forgot you guys run sub-Q8 models
>>
>some Ukrainian guy who cited my paper died in Russian strikes last month
welp
>>
>>106361925
big-MoE changed the computational game a bit but I'd say 4-6bit quants are most widely used
>>
>>106361990
lmao that sounds funny, post more info please
>>
>>106361990
that's horrible to hear, alpindale
>>
>>106361990
What's the point of writing papers?
>>
>>106362027
To get citations
>>
>>106362027
Showing your peers the size of your dick
>>
>>106362027
publish or perish
>>
>>106362027
publishing papers makes you a "researcher" and eligible for free money from universities if you're part of their sekrit club of academics
this way you can live off your degree without getting a real job or doing anything productive
>>
>>106362102
>eligible for free money from universities if you're part of their sekrit club of academics
Lmg, please elaborate
>>
>>106361807
>lmao what a good meme
>that will be $0.16
Do cloudcuckd really?
>>
>>106362129
put it another way, if I'm paying $0.16 for cloudshit, I do want my dick sucked, metaphorically or not.
>>
>>106360238
>Q1 sounds like there's no way it can be good desu
all the people who are positive about q1 are hard copers
>>
>>106360238
q1 is cope, you need at least dynamic q2 for a close to lossless experience with big moe models such as deepseek r1 0528
>>
>>106362129
Wait till they start asking for tips.
>>
>>106360238

Listen to what this anon said >>106362206
>>
>>106359148
What has changed with cards?
>>
>>106362246
Models are now powerful enough to take all the schizo ramblings in your card literally
>>
>>106362162
Unfortunately they make it reluctant to suck my dick when I want it to (ERP) and overly eager to suck my dick when I don't want it to (coding).
>>
>>106362129
its not opus and im not spending that much - its the new 235b qwen3 with the cost multiplied by a random big number

retard discord users tend to like the responses more if they see it costs money and its claude - human psychology
>>
>>106362266
>Models are now powerful enough to take all the schizo ramblings in your card literally
So whats the effective limit for tokens now?
>>
>>106362289
it's not about the amount of tokens, it's what you do with them
>>
>>106359993
>2 memory channels
LOL
>>
>>106362315
>it's what you do with them
Yeah? whats the smallest card you've seen work well? how many tokens?
>>
>>106361215
the discussion thread is lol :)
>>
>>106362282
based
>>
>>106359766
>>106359787
>>106359824
Thanks, llama.cpp + OpenWebUI is way faster. Maybe I'll check out other frontends later. I'm new at this and just used ollama + OpenWebUI because that's the advice that seemed most common online.
>>
>>106362282
kek
>>
File: Grok.jpg (52 KB, 590x595)
52 KB
52 KB JPG
Elon claims Grok 3 will be open sourced "in about 6 months."
https://x.com/elonmusk/status/1959379349322313920
>>
What is the use case for grok 2 when deepseek and qwen3 coder 480b exist?
>>
>>106362417
Seems like you need to x2 every timeframe he gives.
>>
>>106362417
I hereby formally apologize to Elon.
>>
>>106362417
Would be nice. Grok 3 is an okay creative writing model
>>
>>106362417
When will Ani be opensourced?
>>
>>106362417
i kneel. fucking BASED
>>
>>106362476
I will open her source
>>
>>106360243
>Bitnet needs the training to be quantization aware to be near-lossless as they claim.
Pre-training.

What was QAT is now QApT. QAT is now trash thanks to Google and Gemma 3 poisoning the well.
>>
>>106358752
Reminder Miku is canonly skinny and have flat chests
>>
File: Grok2.png (460 KB, 1188x937)
460 KB
460 KB PNG
>>106362380
You're not kidding
>>
>>106362541
>QAT is now trash thanks to Google and Gemma 3 poisoning the well.

what happened?
>>
>>106359808
Miku as character? No.
mikuspammers definitely are, though
>>
>>106362417
Elon sir delivered!
>>
>>106362541
>Pre-training.
Distinction without a difference and a stupid naming convention. It's training.
>>
https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/discussions/2#68a9cfca361af4a168b42b74
In case anyone else tried to make DS 3.1 reasoning work with ST chat completion.
>>
>>106362644
So, is it worth it to make the jump from R1/V3?
>>
>>106362644
>ubergarm
I have seen this name somewhere...
>>
>>106362661
It can deal better with longer context, but it's more autistic so you have to be more explicit about what you want it to do.
>>
File: u.png (173 KB, 460x460)
173 KB
173 KB PNG
>>
>>106362661
Not if your primary use case is Vocaloid/UTAU birthday asking at IQ1KT. V3 0324 is better here
>>
/pol/ favorite celebrity did something, now the serbian is going to be like a kid on a sugarrush all weekend
>>
>>106362735
As one of the prime shitposters I can confirm that I am not feeling like shitposting that much now.
>>
>>106362694
will he quant the grok?
>>
File: 1589887234978.gif (1.54 MB, 230x230)
1.54 MB
1.54 MB GIF
>>106361864
>experiment with QAT
Was MXFP4 really a mistake?
>>
>qwen3-30b-a3b-thinking-2507-q8 slower on ollama but thinks efficiently
>qwen3-30b-a3b-thinking-2507-q8 faster on llama.cpp but keeps repeating itself in the thinking block so the speed gains are negated
What's going on? Why is the same model with the same quantization behaving differently on ollama and llama.cpp? What should I tweak to make the llama.cpp model behave more like the ollama model and reduce overthinking to actually benefit from the faster inference?
>>
>>106362795
ollama is slow trash and you didn't set the samplers correctly on llama.cpp, causing repetition
>>
>>106362795
You need to inspect the prompt, the hyperparameters, and the launch parameters the backends are getting and compare them.
>>
>>106362795
Get a grip on your inference infrastructure and understand what it's actually doing under the hood. Log the raw text input and diff it
>>
Btw I heard that ikllama KT quants are bad at the moment. But what is the problem with them specifically? I wanted to do an R1 IQ2_KT quant since I tried exl3 70B like that and I was surprised how good it was.
>>
>>106362566
>what happened?
Google says they use QAT for Gemma 3, when it's just quantization aware fine-tuning.
>>
>>106362795
Does the llama.cpp log have a warning about a double BOS token?
The GGUF file can define that one should be added and if you then also add one in your prompt you can end up with two.
>>
>>106362351
>how many tokens
Seven
>>
>>106362351
>You're X.
You don't need more
>>
>>106362845
We have sex. You are a pony.
>>
>>106359837
Broken-tutu-24b, turn off all samplers, they adversely affect output, causing bad repetition.
>>
>>106362541
>distillation
>omni
>QAT
They keep watering down established terms.
>>
File: Timeline.png (688 KB, 6277x1302)
688 KB
688 KB PNG
>>106358892
>Adding cuck and shot to the timeline.
A big vulgar and unneeded but more importantly wasn't there initially. Why even include them on the timeline?
>>
>>106359908
>>106360682
do you use 27b?

It's so fucking slow, even compared to 32b models for me.
>>
File: file.jpg (71 KB, 1081x137)
71 KB
71 KB JPG
>>106362417
kek at how fast trannies itt change their flip-flops
>>
>>106361292
>I still think Elon is a despicable piece of shit
no truer words have ever been spoken.
>>
Can someone explain quants to me?

Is it true Q4 K_M is all you really need? I usually go for the highest that my GPU can handle but I literally can't see a difference between Q4 K_M and Q5 K_M but i've not tested long enough to know.
>>
File: 1744388702202956.png (705 KB, 1896x1055)
705 KB
705 KB PNG
song of the day featuring miku and teto, tenntekomai girl
https://www.nicovideo.jp/watch/sm45323744
>>
>>106362417
safetykeks btfo
I hope he buys meta too and fixes the shit out of their models
>>
>>106361270
I will always be glad that Elon kicked started the space industry after it was stagnant for decades. But my god he can't help himself from burning bridges and flying off the handle for no good reason. Hopefully he someday learns how to keep his shit together because eventually he will run out of bridges to burn.
>>
>>106363235
Is this Japanese youtube?
>>
File: mmlu_vs_quants.png (336 KB, 3000x2100)
336 KB
336 KB PNG
>>106363201
The smaller the more degradation, generally.
Basically, since you are losing numerical precision of the numbers that are being used in the calculations, each "internal nudge" towards the final output is that little bit more different ("inaccurate") compared to full precision.
Something like that.
How much the degradation is noticeable or matter will depend on a lot.
The heuristic is, use the largest bpw (correlated with file size) that you can run at the speeds you are comfortable with the context size you need.
>>
>>106363245
yep! and if you've ever seen those videos where viewers' comments are scrolling from right to left, across the main screen of the video, that's where it comes from
>>
>>106363201
>Can someone explain quants to me?
Accuracy goes down as the quantization becomes more aggressive. Generally, bigger models handle low bit quants better than smaller ones. That's it.
>Is it true Q4 K_M is all you really need?
If q4km is good enough for you, use that. If you can manage to run something bigger and tolerate the speed, use that instead. If you need more memory, use a lower bit quant.
It depends on the problem you're trying to solve and your expectations. This is not me asking what the problem is nor what your expectations are. It's something you have to evaluate yourself.
>>
>>106363235
>>
>>106363235
When is crypton going to give up and let Synth V make a Miku voicebank? She sounds awful compared to Teto.
>>
>>106363258
>>106363281
nah I get that stuff, I just read that most graphs show for the standard models (not speaking the crazier sized ones, moreso in the sub 34b range) that Q4_M is sort of the sweet spot or some shit but I have no idea so figured one of you guys may know more.

I'll stick to Q4s for a while, see how they feel.
>>
>>106363245
unironically better for my eyes than american jewtube
now if only I knew more japanese...
>>
>>106363201
quantization is a mapping of the models weights down to a smaller size.
weights are basically floating point numbers.
basically it is like images, the lower the amount of bits you can store for the image, the less accurate the picture will be to the original.
>>
>>106363371
We never needed more than 4bit color.
>>
>>106363415
I like to give my models 6 bits, as a treat.
>>
>>106363415
the greens got duller
>>
>>106363134
I've never gotten Gemma to work to good for me either for similar reasons. I prefer it's writing style to Mistral small but having a 24GB card I struggle to find reason to pick Gemma over Mistral when one is so much quicker.
>>
>>106363423
You noticed that but not the blues turning gray?
>>
File: tetter.jpg (122 KB, 634x357)
122 KB
122 KB JPG
>>106363294
Most SV Tetos while vastly more natural sounding compared to the old sovl UTAU and vocaloid, sound the same. Might be that few producers make an effort with tuning to make her sound different.
Gets boring. Kino exception: https://www.youtube.com/watch?v=ekrAP7mzKa0
New vocaloid versions are meh imo, trying too hard to sound "real". Vocaloid V2-4 Miku variations sound different and have the soul of imperfection.
https://www.youtube.com/watch?v=rQRlSJJ0OrI
>>
File: 63643.jpg (72 KB, 960x540)
72 KB
72 KB JPG
GPT OSS VS Grok 2 VS maverick
Who is the king of local?
>>
>>106361058
>Of course! This is a great question that gets to the heart of
you edited this right? I hope so
>>
>>106363452
me :D
>>
>>106363452
GPT OSS
>>
>>106363452
Drummer
>>
>>106363479
I didn't vote for you.
>>
>>106363518
Alright, rank them.
1. GPT OSS
2. Grok 2
3. Llama 4 Maverick
4. meee
>>
>>106363525
(You) > Grok 2 > Llama 4 Maverick > GPT OSS
Omitting any Chinese options is cheating though.
>>
>>106363452
glm 4.5 air
>>
>>106363540
i know my worth (less than chinese models)
>>
>being excited for >this
>>
>>106363757
He also said grok 3 in six months, which means this is just precedent-setting. I'm more excited for OpenAI getting their lunch eaten from all angles than the actual release themselves.
>>
>>106363767
Dropping a model not a single person will ever use isn't eating anyone's lunch.
>>
>>106362206
>dynamic q2 for a close to lossless experience
Really? Q2? How much better is that than Q1?
>>
>>106363777
Why won't anyone use it?
>>
>>106363781
twice as many Qs bro
>>
>>106363245
No. YouTube is western nicodou
>>
>>106363794
It's old, big, and dumb. Much like your mother. They didn't even have the decency to release the base model.
>>
>>106363850
That sucks
>>
how much ram does mistral small 24b take up at 128k context?
>>
>>106363777
It makes OpenAI's future retiring of 4o instead of open-sourcing look very weak, especially after their last open source shitshow. It causes a public loss of confidence, which is a useful antidote to their arrogance.
>>
>>106363861
They'll say it's too dangerous to release, and for all I know they actually believe that.
>>
>>106363868
The engineers and safety researchers might believe it, but I don't believe for a minute that sam does.
>>
>>106363903
Well, companies only open source last gen technology. And GPT-4o is still current gen for OpenAI :)
>>
>>106362417
Gotta hand it to this,
He delivers (eventually).
>>
grok-2 gguf status
>>
>>106364011
sir sglang is all you need sir
>>
>>106358772
With a 5090 rtx what could i make as far as video?
>>
File: wtf.jpg (33 KB, 447x315)
33 KB
33 KB JPG
Can anyone explain to me how koboldccp works with the offloading shit.

Why does it automatically start reducing the layers when I take something easy like say a 24b Mistral Small model up to 24k context. Does that mean my VRAM isn't enough or something? Because when I just manually set it to 43/43 it works fine, even quicker I think.

Should I ignore that Auto Offload Layer shit entirely?
>>
>>106363903
They kneecapped Toss, there's no way they'll ever release 4o
>>
Did anyone manage to get any TTS models working with RDNA3/ROCm on Arch? I need someone to explain how like I'm a fucking retard, every attempt I've made has failed despite using the rocm torch packages, onnx and whatever else, I always end up with dependency conflicts I fucking hate python environments and pip packages so much
>>
>>106363903
>but I don't believe for a minute that sam does.
The fact that Sam released the stinking pile of shit that was OSS makes me 50/50 whether he was trying to poison future open models from other companies with the approach that takes a sledgehammer to intelligence in the name of "safety", or whether he's genuinely schizo and believes peasants don't deserve what amounts to private internet access
>>
>>106364271
>Auto Offload
jank; ignore. Use that which makes it go faster through trial and error, then save the good config.
>>
File: 1750864005822845.jpg (44 KB, 706x692)
44 KB
44 KB JPG
>>106358757
>Copyright lawsuit accuses Meta of using pirated adult films for AI training:
Kek
>>
I'd use Grok 2 as my Ani's brain. SADly I'm vramlet and ramlet
>>
>>106364011
not needed
>>
>>106364639
>>106364639
>>106364639



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.