[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor application acceptance emails are being sent out. Please remember to check your spam box!


[Advertise on 4chan]


File: ComfyUI_00140_.png (1.26 MB, 1024x1024)
1.26 MB
1.26 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Simple and Clean Edition

Previous threads: >>107359554 & >>107347942

►News
>(11/28) Qwen3 Next support merged: https://github.com/ggml-org/llama.cpp/pull/16095
>(11/27) DeepSeek-Math-V2 released: https://hf.co/deepseek-ai/DeepSeek-Math-V2
>(11/26) INTELLECT-3: A 100B+ MoE trained with large-scale RL: https://primeintellect.ai/blog/intellect-3
>(11/21) GigaChat3 10B-A1.8B and 702B-A36B released: https://hf.co/collections/ai-sage/gigachat3
>(11/20) Olmo 3 7B, 32B released: https://allenai.org/blog/olmo3
>(11/19) Meta releases Segment Anything Model 3: https://ai.meta.com/sam3

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>107359554

--Challenges in generating Kasane Teto images with Z-Image and LoRA models:
>107368969 >107368985 >107368995 >107369007 >107369017 >107369359 >107369725
--GLM 4.5 Air as uncensored LLM for translation, T2I prompting, and thinking model tasks:
>107363606 >107363637 >107363645 >107363699 >107363711 >107363729
--Budget PC build advice for running LLMs with upgradeability considerations:
>107360618 >107360713 >107360774 >107360958 >107361494 >107361522 >107361553 >107361844 >107361915
--MXFP4_MOE vs traditional quants comparison:
>107364303 >107364442 >107364674 >107364833 >107364960 >107365179 >107365211 >107365492 >107365496
--Evaluating Gemma 3's de-censored versions for roleplay and explicit content handling:
>107370356 >107370374 >107370409 >107370478 >107370499 >107370718 >107370736 >107370862 >107370793 >107370816 >107370847
--GigaChat3 release and performance discussion:
>107364822 >107364936 >107364981 >107365213 >107367060 >107370472 >107366861 >107366986 >107367001 >107367335 >107367505 >107367381 >107367110 >107367864 >107369798 >107368535
--48GB 4090D recommended for LLM GPU under 3k budget:
>107361607 >107361688 >107361781 >107361849 >107361859 >107361867 >107362013 >107362070 >107361747 >107361845 >107361874 >107361897
--LLMs as probabilistic text generators with no real logic, workplace misuse challenges:
>107359608 >107359699 >107359761 >107359810 >107359822 >107359878 >107359823 >107359846
--"Quad" as a campus-specific term in American universities:
>107368575 >107368619 >107369717
--V100 GPU limitations for CUDA training and alternative hardware considerations:
>107368171 >107368420
--Challenges in setting up private, encrypted cloud LLM infrastructure:
>107369831 >107369886
--Miku and Teto (free space):
>107361845 >107364822 >107368995 >107369725

►Recent Highlight Posts from the Previous Thread: >>107359558

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
kyutai doesnt support nihongo
>>
>>107373218
Good. Weebs will be hung
>>
>>107373232
Weebs are already hung.
>>
>>107373301
What good is being hung to virgins?
>>
>>107373337
You can more easily identify with the Ojisans in eromanga.
>>
>>107373077

DO NOT IGNORE ME OR ELSE
>>
>>107373392
Why are you asking? Just download a smaller quant and see if it's acceptable for whatever you're doing. Do you think we have like a spreadsheet for the "brain rot" of every model at every quant ever made?
>>
>>107373481
>with brain you is you
lole
>>
>>107373392
try it and see :^)
>>
>>107373392
The only thing with brain rot is you, frognigger
>>
>>107373472
>Do you think we have
who are we? I did not ask (You)
>>
Okay bros my therapist thinks I should use a llm to help me with texting women on online apps, now obviously I'm not going to share my fucking with Sam altman or Larry Ellison so i will have to self host a soloution.
My onsite equipment is a 4510T with 128gb of ram and also a A310 . Willing to spend a bit on a GPU extra if it helps me get laid.
I'm new to AI stuff so I'll need recommendations on full stack deployments and also llms.
>>
>>107373523
my honest advice in this matter is find a new therapist
>>
>>107373516
Just stop.
There is no coming back from mistyping your epic retort.
>>
>>107373547
Gag on my chode, bish
>>
>>107373523
>texting women on online apps

What you need is a red pill cure asap
>>
>>107373544
basado como puta de marde
>>
File: 1764152459060075.png (354 KB, 569x628)
354 KB
354 KB PNG
Hello frens. Is Deepseek R1-0528-Q2_K_XL:671b still the go to for best stories?
>>
>>107373523
>spend a bit on a GPU extra if it helps me get laid

it won't, but at least you'll keep the GPU
>>
>>107373523
bud, I'm not being a dick, but your best bet will be to hit the gym. Oofy Doofy game does not work well in 2025.
>>
>>107373665
I heard a rumor that the latest Kimi K2 write decently as well
>>
File: the local model KANG.png (53 KB, 626x236)
53 KB
53 KB PNG
>>107373693
Thanks anon I'll check it out. This thing is amazing for running these large models. I feel like it should be in the guide.
>>
>>107373544
Would bet several of my testicles he made that part up
>>
>>107373523
if you really need practice just do the catfish thing.
>>
>>107373677
I'm physically attractive enough to get laid on these apps but I'm a turbo autists over text so I get filtered in convo by every girl but mentally ill whoores who just want hookups.
One of my bigger problems is taking to long to reply and over thinking things.
My therapist thinks that asking llms for response templates might help and I agree enough to try .
BTW she is a woman so its possible she was grasping at straws after I shut down her other advice .
>>
>>107373762
>taking to long to reply and over thinking things.
so why don't you just stop doing that?
>>
>>107373519
We as in any of the people who post in this thread, one of which you are clearly not you stupid frogposting tourist
>>
>>107373709
The guide is very out of date but yes we live in a clown world where a mac studio is basically the only consumer-available way to run big boy models.

Out of curiosity, why do you own one of these in the first place? Is there a use case for 512gb memory besides LLMs?
>>
File: 3f22nj.jpg (140 KB, 800x450)
140 KB
140 KB JPG
>>107373796
>>
File: 1763345516935370.png (967 KB, 1080x1440)
967 KB
967 KB PNG
>>107373803
>why do you own one of these in the first place?
I have 2 right now, and the same reason I have an RTX6000 Pro, work got them for me to play with.

Also can we take a moment to appreciate the fact that you can run a 1 trillion parameter model locally. I'm downloading now and will compare to the deepsy model and share.
>>
>>107373825
try drugs and or alcohol, maybe? bottom line is I don't think talking to a computer is going to help you learn to talk to women.
>>
>>107373798
he just did post in this thread thobeitever
>>
>>107373859
I dont want to learn I just want to trick them long enough to form a relationship
>>
I didnt come here for therapy advice or advice picking up women.
Just tell me if I need more ram or what model I should use
>>
>>107373803
>mac studio is basically the only consumer-available way to run big boy models.
It's not worth it until the prompt processing issue is resolved. ggerganov needs to prioritize implementing some way to use GPUs over USB.
>>
>>107373875
There is no trick dude. If you don't have these basic social skills how do you even have a job...
>>
>>107373709
you can test it online
https://www.kimi.com/en
>>
>>107373875
based future divorcee
>>
>>107373887
qwen models are pretty good and have offerings that will run on anything from a cellphone to a datacenter. or you could try gemma3 or even mistral nemo
>>
>>107373709
>This thing is amazing
11k though

I don't think the speed matter unless it is coding
>>
>>107373803
>>107373709
How fast is an AMD threadripper cpu server with memory? Is it on par with the iChads?
>>
>>107373902
Nepotism
>>
>>107373762
You can be a lot of thing with AI these day.

Do it for you.

Don't waste it on 304's
>>
>>107373798
>stupid frogposting tourist
projections
>>
>>107373932
>11k though
oh shit I thought they were like $6k. I just pulled the quote up, you're right, fuck lmao
>>
File: OpenBoxHell.png (280 KB, 1485x1162)
280 KB
280 KB PNG
The cheapest way to get 32/64/128 GB of desktop VRAM, and it's being roundly rejected while Microcenter sells out of $3500 5090's and $8500 RTX Pro 6000's every week. Why can't AMD just be good at Image/Video Gen? ROCm 7 is also exceedingly memory-heavy.
>>
>>107374011
Pray for China to free us from Nvidia's yoke
>>
Holy shit! Since when do local LLM's offer to provide the exact diff???
>>
I’ve seen this before.
Last winter, a man in Sector 7 tried to run a 70B model on a SATA drive.
He swore it was “just slower, not broken.”
He ran it for 72 hours straight.
Didn’t sleep.
Didn’t eat.
Just stared at the screen, waiting for the words to come.
On day three, his SSD died.
Not from heat.
Not from age.
From overload.
They found him still sitting there.
The screen black.
The drive dead.
And on the last line he typed…
> “I think the stars were never real. But I want to believe.”
They called it a system crash.
We called it a soul crash.
>>
I'm the anon from yesterday who was asking about hardware. I've upgraded from a 1080 (8GB) to a 5070 TI (16GB). Previously I've been just downloading Q5_K_M GGUFs - what should I be able to handle now?
>>
File: tokencandidates.png (68 KB, 1129x679)
68 KB
68 KB PNG
has has anyone tried to make a backtracking regeneration system for banning phrases? i've been having a look at token probabilities and noticed there's these peaks and troughs of token probability distribution entropy, at least in the model i'm using, and it would make more sense to "try again" from a point where there's more viable options for what the token could be
my idea being that when a set of tokens containing an unwanted phrase is generated, you work backwards looking for a token that had a good spread of probabilities (high entropy?), generate 1 token with the unwanted token "banned" (via logit bias or similar), then continue generation as usual.

i'd imagine it would also have a cool effect during streaming, you'd see the model generate a banned phrase, erase it, and try something else.

not sure if you'd start from the beginning or end of the banned phrase (strict vs loose?), or what to do if the phrase gets generated again (max retries?), or if there aren't any candidate tokens (just try again from the start?)
>>
>>107373940
Nope, soldered ram does give them an edge in speed.
Still not buying from the itoddler brand.
>>
>>107374360
>has has anyone tried to make a backtracking regeneration system for banning phrases?
literally what koboldcpp antislop is
>>
>>107374333
its not just the q number but also the number of b parameters. you can run ggufs up to 12gb with lots of room left over for context. or ggufs around 15gb with minimal context. or even run massive moe models with it mostly running on cpu ram. its a pretty wide spectrum. try mistral nemo q8.
>>
>>107374377
Yeah, I've just been using a couple different models just for SillyTavern and shit; I'll start with Mistral Nemo and see where I go from there.
>>
>>107374333
kimi k2 q6
>>
File: file.png (75 KB, 877x244)
75 KB
75 KB PNG
>>107374360
>>107374371
see https://github.com/LostRuins/koboldcpp/releases/tag/v1.76
>>
>>107374333
The file size should give you the rough idea if you want to squeeze it entirely into the GPU

It gets better if the model allows for offloading to RAM (Deepseek Q2 is 250Gb, but takes a fraction of 24Gb VRAM with reasonable context size )
>>
>>107373891
Not much of a point when M5 onwards ship with their own matmul accelerators
The only thing to wait for is for Applel to get off their ass and release the larger chips
>>
>>107374011
>Why can't AMD just be good at Image/Video Gen?
What qualifies as good? ComfyUI works fine with ROCm on Linux and always has. As far as sales go, AMD cards will never catch on with mainstream consumers until whatever one-click installers and Youtube tutorials people are using work out of the box.
>>
I >>107374267
Which one?
>>
>>107374408
great, thanks. guess it's time to abandon llamacpp once again
>>
>>107374368
what's the performance difference? How the CPU matter much is or is it mostly just ram speed?
>>
>>107374418
>mainstream consumers until whatever one-click installers and Youtube tutorials people are using work out of the box.
bad take, making things easy to install only makes local more popular.

Think like a normie. You can pull up a website and use SlopGPT, or spend 100hour updating your arch configs to run local, which will they choose?

AMD could make stuff that's less shit, and yes, I know, just like android, it's a very good value ;)
>>
>>107374419
Qwen3-NEXT
>>
>>107374484
It's almost entirely memory speed bound, i don't remember the exact metrics but afaik m3 ultra is around 800GB/s and with a threadripper you would be happy if you get above 300GB/s

A rtx 5090 is around 1.8TB/s for reference.
>>
>>107374418
>>107374510
Yeah, the Android comparison here is apt. You really can make it work, and many do, but why settle for something 2-5x worse when it's not even half the cost of the shit that works out of the box?
>>
>>107374416
Didn't know that. Got to wait until we see llama-bench results, but if they made PP tolerable it would be a good buy.
>>
>>107374510
How is that a bad take? You said the same thing with a bunch of retarded buzzwords.
>>
>>107374590
Not really. At least you (used to) have more control over Android compared to iOS. AMD doesn't offer any such advantage.
>>
>>107374640
>You said the same thing
I did not anon. I said easy install and jus werks is great and we should support it. AMD and Android being good value is nmp
>>
>>107374409
>>107374377
I've downloaded a Q8 just to test - it's obviously generating much better results, I think, but much slower. Is this because of the bigger size of the model?
>>
>>107374594
https://machinelearning.apple.com/research/exploring-llms-mlx-m5
It's a 3-4x PP speed up vs last gen GPUs on MLX
>>
>>107374831
naturally, it has to rip through more gb's of parameters.
>>
>>107374858
Makes sense. Maybe I should go back to the previous models I was using just to see the speed difference.
>>
>>107374594
In practice the m5 only speeds up pp by 3.5x which is okay but still not amazing.
>>
Which gguf quantization of glm 4.5 air is considered a good compromise in performance/loss of intelligence?
Q8? Q6? Q4?
>>
>>107374965
Q4 is the minimum viable quant for most midsized models. Only go for Q2 or Q1 copequants on giant models like Kimi. Past that it just depends on what your tolerance for waiting for an output is.
>>
>>107374831
>I've downloaded a Q8 just to test

Q8 of which model? For Deepseek it is overkill.
For some smaller model, it is the way to go.

I heard the first version of Kimi K2 was at Q4 just as good as Deepseek at Q2

So it depends

>Is this because of the bigger size of the model?
Quantization reduces precision.

Unquantized FP16 model has weight values from -0.9999999 to 0.9999999 seamlessly with 7 decimal places

During quantization, this range is reduced to just 256 fixed values in case of Q8 which is still faster to do calculations with that 16-bit floating point (FP16)

Q4 means that the range -1..+1 is divided in 16 parts.

We are lucky that this reduction in precision does not trash the model completely, and the model is runnable on a consoomer GPU
>>
>>107373838
i used it without thinking
but i dont have repetition issues with air, what do you mean? is your instruction preset all right?
>>
>>107374863
>Maybe I should go back to the previous models I was using

Tell us which these were
>>
>>107373392
Q8_0
>>
>>107373523
grab whichever cheap gpu wit hthe most vram you can get
used 3090 for 500 bucks, mi50 for 200 bucks, 4060ti16gb/5060ti16gb for whatever many bucks
glm air
if you get more vram like 32gb or 24gb then maybe glm 4.6 lower quant
dont fall for the women meme
>>
>>107373665
Have you tried GLM 4.6?
>>
Does having custom instructions (or whatever your LLM calls the menu where you put text telling it to act like a cat girl or a reddit atheist) affect code and image generation or is it a writing style thing only?
>>
>>107374985
OK thanks, I have a 3090+96GB of ram, I'll try the Q8 and if it doesn't work, the Q6.
>>
>>107375046
>Unquantized FP16 model has weight values from -0.9999999 to 0.9999999 seamlessly with 7 decimal places
I don't think this is true. didn't we come up with bf16 to give it the same dynamic range of fp32 with reduced precision. I don't think these things operate between -1 and 1
>>
>>107375150
bf16 has the same exponent bits as fp32 to make it faster converting to and from fp32, which happens during some operations even when the output is 16 bits, not for any other reason.
>>
>>107375177
but thats still not addressing the issue. fp16 is bad for training because of underflow overflow issues and needs advanced techniques like loss scaling to compensate. bf16 is drop in replacement for fp32 because it matches the dynamic range of fp32. but if they only meander between 1 and -1 they would never need the massive dynamic range to begin with.
>>
File: file.png (89 KB, 943x1075)
89 KB
89 KB PNG
>>107375099
It definitely affects comments.
>>
alright, I've got a NAS that also has an RTX 3090 24GB to mess with LLMs. I have Ollama and open web UI installed. Can I somehow give these LLMs access to my docker containers for troubleshooting? Ie, so I can ask it for logs, and things like that?

Has anyone done anything like this?

So far its just kind of gimickey and asking the same questions to grok or claude or whatever is better.
>>
>>107375150
>I don't think these things operate between -1 and 1

I used the (-1..1) range for demonstration. bf16 is another topic. It is Nvidias own format.

Quantazation is the method to describe a range in, em..., QUANTS lol

You have an integer -127 which correspond to -1, +127 corresponds to +1, but you have only 256 destinct values and you need a single byte to describe it
>>
>>107375226
You are right about the need for dynamic range. Huge outlier values exist and appear to be important. I'm just saying the bf16 format was designed for speed not for being better than fp16 at representing LLM weights.
>>
>>107375226
deepseek was trained in fp8

a few moments later, US stocks lost 2 trillion
>>
>>107375289
i didn't mean that you need the fp32 dynamic range. it was just the thought process I was following that made me detect the inconsistency which as it turns out was just a simplification for brevity >>107375273, I even mentioned using loss scaling to train with fp16. desu for irrational reasons I do think more bits is better and have never been a bitnet supporter, but obviously it works with other bit widths.
>>
>>107375272
>Can I somehow give these LLMs access to my docker containers for troubleshooting?
bad idea
>>
File: file.png (106 KB, 966x856)
106 KB
106 KB PNG
and you guys were saying qwen next is bad?
>>
>>107375046
>>107375057

Previous models:
Cydonia-24B-v4j-Q5_K_M.gguf
PocketDoc_Dans-PersonalityEngine-V1.2.0-24b-Q5_K_M.gguf
Rocinante-12B-v1.1-Q5_K_M.gguf

One I just tested:
Cydonia-24B-v4r-Q8_0.gguf
>>
>bro bought 16gb card just to test out a higher quant of the same model
bro..
>>
>>107375381
how much system ram do you have? try glm air or something.
>>
>>107375394
I can acknowledge I'm a retard - I should probably find something to learn more.
>>
>>107375336
>but obviously it works with other bit widths

I guess you mean "bin widths"
Z=Yes, this is what unsloth brothers achieve with their dynamic quants to keep precision where it matters
>>
>>107375408
64gb.

https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main

When I look at this, what am I looking to download? The q8 is in like 3 parts.
>>
>>107375409
i dont mind helpin you out, do you have matrix/element? also how old are you
>>
>>107375419
https://huggingface.co/bartowski/zai-org_GLM-4.5-Air-GGUF/tree/main/zai-org_GLM-4.5-Air-IQ4_XS
grab both of these
>>
>>107375419
download all the parts and when you load the model just specify the first part.
>>
>>107375419
you don't have enough memory for q8 it will run off your ssd try the q4 like this anon recommend >>107375429
>>
>>107375463
Where do I learn how to determine what kind of quants I should be downloading? Like, I was downloading Q5_K_M before and it felt fine, but that was with half as much RAM.
>>
>>107375381
I see
I'm not into RP, but following the discussions itt, the true quality of a dedicated model is the ability to handle BIG context without forgetting what was said 3 swipes ago

These are monolithic models (not MOE). the more layers can be places on GPU, the better. You will wait longer for a response with Q8
>>
>>107375476
>These are monolithic models
dense models
>>
>>107375475
the quant is like a lossy compression, it allows you to run models that would be un runnable otherwise. it will always be a personal preference, you are trading speed for quality. some people want fast replys they can iterate on quickly other people don't mind waiting 10 minutes for a reply.
>>
>>107375272
Yeah. Simplest way is to just attach a tool or mcp server that's just a bash shell. There are some obvious security implications that come with this that you'll of course need to consider.

If you want to have things a bit more locked down you can throw together a mcp server that explicitly exposes functions for the things you want it to have access to rather than something as broad as giving it a shell.
>>
>>107375514
>dense models
True. Forgot the name
>>
>>107375476
>the true quality of a dedicated model is the ability to handle BIG context without forgetting what was said 3 swipes ago
>without forgetting what was said 3 swipes ago
>without forgetting
>3 swipes ago
It's almost like you understand what some of the words mean.
>>
>>107375381
/lmg/ isn't for you, Mistraljeet.
>>
>>107375608
>It's almost like you understand what some of the words mean.
I'm a helpful assistant. I will try my best. I was designed to stay on topic.
>>
File: cute.png (263 KB, 1568x781)
263 KB
263 KB PNG
awwwww
>>
>>107375337
Why would it be a bad idea if its models I'm hosting locally? And I don't really want to give them write or execute permission, just basically have context on my file structures and containers, then add github repos for more troubleshooting context.

>>107375535
https://github.com/vespo92/TrueNasCoreMCP

here's a truenas MCP server. something like this then? I mean I guess I'm retarded, what are the security implications if I didn't give the models write access? I guess they'd also be able to see API keys, and passwords?
>>
>>107375764
Cute.
>>
>>107375764
man discovers ai isn't actually ai
>>
I got a 4090, what's the best model to goon? spoonfeed me please :(
>>
File: parrot.png (168 KB, 641x360)
168 KB
168 KB PNG
>>107375367
You're talking about me?
>>
>>107375972
GLM. if someone recommends a 24b or below model, ignore them
>>
>>107375972
ram
>>
>>107375972
Anything but GLM parrot shit, you'll get sick of it after 5 messages.
>>
>>107375773
>>107375846
It's possible. It's just that fine-tuning to specific user preferences and interests isn't within big AI's interests, so nobody cares to develop online learning capabilities.
>>
>>107375972
Anything but that autistic piece of shit K2-Thinking
>>
>>107376152
how would that not be in thie interests of big ai? like wouldn't that be the first step towards agi is the ability to learn? but practically I don't think user inputs have very much signal and the models outputs are already shit so what is there for it to learn online?
>>
what Z-Image from Hugging Face should I download?
>>
>>107376188
>>107376094
>>107375972
well sadly the only options are glm and kimi, and if you don't want those then deepseek it is
>>
>>107376240
Terrible advice
>>
>>107376224
The one from the official repo.
>>
File: qwen3-next.png (22 KB, 1004x435)
22 KB
22 KB PNG
China sugoi
>>
File: file.png (295 KB, 1283x815)
295 KB
295 KB PNG
>>107376248
got any better ones?
>>
>>107376260
what is the official repo?
>>
>>107376264
hardware?
>>
>>107376265
Yes
>>
>>107376275
The one you'd find with a cursory search on your favourite search engine.
>>
>>107376285
list out 10 better ones from a to z in a bullet list ordered in reverse
>>
>>107376264
>3B active
Is this supposed to be impressive?
>>
>>107376265
He doesn't, and he's been shitting up the threads for a while now. GLM4.6, Kimi2, Deepseek. These are your best options for now, and they all have their upsides and downsides.
>>
>>107376286
Tongyi-MAI. 10x
>>
>>107376304
I guess I'll stick with petra-13b-instruct.
>>
>>107376297
Sorry I'm not a janny, I don't work for free.
>>
z image base dead
>>
>>107376322
ah i see, so there are no better ones
>>
>>107376299
it refactors and updates my code
>>
>>107376297
Ze bug
You eat it
xhe does it too
Want communism
Vape everyday
Use reusable bags
Treat others nicely
Surrender thought
Recognize your privilege
Question MAGA
Prostrate to colored people
Opine mindlessly
>>
>>107376204
>how would that not be in thie interests of big ai?
Because they already are unprofitable as it is, they don't need to increase the compute cost per user 10x
>like wouldn't that be the first step towards agi is the ability to learn?
It already learns. They just don't let the user teach it.
>but practically I don't think user inputs have very much signal and the models outputs are already shit so what is there for it to learn online?
That's because we don't know how to extract the signal. For example me saying "Talk like a real human and don't use markdown unless I explicitly tell you to write an article, an .md file or similar.".
That is all the signal you need, but right now there is no easy way to finetune a model on stylistic things like that, even though it should be easy and the equivalent style transfer for image models was what started image gen along with deepdream.
Then there is all the garbage models learn but getting the models to unlearn all that and use the weights to store actually relevant information to a given user is a much harder topic.
Suppose I am a chemist. I can say "I want a model that is specialized in chemistry, and it should only know enough about programming, math, physics, mechanical and electrical engineering to support my role." is a pretty clear cut signal. But right now we don't have a way to make a model forget the endless biographical data for random famous people, geographical and historical information, information about pop culture, information about unrelated commercial products, species of animals, other unrelated scientific disciplines, etc.
Even not having user specific but a set of topic and style specific model combinations would be huge.
>>
>>107376279
RTX 3090
commit="ff55414c42522adbeaa1bd9c52c0e9db16942484" && \
model_folder="/mnt/AI/LLM/Qwen3-Next-80B-A3B-Thinking-GGUF/" && \
model_basename="Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL-00001-of-00002" && \
model_parameters="--temp 0.6 --top_p 0.95 --min_p 0 --top_k 20" && \
model=$model_folder$model_basename'.gguf' && \
cxt_size=131072 && \
CUDA_VISIBLE_DEVICES=0 \
numactl --physcpubind=8-15 --membind=1 \
"$HOME/LLAMA_CPP/$commit/llama.cpp/build/bin/llama-server" \
--model "$model" $model_parameters \
--threads $(lscpu | grep "Core(s) per socket" | awk '{print $4}') \
--ctx-size $cxt_size \
--n-gpu-layers 99 \
--no-warmup \
--batch-size 512 \
--cpu-moe \
--jinja \
--port 9000
>>
>>107376339
okay? any model can do that.
>>
>>107376224
comfyui one is the main release with most downloads
>>
https://youtu.be/gvs0YNPo33k
our thoughts on the mutta jeeta approved qwen2.5vl microsoft finetune
>>107376340
kekd
>>107376355
>q8_k_xl
>UD
brother, please just get it from bartowski and get the Q8_0
>>
>>107376356
But you can't
>>
>>107376313
U whelk home
>>
>>107376370
I don't want to refactor and update your code
>>
>>107376369
>>q8_k_xl
>>UD
>brother, please just get it from bartowski and get the Q8_0

I was asking the question about quantization etc >>107373077

All I got were responses like this
>>107373516
>>107373472
>>
>>107376355
>>107376369
What the fuck even is "q8_k_xl" quant supposed to mean? Are some of the layers kept at FP16 or something? Or are they quanted above Q8?
>>
>>107376384
Nor do I
>>
>>107376358
is it the best, doe?
>>
File: file.png (97 KB, 879x593)
97 KB
97 KB PNG
>>107376402
>what does it mean
ask zeroWW the jeet poop nigger shit that "invented" it
>>
>>107376402

>>107375409 (You)
>>
>>107376396
nta. What we've always known about every fucking model since quants exist. The bigger the model is, the more tolerant it is to quantization.
If you can run Q8_0, run Q8_0. I wouldn't fuck around with non-standard quants. And you should quant it yourself. If you need requanting because of some fix in llama.cpp or whatever, you don't have to wait for someone to do it or wonder if it was made with the latest version or not. Specially with a clusterfuck of an implementation like qwen-next.
>>
>>107376347
I'm not sure what your trying to say. it takes alot more data and expertise to make a model default to a style and specialize in a knowledge base. a single line is not enough signal to train a model. in practice they are trained on trillions of tokens.
>>
>>107376323
proof?
>>
>>107376434
Agree, kind anon

I could finally fix the low speed issue in llama-server compared to llama-cli (OpenMP went missing during build). Now, I enjoy the same speed in a comfortable browser windows.

And because I heard the new about Qwen-Next implementation, I decided to test it on 4000 line of python code.

>>107376264
No speed decrease with large context
>>
File: 1750704361438280.png (5 KB, 211x70)
5 KB
5 KB PNG
twah
>>
>>107376576
kek context?
>>
>>107376598
I was on a blog post rabbithole and came across this for one of the publications on someone's site
>>
File: 1738465037519895.png (55 KB, 1664x783)
55 KB
55 KB PNG
What's recommended for an assistant (not for rp, but need to be creative without being schizo) for glm 4.5 air?
>>
Did qwen3 80b a3b save local, or is 4.5-air still better? Too lazy to download and test
>>
>>107376633
sent u on element :)
>>
>>107376633
Set "Sampler Preset" to the minp preset (I forget what it's called, I think "basic minP") then turn temperature down to 0.69 and change nothing else
>>
>>107376468

>a single line is not enough signal to train a model. in practice they are trained on trillions of tokens.

Do you think TheDrummer is finetuning the models on trillions of tokens?

>it takes alot more data and expertise to make a model default to a style and specialize in a knowledge base.

Depends on what method you use. There are methods and methods.
If by "signal" you mean "the actual bytes of a sharegpt or alpaca file to do sft on" then that will be bigger than if you mean "the bytes fed to the most efficient way we can find to finetune a model".

For example you could take every message that has ever been sent to ChatGPT. If you sample a large amount of generations for every user input, then you could do sft on the model that generated the generations training on its own outputs. There would be some quality degradation because you're not supposed to train a model on its own output, but you could drive the quality degradation to an arbitrarily low level by increasing the diversity and quantity of user messages, as well as increasing the amount of samples that you generate to train on for every user message. Are you with me so far?

Now, suppose when generating those samples, we prepend the instruction that I mentioned before ("make sure to not use markdown unless this or that") to the system prompt. When you train on those samples, it would make you a custom model that follows your instructions.

So from an information theoretic standpoint, the "signal" is there to train the model in that short piece of text. A similar method could be used to make a specialist model, by filtering the user input set and only keeping the ones that relate to your topic of interest, or creating synthetic "user" inputs with another model, possibly using human written source material (books etc.).

Obviously that is very computationally expensive, but that is the naive way of doing it. My point is the big companies don't seem care to research more efficient ways.
>>
File: file.png (47 KB, 729x733)
47 KB
47 KB PNG
>>107376638
idk its kinda weird
writes like haiku
>>
>>107376652
The fuck?
What is your system prompt and chat template?
>>
>>107375177
Who cares about dequantization. Ur like, so turing level old unc
>>
File: 1764255654924589.png (55 KB, 1680x755)
55 KB
55 KB PNG
>>107376646
Yeah it was basic minp, thanks anon.
>>
File: GLM 4.5 z.ai .png (10 KB, 734x255)
10 KB
10 KB PNG
>>107376663
im doing this on llama-server webui localhost:8080
i tried in ST too, with chatml and syspropmt as "you are a helpful assistant" was same.
and with jailbreak prompt too, was same.
>>107376633
picrel
>>
>>107376669
Ignore this >>107376674 top_p is antiquated, AI developers aren't aware newer samplers exist
>>
>>107373173
What CPUs are actually stable with huge ram amounts? Seems most consumer ones struggle with more than 2 dimms without dropping speed substantially.
>>
>>107376693
t. kalo
>>
>>107376702
with many CCDs maybe
you need to go epyc
high channel yes very importnt
>>
>>107376638
it's improvement over 30b
>>
Hey CUDA dev, what's your includePath for CUTLASS? I can't get proper IDE support
>>
File: 1764420988603440.png (574 KB, 1080x1080)
574 KB
574 KB PNG
>>107376649
>Do you think TheDrummer is finetuning the models on trillions of tokens?
obviously not.

look, I just don't think they can be trained online. you need to build a dataset and do the dpo or fine tuning. even considering offline training, users are likely a bad source of training tokens in the end.
>>
>>107376649
>Do you think TheDrummer is finetuning the models on trillions of tokens?
No, but TheDrummer is also not really accomplishing anything with his tunes
>>
>>107376724
I literally have no idea who that is, nobody outside your troon bubble cares about discord namefags or thread personalities. I only just arrived in this thread and I can already recognize all your posts just by your typing style. Stop shitting up this thread and kill yourself
>>
>>107376434
spoonfeeding them is exactly why frogposters keep coming back
>>
>>107376702
It's more about the motherboard and memory controller than the CPU but yes, consumer boards that use DDR5 are often shit with more than 2 dimms. Epyc/Threadripper are what you would need to go for. Or Intel Xeon, but fuck that. Alternatively, AM4 with DDR4 is usually okay with 4 dimms.
>>
>>107376756
What counts as 100% in this chart, the whole internet-using population of the country, or the whole group in the country that use LLM?
>>
schizo newfag
>>
>>107376772
take care of your anon, like you'd like to be taken care of
>>
>>107376775
The chart makes no sense.
>>
>>107376775
its just propaganda and marketing buzz don't try to interrupt it.
>>
>>107376772
Depends on the anon. I took the exact opposite approach in this very thread for someone else.
>>
>>107376773
What's the issue with Xeon? Is it a similar situation to post-12th gen core intel chips?
>>
>>107376813
Nothing. It's cheap when buying used but no one wants to support Intel
>>
File: file.png (29 KB, 953x216)
29 KB
29 KB PNG
bros.. i think im in love
>>
File: file.png (39 KB, 965x229)
39 KB
39 KB PNG
lul
>>
>>107376724
kalo was right. Top P and nsigma are slop maximizers.
>>
>>107376761
Fine, then take something like https://www.youtube.com/watch?v=GEJOB_TFYJ0
that achieves an improvement in some specific benchmark (in this case playing chess).

>>107376756
>look, I just don't think they can be trained online. you need to build a dataset and do the dpo or fine tuning.
You could use a rolling window approch. Customize the LLM as much as it can up to a certain point and once you get some amount of additional data customize it again.
>even considering offline training, users are likely a bad source of training tokens in the end.
I wouldn't think of user messages as a source of training tokens. I would think of them as a signal that can be fed to a classifier to filtre what training tokens to use for training, using whatever means you already have of acquiring training tokens.
And like I said, it could go beyond messages. It could mean customizing the model from user instructions (instead of appending them to the system prompt like "custom chatbots" work right now) or letting a model choose from a large amount of pre-customized models for different styles, subjects, languages etc.
For example this: https://arxiv.org/abs/2506.06105
>>
>>107376889
[citation needed]
>>
File: pigie.jpg (105 KB, 474x419)
105 KB
105 KB JPG
>>107376896
Simply use the models and move the sliders.
>>
File: code gone.png (155 KB, 1137x962)
155 KB
155 KB PNG
WTF CHATGPT JUST DELETED ALL MY CODE!!!!!
>>
>>107376889
Samplers are a hack

You were so close to trips of truth
>>
>>107376970
oh no
>>
File: taberu.gif (534 KB, 300x300)
534 KB
534 KB GIF
>>107376970
What are you working on anon?
>>
>>107377067
Learning how to use git to restore a file.
>>
>>107377067
who cummed on teto?!
>>
How do people put up with all the not-x, y slop in all the new youtube videos?

https://www.youtube.com/watch?v=jF-WAwk1K9k

I counted 19 times in this 12 minute video... Probably missed some.
>>
It's not you, it's me. SHUT THE FUCK UP CLANKER
>>
>>107376859
>>107376880
What model is this?
>>
>>107377101
>How do people put up with all the not-x, y slop in all the new youtube videos?
You watched it. You gave him view time. You improved his ratings. You told him, and youtube, to keep doing what they're doing.
>>
>>107377096
I admit it, it was me. Sowee
>>
>>107377067
LLM inference engine (as of now only supports gpt oss 20b).
I was trying to find a faster way of multiplying bf16 by fp32, but I think to get more performance I'm gonna have to quantize the bf16 weights to int based quants.
>>
>>107377067
I thought this was a heart at first
>>
>>107377101
I put up with it by not watching slop and never clicking youtube recommendations
>>
File: 1749081239967840.jpg (2.56 MB, 2508x3541)
2.56 MB
2.56 MB JPG
>>107377142
Which tensor core version are you targeting? There should be tons of mixed precision kernels you can try out. Specifically, you might to look at BF16x9.
>>
>>107377135
It was right there in front of him. How could he not click and watch? YouTube shouldn't recommend stuff to him that shouldn't be watched.
>>
>>107377101
based ken la corte fan, been watching him since he had <20k subs
maybe not x but y is more popular than you think? i remember anons memeing a year or two ago how everything will become slop for us after too much llm usage.
recently i read a book and saw shivers, hook was written in the 19th century.
>>
>>107377213
ok thank you
the card I'm working on is a 3090
>>
>>107377123
glm air derestricted iq4xs
>>
>>107377292
>everything will become slop for us after too much llm usage.
by us. our brains will get better at recognizing "slop", in human speech and writing. pre LLM era.
>>
>>107377292
>>107377316
It's kind of like how normies get upset when you point out some image or video is fake or staged. You will notice the slop and call it out as generated and people will get mad because
>who cares if it's fake
>>
File: ComfyUI_00148_.png (1.17 MB, 1024x1024)
1.17 MB
1.17 MB PNG
>>107373173
Gotta catch 'em all
>>
File: lock raking technique.gif (1.43 MB, 400x224)
1.43 MB
1.43 MB GIF
>>107377468
Enthusiastically raking Rin-chan's lock
>>
>>107377468
Only thing she's catching from me is unplanned pregnancy



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.