[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: Miku-09.jpg (131 KB, 512x768)
131 KB
131 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106919198 & >>106904820

►News
>(10/17) LlamaBarn released for Mac: https://github.com/ggml-org/LlamaBarn
>(10/14) Qwen3-VL 4B and 8B released: https://hf.co/Qwen/Qwen3-VL-8B-Thinking
>(10/11) koboldcpp-1.100.1 prebuilt released with Wan video generation support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.100.1
>(10/10) KAT-Dev-72B-Exp released: https://hf.co/Kwaipilot/KAT-Dev-72B-Exp
>(10/09) RND1: Simple, Scalable AR-to-Diffusion Conversion: https://radicalnumerics.ai/blog/rnd1

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: miguu.jpg (74 KB, 600x648)
74 KB
74 KB JPG
►Recent Highlights from the Previous Thread: >>106919198

--Paper: Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity:
>106921354 >106921377 >106921488
--Paper: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression:
>106926865 >106926930 >106926935 >106926946 >106926973 >106926999
--Cost-performance analysis of AMD 3950X 128GB vs custom server for LLM/home server/gaming:
>106919401 >106919472 >106919477 >106920242
--Synthetic data and conversational CoT dataset generation for LLM training:
>106919615 >106919833
--glm-chan model behavior and prompt optimization challenges:
>106919852 >106919884 >106919974 >106920107
--Defining uncensored models through role adaptability vs unpredictable behavior:
>106919886 >106920057 >106920564 >106920631 >106920777
--Limitations and workarounds for training LoRA on quantized models:
>106920664 >106920700 >106920848 >106921079 >106921407
--Sparse model scaling advantages over dense architectures:
>106920856 >106920874 >106920885 >106920916 >106920998 >106921046 >106921100 >106921142 >106921007
--Adding Metal4 tensor support to llama.cpp:
>106920993
--Proprietary GGUF format criticisms:
>106921215 >106923524 >106923584 >106923681 >106923793
--Struggles with AWQ model conversion and vLLM optimization:
>106922104 >106922122 >106922147
--AI/ML education vs practical skills and networking for job prospects:
>106922370 >106922549 >106922690 >106922736
--Valve devs improve Vulkan for llama.cpp AI:
>106930141
--LlamaBarn project announcement and real platform inquiry:
>106928231 >106928236
--Designing a multi-agent AI RPG with state management and narrative consistency:
>106930493 >106930613 >106931198 >106930663
--Challenges in RAG systems for base knowledge integration:
>106931465 >106931513
--Miku (free space):
>106924924 >106930166 >106930227 >106930335 >106930569

►Recent Highlight Posts from the Previous Thread: >>106919206

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>106931370
Thanks but I usually use other people's checkpoints and with them I don't see much difference between temperatures, but the quality is better than I remember
>>
>>106931562
>const static std::string pattern_moe_all = "blk\\.\\d+\\.ffn_(up|down|gate)_(ch|)exps";
Okay I shouldn't need to do set -ot myself at all then.
>>
https://github.com/ggml-org/llama.cpp/pull/16653
posting again since it got posted just at the end of the thread, but proper auto gpu memory allocation is finally coming
>>
Elsa Hitler Margret
>>
Instead of using zram, zswap is more performant option these day. Feels snappier. Even if you have shit tons of ram zswap is still useful because it stabilizes system paging.
>>
File: rip.png (87 KB, 556x512)
87 KB
87 KB PNG
bros, hf is after /ourguy/
>>
>>106931969
good riddance
>>
File: file.png (30 KB, 519x134)
30 KB
30 KB PNG
>>106931980
>t.
>>
>>106931969
always were
>>
>>106931969
I lost like 40GB of private storage lately. Seems like they're trying to free as much space as they can
>>
>>106932013
b-but when it was discussed lately anons said it was le nothingburger and nothing would change?
>>
>>106931969
what about the guy that had like 20k merges
>>
>>106932026
Well, saas are always trash nothing new here. I'm saving for a 20+TB drive now
>>
File: file.png (11 KB, 211x148)
11 KB
11 KB PNG
>>106932061
>*26K quants
thank you very much, and he seems fine, tough he uploads to the team mradermacher account,
I wonder how much space they use.
in comparison to picrel davidau has ~300 models and drummer is only approaching 200
>>
>>106931969
Scammer lost. People won.
>>
If I was him I would only keep the few latest tunes in HF and deposit older stuff to somewhere else.
>>
>>106932168
but you're not him, and never will be
>>
Thankfully the userbase has developed enough to realize slop tunes are all placebo and it is entirely skill issue or being too lazy to prefill a prude model
>>
>>106932185
What do you mean?
>>
>>106932201
>skill issue
Only if earning money is the skill we are talking about.
>>
>>106932201
true, just use Gemma with the response you want pre-written by you in the sys prompt, it works 99% of the time better than nemo!
>>
>>106932210
despite richfags constantly dunking on vramlets in the thread, they never post side by sides of the supposed retard vramlet model and their patrician richfag model because they know in their heart of heats for the purpose of ERP you really don't need that many parameters.
>>
>>106932235
4B ought to be enough, if only they stopped trying to shove the entire internet in there.
>>
I am Drummer.
>>
File: imatrix.png (970 KB, 3110x1315)
970 KB
970 KB PNG
>These quants provide best in class perplexity for the given memory footprint.
What am I missing?
IQ quants seem to be the meme I suspected them to be. All other inference params identical min_p=0.04 sampler only
>>
File: file.png (1.05 MB, 871x796)
1.05 MB
1.05 MB PNG
>>106932235
>for the purpose of ERP you really don't need that many parameters
>>
>>106931969
Open source work?
>>
>>106932258
No you're not.
>>
>>106932258
You only suck penises like me. But you aren't me.
>>
>>106932264
Well anon come on then, post your favorite card with vramlet nemo or gemma and with GLM i'm sure we'll be able to see 3000$ worth of prose improvement.
>>
>>106932290
Nemo and Gemma gave me ED. Glm-chan gave me PE.
>>
>>106932264
Air when? They're scammier than the drummer at this point.
>>
David won?
>>
>>106932106
damn, would really like to see the exact storage usage numbers
>>
>>106932340
schizotunes bros.... WE WON!!!
>>
genuine advice to drummer: make llamacpp agpl fork with lora support, then upload loras only
i doubt u did FFT of glm air right? and for models that bartowski made quants of u coud delete the quants to save space. just keep original models.. id like you to publically announce wat ur gonna do before u start deleting models so we can archive some of ur stuff maybe. at least i know id like to
goodluck drumdrum, i still like ytrying your sloptunes no matter what anons say.
also instead of paying 200$/month u could rent seedbox and host models there or something..
>>
>>106932373
>schizobabble
Try running that through an LLM next time zoomie
>>
>>106932373
Question : >>106932363
>>
>>106931969
>open source work
The only thing he ever did was fill up their hard drives with shit models and there wasn't even anything open source about it.
>>
>>106931969
drummer, start an OF. I'll support you. show off that bussy while you do those 'toons baby
>>
>>106932373
Great advice. He should totally do that.
>>
>>106932395
He's retarded. LoRAs have always worked, but Drummer and the mouthbreathers that use his models probably wouldn't know how to load a LoRA. He also can't simply take llama.cpp and relicense it.
>>
>>106932433
loras are a pain in the ass to use with quanted shit
>>
>>106932433
I do remember LoRA not working in some specific circumstances (multi GPU?), but yeah.
As far as I know, people could release their LoRAs instead of just the final merged model.
I don't know how LoRA interacts with quantization however, if there's something specific you need to do for a specific quant and such, or if it only works with the unquanted model in GGUF format, etc.
>>
What's a lora?
>>
File: muah.jpg (1.08 MB, 3840x2160)
1.08 MB
1.08 MB JPG
>>106931969
What does El Drummer actually do?
My assumption was the model merging/raw fucking around with the tensor data, for no good reason. But if he's actually tuning model weights and people enjoy them then respect.
>>
>106932201
this is who is now pushing for lora bs by the by
>>
>>106932492
He does tune, but it's all pretty half assed and not very interesting. Most attempts are big flops that contribute absolutely nothing.
>>
>>106932459
It's like qlora but without quantization
>>
>>106932492
he does indeed do tunes, david is the one that's mainly merges
>>
>>106932501
>>106932509
We've moved beyond entertaining the concept of somehow merging trained model weights, right? huzzah
>>
>>106932509
I love that David exists.
Where else would you get
>DavidAU/Qwen3-MOE-2x8B-TNG-Deckard-Beta-16B
>>
>>106932577
Did anyone actually try any of these turds? Does David actually do anything to the weights, or does he just slap that shit together in mergekit and calls it a day?
>>
>>106932577
I mean, look at this shit
>This is MOE model config of TWO "DND" (double neuron density) 8B models.
>The first model is trained on the TNG/Star Trek Universe (2 datasets) via Unsloth.
>The second model is trained on the Deckard/PK Dick Universe (5 datasets) via Unsloth.
>Both models use a BASE of Jan V1 4B + Brainstorm 40x (4B+ 40x => 8B parameters.)
>The MOE - mixture of experts - config is 2x8B - 16B parameters. With compression this creates a model of 13B - all the power of 16B in 13B package.
>This MOE drastically upscales the BOTH expert models in this MOE model.
>This model can also be used for Role play, Star Treking, Science Fiction, writing, story generation etc etc.
>The BASE model is (a 4B model + Brainstorm 40x adapter):
This is amazing.
>>
>>106932589
that's not the point, it's drummer that needs to be stopped, david is a wholesome bean.
>>
>>106932589
I've tried a couple and they were all, without fail, schizo out of the box.
Or just exactly like llama 3 8b with high temp but taking double to 4x the memory.
>>
>>106932601
They're both retards wasting space and compute.
>>
>>106932601
if beans could be schizophrenic....
>>
File: file.png (149 KB, 603x920)
149 KB
149 KB PNG
>>106932603
what class of model did you try doe?
>>
>>106932601
drummer bought an ad on 4chan. He is /ourguy/ regardless of any other factor.
>>
>>106931969
>scammer is no longer able to waste bandwidth advertising recycled toys over and over again
based
>>
>>106932616
I love the schizo ass model cards.
Seriously, it's pure ML vodoo.
It's great.
>>
>>106932631
Forgot the image.
>>
>>106932631
I assume you did consult the required reading material, right? https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters
>>
>>106932642
Of course.
Although you also have to keep the caveats of the individual models in mind, such as
>https://huggingface.co/DavidAU/Llama-3.2-8X3B-MOE-Dark-Champion-Instruct-uncensored-abliterated-18.4B-GGUF
>>
>>106932623
fuck off
>>
>>106932623
I forgot about that. Maybe drummer isn't a terrorist, he's just misguided.
>>
>>106932681
it'll be okay Icy-Helicopter-kyun do not worries
>>
Getting the MI50 to work on windows was such a pain in the ass.
>>
>>106932749
Do tell.
Just drivers? Some sort of incompatibility?
RocM issues?
>>
>>106932759
nta but amd never bothered with official windows support for the instinct cards
>>
File: cursed-nagatoro.png (3.43 MB, 1920x1782)
3.43 MB
3.43 MB PNG
>>106932749
On Linux it worked out of the box.
>>
>>106932749
im flabbergasted you even got it to work on it
>>
>>106932759
>>106932987
I had to flash the radeon pro VII vbios and use the bootcamp drivers because the normal drivers wouldn't work for some reason even though they are available in the AMD page.
Also for some reason the flashing tool refused to work on windows so I had to flash it on linux.
For anyone interested in flashing the original vbios for a better one here is the page I used: https://gist.github.com/evilJazz/14a4c82a67f2c52a6bb5f9cea02f5e13
>>
>>106933044
that's a lot of fucking around for having it gimped by windows anyway
>>
>>106931969
Why make this public so soon? I don't think "Thank you HF for giving my popular coomer tunes special treatment!" is good PR for HF.
>>
>>106933044
Cool shit.
Thank you for sharing the link too.
>>
>>106932492
>What does El Drummer actually do?
His biggest success was making the thinking process of a model make the model go into safe rejection mode. I think it was nemo so he made unsafe model safe.
>>
>>106932749
What is this mindset? You are buying some used server gpus and are like "NONONO IT MUST WORK ON MY W10 MACHINE".
>>
>>106932593
>double neuron density
Wow. I think he actually isn't a shyster like drummer. He is just that sovlful.
>>
>>106932623
suck his dick and get HIV
>>
Is it david_AU cause he is golden?
>>
>NEVER
[4031, 3848]
>never
[37593]
>EXCLUSIVELY
[3337, 38953, 3166, 50309]
>exclusively
[327, 4256, 3210]
fuck you zuck
>>
>>106931567
Okay, I know there is stuff you can do to make text models send prompts to SD based on the contents of the chat.
How do you do that?

I've got Forge SD WebUI up and running.
I haven't done anything with locally hosting LLMs yet, only messed around with Janitor and Venus, but I can probably pretty easily get KoboldCPP+SillyTavern+Mistral up and running.
>>
>>106933185
get tokened, faggot
>>
>>106932428
I'm honestly surprised more people don't have OFs starring AI starlets.

The image technology is there.
The video technology is there.
The voice technology is there if you want to got that far.
The text generation technology is there.

Seems like Digital Pimp is a major career possibility.
>>
>>106933141
I'm traveling for some months so I'm lending my pc to my normie cousin because he wants to play some racing games and other stuff.
>>
Can any anons that use ik_llama.cpp sanity check me on my llama-bench.exe setup?

I can run llama-server without issue, but I can't seem to get bench to load. Using a IQ2 GLM-4.6 on 128+24GB.

I'm mainly getting "main: error: failed to load model 'model_path'"

Here is the PS script I'm using to start the server - I've got -ngl 1 just to test, as my issues started when I tried to load any of the model into GPU.


# Change to the directory this script is in
Set-Location -Path $PSScriptRoot

# === Full path to your GLM-4.6 model ===
$MODEL = "G:\LLM\Models\GLM-4.6-IQ2_KL\GLM-4.6-IQ2_KL-00001-of-00003.gguf"

# === Launch llama-server with recommended GLM-4.6 settings ===
& .\llama-bench.exe `
-m "$MODEL" `
-mmp 0 `
-ngl 1 `
-fa 1 `
-fmoe 1 `
-ctk q4_0 -ctv q4_0 `
-ot exps=CPU `
-t 20

Pause

>>
>>106933415
>-m "$MODEL" `
lol, lmao even. use your brain
>>
>>106933437
It's not very big ok

I have "$MODEL" on my scripts for loading it in llama-server.exe and it doesn't give me a hard time.
>>
>>106931969
The only decent model he put out was Unslopnemo 3.0. (4.0 is braindead, 4.1 is okay but less fun with writing style) Everything else I tried is either mega slop or just bad.
>>
downloading ling to see if it can replace kimi sex
>>
>>106933415
I've got it working now.

For some reason, only ~13 of my 24GB of VRAM is used during these benches. Is that normal, or should I be looking to fully saturate that?

$MODEL = "G:\LLM\Models\GLM-4.6-smol-IQ2_KS\GLM-4.6-smol-IQ2_KS-00001-of-00003.gguf"

# === Launch llama-server with recommended GLM-4.6 settings ===
& .\llama-bench.exe `
-m $MODEL `
-mmp 0 `
-ngl 999 `
-p 128,512 `
-n 128,512 `
-b 4096 `
-ub 4096 `
-fa 1 `
-fmoe 1 `
-ctk q8_0 -ctv q8_0 `
-ot exps=CPU `
-t 20

Pause
>>
>>106933766
>For some reason, only ~13 of my 24GB of VRAM is used
> -ot exps=CPU `
You can write an enormous -ot expression or you can wait for >>106931647
>>
Why didn't drummer do his own mememark? Big corpos lie about mememarks all the time. Why not make some fake bars himself?
>>
>>106933766
Do all of those arguments work on llama-bench?
I can't remember what it was exactly, but some stuff that worked in llama-server wasn't implemented in llama-bench IIRC.
Maybe it's the cache quantization, I dunno.
>>
>>106933790
Stop bullying Drummer... He might be bit simple but doesn't mean any harm to anyone.
>>
>>106933802
if you run llama-bench.exe -h it'll give you a list of what it accepts.

>>106933782
What would the enormous -ot expression look like? All I've ever seen is exps=CPU and ".ffn_.*_exps.=CPU"
>>
Why isn't there LFM2-VL 2.6B yet ? Also abliterated / uncensored ?

Need to caption 300k images and all the LLMs suck and I have to stick with FLorence-2 still...
>>
>>106933822
That is Undi and Davidau.
>>
>>106933185
>token banning in 2025
we have string bans now gramps
>>
File: screenshot0240.png (1.99 MB, 2202x1238)
1.99 MB
1.99 MB PNG
>>106932264
i agree with this image
>>
>>106934099
I can't read those bent letters
>>
>>106934183
>t. qwen vl
>>
File: niggers cant read this.jpg (66 KB, 1080x817)
66 KB
66 KB JPG
>>106934183
are you black?
>>
>>106934190
>t. moniqwen
>>
>>106934211
They can't?
>>
>>106934216
They can barely read normal words, and let's not talk about reading comprehension...
>>
Minimum specs to run GLM 4.6?
>>
>>106934263
24GB VRAM + 128GB RAM (maybe 96?)
>>
>>106934263
128gb ram + 24gb vram
>>
>>106932235
Because it's not about the quality of the output, it's about how much you need to reroll and tard wrangle a small model until you get what you want, whereas a large model just gets it. The quality of a small model's output may even be better in side by side comparison, the difference is that with a small model, your struggle to make it output what you have in mind, while with a large model, you're balls deep in actual rp
>>
>>106934269
>>106934271
>32GB 5090 with 64GB RAM
So close and yet so far.
>>
>>106934283
ram is cheap though
>>
>>106934277
You can reroll a small model a hundred times by the time your offloaded large model finish its first gen
>>
File: file.png (250 KB, 1303x760)
250 KB
250 KB PNG
>>106934288
was*
>>
GLM 4.6/Deepsex for the first 20k tokens or so followed by Qwen 235B/22A thinking up to 60k context is the KINO setup for long-form lorebook RP, prove me wrong.
>>
>>106934288
I'm unfortunately a 2 slotkek.
>>
>>106934316
2x64gb sticks exist
>>
>>106934305
Does it actually get better and not worse with 20k prefill?
...
...
I actually never tired Qwen above 10k tokens and I was using it for like 2 months at least. With glmchan I am hitting 10k almost everday. Can't be just the tokenizer right? Weird.
>>
What are the differences between GLM 4.6 and 4.5 for practical use? I don't give a shit about benchmark faggotry.
>>
>>106934288
NTA but my cpu/mobo doesn't boot if I try to use more than 2 32gb ram sticks
>>
>>106934295
Waiting is fine, reading garbage output ruins immersion
>>
>>106934341
From what I noticed, the 2507 thinking version at Q8 does keep the same syntax as the previous context. I also use a user prefill regarding the paragraph formatting so it doesn't devolve into one word sentences and it seems to hold together.
I'm using it because it has the best high context performance out of all local models outside of Deepsex v3.2 at the moment (which isn't even implemented yet), while still being pretty damn fast even at high context. Again, its meh when used at 0 context due to its quirks and lack of world knowledge, but with 20k+ context filled in, its acceptable. Give it a try if you have the RAM.
>>
>>106934381
Have you tried updating your bios?
>>
MTP support soon inshallah
>>
>>106934635
And Qwen 80b!
>>
File: 1704599385920673.png (354 KB, 488x651)
354 KB
354 KB PNG
can any anons recommend the current best general knowledge model? something encyclopedic on science, medicine, history, coding.
I dont care about roleplay or artistic output, just something to answer my inane comments, I am currently using gemini 3 27b
>>
>>106934656
In general, the largest the model, the more knowledge.
So something like kimi I guess.
Of course, Gemma models for example know a lot more than any other model in their weight range, but they are relatively small.
>>
>>106934635
Be the change you want to see bro
>>
>>106934635
Vibe coders will save us.
>>
>>106934381
Pretty sure for AMD there were a bunch of BIOS updates in June or earlier that enabled 64GB DIMM support for many vendors.
>>
>>106934381
>>106934707
Disregard that I thought you were trying to put 64GB sticks in there
>>
>>106931969
Based. /lmg/ shills BTFO.
>>
File: 1521254484420.jpg (625 KB, 2048x1365)
625 KB
625 KB JPG
>>106934669
thanks anon, sorry I made typo with gemini, I was indeed using gemma 3 27b. I'll try kimi
>>
>>106934721
>cheetah
They want to be our pets SO BAD.
>>
>>106933658
the fuck kind of rig do you have?
>>
>>106934635
https://voca.ro/1915MlAOFtMx
>>
https://www.tomshardware.com/tech-industry/jensen-huang-says-nvidia-china-market-share-has-fallen-to-zero
>Jensen says Nvidia’s China AI GPU market share has plummeted from 95% to zero
lol, lmao even?
>>
File: 1751552205845476.png (210 KB, 498x529)
210 KB
210 KB PNG
>>106931567
Question for anons who rp with LLMs: do typically set a specific Max output tokens setting? Or do you usually stick with whatever the default your inference engine/webui uses? Sometimes I'll enter a prompt and the output from the "person" I'm role-playing with (I typically have a system prompt that tells the alarm to act as a specific person or persona ) is only like a sentence or two and other times output an entire paragraph. Which output length it does seems to be completely random. Sometimes I'll do a particular prompt and it shits out a paragraph or two worth of text. I'll restart the engine and input that exact prompt. In this time it'll only be a couple sentences. Is it better to set your own Max token output? I don't really RP with it that often so I'm not really sure what counts as "too much", "too little", "good", or " bad"
>>
>>106934774
>Which output length it does seems to be completely random
If the model's output ends before the token limit, then it's done saying what it meant to say. If it reaches the token limit, the reply will get truncated.
>Is it better to set your own Max token output?
It may truncate the reply. But you can.
>I'm not really sure what counts as "too much", "too little", "good", or " bad"
Whatever you prefer. You can nudge the model by just instructing it to give short or long replies. Results may vary.

The model has no idea what the token limit is, nor does it know how many tokens it generated already. It generates tokens until "it's done" (by generating an EOS token). The token limit is just a setting for the inference program or the client, not the model.
>>
>>106934774
It isn't what you think it is. All it does is reduce the model’s context size by max output tokens, so the response will fit within the context
>>
>>106934851
>All it does is reduce the model’s context size by max output tokens
No. It stops generating once the output limit is reached in the current gen request. It's equivalent to the n_predict setting in a gen request to llama-server.
>so the response will fit within the context
No. It's to prevent run-away generation or to just generate in chunks with [continue]. The reply will not necessarily fit in the context as it can get truncated.
>>
Llama.cpp is refusing to load a finetune converted to gguf from an axolotl checkpoint, which according to Grok is because the rank of the lora_a is 256 while the rank of the lora_b is 128. The rank of the lora was supposed to be 128. Any ideas?
>>
>>106934886
>The rank of the lora was supposed to be 128.
Well? Is it 128? 100% sure?
>>
>>106934774
I set the output length to the maximum supported length when using instruct mode.
The model will generate as much text as it deems necessary. There's no point in cutting it short, especially when reasoning is enabled.

For text completion though I'll put the gen limit at 512 tokens, since text completion will just keep generating text until you stop it.
512 tokens is enough to write a paragraph or two, and gives me a chance to make edits or steer the model before continuing.
>>
>>106934774
for me, 550 is about as much as I allow for non-thinking models for a more book like experience with 3 paragraphs
reasoning models you have to make it higher because the reasoning process uses these tokens.
are you using Silly Tavern?
>>
File: 2025 dram market.png (88 KB, 1280x720)
88 KB
88 KB PNG
dam
>>
>>106934875
Yes. It's what happens in any practical situation
https://github.com/SillyTavern/SillyTavern/blob/74c158bd2e98b8b4dc54d2bb0d088c5a5e918826/public/script.js#L5084
If you set a 4K max response with 16K context, you are only getting 12K tokens for your prompt and chat history
>The reply will not necessarily fit in the context
Wrong. Prompt + response can't be longer than the context
>to just generate in chunks with [continue]
The sole reason for generating in chunks is to provide the model with as much context as possible at the start of a reply
>>
>>106934774
You need to tell the model to keep its replies under 200 tokens unless asked to provide a long answer, for example.
I'll keep output length at infinite, it doesn't do that much.
>>
>>106935020
>The reply will not necessarily fit in the context
>Prompt + response can't be longer than the context
Yes. Meant to say "The reply will not necessarily fit in the token limit".
>All it does is reduce the model’s context size
It does not reduce the context size. It's set at launch. But it does reduce the gen limit so that it doesn't go over the context size.
>so the response will fit within the context
The *generated tokens* will fit in the context, not necessarily the entire reply. That cannot be guaranteed.
>>
>>106934898
When I downloaded another finetune from HF it had the same error so I think there must be some issue with the trainer or Grok was just wrong.
>>
>>106935099
Show your llama-server output where you get the error. Post the fucking models you tried, at least. Can you load other models?
>>
>>106935091
It reduces the available context size for the prompt and chat history. At this point, I refuse to participate in the nitpicking contest
>>
>>106935162
It cannot reduce the context size. That's set at launch. It reduces the gen limit so that it doesn't go over the context size.
>>
>>106935129
Ok, gimme a minute.
>>
>>106935187
You're either retarded or can't read
>>
>>106935187
Ring Attention exists.
>>
File: G3jKisDWsAAHAWW.jpg (506 KB, 2048x1536)
506 KB
506 KB JPG
>>
>>106931969
this guy is such an e-begging piece of shit. his troon tunes, all of them, are worse than the originals
>>106931997
i mean, the drummer sucks, but this is the kind of corporate bootlicking you only see on r*ddit. literally the worst humans on the planet
>>
>>106932264
this
>>
>>106935564
>when she talks while deepthroating your dick as you fuck her ass
>>
>>106933196

You don't need Forge, Koboldcpp has image generation support too and the same models work in it. It even supports models forge doesn't like WAN, qwen image and kontext
>>
>>106934211
to be fair, that's very bad cursive.

also, it's anyone under the age of 30.
>>
>>106934945
>lpddr
>ai
>>
I wish there was a balance between GLM-4.6 and K2-0905. It feels like GLM-4.6 is a bit too clean and K2-0905 has too much of a slop tendency. If they were blended together it would be the perfect model.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.