[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: Miku-09.jpg (131 KB, 512x768)
131 KB
131 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106919198 & >>106904820

►News
>(10/17) LlamaBarn released for Mac: https://github.com/ggml-org/LlamaBarn
>(10/14) Qwen3-VL 4B and 8B released: https://hf.co/Qwen/Qwen3-VL-8B-Thinking
>(10/11) koboldcpp-1.100.1 prebuilt released with Wan video generation support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.100.1
>(10/10) KAT-Dev-72B-Exp released: https://hf.co/Kwaipilot/KAT-Dev-72B-Exp
>(10/09) RND1: Simple, Scalable AR-to-Diffusion Conversion: https://radicalnumerics.ai/blog/rnd1

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: miguu.jpg (74 KB, 600x648)
74 KB
74 KB JPG
►Recent Highlights from the Previous Thread: >>106919198

--Paper: Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity:
>106921354 >106921377 >106921488
--Paper: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression:
>106926865 >106926930 >106926935 >106926946 >106926973 >106926999
--Cost-performance analysis of AMD 3950X 128GB vs custom server for LLM/home server/gaming:
>106919401 >106919472 >106919477 >106920242
--Synthetic data and conversational CoT dataset generation for LLM training:
>106919615 >106919833
--glm-chan model behavior and prompt optimization challenges:
>106919852 >106919884 >106919974 >106920107
--Defining uncensored models through role adaptability vs unpredictable behavior:
>106919886 >106920057 >106920564 >106920631 >106920777
--Limitations and workarounds for training LoRA on quantized models:
>106920664 >106920700 >106920848 >106921079 >106921407
--Sparse model scaling advantages over dense architectures:
>106920856 >106920874 >106920885 >106920916 >106920998 >106921046 >106921100 >106921142 >106921007
--Adding Metal4 tensor support to llama.cpp:
>106920993
--Proprietary GGUF format criticisms:
>106921215 >106923524 >106923584 >106923681 >106923793
--Struggles with AWQ model conversion and vLLM optimization:
>106922104 >106922122 >106922147
--AI/ML education vs practical skills and networking for job prospects:
>106922370 >106922549 >106922690 >106922736
--Valve devs improve Vulkan for llama.cpp AI:
>106930141
--LlamaBarn project announcement and real platform inquiry:
>106928231 >106928236
--Designing a multi-agent AI RPG with state management and narrative consistency:
>106930493 >106930613 >106931198 >106930663
--Challenges in RAG systems for base knowledge integration:
>106931465 >106931513
--Miku (free space):
>106924924 >106930166 >106930227 >106930335 >106930569

►Recent Highlight Posts from the Previous Thread: >>106919206

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>106931370
Thanks but I usually use other people's checkpoints and with them I don't see much difference between temperatures, but the quality is better than I remember
>>
>>106931562
>const static std::string pattern_moe_all = "blk\\.\\d+\\.ffn_(up|down|gate)_(ch|)exps";
Okay I shouldn't need to do set -ot myself at all then.
>>
https://github.com/ggml-org/llama.cpp/pull/16653
posting again since it got posted just at the end of the thread, but proper auto gpu memory allocation is finally coming
>>
Elsa Hitler Margret
>>
Instead of using zram, zswap is more performant option these day. Feels snappier. Even if you have shit tons of ram zswap is still useful because it stabilizes system paging.
>>
File: rip.png (87 KB, 556x512)
87 KB
87 KB PNG
bros, hf is after /ourguy/
>>
>>106931969
good riddance
>>
File: file.png (30 KB, 519x134)
30 KB
30 KB PNG
>>106931980
>t.
>>
>>106931969
always were
>>
>>106931969
I lost like 40GB of private storage lately. Seems like they're trying to free as much space as they can
>>
>>106932013
b-but when it was discussed lately anons said it was le nothingburger and nothing would change?
>>
>>106931969
what about the guy that had like 20k merges
>>
>>106932026
Well, saas are always trash nothing new here. I'm saving for a 20+TB drive now
>>
File: file.png (11 KB, 211x148)
11 KB
11 KB PNG
>>106932061
>*26K quants
thank you very much, and he seems fine, tough he uploads to the team mradermacher account,
I wonder how much space they use.
in comparison to picrel davidau has ~300 models and drummer is only approaching 200
>>
>>106931969
Scammer lost. People won.
>>
If I was him I would only keep the few latest tunes in HF and deposit older stuff to somewhere else.
>>
>>106932168
but you're not him, and never will be
>>
Thankfully the userbase has developed enough to realize slop tunes are all placebo and it is entirely skill issue or being too lazy to prefill a prude model
>>
>>106932185
What do you mean?
>>
>>106932201
>skill issue
Only if earning money is the skill we are talking about.
>>
>>106932201
true, just use Gemma with the response you want pre-written by you in the sys prompt, it works 99% of the time better than nemo!
>>
>>106932210
despite richfags constantly dunking on vramlets in the thread, they never post side by sides of the supposed retard vramlet model and their patrician richfag model because they know in their heart of heats for the purpose of ERP you really don't need that many parameters.
>>
>>106932235
4B ought to be enough, if only they stopped trying to shove the entire internet in there.
>>
I am Drummer.
>>
File: imatrix.png (970 KB, 3110x1315)
970 KB
970 KB PNG
>These quants provide best in class perplexity for the given memory footprint.
What am I missing?
IQ quants seem to be the meme I suspected them to be. All other inference params identical min_p=0.04 sampler only
>>
File: file.png (1.05 MB, 871x796)
1.05 MB
1.05 MB PNG
>>106932235
>for the purpose of ERP you really don't need that many parameters
>>
>>106931969
Open source work?
>>
>>106932258
No you're not.
>>
>>106932258
You only suck penises like me. But you aren't me.
>>
>>106932264
Well anon come on then, post your favorite card with vramlet nemo or gemma and with GLM i'm sure we'll be able to see 3000$ worth of prose improvement.
>>
>>106932290
Nemo and Gemma gave me ED. Glm-chan gave me PE.
>>
>>106932264
Air when? They're scammier than the drummer at this point.
>>
David won?
>>
>>106932106
damn, would really like to see the exact storage usage numbers
>>
>>106932340
schizotunes bros.... WE WON!!!
>>
genuine advice to drummer: make llamacpp agpl fork with lora support, then upload loras only
i doubt u did FFT of glm air right? and for models that bartowski made quants of u coud delete the quants to save space. just keep original models.. id like you to publically announce wat ur gonna do before u start deleting models so we can archive some of ur stuff maybe. at least i know id like to
goodluck drumdrum, i still like ytrying your sloptunes no matter what anons say.
also instead of paying 200$/month u could rent seedbox and host models there or something..
>>
>>106932373
>schizobabble
Try running that through an LLM next time zoomie
>>
>>106932373
Question : >>106932363
>>
>>106931969
>open source work
The only thing he ever did was fill up their hard drives with shit models and there wasn't even anything open source about it.
>>
>>106931969
drummer, start an OF. I'll support you. show off that bussy while you do those 'toons baby
>>
>>106932373
Great advice. He should totally do that.
>>
>>106932395
He's retarded. LoRAs have always worked, but Drummer and the mouthbreathers that use his models probably wouldn't know how to load a LoRA. He also can't simply take llama.cpp and relicense it.
>>
>>106932433
loras are a pain in the ass to use with quanted shit
>>
>>106932433
I do remember LoRA not working in some specific circumstances (multi GPU?), but yeah.
As far as I know, people could release their LoRAs instead of just the final merged model.
I don't know how LoRA interacts with quantization however, if there's something specific you need to do for a specific quant and such, or if it only works with the unquanted model in GGUF format, etc.
>>
What's a lora?
>>
File: muah.jpg (1.08 MB, 3840x2160)
1.08 MB
1.08 MB JPG
>>106931969
What does El Drummer actually do?
My assumption was the model merging/raw fucking around with the tensor data, for no good reason. But if he's actually tuning model weights and people enjoy them then respect.
>>
>106932201
this is who is now pushing for lora bs by the by
>>
>>106932492
He does tune, but it's all pretty half assed and not very interesting. Most attempts are big flops that contribute absolutely nothing.
>>
>>106932459
It's like qlora but without quantization
>>
>>106932492
he does indeed do tunes, david is the one that's mainly merges
>>
>>106932501
>>106932509
We've moved beyond entertaining the concept of somehow merging trained model weights, right? huzzah
>>
>>106932509
I love that David exists.
Where else would you get
>DavidAU/Qwen3-MOE-2x8B-TNG-Deckard-Beta-16B
>>
>>106932577
Did anyone actually try any of these turds? Does David actually do anything to the weights, or does he just slap that shit together in mergekit and calls it a day?
>>
>>106932577
I mean, look at this shit
>This is MOE model config of TWO "DND" (double neuron density) 8B models.
>The first model is trained on the TNG/Star Trek Universe (2 datasets) via Unsloth.
>The second model is trained on the Deckard/PK Dick Universe (5 datasets) via Unsloth.
>Both models use a BASE of Jan V1 4B + Brainstorm 40x (4B+ 40x => 8B parameters.)
>The MOE - mixture of experts - config is 2x8B - 16B parameters. With compression this creates a model of 13B - all the power of 16B in 13B package.
>This MOE drastically upscales the BOTH expert models in this MOE model.
>This model can also be used for Role play, Star Treking, Science Fiction, writing, story generation etc etc.
>The BASE model is (a 4B model + Brainstorm 40x adapter):
This is amazing.
>>
>>106932589
that's not the point, it's drummer that needs to be stopped, david is a wholesome bean.
>>
>>106932589
I've tried a couple and they were all, without fail, schizo out of the box.
Or just exactly like llama 3 8b with high temp but taking double to 4x the memory.
>>
>>106932601
They're both retards wasting space and compute.
>>
>>106932601
if beans could be schizophrenic....
>>
File: file.png (149 KB, 603x920)
149 KB
149 KB PNG
>>106932603
what class of model did you try doe?
>>
>>106932601
drummer bought an ad on 4chan. He is /ourguy/ regardless of any other factor.
>>
>>106931969
>scammer is no longer able to waste bandwidth advertising recycled toys over and over again
based
>>
>>106932616
I love the schizo ass model cards.
Seriously, it's pure ML vodoo.
It's great.
>>
>>106932631
Forgot the image.
>>
>>106932631
I assume you did consult the required reading material, right? https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters
>>
>>106932642
Of course.
Although you also have to keep the caveats of the individual models in mind, such as
>https://huggingface.co/DavidAU/Llama-3.2-8X3B-MOE-Dark-Champion-Instruct-uncensored-abliterated-18.4B-GGUF
>>
>>106932623
fuck off
>>
>>106932623
I forgot about that. Maybe drummer isn't a terrorist, he's just misguided.
>>
>>106932681
it'll be okay Icy-Helicopter-kyun do not worries
>>
Getting the MI50 to work on windows was such a pain in the ass.
>>
>>106932749
Do tell.
Just drivers? Some sort of incompatibility?
RocM issues?
>>
>>106932759
nta but amd never bothered with official windows support for the instinct cards
>>
File: cursed-nagatoro.png (3.43 MB, 1920x1782)
3.43 MB
3.43 MB PNG
>>106932749
On Linux it worked out of the box.
>>
>>106932749
im flabbergasted you even got it to work on it
>>
>>106932759
>>106932987
I had to flash the radeon pro VII vbios and use the bootcamp drivers because the normal drivers wouldn't work for some reason even though they are available in the AMD page.
Also for some reason the flashing tool refused to work on windows so I had to flash it on linux.
For anyone interested in flashing the original vbios for a better one here is the page I used: https://gist.github.com/evilJazz/14a4c82a67f2c52a6bb5f9cea02f5e13
>>
>>106933044
that's a lot of fucking around for having it gimped by windows anyway
>>
>>106931969
Why make this public so soon? I don't think "Thank you HF for giving my popular coomer tunes special treatment!" is good PR for HF.
>>
>>106933044
Cool shit.
Thank you for sharing the link too.
>>
>>106932492
>What does El Drummer actually do?
His biggest success was making the thinking process of a model make the model go into safe rejection mode. I think it was nemo so he made unsafe model safe.
>>
>>106932749
What is this mindset? You are buying some used server gpus and are like "NONONO IT MUST WORK ON MY W10 MACHINE".
>>
>>106932593
>double neuron density
Wow. I think he actually isn't a shyster like drummer. He is just that sovlful.
>>
>>106932623
suck his dick and get HIV
>>
Is it david_AU cause he is golden?
>>
>NEVER
[4031, 3848]
>never
[37593]
>EXCLUSIVELY
[3337, 38953, 3166, 50309]
>exclusively
[327, 4256, 3210]
fuck you zuck
>>
>>106931567
Okay, I know there is stuff you can do to make text models send prompts to SD based on the contents of the chat.
How do you do that?

I've got Forge SD WebUI up and running.
I haven't done anything with locally hosting LLMs yet, only messed around with Janitor and Venus, but I can probably pretty easily get KoboldCPP+SillyTavern+Mistral up and running.
>>
>>106933185
get tokened, faggot
>>
>>106932428
I'm honestly surprised more people don't have OFs starring AI starlets.

The image technology is there.
The video technology is there.
The voice technology is there if you want to got that far.
The text generation technology is there.

Seems like Digital Pimp is a major career possibility.
>>
>>106933141
I'm traveling for some months so I'm lending my pc to my normie cousin because he wants to play some racing games and other stuff.
>>
Can any anons that use ik_llama.cpp sanity check me on my llama-bench.exe setup?

I can run llama-server without issue, but I can't seem to get bench to load. Using a IQ2 GLM-4.6 on 128+24GB.

I'm mainly getting "main: error: failed to load model 'model_path'"

Here is the PS script I'm using to start the server - I've got -ngl 1 just to test, as my issues started when I tried to load any of the model into GPU.


# Change to the directory this script is in
Set-Location -Path $PSScriptRoot

# === Full path to your GLM-4.6 model ===
$MODEL = "G:\LLM\Models\GLM-4.6-IQ2_KL\GLM-4.6-IQ2_KL-00001-of-00003.gguf"

# === Launch llama-server with recommended GLM-4.6 settings ===
& .\llama-bench.exe `
-m "$MODEL" `
-mmp 0 `
-ngl 1 `
-fa 1 `
-fmoe 1 `
-ctk q4_0 -ctv q4_0 `
-ot exps=CPU `
-t 20

Pause

>>
>>106933415
>-m "$MODEL" `
lol, lmao even. use your brain
>>
>>106933437
It's not very big ok

I have "$MODEL" on my scripts for loading it in llama-server.exe and it doesn't give me a hard time.
>>
>>106931969
The only decent model he put out was Unslopnemo 3.0. (4.0 is braindead, 4.1 is okay but less fun with writing style) Everything else I tried is either mega slop or just bad.
>>
downloading ling to see if it can replace kimi sex
>>
>>106933415
I've got it working now.

For some reason, only ~13 of my 24GB of VRAM is used during these benches. Is that normal, or should I be looking to fully saturate that?

$MODEL = "G:\LLM\Models\GLM-4.6-smol-IQ2_KS\GLM-4.6-smol-IQ2_KS-00001-of-00003.gguf"

# === Launch llama-server with recommended GLM-4.6 settings ===
& .\llama-bench.exe `
-m $MODEL `
-mmp 0 `
-ngl 999 `
-p 128,512 `
-n 128,512 `
-b 4096 `
-ub 4096 `
-fa 1 `
-fmoe 1 `
-ctk q8_0 -ctv q8_0 `
-ot exps=CPU `
-t 20

Pause
>>
>>106933766
>For some reason, only ~13 of my 24GB of VRAM is used
> -ot exps=CPU `
You can write an enormous -ot expression or you can wait for >>106931647
>>
Why didn't drummer do his own mememark? Big corpos lie about mememarks all the time. Why not make some fake bars himself?
>>
>>106933766
Do all of those arguments work on llama-bench?
I can't remember what it was exactly, but some stuff that worked in llama-server wasn't implemented in llama-bench IIRC.
Maybe it's the cache quantization, I dunno.
>>
>>106933790
Stop bullying Drummer... He might be bit simple but doesn't mean any harm to anyone.
>>
>>106933802
if you run llama-bench.exe -h it'll give you a list of what it accepts.

>>106933782
What would the enormous -ot expression look like? All I've ever seen is exps=CPU and ".ffn_.*_exps.=CPU"
>>
Why isn't there LFM2-VL 2.6B yet ? Also abliterated / uncensored ?

Need to caption 300k images and all the LLMs suck and I have to stick with FLorence-2 still...
>>
>>106933822
That is Undi and Davidau.
>>
>>106933185
>token banning in 2025
we have string bans now gramps
>>
File: screenshot0240.png (1.99 MB, 2202x1238)
1.99 MB
1.99 MB PNG
>>106932264
i agree with this image
>>
>>106934099
I can't read those bent letters
>>
>>106934183
>t. qwen vl
>>
File: niggers cant read this.jpg (66 KB, 1080x817)
66 KB
66 KB JPG
>>106934183
are you black?
>>
>>106934190
>t. moniqwen
>>
>>106934211
They can't?
>>
>>106934216
They can barely read normal words, and let's not talk about reading comprehension...
>>
Minimum specs to run GLM 4.6?
>>
>>106934263
24GB VRAM + 128GB RAM (maybe 96?)
>>
>>106934263
128gb ram + 24gb vram
>>
>>106932235
Because it's not about the quality of the output, it's about how much you need to reroll and tard wrangle a small model until you get what you want, whereas a large model just gets it. The quality of a small model's output may even be better in side by side comparison, the difference is that with a small model, your struggle to make it output what you have in mind, while with a large model, you're balls deep in actual rp
>>
>>106934269
>>106934271
>32GB 5090 with 64GB RAM
So close and yet so far.
>>
>>106934283
ram is cheap though
>>
>>106934277
You can reroll a small model a hundred times by the time your offloaded large model finish its first gen
>>
File: file.png (250 KB, 1303x760)
250 KB
250 KB PNG
>>106934288
was*
>>
GLM 4.6/Deepsex for the first 20k tokens or so followed by Qwen 235B/22A thinking up to 60k context is the KINO setup for long-form lorebook RP, prove me wrong.
>>
>>106934288
I'm unfortunately a 2 slotkek.
>>
>>106934316
2x64gb sticks exist
>>
>>106934305
Does it actually get better and not worse with 20k prefill?
...
...
I actually never tired Qwen above 10k tokens and I was using it for like 2 months at least. With glmchan I am hitting 10k almost everday. Can't be just the tokenizer right? Weird.
>>
What are the differences between GLM 4.6 and 4.5 for practical use? I don't give a shit about benchmark faggotry.
>>
>>106934288
NTA but my cpu/mobo doesn't boot if I try to use more than 2 32gb ram sticks
>>
>>106934295
Waiting is fine, reading garbage output ruins immersion
>>
>>106934341
From what I noticed, the 2507 thinking version at Q8 does keep the same syntax as the previous context. I also use a user prefill regarding the paragraph formatting so it doesn't devolve into one word sentences and it seems to hold together.
I'm using it because it has the best high context performance out of all local models outside of Deepsex v3.2 at the moment (which isn't even implemented yet), while still being pretty damn fast even at high context. Again, its meh when used at 0 context due to its quirks and lack of world knowledge, but with 20k+ context filled in, its acceptable. Give it a try if you have the RAM.
>>
>>106934381
Have you tried updating your bios?
>>
MTP support soon inshallah
>>
>>106934635
And Qwen 80b!
>>
File: 1704599385920673.png (354 KB, 488x651)
354 KB
354 KB PNG
can any anons recommend the current best general knowledge model? something encyclopedic on science, medicine, history, coding.
I dont care about roleplay or artistic output, just something to answer my inane comments, I am currently using gemini 3 27b
>>
>>106934656
In general, the largest the model, the more knowledge.
So something like kimi I guess.
Of course, Gemma models for example know a lot more than any other model in their weight range, but they are relatively small.
>>
>>106934635
Be the change you want to see bro
>>
>>106934635
Vibe coders will save us.
>>
>>106934381
Pretty sure for AMD there were a bunch of BIOS updates in June or earlier that enabled 64GB DIMM support for many vendors.
>>
>>106934381
>>106934707
Disregard that I thought you were trying to put 64GB sticks in there
>>
>>106931969
Based. /lmg/ shills BTFO.
>>
File: 1521254484420.jpg (625 KB, 2048x1365)
625 KB
625 KB JPG
>>106934669
thanks anon, sorry I made typo with gemini, I was indeed using gemma 3 27b. I'll try kimi
>>
>>106934721
>cheetah
They want to be our pets SO BAD.
>>
>>106933658
the fuck kind of rig do you have?
>>
>>106934635
https://voca.ro/1915MlAOFtMx
>>
https://www.tomshardware.com/tech-industry/jensen-huang-says-nvidia-china-market-share-has-fallen-to-zero
>Jensen says Nvidia’s China AI GPU market share has plummeted from 95% to zero
lol, lmao even?
>>
File: 1751552205845476.png (210 KB, 498x529)
210 KB
210 KB PNG
>>106931567
Question for anons who rp with LLMs: do typically set a specific Max output tokens setting? Or do you usually stick with whatever the default your inference engine/webui uses? Sometimes I'll enter a prompt and the output from the "person" I'm role-playing with (I typically have a system prompt that tells the alarm to act as a specific person or persona ) is only like a sentence or two and other times output an entire paragraph. Which output length it does seems to be completely random. Sometimes I'll do a particular prompt and it shits out a paragraph or two worth of text. I'll restart the engine and input that exact prompt. In this time it'll only be a couple sentences. Is it better to set your own Max token output? I don't really RP with it that often so I'm not really sure what counts as "too much", "too little", "good", or " bad"
>>
>>106934774
>Which output length it does seems to be completely random
If the model's output ends before the token limit, then it's done saying what it meant to say. If it reaches the token limit, the reply will get truncated.
>Is it better to set your own Max token output?
It may truncate the reply. But you can.
>I'm not really sure what counts as "too much", "too little", "good", or " bad"
Whatever you prefer. You can nudge the model by just instructing it to give short or long replies. Results may vary.

The model has no idea what the token limit is, nor does it know how many tokens it generated already. It generates tokens until "it's done" (by generating an EOS token). The token limit is just a setting for the inference program or the client, not the model.
>>
>>106934774
It isn't what you think it is. All it does is reduce the model’s context size by max output tokens, so the response will fit within the context
>>
>>106934851
>All it does is reduce the model’s context size by max output tokens
No. It stops generating once the output limit is reached in the current gen request. It's equivalent to the n_predict setting in a gen request to llama-server.
>so the response will fit within the context
No. It's to prevent run-away generation or to just generate in chunks with [continue]. The reply will not necessarily fit in the context as it can get truncated.
>>
Llama.cpp is refusing to load a finetune converted to gguf from an axolotl checkpoint, which according to Grok is because the rank of the lora_a is 256 while the rank of the lora_b is 128. The rank of the lora was supposed to be 128. Any ideas?
>>
>>106934886
>The rank of the lora was supposed to be 128.
Well? Is it 128? 100% sure?
>>
>>106934774
I set the output length to the maximum supported length when using instruct mode.
The model will generate as much text as it deems necessary. There's no point in cutting it short, especially when reasoning is enabled.

For text completion though I'll put the gen limit at 512 tokens, since text completion will just keep generating text until you stop it.
512 tokens is enough to write a paragraph or two, and gives me a chance to make edits or steer the model before continuing.
>>
>>106934774
for me, 550 is about as much as I allow for non-thinking models for a more book like experience with 3 paragraphs
reasoning models you have to make it higher because the reasoning process uses these tokens.
are you using Silly Tavern?
>>
File: 2025 dram market.png (88 KB, 1280x720)
88 KB
88 KB PNG
dam
>>
>>106934875
Yes. It's what happens in any practical situation
https://github.com/SillyTavern/SillyTavern/blob/74c158bd2e98b8b4dc54d2bb0d088c5a5e918826/public/script.js#L5084
If you set a 4K max response with 16K context, you are only getting 12K tokens for your prompt and chat history
>The reply will not necessarily fit in the context
Wrong. Prompt + response can't be longer than the context
>to just generate in chunks with [continue]
The sole reason for generating in chunks is to provide the model with as much context as possible at the start of a reply
>>
>>106934774
You need to tell the model to keep its replies under 200 tokens unless asked to provide a long answer, for example.
I'll keep output length at infinite, it doesn't do that much.
>>
>>106935020
>The reply will not necessarily fit in the context
>Prompt + response can't be longer than the context
Yes. Meant to say "The reply will not necessarily fit in the token limit".
>All it does is reduce the model’s context size
It does not reduce the context size. It's set at launch. But it does reduce the gen limit so that it doesn't go over the context size.
>so the response will fit within the context
The *generated tokens* will fit in the context, not necessarily the entire reply. That cannot be guaranteed.
>>
>>106934898
When I downloaded another finetune from HF it had the same error so I think there must be some issue with the trainer or Grok was just wrong.
>>
>>106935099
Show your llama-server output where you get the error. Post the fucking models you tried, at least. Can you load other models?
>>
>>106935091
It reduces the available context size for the prompt and chat history. At this point, I refuse to participate in the nitpicking contest
>>
>>106935162
It cannot reduce the context size. That's set at launch. It reduces the gen limit so that it doesn't go over the context size.
>>
>>106935129
Ok, gimme a minute.
>>
>>106935187
You're either retarded or can't read
>>
>>106935187
Ring Attention exists.
>>
File: G3jKisDWsAAHAWW.jpg (506 KB, 2048x1536)
506 KB
506 KB JPG
>>
>>106931969
this guy is such an e-begging piece of shit. his troon tunes, all of them, are worse than the originals
>>106931997
i mean, the drummer sucks, but this is the kind of corporate bootlicking you only see on r*ddit. literally the worst humans on the planet
>>
>>106932264
this
>>
>>106935564
>when she talks while deepthroating your dick as you fuck her ass
>>
>>106933196

You don't need Forge, Koboldcpp has image generation support too and the same models work in it. It even supports models forge doesn't like WAN, qwen image and kontext
>>
>>106934211
to be fair, that's very bad cursive.

also, it's anyone under the age of 30.
>>
>>106934945
>lpddr
>ai
>>
I wish there was a balance between GLM-4.6 and K2-0905. It feels like GLM-4.6 is a bit too clean and K2-0905 has too much of a slop tendency. If they were blended together it would be the perfect model.
>>
File: file.png (1 KB, 190x28)
1 KB
1 KB PNG
>>106935980
projections from 2024?
>>
File: 68er.png (607 KB, 1007x496)
607 KB
607 KB PNG
Huh....apparently I can't run GLM 4.5 Air Q4_K_M on a fucking 5090? something feels odd here, maybe my settings are just wrong when running llama-server?
Feels like this shouldn't be an issue but idk?

helppp
>>
>>106936224
Did you move the expert tensors to RAM?
Are you trying to use the full 128k context?
>>
>>106936224
> failed to open GGUF file
00001 file looks corrupted, looks like you need to download it again.
>>
>>106936236
>Did you move the expert tensors to RAM?
How do I do that exactly? I run it using this exact command:

"llama-server -m GLM-4.5-Air-Q4_K_M-00001-of-00002 -c 32768"

>Are you trying to use the full 128k context?
No, I use 32768. I also tried just not setting a context at all (which I think defaults to some piss low 4096 amount) and that also did not work
>>
>>106936242
So this isn't an "out of VRAM" error? that's what I figured was happening. The file is corrupt?
>>
>>106936252
yeah you're not out of VRAM, it can't open the file
>>
>>106936250
>--n-cpu-moe 99 (47 would work the same)
That will probably leave a ton of free vram, then you can lower that to 46, 45, 44, etc, until you fir as much of the model in VRAM as possible for the fastest possible generation.

>>106936252
>that's what I figured was happening
It is.
>>
>>106936255
>>106936256
>it's a VRAM error
>It's not a VRAM error
anons you're killing me here
>>
>>106936260
Test it with --n-cpu-moe 99 and you'll see.
>>
>>106936265
>>106936260
Oh no, actually. I think the other anon is right.
Do you have both .gguf files in the same folder?
>>
>>106936272
I'm dumb as shit. I had the 00002 in a different folder under that one (idk why) I'll try it with them both in the same folder and report back later.
>>
>>106936281
also thanks, anons
>>
>>106936281
Once you get that working, go read about the -ngl and --n-cpu-moe arguments/parameters, otherwise you'll be running the model completely on your CPU.
You can launch llama-server with the -h argument to get a help explaining what each parameter does.
>>
>>106936295
no problemo bro
>>
>>106936298
>otherwise you'll be running the model completely on your CPU.
That's really bad. I want to use my 5090.
What do you recommend I try for a 5090 running GLM air 4.5 Q4?

"llama-server -m GLM-4.5-Air-Q4_K_M-00001-of-00002 -c 32768 -ngl 99"

How about that? also I assume if I'm using -ngl I don't also want to be using -n-cpu-moe?
>>
File: 1739602150295048.png (749 KB, 904x702)
749 KB
749 KB PNG
nani.. (maybe someone less lazy than me can capture gif/vid)
https://xcancel.com/kiriTNS_mk/status/1979538163833221607
>>
File: 1732065050824882.jpg (690 KB, 2048x1152)
690 KB
690 KB JPG
@grok is this real
>>
>>106936322
I don't think that's currently road legal.
>>
File: 1730547060590878.jpg (707 KB, 2048x1152)
707 KB
707 KB JPG
>>106936322
one more
>>
>>106936318
>I assume if I'm using -ngl I don't also want to be using -n-cpu-moe?
Using just -ngl, you'll be trying to load the whole model into your VRAM, and that just won't fit.
What you do is use -ngl 99 to tell llama.cpp to load all layers of the model to your VRAM, then you use --n-cpu-moe 99 to tell llama.cpp to exclude the expert tensors from that, which are the bulk of the weights of the model.
Of course, with -ngl99 + --n-cpu-moe 99, you'll have a lot of free VRAM, so you can lower --n-cpu-moe to put more and more of the model in your VRAM.
>>
>5 million downloads and still no goofs
AIIIEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
>>
>>106936333
Miku is above the law.
>>
>>106936345
I'll give that a shot, thanks
>>
>>106931567
Good morning /lmg/. What have you been up to?
>>
>>106936432
preparing to wipe my win10 install with arch
just making sure to back up everything 1st
>>
>>106936443
I've always wondered what distro I would use if I ever had to main Linux. Looks a good one for a relative beginner that only has experience with Linux CLI but otherwise mostly uses Windows?
>>
>>106936449
i mean ubuntu with Dash to Panel is basically this
>>
File: 1746933648372488.png (1009 KB, 2359x1749)
1009 KB
1009 KB PNG
>>106936449
huh. Hard for me to answer since I was a beginner 25 years ago. Maybe just dive into the deep end with arch, or try ubuntu for 3 to 6 months, then after reinstall with arch with full knowledge that ubuntu is gay and is just an easy stepping stone.
>>
>>106936098
>clean
>slop
could you explain again but in proper english this time
>>
>>106936476
proper english is obsolete
>>
File: 1751501012438816.png (10 KB, 1083x140)
10 KB
10 KB PNG
>pay for a year of claude opus
>they change the deal and fuck over your convo limits so even a few casual convos set you over
OWARI DA...
>>
>>106936512
get local
>>
>>106936512
>>>/aicg/
>>
File: 1737045152592995.png (104 KB, 271x238)
104 KB
104 KB PNG
>>106936515
>>106936516
there's nowhere to talk about cloud models and /aicg/ are a bunch of sex crazed degenerates
>>
File: 1734479315727818.png (367 KB, 1080x2364)
367 KB
367 KB PNG
>>106936432
>>
>>106936523
so then dont talk about cloud models. they suck.
>>
File: 1758474429804075.png (376 KB, 1080x2364)
376 KB
376 KB PNG
>>106936523
You could always try the /vg/ general. What are you talking about? So much in depth that you're reaching the limits so quickly anyway?

>>106936528
>>
>>106936523
stay in your lane lil bro
>>
>>106936531
The models are better than anything we have locally, it's everything else about their usage that sucks.
>>
>>106936541
GLM 4.6 local mogs everything
>>
>>106936544
>mogs
opinion disregarded
>>
File: no.gif (2.06 MB, 638x266)
2.06 MB
2.06 MB GIF
>>106936547
no u
>>
>>106936538
the prob is that "so much in depth" was drastically limited because anthropic wants to serve sonnet, not opus. they fucked over opus users. i guess this could be an argument for local since you're not at a random corpo's whims but in honesty local just doesn't measure up
>>
>>106936523
>and /aicg/ are a bunch of sex crazed degenerates
Where do you think you are?
>>
>>106936578
we here at lmg only utilize LLMs for productivity adjacent purposes
>>
Honest question, why do people bother with all the command line fucking around with llama when kobold is far simpler and does the same thing with a GUI?
Just use fucking kobold?
>>
>>106936607
To confuse dumb anons.
>>
>>106936603
productivity of semen
>>
>>106936607
it really depends if you want advanced features or not, like with vllm you can get parallelism, which can lead to much higher cumulative tokens per second if you are doing things like coding autocomplete.
Also, creating a script to run an LLM instead of opening a GUI can be easier for some people, and also helpful for developers when creating an application which uses an LLM.
>>
>>106936443
>>106936449
You can do gpu pass through from a Linux host to a Windows VM if you actually want windows
>>
File: 1731004385497993.png (254 KB, 900x806)
254 KB
254 KB PNG
>>106936540
>>106936523
>>106936512
seriously though, if i want to complain about new claude limits is that just impossible?
>>
>>106936715
yea
>>
>>106936715
the bitching about anthropic limits is unbearable. unsubscribe and pay for it via API and you can use it as much as you like. oh, too expensive? now you understand
>>
File: 1742623164412708.png (142 KB, 934x876)
142 KB
142 KB PNG
>>106936749
i mean.. can't i have venture capitalists subsidize my nothing convos about anime? that was comfy
>>
File: b8.png (203 KB, 1631x1718)
203 KB
203 KB PNG
>>106936761
Well, there's an alternative. I can tell you how to get banned and get a full refund. Would you prefer that?
>>
>>106935639
What you perceive as size-related intelligence in smut scenes is the result of larger models knowing more _even after training data filtering_. A smaller model mid-trained/continually-pretrained on lightly filtered RP-relevant data instead of math reasoning and coding would perform better and not get confused like that.
>>
>>106934669
I swear Gemma-3-27B becomes a 13B model when it engages "RP mode", and probably around a 6B model during ERP.
>>
>>106936858
models all fall off past a certain context, probably even more so when in certain tasks it was not very well trained in
>>
>>106936607
Because I only need to type the command once and then I can run it whenever I want? The real reason to use kobold is for features that llama.cpp doesn't have, like phrase banning
>>
>>106936432
Thinking about all the fun to be had using local models, without using local models.
>>
>>106932433
>He also can't simply take llama.cpp and relicense it.
look at koboldcpp, its AGPL
original code is still MIT, and a MIT project can be relicensed under a more restrictive license. its gnu compatible
>>106932395
i know there were tons of issues, not all too sure about support. but if loras were properly supported, we'd definitely see more loras being uploaded instead of merges
>>
I did some analysis of DeepSeek V3.1 Terminus on artificial test cases to try to tease out ideal sampler settings, and I found the official suggestion of top-p=0.95 isn't an arbitrary "IDK lol this sounds good" choice but actually was a pretty good cutoff point in the cases where I added up token probabilities. Since then I've been running with top-p 0.95 and temperature 0.8, the temperature choice being more arbitrary -- what settings have others found enjoyable?
>>
>>106936782
Based
>>106936749
This is the correct answer
>>
glm 4.6 at q8 or deepseek 3.1 at q4?
>>
>>106937152
glm duh
>>
>>106935129
Ok, here it is, sorry for the delay.
The LoRa was trained with (more or less) the axolotl example config for this model except using a rank of 128.
The lora to gguf script works fine.
I then run the llama server with this command:
./llama.cpp/build/bin/llama-server -c 60000 --port 8001 -m ./data/huggingface-cache/hub/models--MaziyarPanahi--Meta-Llama-3.1-405B-Instruct-GGUF/snapshots/85b9bd67025a43
37e9694ec0edaf46437fe6283b/Meta-Llama-3.1-405B-Instruct.Q3_K_S.gguf-00001-of-00009.gguf --lora /workspace/llama405b-outputs/outputs/out/qlora-llama3_1-405b/checkpoint-60/checkpoint-60-F16-LoRA.gguf

This is the error it throws:

llama_adapter_lora_init_impl: - kv 11: general.quantization_version u32 = 2
llama_adapter_lora_init: failed to apply lora adapter: tensor 'blk.0.attn_k.weight' has incorrect shape (hint: maybe wrong base model?)
common_init_from_params: failed to apply lora adapter '/workspace/llama405b-outputs/outputs/out/qlora-llama3_1-405b/checkpoint-60/checkpoint-60-F16-LoRA.gguf'
srv load_model: failed to load model, './data/huggingface-cache/hub/models--MaziyarPanahi--Meta-Llama-3.1-405B-Instruct-GGUF/snapshots/85b9bd67025a4337e9694ec0edaf46437fe6283b/Meta-Llama-3.1-405B-Instruct.Q3_K_S.gguf-00001-of-00009.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error


>>106936985
It's a shame because merging the weights is a pain in the ass and takes a ton of disk space.
>>
I have never understood the furor around drummer's slops, the only thing he did that was decent was Unslop Nemo
>>
>>106937178
They started as low-tier, disgustingly-named models memed to popularity just enough that the guy in question saw a money-making opportunity.
>>
>>106937231
money-making how, what's his endgame? put the goofs behind a paywall?
or is the idea to just get a big following on twitter and become an Accomplished AI Evangelist and one day get a job
>>
>>106937292
He literally did this weird model that spits all sorts of random brand names at you, what the fuck was that about. You can really tell the guy is thinking really hard how to sell this crap
>>
File: open4work.png (16 KB, 472x119)
16 KB
16 KB PNG
>>106937292
Like I said in the past before he even set up a Patreon / Ko-fi account: to saturate the space with his shit, no matter how good or bad, to get recognition and eventually hired for work in an actual company.

That's pajeet-tier behavior that retards keep rewarding and so we'll keep seeing more of it until he's accomplished his goals.
>>
>>106937169
This is the shape of the tensors in the LoRa GGUF:
Tensor Name: blk.0.attn_k.weight.lora_a
Dimensions: [16384 128]
Data Type: 0
Tensor Name: blk.0.attn_k.weight.lora_b
Dimensions: [ 128 1024]

But the model's blk.0.attn_k has the shape
blk.0.attn_k.weight     [16384, 2048]

So I'm not sure why lora_b shape doesn't match the shape of the attn_k matrix, but that seems to be the problem.
The curious thing is that the model merged seemingly without errors when using axolotl (haven't checked the generation quality yet).
>>
I just can't stop thinking about TheDrummer.
Like. I was just sitting at my computer, right? And all of a sudden. BAM... TheDrummer. So I had to post it. I had to tell someone.
I just can't stop thinking about him.
>>
On the other hand a large LoRa like that would consume a large amount of memory, so maybe it's a blessing in disguise.
>>
talking about drummer is now obvious bait
don't know how you fucks keep falling for this shit
>>
Should you finetune model on SDK code and maybe some examples you are using or will it just mess it up?
>>
>>106937313
Are you kidding? That's my favorite of his models and probably the one I've wasted the most time chatting to. I show it to everyone IRL and host it locally for people when they ask "How are the AI companies gonna make money when VC dries up?". It was inspired by a Black Mirror episode by the same name.
>>
File: file.png (2 KB, 106x150)
2 KB
2 KB PNG
hello

first time trying MoE model, do i need to set up something specific in openwebui model settings to make it work or it just works?

i have 96gb vram across 6 cards, bigger models will get spread out on cards and part of a model gets used accordingly
>>
Also, --lora-scaled works perfectly in llama.cpp + Cuda, I use it all the time. I had issues with it a while back using AMD/Vulkan.
>>
>>106937519
Please stop wasting your time on this rock doing this shit.
We have a planet to save.
>>
File: file.png (3 KB, 145x151)
3 KB
3 KB PNG
>>106937527
i have house to heat and the energy this machine uses is converted to heat, which is not going to waste, the hardware i'm using is second hand and as cheap as it can be
>>
>>106937513
Should I do a 24B version of Rivermind?

>>106937231
Yeah, good times. I think it was Smegmma 9B that made me realize it wasn't going to work in the long-run. Lots of people opted out because of the name.

Also, Lemmy (creator of Celeste) told me to stop fucking around if I wanted providers to host my models. That's when I decided to go with sci-fi, starting with Expanse ships, and released Rocinante as my first serious take.

>>106937392
>>106937404
I don't see how I'm causing harm to anyone, so I can't take them seriously. No idea why I get to live in their heads rent-free.
>>
>>106937610
>Should I do a 24B version of Rivermind?
yes
>>
>>106930625
DDR6 is coming out next year or early 2027. DDR4 is no longer being made and as it stands right now the only ram really being sold on the consumer market is DDR5 instead of DDR4 and DDR5
>>
>>106937610
Drummer, please figure out why your models write for {{user}} so often before making any more
Your cydonias are especially bad in this regard, every one that came after v2g.
>>
>>106937877
I think it might have to do with it being a small model that is being hit by a grift hammer for no good reason.
>>
>>106937888
Older models like Rocinante and Unslop were fine, and no worse than the original model when it comes to imitation
>>
i tried ling sex but i think i fucked up the template. ill try again later
>>
>>106937877
>why your models write for {{user}} so often before making any more
I don't use his finetunes/merges but this tends to happen when there's a fuckfest of merges in one model.
>>
>>106937950
>no worse
no better either
>>
>>106937951
If a 1T token can get fucked up by a template then it is bad. Maybe not as a whole but that probably means that the sex stuff is a very small domain of the training and it can't generalize it at all.
>>
File: 111.jpg (62 KB, 772x448)
62 KB
62 KB JPG
Holy shit, Chinese regulations are cuckolded like crazy. No wonder quality is going down so much.
>>
>>106938060
Why do they care about IP infringements and woke shit? I don't understand.
>>
>>106938060
China is a socialist country, and wokeness is a socialist concept. It's just that LLMs were a new thing, and the CCP didn't catch up fast enough.
>>
>>106938079
They're copypasting burgers regulation since they're trying to market their models there
>>
>>106937989
less censored though
>>
>>106937610
>Should I do a 24B version of Rivermind?

I mean I'd like one (not the Lux version) yeah! But I'm not sure if it'd be popular given what the guy I was replying to said.

I like how it chooses relevant brands/products to shill based on the topic.

> good times

I remember laughing at the Moistral model card, I can't remember what you wrote but it was something like "turn this into a moistral masterpiece". And I saw a HN comment like "Moistral? There's no way that can be real...".

Good times indeed!
>>
gmma soon
>>
>>106938113
>less censored
>than nemo
What the fuck are you on about?
>>
>>106936320
Not good for aerodynamics but I like the fluttering hair
>>
>>106931969
>suddenly a lot of (you)'s
hmm
>>
>>106937102
Just the default of top-p 0.95 and temp 0.6. The responses I get on rerolls are varied enough that I've never felt the urge to mess with the temperature.
>>
>mmmmm we can't just send http requests, we need 5 wrapper libraries and a 2 typing libraries
>all libraries are incompatible with each other two releases later
what's this programming paradigm called?
>>
>>106938305
"What in tard-ation are ya doin' stupid vibe coders?"
>>
>>106938305
>things that didn't happen
leave the hallucinations to the llm bro
>>
Does llama.cpp support Deepseek V3.2 and its meme sparse attention yet? I remember trying it when it came out over the API, going 'yeah, this is slightly better than V3.1(-Terminus) before forgetting about it the moment GLM4.6 released like two days later.
>>
>>106938899
>this is slightly better than V3.1(-Terminus) before forgetting about it the moment GLM4.6 released like two days later
The CohereLabs/c4ai-command-a-03-2025 moment of the whale...
>>
anything better than rocinante yet?

my 8ball from walmart says: no, not for local
>>
>>106938986
Nemo (not drummer shittune) will probably remain the king of the era of undertrained > safe unsuable LLM-s. Luckily we are in a new era started by glmsex.
>>
>>106931969
>Offering free models is a scam
This general is so fucking retarded
>>
>>106939081
If they were so good, he wouldn't need to release a new one every few days, piggybacking off new models from legitimate AI labs.
>>
There's no way the haters don't know they're contributing to keeping TheDrummer in everyone's mind.
They secretly like him. They want to hug him and give him kisses and walk hand in hand with him for the whole world to see.
They got TheDrummy issues.
>>
>>106931567
#NotMyMiku
>>
>>106939149
they can't get the taste of his drummies out of their mouth
>>
>>106939149
HuggingFace is fed up with him too.
>>
>>106939149
Put drummer derangement syndrome in your next card. It is a collective activity of calling a faggot a faggot. If a newfag enters a thread right now they aren't going to download a tune of someone who is being called a faggot by multiple posters.
>>
File: disgusting.jpg (335 KB, 3840x2160)
335 KB
335 KB JPG
In civilized and developed Japan, when a daughter brings her chosen one to meet her parents, they give him a test. They go through his models and check the quants. If the boy doesn't have at least q4, it's clear that he comes from a dysfunctional family. The test is 100% accurate, and even the WHO and the UN have acknowledged that families dominated by alcoholism, drug addiction, and incest always show a preference to q3 and lower quants. That is why I'm not surprised that most of you poorfags think that output from these retarded quants is acceptable. You were eating shit the whole life and I pity you, but have some honor and human dignity to not consume fecals when we are filling our palates with higher quants, it's a pleasure only for sophisticated gourmets. At the sight of f16 precision output, you would probably jump and scream like monkeys in a zoo. You dirty, muddy swine.
>>
Seems like this thread has stagnated BADLY.
>>
>>106939313
It's a thread about LLMs after all
>>
>>106939292
>q4
quantlet projection
>>
>>106939292
erm, according to this copetest there is barely any difference between q6, q8 and f16
>>
>>106939384
Why, don't you use your models with greedy decoding?
>>
f32
fa off
ngl 999
ncmoe 0
swa-full
ctx-size 0
>>
oh no no no no
https://xcancel.com/ylecun/status/1979595060447416733
>>
>>106939323
Does this mean we have Cold AI Winter #2?
>>
>>106939477
another one
https://xcancel.com/ylecun/status/1979596956277289353#m
yann buried llms
>>
>>106939313
zuck will develop agi and save it bigly
>>
File: 638er.png (486 KB, 995x103)
486 KB
486 KB PNG
>>106936345
>>106936299
>>106936298
Update on this. I have it working now correctly using the following paramaters and it uses up 100% of my 5090's VRAM. So that's good, however I feel this can be tweaked further for better performance? Any suggestions on how to modify this command to get more out of it?

Like I see others running shit like "--ctx-size 40000 --flash-attn --temp 0.6 --top-p 0.95 --n-cpu-moe 41 --n-gpu-layers 999 --alias llama --no-mmap --jinja --chat-template-file GLM-4.5.jinja --verbose-prompt"
(just as a random example)
>>
File: file.png (102 KB, 668x501)
102 KB
102 KB PNG
threadly reminder this faggot browses /lmg/
>>
File: settingstavern.png (606 KB, 262x1879)
606 KB
606 KB PNG
Gave sillytavern a shot with llama (also tried kobold because why not) and I have the same recurring issue with it. It keeps giving me massive blog post replies.
I can literally say something as simple as "I look around the room for a light" and it gives me a massive essay reply with many actions and events happening rather than a shorter reply that relates to what I said.

I assume this has to do with my settings as I have not tweaked them much. What should I change these settings too? I assume the tokens are wrong as well
>>
>>106939535
Flash attention is probably already on by default now, but you can add -fa 1 if you want to make sure that it's on.

>Like I see others running shit like
Read the llama-server help output and you'll understand that most of those don't really apply for your case.
>>
>>106939598
Usually the reply length is a function of two things :
1. The model's training.
2. Tour promot (system prompt + character card + examples + first message + etc etc).
Usually the second one will have the most impact.
Try tweaking the first message to be brief, see how much that impact things.
>>
>>106939599
I did read through the llama-server -h last night when it was suggested to me, and while most of it made sense I wasn't 100% sure what I could add that would help improve the parameters for the 5090.
Also yeah flash attention is on by default that's why I didn't add -fa to it.

So just stick with: "llama-server -m GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf -c 32768 --n-cpu-moe 33 --n-gpu-layers 99" ?
does that seem fine or anything I should be adding?
>>
>>106939630
I think you are good yeah.
What speeds are you getting?
>>
>>106939645
>What speeds are you getting?
where does it show the speed? I looked through command prompt when it genned and didn't see anything
>>
>>106939607
what about the tokens?
>>
How much does the structure of your character cards matter for any given model?
If it does, what's the optimal format for GLM?
>>
>>106939607
>Tour promot
Holy fuck. I'm not even mobile posting.
What the hell.
>Your prompt*

>>106939656
See pic.
The part where it says
>prompt eval time = 1158.66 ms / 103 tokens ( 11.25 ms per token, 88.90 tokens per second)
> eval time = 152954.13 ms / 2859 tokens ( 53.50 ms per token, 18.69 tokens per second)
> total time = 154112.79 ms / 2962 tokens

>>106939667
You mean Tokens (Response)?
That doesn't control what the model wants to write, just where the generation will cutoff.
So if you set it to 120 tokens and the model wants to spew 1024 tokens, it'll just cut the text before it's done.

>>106939710
Now that's one hell of a question.
>>
File: this stuff.png (52 KB, 960x716)
52 KB
52 KB PNG
>>106939712
And f course I forgot the image.
>pic related
>>
>>106939712
>>106939716
Do note that you want to get these values on a larger prompt, otherwise the prompt eval time will be a meaningless, like in my pic.
>>
>>106939712
wats rest of ur rig's specs?
glad u stopped namefagging btw
>>
>>106939733
I never namefagged in my 17 years of 4chan.
Ever.
It's a gaymer notebook with 8gb of vram.
That screenshot is of qwen 30B A3B.
>>
>>106939755
wtf dont u have a 5090 and arent u running glm air???
glad you never namefagged in your 17 years of 4can
>>
>>106939712
>See pic.
>The part where it says
I'll check when i get home from work. Thanks for the advice.

Let's say the gen speed is a bit slower than I'd like, what would I do to the parameters to help speed it up? lower the 32k tokens?
>>
>>106939774
Different guy.
I'm the one giving advice to the dude with a 5090 (>>106939775).
>>
>>106939712
>Holy fuck. I'm not even mobile posting.
Maybe I'm just paranoid and should proofread better, but sometimes I swear the site changes or deletes words after I hit post.
>>
>>106939710
The vocab choices have more impact than structure.
Call your npc a tsundere and you need zero else on personality for example.
>>
>>106939826
that might be true, you're on windows
>>
>>106939598
>Always respond in 1-2 short paragraphs. Limit {{char}}'s response to less than 200 tokens unless specifically asked to provide a long answer.
If it doesn't understand instructions then it's a bad model.
>>
>>106939846
>Limit {{char}}'s response to less than 200 tokens unless specifically asked to provide a long answer.
I don't think models are able to count their own tokens, nor are they trained to correlate tokens to sentence length or whatever, but I guess it can serve as a heuristic to "short responses".
>>
What causes the AI to get stuck in 'a loop'? I'm using TheDrummer_Cydonia-R1-24B-v4-Q4_K_M and it was fine for a bit, but now the AI just keeps repeating it's last message ad verbatim, and then parrots word for word the last dialogue it gave.
>>
>>106939981
Bad model + bad settings, usually.
First thing is to check if your samplers aren't truncating the token pool too much or making the top tokens too likely.
>>
What's the chuddiest model without prompt massaging it?
>>
>>106940021
'toss
>>
>>106940021
Llama 2 with the wrong chat template.
>>
>>106940020
I'm using the kobold (godlike) preset, and I made sure to click the button at the bottom to load default order. I'm real stupid and new when it comes to this, I have no idea what I'm doing. The character card worked fine before, and I had way longer chats with it that didn't break down and loop like this.
>>
>>106940061
>I'm using the kobold (godlike) preset,
Yeah, don't do that.
I don't know how that one looks specifically, but those presets were always vodoo and were created a long, long time ago.
Try resting your samplers and using something mild and more "default" like Temp 0.75, TopK 100, TopP 0.95.
See what that does.
Also, if you are quanting the cache, try disabling that, see if that helps at all.
>>
>>106939920
You sound retarded, retard.
>>
>>106940094
I would if I knew what the fuck most of that even meant, lol. But I'll try messing with the settings and try it again. Should I ignore the sillytavern thing about it working better on text completion with the koboldAPI and go back to using the KoboldAPI Classic? Or is there a place to get premade settings, or is this on a per-system kind of thing?
>>
File: 1749471529130802.png (94 KB, 1898x259)
94 KB
94 KB PNG
>>106931567
Alright. Finally quantized and uploaded one of my side projects.

https://huggingface.co/AiAF/rp-sft-merged_1000_GGUF

What kind of prompts should I use to test it?
>>
>>106940264
you should consider licensing it under AGPLv3 or cc-by-nc-4.0 so kikes cant steal your models
downloading it rn, can you tell us what instruct template you used for this? or is this just a completion model liek llama 1
>>
>>106940264
>What kind of prompts should I use to test it?
Nala according to the instructions in
>https://justpaste dot it/GreedyNalaTests
>>
>>106940288
<start_of_turn>user [your prompt goes here] <end_of_turn> <start_of_turn>model [model response]


Idk why anyone would bother tuning a model to be a completion model only lol. The only decent use for base/ completion models that I know of is if you have an RAG setup using a model that is good at doing tasks based on the prompt it received

>>106940307
Damn, I forgot that page existed. I'll test it with that
>>
File: topkek.png (327 KB, 1309x1120)
327 KB
327 KB PNG
>>106940159
>>106940061
>>106939981
So I broke it free of it's loop by fucking with it's settings some and then having my character forcefully exit the scene, but then the AI went full batshit and fucking ended the RP on me, lmao.
>>
>>106940400
>"REDMPTION"
lol
>>
File: mysides.png (156 KB, 1271x480)
156 KB
156 KB PNG
>>106940439
>>106940400
Oh my god I somehow made it worse. I loaded up a mistral 7b tekken preset or whatever, since apparently that would work for the model, changed the XTC settings to .1 and .5, and then the fucking thing started giving me OOC thoughts and comments, even though it never did that before. I had previously used 'ooc: whatever' before just fine without this happening. Pretty fucking funny, though.
>>
File: freshair.mp4 (951 KB, 1280x720)
951 KB
951 KB MP4
>>106936320
https://files.catbox.moe/bjswrg.mp4
>>
>>106940307
user
"ahhh ahhh mistress"

model
*She rubs her paws along your thighs and stomach as she starts to undo your pants.* "That's right, little one, tell your mistress how much you want it." [end of text]

>>
File: offwithyerdick.png (188 KB, 1299x599)
188 KB
188 KB PNG
>>106940674
Welp. I guess this is what happens when you force an AI to continue after having a meltdown and trying to end the RP. Nice to know the model isn't instant horny, but RIP to the character's dick.
>>
>>106940742
Hey.
Paw, called anon little one, ran with the mistress thing in context correctly.
You know what? I've seen worse.
How does it respond if you turn the
>"ahhh ahhh mistress"
into a whole ass descriptive paragraph?
Does it still respond with a short sentence or does it follow your input?
>>
File: rp-2b.png (21 KB, 975x277)
21 KB
21 KB PNG
>>106940377
with the spaces? what about sysprompt?
>>
>>106940821
>>106940821
>>106940821
>>
>>106940768
I used link rel as a completion test:

https://files.catbox.moe/q768fb.txt

And used the following command via llama.cpp

./build/bin/llama-cli -m ./rp-sft-merged_1000-f16.gguf -f Nala-Test_Gemma2.txt
>>
>>106940820
I forgot to mention the chat template is Gemma. I have the same issue where if I forgot to include the
--chat-template gemma
flag then the model would immediately start talking about random shit ad Infinitum because llama-cli by default expects your prompts to be in the prompt format. The model expects. Using that flag fixed the issue. So maybe you need to tell your web UI/ inference engine to use that prompt template
>>
>>106941440
>the chat template is Gemma
>>106940377
>
<start_of_turn>user [your prompt goes here] <end_of_turn> <start_of_turn>model [model response]

And then we have this
>https://ai.google.dev/gemma/docs/core/prompt-structure
Every.... Fucking... time...
>>
>>106941492
I don't see what the issue is.
>>
>>106941503
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>>
>>106941503
<start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>

He used spaces instead of newlines.
>>
>>106941521
Numb nuts, explain what the issue is instead of looking for an excuse to argue with people. You aren't even using the model. So what are you complaining about?
>>
>>106941537
Did you read the text file it uploaded?

https://files.catbox.moe/q768fb.txt

The example written here >>106940864 was just written my me on the fly as a rough explanation as to how you're supposed to format your prompts
>>
>>106941553
>explain what the issue is
You really cannot see the issue? Like all the other times?
>>
>>106941588
See >>106941580

Even if that that prompt template was formatted as badly you say it is (it isn't, you didn't bother to read the file), that would not be causing the engine to spam text indefinitely.
>>
>>106939493
>GPTards
Yann nooooo!!!! I am the autist here who has no friends and even I know not to say that cause it can be misconstrued! YAANNN!!!!



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.