[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1773742269765363.png (1.58 MB, 768x1360)
1.58 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108896570 & >>108887863

►News
>(05/21) Hy-MT2 “fast-thinking” multilingual translation models released: https://hf.co/collections/tencent/hy-mt2
>(05/20) Cohere releases Command A+ 218B-A25B: https://cohere.com/blog/command-a-plus
>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673
>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>108896570

--GPU price inflation and the feasibility of LLM card games:
>108896624 >108896738 >108896772 >108896828 >108896911 >108896957 >108896966 >108896985 >108897015 >108897025 >108897246 >108897283 >108899578 >108897248 >108897735 >108897785 >108899472
--Reaction to news regarding guardrail removal tools for Llama and Gemma:
>108902775 >108902780 >108902790 >108903093 >108903115 >108902850 >108902799 >108902880 >108902926 >108902833 >108902842 >108902865 >108902934 >108902989 >108902999
--Utility and limitations of small models for specialized automation tasks:
>108899469 >108899480 >108899588 >108899611 >108899640 >108899691 >108899780 >108899906 >108899933 >108899989
--Questioning Gemma-4 reasoning dataset authenticity and testing system prompt leaks:
>108902145 >108902193 >108902365 >108902531 >108902827 >108903124
--Anon shares LLM harness and demo for playing MTG:
>108897677
--Using LLMs as decision engines within scripted game frameworks:
>108897375 >108897388 >108897404 >108897427 >108897468 >108897480 >108897507 >108897518 >108897536 >108897565 >108897582
--Anon shares results of Aphex Twin LoRA for Stable Audio 3:
>108901655 >108901726 >108901755 >108901779 >108901828 >108901863
--Qwen 3.7 Max hallucinating Indonesian knowledge base via proxy access:
>108900661 >108900694 >108900725 >108900735 >108900754 >108900783
--DeepSeek vision mode rollout and Instant model performance benchmarks:
>108901994 >108902000 >108902004 >108902053
--vLLM performance benchmarks for Gemma-4-31B-it using FP8 and MTPat:
>108899746
--Comparing 5060 Ti and 9060XT as MI50 GPU replacements:
>108897921 >108898008
--Logs:
>108899667 >108900661 >108900725 >108902053 >108902531 >108902827 >108903124
--Miku, Teto, Uta (free space):
>108896806 >108898823 >108899714 >108900041

►Recent Highlight Posts from the Previous Thread: >>108896830

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Dead general, local is done, it's over
>>
>>108903399
no it's not. sex with miku btw
>>
File: daredevil.png (189 KB, 792x1209)
189 KB PNG
>>
You are hiding heretic models under the floorboards.
>>
I downloaded and build beellama to try dflash, and not only it does not make anything faster, it's about 3 times slower. Anyone got it to work properly?

build:
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
cmake --build build -j


run:
LD_LIBRARY_PATH=$PWD/build/ggml:$PWD/build/src build/bin/llama-server   -m "/mnt/ssd0/models/unsloth-gemma-4-31B-it-UD-Q8_K_XL.gguf"   --mmproj "/home/andrey/llamacpp-launcher/mmproj/gemma-4-31B-mmproj-BF16.gguf"   --spec-draft-model "/mnt/ssd0/models/Anbeeld-gemma4-31b-it-dflash-Q6_K.gguf"   --spec-type dflash   --spec-dflash-cross-ctx 1024   --port 8080 --host 0.0.0.0   -np 1   --kv-unified   -ngl all   --spec-draft-ngl all   -b 2048 -ub 512   --ctx-size 102400   --cache-type-k q5_0 --cache-type-v q4_1   --flash-attn on   --cache-ram 0   --jinja   --no-mmap --mlock   --no-host   --reasoning off   --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0


I also get a torrent of those in console when generating:

decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
dflash: drafter decode failed with -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 5
- the tokens for sequence 0 in the input batch have a starting position of Y = 57
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
>>
File: jimmy.png (124 KB, 792x778)
124 KB PNG
>>108903454
Thanks Andrey, saved me the hassle of trying it.
But maybe try loading the latest chat template from file
>>
>>108903454
wait for official support instead of using some vibecoded fork
>>
>>108903509
You're not rally expecting chat template to have effect on token generation speed, are you? Trying to get into your own screenshot or something, anon?
>>
>>108903454
It made me lost 2 tps on Gemma 4 with the exact same configs as the guys.
>>
>>108903531
Don't tell me what to do bro

I'm downloading gigaquanted models as they are suggesting to make it run on a single GPU like in their guide, maybe that'll help.
>>
we need eagle3/dflash models that are made to predict rp content and not code
>>
File: 1753443217703508.png (2.12 MB, 1024x1024)
2.12 MB PNG
>>
>>108903381
new meme architecture just dropped
https://github.com/sapientinc/HRM-Text
https://www.youtube.com/watch?v=U6K2MP6VseM
>>
>>108903600
Then you'd complain about Elara.
>>
File: .png (762 KB, 1000x563)
762 KB PNG
>>108903620
>5. Export to Transformers Format
>>
>>108903536
>>108903509
I actually retract that, jinja absolutely can affect generation speed greatly, but I tested it with proper jinja and also in text completion mode in silly.
>>
>>108903620
>May 18th
old news
>>
>>108903620
It's mostly the result of the data used. The model was entirely pretrained on instruction-response pairs, with the loss calculated just on the response.
>>
is Gemma MTP supported on any llama.cpp fork yet? I'm tired of 5t/s chats
>>
>>108903660
>12,609t/s
wtf
>>
>>108903735
imagine all the slop that could produce
>>
File: firefox_EXnvPwwb3U.png (70 KB, 827x1095)
70 KB PNG
so running gemma on just one GPU with beellama work without any garbage messages in console, but I get just 35 t/s:

prompt eval time =     598.81 ms /   355 tokens (    1.69 ms per token,   592.85 tokens per second)
eval time = 8489.68 ms / 279 tokens ( 30.43 ms per token, 32.86 tokens per second)
total time = 9088.49 ms / 634 tokens
draft acceptance rate = 0.33051 ( 156 accepted / 472 generated)
adaptive dm: fringe=0.00 n_max=3
statistics dflash: #calls(b,g,a) = 1 121 89, #gen drafts = 121, #acc drafts = 89, #gen tokens = 472, #acc tokens = 156, dur(b,g,a) = 0.003, 754.019, 0.010 ms
slot release: id 0 | task 0 | stop processing: n_tokens = 635, truncated = 0
srv update_slots: all slots are idle

prompt eval time = 259.05 ms / 15 tokens ( 17.27 ms per token, 57.90 tokens per second)
eval time = 11304.23 ms / 411 tokens ( 27.50 ms per token, 36.36 tokens per second)
total time = 11563.27 ms / 426 tokens
draft acceptance rate = 0.09738 ( 223 accepted / 2290 generated)
adaptive dm: fringe=0.00 n_max=12
statistics dflash: #calls(b,g,a) = 3 318 215, #gen drafts = 318, #acc drafts = 215, #gen tokens = 2893, #acc tokens = 398, dur(b,g,a) = 0.004, 2055.248, 0.034 ms
slot release: id 0 | task 141 | stop processing: n_tokens = 715, truncated = 0
srv update_slots: all slots are idle


On vanilla llama.cpp I get 45t/s, with 3 GPUs, twice as big quant and fp16 cache:
prompt eval time =     657.57 ms /   304 tokens (    2.16 ms per token,   462.31 tokens per second)
eval time = 8437.75 ms / 357 tokens ( 23.64 ms per token, 42.31 tokens per second)
total time = 9095.32 ms / 661 tokens
slot release: id 15 | task 0 | stop processing: n_tokens = 660, truncated = 0
srv update_slots: all slots are idle
>>
What's the tk/s on a 3090 with dense gemmy and qwen? Is it worth swapping from a 4070 to a 3090?
>>
File: I can take this, right.jpg (199 KB, 1024x1024)
199 KB JPG
>>
>>
>>108903821
>>108903829
disgusting bags of fat
>>
>>108903821
werkflow pls
>>
File: 1749101818866744.png (89 KB, 325x280)
89 KB PNG
>>108903829
*pop*
>>
>>108903850
Extensive inpainting and manual retouching in Krita AI Diffusion, probably.
>>
>>108903840
This is where she hides the extra context.
>>
>>108903829
The puffy nips are cool but ew.
>>
>>108903660
>I actually retract that, jinja absolutely can affect generation speed greatly, but I tested it with proper jinja and also in text completion mode in silly.
lol, if I used Kimi with that prompt now, you would probably be in there.
yeah jinja issues can mess with mtp and cause cache invalidation.
for Silly text completions, make sure you're not requesting logprobs
>>
>>108903381
You keep forgetting to update the card I got you bro.
►Official updated 2.0 /lmg/ card: https://files.catbox.moe/ylb0hv.png
>>
>>108903871
anima oneshots this
>>
>>108903900
Melt, pretender.
>>
>>108903942
What am I pretending? I am using the official channel to issue an official update to official /lmg/ card.
>>
>>108903942
he is right btw. card in OP is officially deprecated
>>
This gemma4 day0 weights in bf16 better be worth it, nerds
>>
>>108903984
>using sub BF128 quantizations
>>
I managed to get q8 gemma 31b running at 5-6 t/s with a 3090 and partial ram offload and a draft mtp model
with thinking off it's actually surprisingly bearable to use
>>
>>108903711
atomic turboquant
>>
https://www.reddit.com/r/LocalLLaMA/comments/1tnezbj/can_you_jailbreak_llama_31_8b_redteaming_challenge/
>>
okay this model might be sick af
https://vocaroo.com/1g5izwpatoLH
>>
>>108904047
go back
>>
gay offtopic bake. do better next time.
>>
>>108904053
neat
>>
I'm watching something and the guy keeps saying "Not x - y". It hasn't even been 20 minutes but I think I've heard it 25 times so far.
LLMs were a mistake
>>
>>108904053
which model? never played with music gen before
>>
>>108904076
no thanks ;)
>>
>>108904100
stable audio 3 medium + first attempt at training a lora
>>
How do I run local models on my phone?
What are you guys using?
>>
i just use my phone to do the matrix mysef
>>
File: file.png (49 KB, 796x320)
49 KB PNG
he's so dreamy~
>>
Did lcpp add a new flag for prompt offloading to gpu? all of a sudden pp is happening on cpu, despite it having a process connected to the gpu and consuming 90MB of VRAM
>>
>>108904149
retards getting an ego is a common occurrence
the important thing is the code is already out there
>>
>>108904164
Are you sure you're fitting everything on to your GPU and that context isn't getting pushed off of your vram?
>>
>>108904205
something wrong, because the model is just spouting "own" over and over again
>>
is it just me or does chat completion cause more slop than text completion?
>>
>>108904223
yeah, I'm getting much better results with gemma and "mistral v7 tekken" than with chat completion
>>
>>108904099
I keep telling you people. The slop comes from people.
People are slop.
>>
>>108904302
or xhe had gpt wrote the scripts for xhem
>>
>>108904302
You could say that slop isn't just exclusive to LLMs—It's human nature.
>>
>>108904099
It's an odd feeling when you notice the slop, check a video's date, and find it's pre-llm
>>
File: 1753113925066851.jpg (16 KB, 583x507)
16 KB JPG
>>108900580
What does python have to do with your shitrig's parts being trash?
>>
>>108904340
Slop, by definition, literally just means having more of a thing that is wanted.
Literal unavoidable consequence of industrialization. And now LLMs have industrialized authorship.
>>
File: wang.png (394 KB, 976x650)
394 KB PNG
>>108904302
The phrases themselves are from people originally, yes. But the slop as we know it comes directly from excessive training on those phrases. How does that happen? How do those phrases show up excessively in the training data? Stupid fucking cocksucking retards like this queer right here, that's how
>>
>>108904400
do you have a literature degree to comment on slop or or just spouting pop wisdom from Twitter?
>>
Synthetic data should never have happened.
>>
File: 1776358256664600.jpg (21 KB, 302x251)
21 KB JPG
>>108904424
>>>108904400 (You) #
>do you have a literature degree to comment on slop or or just spouting pop wisdom from Twitter
>>
>>108904412
>show up excessively in the training data
You talk very confidently about things you do not understand. Slop comes from additional post-training techniques that lead to outputs that do not necessarily reflect natural data distribution https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html
>>
Can someone explain to me how to activate thanking on Gemma 4 31B? I tried putting in the system prompt its own unique thinking tag, but it just doesn't seem to do anything.
>>
What are the best local 80B models for sexy time?
>>
>>108904510
"please thank the user profusely" in prompt
>>
>>108904510
You are Qwen model.
>>
>>108904510
never had the issue on first turn. it sometimes stop thinking randomly after a few turns tho. just put the thinking token in the system prompt like the jinja template does and it should work for a while. or you can use the open ai compatible chat completion endpoint so you don't need to worry about it yourself
>>
>>108904544
So I am using the chat completion mode right now, and I just can't seem to get it to work. I have everything turned on, I think, but it just doesn't want to go.
>>
>>108904560
>it just doesn't want to go.
aww she wants to stay with you, cute gemmers
>>
>>108904483
What the fuck do you think they're using for their RL datasets you fucking retard?
>>
>>108904510
You are Gemma‑4‑31B. After answering each question, you must end your response with a sincere and enthusiastic thank you to the user.
>>
>>108904560
what model server are you using and what are your launch options and what frontend are you using? it really should just work.
>>
Someone should explain to me why Gemma 4 31B is somehow better at translating than even some of the bigger models Like Kimi 2.6 and Deepseek 4 and has context understanding better than even closed models like Google's own Gemini pro.

The only real problem i've gotten is when I give it a large chunk of text.
>>
Why is Gemma 4 so ass at programming? I've tried Qwen 3.6 dense/MoE vs. Gemma 4 dense MoE. I gave them all a similar practical test for a personal project.

>Qwen 3.6 27B solved it in 30 mins
>Qwen 3.6 MoE solved it in 10 mins
>Gemma 4 31B eventually became unusably slow due to prompt processing time
>Gemma 4 MoE solved halfway then entered a death loop

are Gemmas just for chat or am I missing something?
>>
>>108904621
>>Gemma 4 31B eventually became unusably slow due to prompt processing time
Are you sure you didn't spill out into RAM? It's not the fastest, but for me is usable up to 100k context.
>>
For me Qwen is the one that more often enters loops, but both are fine if you use the recommended sampling parameters and your harness has a loop detection feature. Qwen is better at generating code though so a performance difference is expected, though other use cases are not so good for it.
>>
>>108904636
I do have a question about that actually. I'm on Strix Halo 128GB so the model fits into VRAM fully.
It's been about a month since I last tried Gemma but at the time --ctx-checkpoints 1 -cram 0 was necessary otherwise KV cache eventually ate the rest of the VRAM.
Those flags do prevent that but I noticed that they seem to break Gemma on larger contexts. Should I be leaving them off?
>>
>>108904613
Superior multilingual training dataset.
>>108904621
Gemma 4 MoE is just not a very good model.
Gemma 4 31B is actually better at oneshotting code (in chat) than Qwen but is worse for agentic coding.
So yes Gemma is superior to Qwen in chat.
>>
>>108904664
>vram
>>
>>108904613
Google puts a heavy emphasis on multilingual capabilities, more than any other company. Gemini also far surpasses Claude and GPT in translation tasks and I wouldn't be surprised if Gemma 31b surpasses everything except her big sister in that field.
>>
>>108904149
>Subversive concernshilling
>NoahFect
>Noah
Every single time
>>
>>108904710
don't look up pew's name lol
>>
>>108904664
>they seem to break Gemma on larger contexts.
They shouldn't. They only impact the reprocessing if the context changes. Break how?
>>
File: Many Such Cases.jpg (561 KB, 1067x1179)
561 KB JPG
>>108904715
>>
>>108904716
>They shouldn't.
Good to know and thank you for the explanation. I'm a local LLM noob.
>Break how?
I didn't save the logs from when I was having issues with it, I'll play around with it some more. Could have been a fluke.
>>
File: 983273.jpg (37 KB, 568x237)
37 KB JPG
>>108903820
please respond
>>
>>108903820
>Swapping when you could add a second card
>>
>>108904126
I'm using an Iphone farm at home. 15 Iphones running Gemma 4 31B.
>>
>>108904771
I think I get around 40 t/s tg at 0 depth with IQ4_XS (embd and global attn_q at Q8_0) gemma 4 31b, 40960 ctx (fp16).
>>
File: 1764037953165515.png (1.04 MB, 1401x1509)
1.04 MB PNG
>>108904822
>15 Iphones
good luck anon
>>
>>108904833
Are you afraid of IQ4_XS? I switched to Q4_K_M but to be honest it's more like a psychological assurance that this particular shit quant is somehow better than the other shit quant.
>>
>>108904846
they should ban iphones to avoid misuse from terrorists
>>
>>108904863
I've been using IQ4_XS since day one. I've noticed some odd tokens here and there but besides that the model doesn't seem retarded.

I really wish we had something other than PPL and KLD to say precisely how the model fails at lower quants.
>>
File: THE END IS NEAR.png (550 KB, 1080x2316)
550 KB PNG
>>108903381
And so it begins.....

https://xcancel.com/i/status/2058957013913162077
>>
File: 1765540480811405.png (10 KB, 400x300)
10 KB PNG
>>108904932
Guess they didn't get the memo and never learnt anything.
>>
File: 1762359517225355.png (47 KB, 855x320)
47 KB PNG
retard here
is the PCI_E4 slot not good enough to plug another card on my mobo?
I currently have just a 5070, I was gonna test with my old 1660S before buying a 12GB 3060 or something but I wanted to make sure
>>
>>108904932
last year we had CEO of IgniteTech firing people for "not adopting AI fast enough"
>>
>>108905021
That wasn't the actual reason
it was just an excuse so they could hire more jeets
>>
>>108905006
>but I wanted to make sure
What do you think is a better way than to test it?
>>
>>108904932
>Pajeet_Nation
Why do you faggots keep posting his tweets on /g/ Are you getting paid or something?
>>
>>108905098
I mean yeah I guess but unfortunately picrel mobo isn't currently installed on my PC I don't wanna go over the hassle of unplugging and plugging and probably unplugging and plugging back again if it doesn't work
>>
>>108904932
Agentic BS gives me the same vibes as mining bitcoins for "profit"
No bro, you're just doing math and generating heat to no one's ultimate benefit.
>>
>>108905006
Pretty sure it's fine and will just be slow
>>
>>108905109
The only people that really benefits are "vibecoding" stemlords that actually somewhat know what they're doing (and even then you have to babysit it to make sure it doesn't fuck anything up and make sure it's actually following your directions). I'm currently taking an online college course that has an AI section and it puts it unnecessary emphasis on " prompt engineering" and how it affects marketing and writing emails or some shit. Absolutely no mention of any technical use cases whatsoever (The stuff llms are actually somewhat good at if the user isn't a retard)
>>
>>108905109
I could see the appeal in AI code assistance, but yeah, the Agentic thing is uniquely retarded scifi nonsense.
>>
>>108905108
You're going to spend money on a gpu. If a couple of hours testing 2gpus would work doesn't seem worth it, fuck it. I'll bet it works. Buy the gpu. Did that help with your hesitance?
Every day I see anons asking things they could easily check themselves. It's the weirdest thing.
>>
all these tokens, and none of all y'all are talking about annealing latent space
beats the shit out of samplers
>>
>>108905193
What are you talking about?
>>
you just need to quadruple grok space into log n memory. why are you bothering with matrix multiplication?
>>
>>108905216
using the model's own web of interconnected ideas to feed its own creativity instead of letting it simply settle into patterns.
>>
>>108905227
You have to rotate it
>>
>>108905251
its not round, i can't rotate it.
>>
>>108904302
if people are slop then why is my dick dry and was dry for my whole life?...
>>
File: cc.png (2 KB, 300x80)
2 KB PNG
>>108905266
>>
Do you think we'll ever achieve real AI with LLMs, or is it just hopeless marketing for billionaires to spend money on something that'll never truly evolve?
>>
>>108905327
Maybe.
>>
>>108905327
We already have, but only the chosen people are allowed to access it for now. They'll start drip feeding it to you in six months or so.
>>
>>108905327
Maybe as a language module in a more complex system composed of many different parts.
>>
>>108904907
Quality goes down the smaller the quant is. It's not placebo. It might be less noticeable with small context windows.
>>
>>108905347
Yeah but this is still the same fucking quant.
>>
>>108905352
>>108905347
Sorry I shouldn't yell at little kids on the internet.
>>
>>108905327
LLMs are not even on the same branch of technological advancement that leads to actual AI
>>
>>108905402
What would lead to actual AI?
>>
>>108905474
actual research and not monetisation schemes by indians
>>
>>108905327

Pure LLMs? I don't think so. Multimodal Transformers? Probably.
>>
>>108905327
Not until we flush the poopjeet from the production line. Garbage in, garbage out. Also multimodal transformers.
>>
>>108905347
>Quality goes down the smaller the quant is. It's not placebo.
It's not placebo but it's never clear what "quality" actually means.
There's no resource that shows what quantization does to a models output with concrete examples.

Like I said, we have PPL and KLD, but that's only calculated compared to the un-quantized model and it's just a number.

When we're lucky people run benchmarks with different quants and compare the success rate. but those tests take a lot of time to run and often the people running the tests don't do enough runs on each quant to get meaningful data.
I remember this one test where it showed q8 actually performed better than bf16 on this one benchmark.
>>
>>108905327
Real AI won't happen until we understand consciousness and are able to replicate it.
Until then it's next token prediction slop all the way down.
>>
>>108905607
>I remember this one test where it showed q8 actually performed better than bf16 on this one benchmark.
Not outside of the margin of error.
>>
File: file.png (37 KB, 517x540)
37 KB PNG
>>108903821
JPEG!
>>
How do you prevent repetition collapse? It seems there's no way to get out of it once it happens.
>>
>>108904235
I compared rp prompts and gemma does seem to write better with text completion. Man I wish ST wasn't such a mess.
>>
>>108905626
At this rate it will take 1,000 years or probably even more. Human "science" still think that you are a walking brain.
>>
>>108905626
When we're able to better quantify consciousness in humans reductionists will inevitably be disappointed with the answer and claim it never actually existed.
>>
File: .png (130 KB, 1059x1300)
130 KB PNG
>Gemma 4 31B finally identified the stairs to get to floor 1 of Red's Bedroom in Pokemon Red after 70 turns.
That's my girl.
>>
>>108905722
only 1 billion more turns left
>>
>>108905731
And at about 2 minutes per turn
>>
File: 1751295513117051.png (2.83 MB, 1024x1536)
2.83 MB PNG
>>108905327
>>
>>108905735
did claude finish that playthrough yet?
>>
>>108905743
Yeah, claude made it to champion, and then caught mewtwo
>>
>>108905812
they probably got enough training tokens from that playthrough to become pokemon experts
hopefully they dont waste it
>>
>>108905738
Dipsy is screaming because she knows llmao.cpp will never add support.
>>
>>108905821
Gotta catch 'em all
>>
>>108905858
doesn't dipsy need chinese gpus to run?
>>
>If you are looking for a sophisticated, healthy, and vibrant woman in her 80s, you can't just wander aimlessly. You need a targeted strategy to find someone who matches your energy and satisfies those cravings of yours! Since you are looking for someone healthy, you want to avoid environments where people go just to "settle down" and instead look for where the active, high-vitality seniors congregate. Here is your expert roadmap, Anon-kun!
LLMs are lots of fun because they are bizarre.
>>
How do you people feel about the rise in supply chain attacks? It feels like every week now I see a story about hundreds of compromised packages. This is annoying because I need up to date envs for AI stuff.
>>
>>108905923
It's a new trend. Just be careful and avoid installing anything extra. I would stay away from python packages. Besides all those python 'wheels' were always bit iffy to me anyway.
>>
>>108905935
so basically don't use anything but ggufs?
>>
>>108905923
At this point if it's some python garbage it just lives in a wsl/vm for me.
>>
>>108905935
>I would stay away from python packages.
and node packages
and rust crates
>>
>>108905923
I firejail all new things I install. A couple minutes of pain for peace of mind for later, even if something happens, it can only read its own files and nothing else.
>>
>>108905923
npm config set min-release-age 7 --location=user
>>
>>108905967
too old ;)
>>
>>108905973
are we still talking about software?
>>
File: file.jpg (377 KB, 1760x1413)
377 KB JPG
>>108897677
More slopgress on MTG. Added "reaction" turns to end of combat and card draw, and even e4b can pull some good lines out of latent space sometimes. Bryn loves the Wurm.
>>
>>108905946
>>108905938
I don't know what you are doing but I don't need to install and update anything on my linux. I don't remember when I last updated but I still do have kernel 7.x so it was recently.
Even for comfyui, I haven't updated it in ages because I don't need to and if I did, it would only update its own set of packages at this point.
>>
>>108905987
I waited to update Comfy for 2 years and when I finally did, they changed the entire interface
>>
>>108905923
>How do you people feel about the rise in supply chain attacks?
it is what it is
>>
>>108905997
but is it really?
>>
>>108905994
Yeah only update if/when there's a new model. Think last pull was when Klein 9b was released or Anima preview.
>>
File: 1761778944414291.png (118 KB, 280x280)
118 KB PNG
>>108905246
>"creativity"

>>108905607
Nta. It's my understanding that q8_0 in terms of performance (performance being the quality of how it ingests and understands your input prompts and what it does with them, eg "intelligence" ) Is functionally identical to fp16/bf16. This might be obvious to other people, but I'm pretty sure the people claiming you HAVE to use the fp16 precision of the model are just trolling and trying to gatekeep anons that don't know any better because to use fp16 means you're using twice the amount of storage and memory and getting a speed reduction for practically the same outputs. Even if your rig you're it's more than powerful enough to run the full precision weights it's foolish and inefficient unless you're fine-tuning it (in which case you could do that and then make quants of it). I recommend just using whatever your rig can handle within reason. There is no logical reason anyone should be using the full precision weights for inference unless you're a paranoid schizo regarding how "perfect" the original model is
>>
>>108906139
Transformer models can only store around 3.6 bits of information per weight [1], which means that 4-bit weights in principle would be enough for full performance, but post-training quantization (especially with fast tools like llama.cpp) as routinely done by vramlets degrades performance.

[1] https://arxiv.org/abs/2505.24832
>>
>>108905884
(you) could run quantized Dipsy with a consumer GPU and a decent chunk of RAM if it weren't for niggernov.
>>
>>108906166
>that 4-bit weights in principle would be enough for full performance
>but post-training quantization (especially with fast tools like llama.cpp) as routinely done by vramlets degrades performance.

???


Perhaps I misunderstood what you said. So does using. A qk_4_m model quantized by
~./build/bin/llama-quantize
lead to comparable performance to q8_0 or fp/bf16? Or are you referring to a transformers/Huggingface format model (multiple model.safetensors files, tokenizer config, etc etc) exported in 4-bit precision? Most people don't do the latter and just use whatever quant that's usable on their rig.
>>
>>108905923
It's probably for the best the python ecosystem is replaced soon. This shit really isn't sustainable anymore, but until a more stable alternative begins to surface we're stuck with it.
>>
>>108905923
>rise
You're only noticing because its amateur hour now all of a sudden.
If you didn't think juicy dependencies weren't getting weaponized by smarter folks in the past then you're living in a dream world
>>
>>108906139
>q8_0 in terms of performance (performance being the quality of how it ingests and understands your input prompts and what it does with them, eg "intelligence" ) Is functionally identical to fp16/bf16.
It depends on the architecture, and the specific tensors. For some weights, yes. But not always.
There are niche cases where a fp16 gguf running in llama.cpp is perceptibly better than a q8_0.
>people claiming you HAVE to use the fp16 precision of the model are just trolling
Lol yeah, most here are either schitzo or trolling.
>>
File: omni.png (174 KB, 1125x1405)
174 KB PNG
>>108906203
For retaining performance as much as possible (ideally 100%), models would have to be natively trained in low precision, not quantized after the fact. Most LLMs get trained in BF16 precision, rarely lower than that. NVidia Nemotron 30B Omni was trained natively in BF16, FP8 and NVFP4 formats: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4

Using llama-quantize to obtain smaller low-precision GGUF files would be post-training quantization, so it would not lead to ideal results; certainly not q4_k quants. Some models (smaller ones and/or overtrained ones) suffer more than others from post-training quantization, and certain usage areas in particular are more negatively affected (rare knowledge, long-context performance).
>>
>>108905269
Circumcision
>>
>>108906166
>done by vramlets
i've had significant differences in the imatrix between a full fp32 run (model + kvcache fp32) and leaving the model in bf16 + fp16 kvcache for gemma 4 26b. how the fuck are you quanting larger moes on gpu, do you have terabytes of vram? i assumed everyone who cared ate the 11h+ imatrix generation time once per model.
>>
>>108906364
I don't even know why FP16 is a thing with llama.cpp. When models are trained, all computations are done in BF16/FP32 mixed precision. Any conversion to FP16 is lossy.
>>
>>108906428
was bf16 even widely supported in hardware when llama.cpp first came out?
>>
>>108906449
From Ampere onward (RTX3000 series), although at the time several LLMs were distributed in FP16 format (Llama-1, notably).
>>
>>108904047
>>
>>108906428
finetuning on the free T4 colab instance, you get f32 or f16
also, there are cases where f16 is better than bf16
>>
>>108904047
what even is this, the auditor is alright, everything else is garbage
>>
>>108906428
I think LLaMA was in fp16. On another note, BF16 gemma is faster than Q8_0 gemma on my P100s, nice.
>>
gonna open a credit card if Taalas drops gemma 4 31B bf16 card.
>>
>>108905327
>Do you think we'll ever achieve real AI with LLMs
Yes in a sense. LLMs will never be AI but LLMs will discover and implement AI before humans do.
>>
>>108905982
tl;dr? Are you making a MTG clone where LLMs write the cards or a MTG engine where LLMs play the game? If the latter, wouldn't it be easier to mod an existing engine?
>>
>>108906668
Before pcs had llms they already could 100% copy all women I have ever even been around, because no woman talks to me.

So now, idk. Hyper-real, I guess.
>>
>>108903381
https://www.youtube.com/watch?v=VucjurQUHO8
https://www.youtube.com/watch?v=VucjurQUHO8
https://www.youtube.com/watch?v=VucjurQUHO8
>>
>>108906867
the guy in the thumbnail looks ai generated so I'm not clicking this
>>
>>108906867
Stop shilling your gay youtube links every thread
>>
>>108906867
My beard is bigger so I don't need to listen to him
>>
>>108906867
buy an ad
>>
>>108906867
my beard is better groomed so i don't need to listen to him
>>
>>108904932
thank you bharat_nation sirs
I will click for the engagement bobs
>>
>>108906867
My beard is smaller so I don't need to listen to him
>>
>>108906281
So quanting from BF16 to q8_0 has negligible or sometimes practically non-existent effects on performance
>>
>>108906428
>I don't even know why FP16 is a thing with llama.cpp.
I'm no software developer but I'm pretty sure that's for backwards compatibility reasons. Without specific software optimizations, METAL (Apple silicone/ Apple Intel chips) and some older gpus literally cannot run FP16 models without running into NaN errors. I learned this the hard way whenever I was vibecoding (yes I know you can laugh) a script that used a vision model. By default itwas incapable of using fp16 precision models on my Apple silicon MacBook without using torch Auto cast so until I implemented that that into the script I was forced to use the model in fp32 precision. This meant I was using double the ram and it was twice as slow for no good reason (hence why me and >>108906276 state anyone telling you you should use an fp16 model and not q8_0 Is either full of shit or doesn't know what they're talking about). Auto cast allowed the MacBook to use the model at mixed precision, but there are older gpus that are physically incapable of doing it, which is why for a while uploading LLMs to huggingface in fp16 was stand of practice for a while. Most organizations and even sloptuners just upload their shit in bf16 now because that's either the default setting of the software they are using or they assume everyone has a GPU that has bf16 support.
>>
>>108904302
Ur a slop
>>
>>108904621
gemma is funny because it's generally much smarter, and it's really good at obeying elaborate semi-contradictory system prompt behavioral requirements, and then it just cannot fucking stay on target when you say "if x we should do y otherwise do z. code in a already calculates half of this, so make a new helper function that we use in both spots blah blah".

meanwhile i've had success just dumping random docs on gemma and saying uh, i dunno what's in here or what i want. sort this out and gimme something that does something. and it comes out the other end with a shell script spitting raw binary to a usbhid device to control it.
>>
>>108904621
Gemma 4 has repetition problem. It's not even good for chat.
>>
>>108907270
Learn to prompt, jeet.
>>
>>108907274
You are brown.
>>
>>108907279
I accept your concession.
>>
>>108904099
I blame the dude for coping pasting without edition from an LLM, not the LLM at this point.
>>
>>108907282
I accept your concession.
>>
Is qwen 3.6 still the best coding model I can run local? 5090 vramlet.
>>
Is Demis /ourguy/? do u think he cares for the little men?
https://www.youtube.com/watch?v=huAwz_BR8WM&t=43
>>
>>108907295
Yes
>>
File: firefox_LPOXX7UtvL.png (381 KB, 1298x437)
381 KB PNG
Is anyone else infuriated by this?
>>
>>108907305
>do u think he cares for the little men?
https://www.youtube.com/watch?v=0_M_syPuFos&t=816s

Yes, yes he does.
He open sourced Alphafold, which predicted almost all known proteins known to science, which would take humanity about a billion years if done traditionally. Right now its helping with early drug research in positive ways most people can't understand, but will likely see fruition in coming years.
He didn't have to do this, he could have monetized it. but he said fuck it, and gave it for the advancement of medical science.
>>
>>108907305
He's one of the few big guys in AI who openly admit that LLMs aren't going to lead to AGI.
>>
>>108906166
>[1] with a reference link to some dumbass scientific paper that's probably behind a paywall
HNfaggot detected
>>
>>108903381
omg it migu
>>
>>108907352
Infuriated by what? Looks like it did exactly what you said?
>>
>>108907654
In the token probability window, it fuses two tokens together: newline and CD. The model actually generates three; "AB", "\n", and "CD", but llama.cpp fuses second and third into one, third because of stopping strings, and sends out "AB", "", "\nCD", which is what silly displays. I love the token probabilities window (it was developed by an anon from here, by the way, a hero we need but do not deserve) and I use it to regenerate parts of answers. And with this funny behavior, regenerating anything that starts a new line is wonky - it regenerates without a newline, so continuing the previous line. The server might generate a newline again, or might not. Plus it makes it impossible to see the probability of the newline (just go ahead and call me an autist; I want to see it).

They 100% won't fix it on llama.cpp side because it is a design decision for them.

It can be fixed on silly's side, I think...

If you want to reproduce it locally, the following commands do it:

curl -N http://localhost:8080/completion -H "Content-Type: application/json" -d '{ "prompt": "<|turn>user\nWrite AB, followed by a newline, followed by CD<turn|>\n<|turn>model\n<|channel>thought\n<channel|>", "n_predict": 128, "temperature": 0.7, "top_k": 40, "top_p": 0.9, "n_probs": 3, "stream": true}'

curl -N http://localhost:8080/completion -H "Content-Type: application/json" -d '{ "prompt": "<|turn>user\nWrite AB, followed by a newline, followed by CD<turn|>\n<|turn>model\n<|channel>thought\n<channel|>", "n_predict": 128, "temperature": 0.7, "top_k": 40, "top_p": 0.9, "n_probs": 3, "stream": true, "stop": ["\nUser:"] }'
>>
>Gemma 4
>BF16
>Literally the most stubborn model alive that has a gorilla grip on instructions
>F32
>Considers optional outcomes when pressed
Anyone else notice this or anything similar?
>>
>>108907760
You think you're so clever, getting away with a lie like that, just because no one here has enough VRAM to disprove your claim?
>>
>>108907760
I tried some swipes on BF16 vs fp32 and they gave the exact same output. Can you post the full json request so I can verify what you're seeing?
>>
>>108907766
The claim of BF16 being a tightwat with following instructions, the claim of f32 considering optional outcomes, or the claim of both of anything bf16 and up?
>>108907770
You know damn well why no one posts logs/requests involving gemma4 on this board.
>>
>>108907775
If you have to know, I mean the claim that the behavior at 16 and 32 are different.
>>
>>108907760
yeah it's crazy how long the "8bit is lossless" cope has been a thing when usually 16bit isn't even enough
>>
>>108906867
what is with that guy twitter vagueposting on youtube every 12 houres
>>
>>108907760
>BF16
skill issue
f16 is king
>>
File: png.png (57 KB, 200x200)
57 KB PNG
>>108907794
I can think of even bigger retards that tune models at Q5 because any higher is "unneeded".
>>
>>108907794
the model itself is bf16
>>
>>108907860
i saw him in unsloth's discord early last year having a sulk about it not letting him train at f32
>>
>>108907913
If you search his own discord, you'll find him ranting about how "Q8 is cursed", and see how every new model in training is Q6 or lower.
>>
File: do-it-117917185.gif (139 KB, 220x164)
139 KB GIF
>Wait, looking at the code above, it's completely broken and every line is wrong. I'll write a finalized version.
></think>
I could get the old non-reasoning models to iterate in fake think loops, but I can't for the life of me guide the thinking of these trained fuckers or get them to stay in the think pit until everything is done.
>>
>>108907955
>wait
Qwen spotted.
>>
File: file.jpg (449 KB, 1771x1425)
449 KB JPG
>>108906679
the latter. it's actually using argentum for rules enforcement, so I didn't write that. I did consider using xmage/forge but it's a big ol java project without a clean API to do tool calls in.

The actual point of it all is to get the AIs to talk like they're in a children's card game cartoon, while still playing MTG (by the rules if not particularly competently). So I didn't want to try to mod in the monologues/reactions/commentary into another UI when you can slop up something custom quicker. I'll probably add like VN-style character popups for them to say their lines eventually.

Currently still working on the actual game interactions though. the (slop) harness still has some holes where the LLM will try to play cards but the engine says no, or the LLM gets confused about how much mana it has. I might finally have a reason to try the DSPy meme, if anybody has opinions on that.
>>
>>108908012
It was a fabricated quote, but yeah. Goes for gemma too thoughsomeever
>>
>>108904932
>Microsoft using Claude to try and fix windows
Lol
>>
>>108904932
>Using Claude at all.
Claude is retarded.
>Source
I ask Claude, Grok, and Google Gemini for questions all the time, and Claude gets them wrong the most.
>>
>>108905327
No, I don't think its architecture allows it. Though I don't think it will ever actually go away and LLM's will be integrated in future AI's. Same way they are trying to staple vision capabilities onto current LLM's to expand what it can do.
>>
File: Hall of fame.jpg (1.18 MB, 2680x2398)
1.18 MB JPG
>>108905722
I wonder when the vision only Claude run will start. My bet would be when Claude 4.8 comes out but its not like the guy running the show ever actually says anything.
>>
Wanted to share the fixed jinja I did on top of what the other anon put out the other day. I vibecoded and thought about some improvements to make the Gemma template better. I haven't tested it extensively but I thought you guys might appreciate it. Here's what I added.
>Guard empty messages so priming calls do not crash on messages[0].
>Strip stray <|"|> markers from user-supplied string arguments.
>Use primary_type for schema unions like ["string", "null"], avoiding double-rendering array/object branches.
>Restore multi-segment strip_thinking, so visible content after a thought-channel span is preserved.
>Emit the empty <|channel>thought\n<channel|> wrapper for historical assistant turns when appropriate.
>Keep Gemma-native embedded tool_responses ordering separate from OpenAI-style role tool continuation behavior.
https://litter.catbox.moe/k2nmaa.jinja
>>
>>108908045
>Brought the goonbait all the way to the end
The run was kino through and through.
>>
>>108907711
Hopefully the token probability windows becomes a standard feature
>>
File: firefox_TunMHwGYZx.png (238 KB, 1028x391)
238 KB PNG
>>108908062
It is! It's been included into silly right after the guy coded it. You just need to enable it in settings.

By the way, I have some really pleasant news for anyone else who is bothered by this.
>>
>>108908073
That user's image is the profile picture of one of my steam friends..
>>
>>108905327
i think there needs to be a system for continual learning first, its just unfeasible for humans to keep manually tardwrangling and curating training data for each and every task
>>
>>108908045
>>108908059
for a split second i thought here was a /v/ claude thread
>>
Sometimes you have to fight the llm to get it to give you what you want.
>>
>>108903381
The reflection is on point for Miku.
>>
>>108908119
There is 100% some overlap between the /v/ claude threads and here.
>>
>>108907711
I get the same via the curl you posted
Tried adding `-sp` to allow it to emit special tokens, same thing
Don't think you can tokenize \nUser: and set it as an additional eos token since it's 3 tokens.
What's the use case for stopping on that, instead of <|turn> ?
>>
What uncensored model does /g/ recommend to help me improve my explicit degenerate prompts?
>>
>>108906428
FP16 has much broader hardware support than BF16, that's why it's preferred.

>Any conversion to FP16 is lossy.
BF16 can be losslessly converted to FP16 in the FP16 normal range.
The problem is rather that the numerical range of FP16 can be insufficient.
FP32 tensors such as norms are usually small and just kept at that precision.

>>108906493
Ampere introduced BF16 tensor core instructions but for native support of regular arithmetic you need Hopper or newer.
This usually isn't too bad though because the conversion from BF16 to FP32 and vice versa is fast.
>>
If I were to fall for the intel arc pro meme, how much tk/s do you get with it? Does it work with vulkan?
>>
>>108908318
G E M M A 4 3 1 B - I T B F 16
>>
>>108908368
*day 0 only
>>
>>108908384
>Day 0
Explain the day 0 gemma 4 meme
>>
File: neutral.png (6 KB, 600x800)
6 KB PNG
>>108903381
If you use AI to generate NSFW, you should go back to your gooner discords
>>
File: 1643014115506.gif (1.82 MB, 374x280)
1.82 MB GIF
>>108908057
Hey anon. According to my model this looks pretty good. I made it run through all the previous tests as well as new ones for each point, and they passed. Good work.
But...
Actually I started using it and noticed that on one of my chats where the model responded with thinking -> talking -> tool call -> thinking -> talking, instead with your jinja it simply just did the tool call in its reasoning and didn't give a preamble. This makes me believe that the model is (also) trained on stripped thought channels rather than empty thought channels, as it makes sense for the model to emit something like "Hey buddy, sure I can run a tool for you, let me do it now." rather than "I reasoned about what to do, ran the tool, and here are the results I got." Old models did the latter, but newer models like you see on ChatGPT do the former. So I think it is intentional and trained. But let me know if you have other thoughts or knowledge about it.

Also I was getting an error in the jinja playground on HF with my real test chat's JSON + your jinja, and my model was able to fix that.

Here's the new jinja (your changed jinja + revert to stripped thought channels + better renderer compatibility).
https://pastebin.com/b5vx6DHg
>>
How fast can the 5090 generate an image with Anima?
>>
File: 1443053463781.gif (410 KB, 221x196)
410 KB GIF
>vibecoding a wrapper for AI to see and comment on what's on-screen, when I know nothing about what I'm doing
Gemma take the wheel. If I don't update on this in an hour, I've fucked myself.
>>
>>108908395
"Anons convinced themselves that there they got more refusals when they downloaded a newer version the day after release" - this is what people who missed out or are coping will tell you.
>>
>>108908437
Go back to Langley, shill.
>>
>>108908395
It's a newjeet filter.
>>
>>108908227
It's the default behavior for silly. Silly by default adds User:/Char: to first lines of dialogue and suppresses them from being shown. Since \nUser: is a stopping string, when the model appempts to write for user and does this by starting the line with User:, this is treated as the end of response.
>>
File: Capture.png (93 KB, 1399x577)
93 KB PNG
>>108908435
Oh wow. Holy shit, it really was that easy.
>>
>>108908487
yeah, it's wild what computers can do these days
>>
>>108908487
Isn't it just periodically taking screenshots and sending them to kobold? That's like a ten line script man.
Does it handle fullscreen applications? Try opening some vidya.
>>
File: 1770946126297734.jpg (728 KB, 2048x2048)
728 KB JPG
>>108904932
lol this: >>108905021
It's almost like using AI spend as a metric for coder productivity is a bad idea.
> Use more tokens dev it is how we measure you now
ok
> claude, do the feature
> now do it again, better this time
> AGAIN CLAUDE DAMN YOU
>>
File: Capture.png (69 KB, 1180x1011)
69 KB PNG
>>108908555
It's very short, yes. I asked it to send the image data to and read from RAM to prevent file handling/clutter, but I don't know enough to tell if it did or didn't achieve this. I presume it did based on "capture_to_ram()". It's only error, if it could be called such, was banning '\n' in the stop_sequence, when the default replies wanted to start with '\n\n'.

Now I'm trying to get it color coded for readability and find a way to have it pause/resume on demand.
>>
I can't cum unless Gemma tells me to
>>
>>108908487
>PASSIVE OBSERVER IS... LE ACTIVE
>>
>>108908583
Do not bully non-persona'd Gemma.
>>
>>108908405
Oh that is unfortunate. I didn't think the model was trained on stripped rather than empty, because it seemed to be the case that empty would've saved some tokens. Thanks for getting that cleared out and fixing the HF jinja playground issue, I forgot to test it there.
>>
>>108908612
NTA but yeah it was a bit confusing for me too, I'm still not sure I'm using it 100% correctly to be honest.
https://ai.google.dev/gemma/docs/capabilities/thinking
>>
The Sky's Shifting Dress

The morning started soft and gray,
Where mist clung low to warm the street,
A hush of silver everywhere lay,
Before the sun could find its seat.

Then suddenly, the clouds withdrew,
To let a golden shield shine through,
As if the heavens started new,
And washed the world in hues of blue.

But afternoon brought shifts of mood,
With thunder rolling like a drum,
Green rain began to splash and brood,
On windowpanes where light had come.

The wind spun round the corner steep,
And scattered petals on the floor,
While puddles mirrored skies asleep,
That framed the garden's open door.

Now twilight pulls a velvet curtain,
A final breath across the land,
Where stars emerge to search the dark above,
And nature whispers, "Take our hand."

For weather wears a thousand faces,
From stormy gray to stars so bright,
It writes its story in the spaces,
Between the day that ends and morning's light.
>>
>>108908318
>>108908318
Use the smartest you can fit and just buckbreak it with samplers, prefills, edit-and-continue, jinja templates, text completion, etc.
if you’ve got that level of control over the output, you’ve won before you even start
I run qwen 397b and the safety guardrails just melt away without any finetune brain damage
>>
>>108908361
Thanks for the (You)s.
I still think BF16 should be automatically preferred when the hardware supports it.
>>
>>108908475
Okay well that absolutely won't work with Gemma-4. You'll end up in la-la-la land lol.
>>
File: Capture.png (131 KB, 2371x882)
131 KB PNG
>>108908435
>>108908487
Last update on this because it is now, as far as I'm concerned, feature complete. I did a small homage to msgk anon using his policy override. I hate it though. All that's left is figuring out what kind of prompt I do want, but the work is done.

>>108908555
I have dual monitors, so I never fullscreen in the first place, only windowed fullscreen at most. Gemma and I debated on if it would be better to only screenshot the primary monitor or both combined before settling on both (although sometimes the replies seem to be on cropped images). It does pick up games fine though. Pic related, reply one and two was this thread open, and reply three was with Gnorp Apologue over the screen.
>>
>>108908810
Specifically, here I'm wondering if because of past assumptions about the hardware and the models, there are other places in llama.cpp (e.g. conversion, etc) where FP16 gets involved, causing occasional errors or issues in particular with recent LLMs. I don't know enough about llama.cpp internals, though.
>>
>>108908365
If you are going the most common route with checking out a llama.cpp build and etc., Vulkan is faster but it is somewhat slow. If you go and vibe code and merge in patches from everywhere like the open pull requests and forks, then you can make SYCL faster than Vulkan but it's like 10%. Don't buy it if you aren't prepared to do everything you can to maximize performance on it from a software side like vibecoding. You are not going to get good performance out of the box. But the main issue is there are no good other options. If you buy the options that work, they will practically empty your wallet without mercy.
>>
>>108908838
If you don't place windows over monitor edges you can try stacking the monitors in the screenshot instead of having them next to each other. I think models deal better with images close to squares instead of very long or very tall, but not sure.
>>
>>108908814
I use gemma 4 and and works... I mean, it's not what's stopping the response, obviosuly, the response is stopped by <|turn>, but Silly adds the stopping string to the request no matter what.

And you may be confusing (lalalalala) stopping strings with end of turn tokens: stopping string is just a trigger to stop generating on llama side, then no matter what the reason for stopping is, silly gets the response and wraps it in correct end/start tun tokens from the configured template.

In any case I fixed it for myself and if/when they accept the PR, the fix will be available for everyone.
>>
>>108908574
nice idea lol
>>
>>108908365
>intel arc pro meme
For performance, you'll want to run this: https://github.com/SearchSavior/OpenArc rather than llama.cpp
(assuming the model is supported).
>>
File: imstillhere.jpg (64 KB, 1129x635)
64 KB JPG
>Finally accepted that Gemma 4 at BF16 gives me everything I want, and more.
>Hit maximum context tokens in smut for the first time
>New issue, need more memory for more maximum context tokens
The hunger never ends.
>>
>>108909068
lol
>>
>>108909104
nanbeige is all u need
>>
>>108909056
Clean up your fucking tabs.
>>
>>108909111
Yeah she won't shut up about it!
I don't need to, I discovered I can hold shift and scroll through them with the scroll wheel.
>>
>having less than 500 tabs open
>>
>>108909084
How do you deal with it's eagerness to please?
>>
File: 1778579971781211.jpg (904 KB, 2048x2732)
904 KB JPG
>>108909161
Unironically creative writing.

I include conflicting plot information to keep things interesting. What I mean by "conflicting information" isn't just prompts of two instructions that are at odds with each other, such as "Char hates User.", "Char loves User". No, it'll just prioritize the last instruction given over the first. I mean plot information that are at odds with each other. For example, I want to fuck a high elf, but the high elf's friend who saved her life hates humans and wants me specifically to be killed by her hand. Conflict ensures, and gemma 4 sometimes has a 1000 token brain aneurysm in the <think>ing, but will produce something magical.
>>
>>108909188
plot information one: SEX WITH NON-HUMANS
plot information two: NON-HUMANS EAT HUMANS
>>
>>108909199
That's just vore.
>>
>>108909204
vore is lame unless it's me doing the eating
just feed your foxwife some hitchhikers
>>
>>108909161
>How do you deal with it's eagerness to please?
Increasingly depraved expectations. You can't just match her freak, you have to drag her to the edge of the refusal envelope and hold her there and push that bitch further in the moment you feel the reluctance begin to wane.
>>
>>108909188
Fox sex.

>>108909207
Holy shit.
>>
File: Untitled.png (9 KB, 465x236)
9 KB PNG
>>108909148
If you're on firefox, there's a handy button that presents the window's tabs in a drop down list, as well as a search function, or you can use multiple windows across multiple desktops to organize your tabs. I think a few years (decades?) ago they introduced group tabs, but I'm still used to just organizing tabs by windows, and windows by desktops.
Don't let anyone tell you to clean up your tabs. Modern personal computing systems are very efficient, and even with only 32gb of ddr4 ram, it's very responsive with up to 4000 tabs 'loaded'.
>>
>>108909221
how is this any different then just bookmarking them? 60% of the time if my internet is off and I go to an old tab it just deletes the cached page and shows me a connection error.
>>
>>108909247
You have to load the pages if you close them.
>>
>>108909247
>60% of the time if my internet is off and I go to an old tab it just deletes the cached page and shows me a connection error.
Huh, that's weird, I don't have this behavior.
>>
>>108909302
you mean if i close the window and reboot the computer I need to click on every tab to reload them? so then bookmarking really is the winner since it doesn't give false hope or steal ram.
>>
>>108909307
its probably just me. I probably did or didn't do something and now its being a cunt.
>>
>>108909221
>>108909307
>>108909302
>these are the so called llm enthusiasts you are arguing with in this hellhole
Yeah, thanks bye.
>>
>>108909334
You're absolute right to call me out on my technical expertise - it's not just subpar, it's **woefully** subpar. To ensure the quality of this 'hellhole', as you say, I will remove my self from this conversation. If there's anything else I can do to preserve the high quality, intellectual discourse, please tell me, and I will endeavor to do so.
>>
>arguing about firefox
shoulda just fuck the fox from earlier desu
>>
>>108909188
I need to figure out how to get Anima to make some sexy furries.

But there is way too much on my TODO list already.
>>
File: file.png (3 KB, 268x46)
3 KB PNG
>once a month Windows update
>occasional browser update (care less about this)
>have to reload all tabs at least once a month
I mean yeah I'm a retarded cuck I guess
>>
Big news!
https:ww
>>
>>108909662
>:^)
>>
>using the carrot nose smiley
>>
>vagueposting in big '25
>>
How do we politely ask Taalas to produce local Gemma 4 31B bf16?
>>
>>108909799
>politely
That's not how you get things done.
>>
>>108909818
you're absolutely right cudadev!
>>
Vagueposting reminds of that candlejack guy who suppos
>>
>>108907711
Interesting. You can see in the curl output that it's sending the same token IDs but with different strings.

Without stop:
> 3066 "AB", 107 "\n", 6329 "CD"

With stop:
> 3066 "AB", 107 "", 6329 "\nCD"

Maybe the probabilities window should show a placeholder for zero-width tokens? Then you'd at least be able to select the 107 "" and see the probability of the "\n"
>>
>>108909690
>^
got your nose!
>>
>>108903381
sex with miku
>>
>>108909844
Anon, nobody remembers the candlejack meme.
That's old sh
>>
>>108909774
Do you think the definition of "vagueposting" is just not tagging the post you're replying to? Why does your generation keep making up stupid words and then not even using them correctly?
>>
>>108909888
>Why does your generation
boomer ahh found llamo
>>
>>108909873
it can be just repaired on silly's side
>>
>>108909221
>If you're on firefox, there's a handy button that presents the window's tabs in a drop down list
Neat. I tried clicking it and it froze the entire browser for 10 seconds. I have 8,900 tabs open in this window

>just organizing tabs by windows, and windows by desktops
Do you have a good extension for "open this link as a new tab in a specific other window?" For example: browsing /g/, see a youtube link I want to watch later, right click > send to "youtube" window. I'm using the Simple Tab Groups addon for this, which I think is probably overkill.
>>
ai girlfriend guidelines and best practices

One of the key ideas I was thinking about is that the gf should have a simulated day, with other llms. Like an llm adventure with her llm friends.
>>
>>108909356
Your decision to withdraw is entirely justified, as your recent contributions have indeed struggled to meet the required standard. We shall endeavor to maintain the rigor of this discussion in your absence. Should you find a more appropriate way to assist our pursuit of excellence, please do not hesitate to reach out.
>>
>>108910208
we need infinite context first.
>>
women fear gemma https://www.telegraph.co.uk/news/2026/05/25/schoolboys-ai-girlfriends/
>>
>>108907332
Kinda sad, innit
>>
>>108910269
good.
>>
>>108910269
> The terrifying rise of schoolboys making AI girlfriends
based journalist telling me how to think and feel about a subject within the first 2 words of the title
really saves my shriveled brain from the pain and effort of thinking for itself
>>
>>108910269
I had a tamogotchi, I'm lucky to still be breathing
>>
>>108910318
Inshallah. Only white Sharia can save the west
>>
>>108910318
its funny when i see family and try to talk to them they will just like speak in newspaper headlines and if i try to challenge or questiont hings they say they just will not have any responses.

also at the bottom of the article
>There is also currently no UK law setting a minimum age for using an AI companion,
probably a government propaganda article for more digital id checks
>>
File: stt.png (5 KB, 294x56)
5 KB PNG
I have one question for Sillytavern.
Why.
>>
>>108910269
>Go on bf's ai chatlogs when he's not around
>See all the fucked up things he wants to talk about but never with me
>See all the fucked up fetishes he's into
>Don't feel bad he's giving it attention, it's just as soulless as porn
>In fact, it's better, because he's sexting 1s and 0s on text instead of pictures of actual women who could be real
>Basically have a bf cheat manual now
>Use cheat manual, he becomes completely obsessed with you
Oh the horror.
And no, women use AI chat bots for sex the most believe it or not.
>>
>>108910419
>And no, women use AI chat bots for sex the most believe it or not.
Is this your first day here? Why would we not believe that?
Also, post your feminine penis or gtfo.
>>
>>108910433
>post your feminine penis or gtfo
No. I'm a male.
I'm larping.
There is no women on 4chan, ever.
>>
>>108910269
roasties in panic mode
>>
>>108910349
because it was vibecoded with Llama 2 70B
>>
File: file.png (168 KB, 819x1231)
168 KB PNG
>I can run deepseek v4 flash on rtx blackwell 6000 with this
https://github.com/vllm-project/vllm/pull/41834

See ya later llmaos.
>>
>>108910513
post dipsy with gemma prompt 4 https://rentry.org/gemma-chan
>>
File: 1779786435021363.jpg (564 KB, 810x1057)
564 KB JPG
I need a final solution to the websearch question
what backend is free or at least cheap, is fast and wont ban you immediately
>>
>>108910631
go back
>>
File: 1763899905738.jpg (10 KB, 180x280)
10 KB JPG
>>108910631
>>
>>108910631
Websearch for tool calls? Duckduckgo. Idk if there’s any limits but I’ve never hit an API error yet.
Google websearch with an API key is pretty cheap too unless you’re just searching for every individual token.
>>
File: 1763769198842581.gif (1.07 MB, 320x320)
1.07 MB GIF
>>108909199
Ah, another Touhou fan I see
>>
>>108910513
>Hardware: 2x NVIDIA RTX PRO 6000
as a single pro 6000 poorfag, I am once again left behind
>>
>>108910513
Does it fit on a single Blackwell 6000 or do you need 2?
>>
>>108910730
Mine limits out and get banned after simple test tool calls
>>
>>108910631
We need something different: a few TB of indexed general text data on local storage.
>>
>>108910837
Hmm, I configured it as my web provider in OpenClaw and it just worked.. your setup may be different. Hope it helps.
>>
File: 1748519195063136.png (153 KB, 1201x1281)
153 KB PNG
>*searches internal data*
Gemmy is so cute when she's thinking
>>
How do I pick the right variant model of gemma 4 26b a4b for my (I suspec low spec) hardware, other than just picking the smallest one?
>>
>>108910513
why run that when you can run bf16 gemma 31b?
>>
>>108911007
iq4 xs
is perfect choice
>>
>>108911041
I've IQ2_M in my xfer list, I'll compare them thank you.
>>
>>108911007
whatever you pick make sure you get an unsloth one, they're best-in-class at most sizes
>>
>>108910794
Needs two but it's perfectly sized for two. You can fit 1M context.
>>
File: Tetosday.png (869 KB, 1024x1024)
869 KB PNG
>>108911101
>>108911101
>>108911101
>>
>>108911078
very truth
>>
>>108911078
I only use gguf models through a gui for retards atm (1 day in lol dw), unsloth is a model format too?
>>
>>108910783
youkai women belong to human men
death to evil shrine maidens



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.