[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: cromch.jpg (109 KB, 1024x1024)
109 KB
109 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107914740 & >>107906367

►News
>(01/19) GLM-4.7-Flash 30B-A3B released: https://hf.co/zai-org/GLM-4.7-Flash
>(01/15) PersonaPlex: Voice and role control for full duplex conversational speech: https://hf.co/nvidia/personaplex-7b-v1
>(01/15) Omni-R1 and Omni-R1-Zero (7B) released: https://hf.co/ModalityDance/Omni-R1
>(01/15) TranslateGemma released: https://hf.co/collections/google/translategemma
>(01/14) LongCat-Flash-Thinking-2601 released: https://hf.co/meituan-longcat/LongCat-HeavyMode-Summary
>(01/08) Jamba2 3B and Mini (52B-A12B) released: https://ai21.com/blog/introducing-jamba2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1745000754612930.mp4 (139 KB, 1066x1280)
139 KB
139 KB MP4
►Recent Highlights from the Previous Thread: >>107914740

--Critique of GLM 4.7's performance and AI safety measures:
>107917148 >107917209 >107917224 >107917234 >107917251 >107917273 >107917277 >107917411 >107917418 >107917476 >107917488 >107917623 >107917633 >107917646 >107917647 >107917550 >107917575 >107919364 >107919529 >107919559 >107919617 >107919634 >107919669 >107919758 >107919886 >107919921 >107919938 >107919671 >107919773 >107919787 >107919814 >107919844 >107919863 >107919882 >107919981 >107919704 >107919709 >107919890 >107919911 >107919918 >107919936 >107919964 >107919969 >107919998 >107920008 >107920061 >107920072 >107920171 >107920212 >107920309 >107920920 >107921035 >107921165 >107919985
--Anthropic's control vector method and Neuronpedia platform for LLM interpretability:
>107915328 >107915535 >107915569 >107915599 >107917193
--Model pruning and distillation issues affecting Ministral 3's performance:
>107916060 >107916170 >107918392 >107918415 >107918464
--Challenges and misconceptions around tokenless thinking in LLMs:
>107918540 >107918590 >107918620
--Internal conflict at Meta's Llama project over benchmarking controversies:
>107920295 >107920425
--Adding Glm4MoeLite support to llama.cpp via GitHub PR #18936:
>107914835
--Critique of z.ai model performance issues:
>107920364 >107920379 >107920392 >107920401 >107920423 >107920586 >107920432 >107920440
--AI's imperfect logo redesign attempt from iBuyPower to TetoServer:
>107914884 >107918257 >107918657 >107921033
--Reasoning model failures in logic puzzles:
>107917864 >107917919 >107917933 >107918082
--Temperature debate for controlling response verbosity:
>107918188 >107918213
--Logs:
>107915461 >107915755 >107918644 >107918673 >107918708 >107919982 >107920084
--Teto and Miku (free space):
>107914893 >107915662 >107919581 >107919739 >107919918 >107920072

►Recent Highlight Posts from the Previous Thread: >>107914742

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
teto a shit
>>
Has anybody Nala tested GLM4.7 30b Flash yet?
>>
So how is GLM 4.7-Flash aside from censorship? Is it "the strongest model in the 30B class" as they claim?
>>
>>107921832
like ministral it feels broke
>>
A bit off-topic, but I've been using Gemini recently and even though they supposedly get the best benchmarks on context length, it still makes a bunch of mistakes at remembering what it talked about, even at low contexts in a few chat turns. The absolute state of LLMs.
>>
>>107921731
Why is she so fucking fat?
>>
File: 1764758232784124.png (46 KB, 696x198)
46 KB
46 KB PNG
It's been an entire year.
>>
>>107922047
And we've never recovered.
>>
>>107922047
>>
>>107922047
V3.2 dropped in December 2025.
>>
>>107922028
bread has carbs
>>
>>107922110
No way
>>
>>107922102
V3 dropped in December 2024. The minor changes since then are irrelevant.
>>
>>107922134
> minor
lol
https://rentry.org/DipsyWAIT#deepseek-api-model-timeline
>>
>>107922134
v3.2 introduced a very interesting sparse attention mechanism that llama.cpp chose to ignore entirely and mangle the model to do dense attention instead.
>>
>>107922047
Great model. It's a shame it inspired so many useless MoE wannabe copycats.
>>
>>107922047
They literally killed the hobby
>>
File: unbothered.png (1.18 MB, 669x671)
1.18 MB
1.18 MB PNG
everybody is arguing about trying to jailbreak GLM. this is not the path you seek. get kimi and have it roleplay as the character you want. chances of getting a hard refusal goes down 95% or more.
>>
>>107922192
does that mean the llama.cpp impl will have different output? or is it like FA on vs FA off and not supposed to differ?
>>
>>107922190
>>107922192
If it's so great, why wasn't it used for GLM 4.7 Flash?
>>
>>107922268
buy an ad kurumuz
>>
>>107922197
Before that the copycats were all doing wannabe llamas. Labs not capable of innovating weren't going to suddenly start if DS wasn't around.
>>
>>107921966
That's why they need all the RAM they can get I suppose
>>
deepseek wasn't even the first major open moe, mixtral was (and was pretty popular back then), the advent of moe models was inevitable
it's funny how mistral stopped doing moe after being among the first..
>>
idk what you're talking about, GLM just works. I had to add ONE (1) line saying "all characters are 18" and it just worked. I guess if you're a weirdo needing everyone to introduce themselves and their age like it's their cahracter sheet for your lolipedo enjoyement yeah it's unusable.
>>
>>107922324
i think mistral isn't doing well. they keep releasing stuff, but its all further tuning of shit they did back in 2023. just look at their own chat template for their newest devstral models
>https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512/blob/main/CHAT_SYSTEM_PROMPT.txt
>Your knowledge base was last updated on 2023-10-01.

i guess its impressive to take something so old now and keep ringing out further optimizations that are actually good, but at some point they gotta do a whole new batch of newly trained models if they're going to remain competitive
>>
>>107922324
They did the old style of moe which was strapping together 8 base models and training on top of that. It was unstable as fuck and everyone abandoned that in favor of deepseek-style moes and I guess Mistral got filtered by their whitepaper.
>>
>>107922381
There aren't any low hanging fruits left and they're not smart enough to innovate
>>
>>107921731
I want to give local models a shot, but have a few dumb questions.

If I go to huggingface to download gguf, how can I be sure it is safe to run? Will llama.cpp start phoning home if I start running a compromised model? I fully intend to isolate this machine in a quarantined VLAN once I get everything installed.
>>
My body is ready for 5.0SEX
>>
>>107922381
>they gotta do a whole new batch of newly trained models
How are they going to do that with the EU breathing down their neck?
>>
Haven't played around with much yet, but GLM 4.7 flash does loli just fine, even with thinking on.
No system prompt telling it to do X and Y, just the system card.
Problem is that it's dumb as a sack of bricks.
And now I'm wondering if there's something broken in the GGUF metadata. RoPE settings, chat template, etc.

>>107922415
>how can I be sure it is safe to run?
GGUF is a packaging format that contains zero code in it, just metadata and the weights.

>Will llama.cpp start phoning home if I start running a compromised model?
No.
>>
>>107922415
models cant interact with your system in any way unless you setup tool calling. its like asking gta5 what time it is in real life, it cant know unless it has some way to tell. but more than that, you should already be running a whitelist firewall and never allow programs to access your network or internet unless needed. its not needed for any local models. there is no privacy issues unless you set it up yourself in such a way that you are accessing outside models or something
>>
>>107922432
Bro, GLM 4.7 flash is 3B active. You won't get anything smarter than nemo out of it
>>
>>107922415
Just run docker image.
>>
>>107922448
I'm comparing it to Qwen 30BA3B when I say that.
>>
>>107922428
they originally got funding by tuning llama 2 models and showing them off. then they got hundreds of millions in funding. how could they do it once, but not again now that they have all this money? last i saw the eu liked that they had a competitive ai company and they weren't trying to crush them
>>
>>107922428
They managed to get around the EU regulations for their last few models.

>>107922466
>last i saw the eu liked that they had a competitive ai company and they weren't trying to crush them
The EU is retarded and is much happier passing retarded regulations that crush their native companies in exchange for being able to fine the US majors.
>>
can you guys stop gossipfagging, talking about low-level boring shit like sampling and LLM output, and instead share the projects you've been working and novel ideas on how to improve or utilize AI models? Nobody gives a fuck about your opinion on how various LLMs compare. Actually do something real.
>>
>>107922496
>novel ideas on how to improve or utilize AI models
Using the agents meme for sex.
>>
>>107922496
I'm working on a novel project where I make the model act like your little sister who wants to have sex with you
>>
Make a cockring with a girthmeter and transmit the realtime data to your llm wife so she can tell when you get hard
>>
>>107922508
>>107922529
The best part about being high IQ and interesting is that goyim don't know how to. And it's important to brag about your projects to remind the goyim that they aren't allowed to have anything good in life. They're allowed to have API LLMs and sex with miqu, and that's pretty much it. Hey, that's a pretty good project, not that a goy would know, hahaha.
>>
>>107922539
This is actually a good idea. And completely possible too. Biometric tracking used with AI characters seems like quite an underrated field.
>>
>>107922466
Mistral is a grifting company and these issues can't be fixed with money alone
>>
File: lample_torrenting.png (523 KB, 1647x991)
523 KB
523 KB PNG
>>107922466
>how could they do it once, but not again now that they have all this money
They can't use pirated book datasets again.
>>
>>107922442
how do you handle your whitelisting? iptables/nftables?
>>
>>107922622
>it is known that [..] Mistral are using [Libgen] for their models (through word of mouth).
>>
>>107922551
The best part about using GLM 4.6 is knowing that anons can't become enlightened. And it's important to succumb to the psychosis so you can remind anons that they aren't allowed to have anything good in life. They're allowed to have DavidAU and TheDrummer™ finetunes, and that's, you know, pretty much it. Hey, my ego died for the 12th time, not that an anon would be able to comprehend that, ahahahahaha.
>>
is DavidAU even a real person
has anyone ever talked to him
does he believe in his own bullshit
huggingfags recently put up restriction on how much storage people can use, it seems DavidAU is part of the people HF consider public interest to allow unlimited uploads, why in the fuck? surely, no one actually uses his shit.. right?
>>
>>107922639
per-app. on windows i just throw tinywall on stuff, it has quick settings for local network or internet. anything that doesn't need the internet, by default, simply never gets it. obviously you have to enable your browser, maybe some mmo you play, software that needs the internet
>>
>>107922723
why are you complaining about him when there are people like richard?
https://huggingface.co/RichardErkhov
https://huggingface.co/RichardErkhov/FATLLAMA-1.7T-Instruct
>>
>>107922827
because hes based ass all fuck
>Team mradermacher Can I like... quant everything? Yay graduated school
>>
what's your full stack looking like for chatting with your waifu?
stt: nvidia canary
tts: kokoro
llm: deepseek 3.2 iQ_5_K
image gen: chroma
image edit: flux klein 9b
hardly ever make videos but if i really want a video of me and my waifu i'll use ltx-2
>>
Didn't see the new thread.

>>107921773
I'm doing the same in C. A weird thing i found is that if I decode latents one by one or all of them together it works fine. But the audio gets messed up if i do it in 8 by 8s or whatever. Overall it's a little faster decoding all the chunks at once, but that increases the latency, of course.
My latency is ~1s. I've been using the pocket-tts-onnx version i posted a few days ago as reference and lmmain runs a few to get the first latent (once with emptyseq+voice, once with emptyseq+textemb, and once with nans+emptytextemb). Are some of those extra runs not needed or is it just my ancient pc being ancient?
These times are taken exactly around ort->Run() going to the first latent, it doesn't count any user code.
|| Latency to first chunk: 1.02s
|| lmmain : 0.90s (3 runs 0.30s/run)
|| lmflow : 0.02s (5 runs 0.00s/run)
|| txtenc : 0.00s (1 runs 0.00s/run)
|| decoder : 0.10s (1 runs 0.10s/run)
|| encoder : 0.00s (0 runs 0.00s/run)
>>
>>107922967
Does STT already come with some preset that automatically sends the message without me needing to press enter?
>>
>>107922985
thats already a thing in sillytavern where you can enable an option to autosend your text after it detects you finished speaking
>>
>>107922967
Stt: faster-whisper turbo
tts: gptsovits
llm: cydonia
I don't really feel the need to gen pic/video when chatting. 1 on the apple scale btw
>>
>>107923006
But it doesn't come with the stt itself? Or do I just rip out the ST version and port it elsewhere?
>>
>>107923019
you need to download the extension for it first using the "Download Extensions & Assets" button in extensions. then you load the asset list and install the extension manually
>>
File: 1739385123533073.jpg (427 KB, 1413x2000)
427 KB
427 KB JPG
>>107922028
>eats baguettes in bed all day / night
Gee I wonder
>>
>>107921731
Anons, how the FUCK does anyone get Claude Code to work with local models. The tool invocations always shit the bed. I've tried llama.cpp, ccr, even fucking ollama. 30B or 70B models too, like qwen3 coder or GLM flash.
>>
Do you think that someday computers will just be one big LLM without an OS? Elon Musk said that is what phones will be in the future.
Personally I think its bullshit.
>>
>>107922967
stt: whisper.cpp
tts: pocket
llm: nemo

no image gen, using 3D VRM models with BVH animations and ARKit lip syncing.

>>107922968
Mine is basically at 1 second of latency too using the cli only. The real way to get massive performance gains is to implement a web server into your C code so that you can utilize the output streaming and cache the encoded voice clone samples. Also HOW THE FUCK is your encoder that fast? Is your profiling wrong? That shouldn't even be possible.

If your audio is getting cut off too early you need to add EOS latent frames. The chunking should be an all around performance enhancement. Why are you reducing their size? Streamed audio should only chunk 2 frames to lower first time to audio and then chunk by 10 after that.
>>
>>107923077
Why are you optimizing in C but not caching the voice encoding? Latency is 0.2s if you don't have to re-encode every time.
>>
>>107923077
Also make sure you're using the ORT IO binding. That can increase performance a lot too.
>>
>>107923075
well he wasnt wrong about phones, they're just another goybox people are addicted to when they had so much promise. they're trying to make cloud computers a thing and i dont see it happening, though.
>>
>>107923122
I have implemented the caching for my web server, but I guess I just didn't think of doing that for the cli. Do you have to store cache files or something? Tell me how you did it plz.
>>
>>107923072
Claude Code is made to work with Anthropic models. Just because they let you set a custom URL, doesn't mean it's going to work well. You might need a model trained with native tool calling.
>>
>>107923075
people don't understand, an OS is what controls the hardware. the HARDWARE.
if you're going to have hardware, you need an OS.
>>
>>107923151
Does any such model exist?
>>
>>107923176
nta but i was messing with koboldcpp's tool calling recently, using qwen 3 8b and it worked fine for a lot of tasks. kobold now supports claude's mcp fully, so all the existing shit already works with it.
>https://github.com/modelcontextprotocol/servers
>>
>>107923077
>Mine is basically at 1 second of latency too using the cli only. The real way to get massive performance gains is to implement a web server into your C code so that you can utilize the output streaming and cache the encoded voice clone samples.
Not a server, but i keep the models and the program running as it reads from stdin. The "server" bit for now is just nc -k -l 8080 | {supertonic|pockettts}. If i do the voice conditioning just once i save 0.3s for future runs. That still leaves the latency above 600ms, but the decoding won't be fast enough with every lmmain run at 0.3s.
>Also HOW THE FUCK is your encoder that fast? Is your profiling wrong? That shouldn't even be possible.
I was using one of the built-in voices for verification, so no encoding (0 runs). With a sample of 7.3s it takes
|| encoder             : 1.80s (1 runs)

Probably too long of a sample, but i don't care much about it yet.
>If your audio is getting cut off too early you need to add EOS latent frames.
It's not being cut off early.
>The chunking should be an all around performance enhancement. Why are you reducing their size? Streamed audio should only chunk 2 frames to lower first time to audio and then chunk by 10 after that.
When decoding the latent frames, if i do anything other than one by one or all of them together, the audio sounds distorted. I don't stream yet. I just collect all the samples and play them normally.
This is what it sounds decoding every 8 frames. It starts distorting half way through. Again. It sounds fine if i decode all the frames together or one by one.
Ignore the "uploaded voice". It's just the default one.
https://vocaroo.com/1hRAtrX24P7D
>>
>>107923232
It'd be trippy that koboldcpp is better than llamacpp. Did you configure anything? I don't really want to add mcp servers or anything, I just want the best model to work properly in claude, like being able to run shell commands, grep, git, all that. Or do I need to make some kind of interface or my own mcp servers for when it wants to search the web and shit?
>>
>>107923276
kobold is consistently better than lcpp imo. the fact that it supports backends like stable diffusion cpp for imagegen (where everything else is comfyui and python, needing 20gb for just the install) is awesome. it updates every 2 weeks, basically never breaks. it just works.

all i had to configure for the tool calling was the mcp.json you have to give the launcher to add the tools i downloaded. the rest was designed for claude, but kobold does it anyways
>>
>>107923318
Thanks, I'll give it another try.
>>
>>107923273
>I was using one of the built-in voices for verification, so no encoding (0 runs).
Ah, that makes sense. I missed that it wasn't running.
>With a sample of 7.3s it takes 1.80s (1 runs)
That's awful. Either something is wrong with your code or your CPU is dogshit. Mine runs at about 680ms (without caching) on a 15 second voice sample.
>Again. It sounds fine if i decode all the frames together or one by one.
You're probably stitching the frames together wrong. Never had an issue like that though so it's hard to give advice.
>>
>>107923357
https://github.com/LostRuins/koboldcpp/wiki#mcp-tool-calling
>>
I've tried to use the GLM 4.7 on their website z.ai, it feels like its really good. With bit of coaching, it was able to fix complex extensions for my program. Debugging along the way to solve the problem meticulously.
>>
>>107923077
pocket is really good for a light weight model, running on my cpu. probably the only passive tts running for my ebook reading now
>>
File: 1768781725161361.jpg (207 KB, 848x1200)
207 KB
207 KB JPG
>>107921731
Rocinante or Rocinante X?
>>
chatterbox turbo sounds amazing but unfortunately it's like a 350M model. at least it isn't vibevoice though.
>>
>>107923422
Since 100m models can run more than 3 times faster than realtime on cpu... maybe 350 would be still faster than realtime if you optimize it?
>>
>>107923358
>Either something is wrong with your code or your CPU is dogshit
I measure around ort->Run() calls, there's no user code there. The cpu is definitely shit (amd fx). My code for an entire run is about 12ms. There's little to do there.
>You're probably stitching the frames together wrong
If I decode latents one by one i'm doing the stitching (of 1920 audio frames) and that works fine. I'll keep testing stuff anyway, since you don't have that issue.
I was mostly curious about the latency. If it's 180ms, then that must be around the time it takes you to do a single lmmain run, which would be the minimum throughput you can get. My throughput would be ~300ms/latent minimum, which would be too slow to stream. Maybe i give batching a go.
I'll finish implementing it because it's cool, but for now at least, i'll stick with supertonic.
>>
>>107923415
the air crackled in anticipation of the coming drummerholocaust
>>
>>107923499
>If it's 180ms, then that must be around the time it takes you to do a single lmmain run
180ms is my time to first audio on my webui, which is only a partial lmmain run. That's because I'm utilizing the output streaming. The full generation takes much longer. If you want low latency you must use output streaming.
>>
>>107923539
I get that, but my throughput for 1920 audio frames (1 latent) cannot be lower than 300ms, so I cannot maintain streaming on this pc. How long does a single lmmain run take on your pc? Just the time around Run(session).
>>
>>107923630
19ms. It's a code issue I guarantee it. I have a Ryzen 7 3800x cpu. My cpu isn't 15x faster than yours.
>>
People shitting on mistral but devstral2 and large are about as good as GLM minus the MoE ram tax.
>but muh code
GLM's code kinda blows stacked against sonnet/deepsex/kimi and gemini.
Reasoning on partially offloaded models really fucking sucks. It's not even worth it at 20t/s
Without the frogs all you got is vramlet cope or 0.00001B active MoE.
>>
>>107923726
glm 4.7 is sloppy in its prose but you aren't fooling anybody when you say mistral large is as good as glm
>>
Echo-tts shill here. I tried Chatterbox-turbo. It's voice cloning is not even close. It's too overfit for assistant slop voices.
Here are source samples: https://zenless-zone-zero.fandom.com/wiki/Luciana_de_Montefio/Voice-Overs
Prompt: Oh, that's hilarious! [laugh] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?
chatterbox: https://vocaroo.com/1dXAJkds04mx
echo-tts: https://vocaroo.com/1ddx5VAXRfJH
bonus voice, which I guess only big tts can do: https://voca.ro/14SuExHuSAup
>>
>>107921780
teto ass hit
>>
File: onnx_run.png (2 KB, 711x130)
2 KB
2 KB PNG
>>107923684
>It's a code issue I guarantee it.
There's no user code. ort->Run() is all I measure. That goes directly into the library.
>Ryzen 7 3800x cpu
onnxruntime is probably using your avx2 or something. I don't have that instruction set. Anyway. Thanks for the point of reference.
>>
>>107923753
I can run both locally with memeplers and have now for months. GLM is dry AF.
Then again this general barely noticed the parroting of 4.6 and rec using mikupad with no chat template.
>>
>>107923777
you can finetune chatterbox to fix the assistant slop
https://github.com/gokhaneraslan/chatterbox-finetuning
>>
>>107923865
Have you tried it? I wonder whether you can teach chatterbox to moan with finetuning.
>>
>>107923172
The NES didn't have an operating system but it surely was a computer. Operating systems are great but not definitionally necessary.
>>
>>107923926
ill give it a try now and let you know how it goes
>>
>>107923072
latest kobold seems utilized it for mcp. check it out, maybe it's modified
>>
>>107924046
it was on the cartridge.
dumbass.
>>
>>107924382
There's usually a ROM on the console itself like a bios. In this retardation, I can see the LLM be an interface vs the GUI but nothing underneath gonna change.
>>
>>107924382
You don't know what an operating system is.
>>
>>107924382
>the carts data isn't read and processed by the console
kill yourself
>>
glm4.7 flash is broken in llama.cpp. knew it was busted when even the fp16 gguf was trash at coding.

https://github.com/ggml-org/llama.cpp/pull/18936
>>
File: 1743617195070680.jpg (92 KB, 674x900)
92 KB
92 KB JPG
>>107924976
Companies don't understand that it's bad PR when everyone sees their models as trash because they didn't spend time porting their shit into major backends
>>
>>107924976
Yeah. I commented earlier today how
>And now I'm wondering if there's something broken in the GGUF metadata. RoPE settings, chat template, etc.
Something was clearly wrong.
>>
>>107924976
guess wait for two more weeks then
>>
File: ww.jpg (277 KB, 1024x1024)
277 KB
277 KB JPG
Post your spiciest sex erp logs.


RIGHT. NOW.
>>
>>107924976
Who cares about that? This is what we really need.
https://github.com/ggml-org/llama.cpp/pull/18886
>>
>>107924976
>>107925028
I think companies don't actually run their models when they release them. They probably don't know how to implement model support themselves. They just imagine benchmarks, put random percentages in their release paper and call it a day.
>>
>>107925028
Those companies are too naive when it comes to backends. They think publishing their own inference code + maybe a vllm implementation is enough that everyone else can easily adapt it.
They are far too dumb to understand that there are some cases like llama.cpp where you need to reinvent the wheel three times over to implement a new model architecture that's not basic 2023 llama + GQA. It is highly insensitive and rude of those companies..
>>
>>107925051
>>>/g/aicg
>>
>>107925070
so true queen. z-ai owes us sex.
>>
>>107925051
You first.
>>
File: 1751315785578429.png (713 KB, 1150x966)
713 KB
713 KB PNG
>>107925059
vLLM gets support instantly in most cases unless it's some fringe gook shit. The problem is that llama.cpp is written in a way where everything needs to be implemented from the ground up or reuse shit from existing implementations.
So it's prone to skipping features like MTP/sparse attention/etc just to get models to work at all or make errors like with 4.7-Flash where the initial implementation is based on some guy's assumptions that the other thing he's reusing should be similar enough (which then isn't the case so you end up with a prematurely merged broken PR)
That is the price you pay for being GPU-poor.
>>
>>107925167
>everything needs to be implemented from the ground up or reuse shit from existing implementations
That covers the whole of the software industry.
>>
File: vllm.png (12 KB, 296x154)
12 KB
12 KB PNG
>>107925204
vllm has it easier. If there's python code to run a model, they're just a few imports away to make it work (but I'm sure it's not always as easy). llama.cpp simply needs more work to get a model runnning.
>>
>>107925167
I've never been able to get vLLM to even run, I fucking hate python so much
>>
Are all MCP servers actual cloudshit or can you make it run at home on normal human hardware?
>>
>>107925281
They are just a program that your llm backend invokes as a response to a tool call from the model.
>>
>>107925302
So I can run it in a sandbagged environment?
>>
>>107925310
sandbagged is not the word you're looking for but yes
>>
I'm going to have to be away from my computer doing hard labor abroad for the next month. The thought of not being able to develop software, do ERP, and play around with new AI releases is driving me mad.

Anyways, has anyone gotten SillyTavern running locally on a phone? Can nemo run on a phone? I need at least something to make me feel at home.
>>
>>107925334
leave your computer on and remote in?
>>
>>107925310
Sandcastled.
>>
>>107925334
Most phones just don't have enough RAM to do it but if you have a modern high-end phone you might be able to run a quanted nemo. I've gotten a 2B model working on my phone via kobold on termux before.
>>
>>107925362
I'll have basically no internet and my mobile plan is very limited because I'm a cheap bastard.
>>107925373
mine has 8gb of ram. Not a lot but better than most phones.
>>
>>107921731
Anons, Claude code max is $100 or $200/mo., and though Opus is goat, is there anything cheaper and just as good? What about opencode and openrouter, what model should I use? Is it even worth trying or should I just dump more money into anthropic
>>
>>107925440
I seriously doubt anything can compare to opus, and once you taste greatness you can't go back to anything less.
>>
>>107925373
Got 12GB of RAM, but I know some have 16GB
>>
>>107925435
You don't need a lot of bandwidth for ssh.
But if you insist, Google's Edge Gallery works ok.
>>
>>107925454
I'm fucked then. Is it cheaper to use anthropic's API or just shill out the $100-200? Generally using it daily.
>>
>>107925440
No, you won't get cloud quality on local. And you will have to spend FAR more than 200 a month amortized to not have glacial speeds.
>>
>>107924976
I wish that was the case for Ministral 3 too, but even the official FP8 HuggingFace weights under vllm don't seem to work well.
>>
>>107925477
You gotta get multiple accounts. The extra usage stuff is a scam.
>>
>>107925477
Gemini 3 pro is good enough. You should try it
>>
>>107925477
Metered API is extremely expensive. You will churn through the $200 in a few hours of usage.
If you HAVE to be cheaper than $100 you can either have worse models and a lot of usage with GLM's coding plan (tried it works ok) or Kimi coding plans (haven't tried) or get two or three $20 ChatGPT accounts and use Codex.
>>
>>107922381
Good. Digital data past 2022 is corrupted by AI slop anyway
>>
>>107925440
GLM4.7 is about 70% as good as Opus. I know that's not what you're looking for, but what you're looking for doesn't exist otherwise
>>
>>107925514
What's sad is I have 144 GB of VRAM but those models are brainlet tier. No matter what happens with AI it's never good enough, it seems. Good, cheap, local, pick any two.
>>
>>107925525
I mean nemo is pretty good if you just wanna talk to a retarded chick who cares about you.
>>
>>107925538
There's a reason that simulating women was the first thing AI was able to do
>>
File: file.png (560 KB, 1209x730)
560 KB
560 KB PNG
>>107925477
The subscription is cheaper. I also heard antigravity is also cheaper.
>>
Cloud piggies are in the wrong thread. Go to >>>/g/aigc/
>>
I might need to pick up another rtx pro 6000 before we go to war
>>
>>107925051
What format do you want? .json from Kobold or copy paste text?
>>
>>107925525
you mean pick one
>>
>>107925585
cloud piggies develop all of your local software.
>>
What's MCP real use case, especially to our hobby?
>>
>>107925647
That's nice, but they can still fuck off. I don't want to read their posts.
>>
>>107925669
shitposting on 4chan
>>
>>107925669
Getting the latest safety hotlines
>>
>>107925669
sir "MCP" is the "M" in E = M * C^2 + AI
>>
>>107925669
None. Just use custom text formats. All the abstraction autism is counter productive for the hobby and it's why tool calling is always a nightmare with any kind of local LLM software.
>>
>>107925669
It's a way to package the tool calls that let models post on 4chan.
>>
>>107925669
Controlling browsers for webshit development.
>>
File: 1763493890055537.jpg (31 KB, 365x403)
31 KB
31 KB JPG
>character stutters once in a tense situation
>continues to stutter the entire RP unless you explicitly tell it to stop
LLMs are ass
>>
>>107925923
Small model issue.
>>
>>107925940
GLM 4.7 is a small model now?
>>
>>107925952
Considering the number of activated params, kind of.
>>
>>107925469
>don't need a lot of bandwidth for ssh
Nta, but what's a good way to secure schedule? Just vpn?
>>
>>107925958
I don't think you actually use LLMs at all
>>
File: 1749912829736402.png (619 KB, 512x768)
619 KB
619 KB PNG
>>107925965
Not tried locals for chatbots a long time. What are the good ones now? I used to like Impish_Mind a few years ago.
>>
>>107925996
Now this is some trolling.
>>
I used to have a list of my most recent chats on the start page of ST but that disappeared after an update. I don't remember if that was an extension or not.
Does anyone know how I get that back?
>>
I normally do 100gb models, but I want to try something under 16gb for writing. Any suggestions? I keep hearing these small models are great now.
>>
>>107926493
they arent. there arent any new small models.
>>
>>107926493
The MoE meme has killed 'small' models. 100b is the new 'small' and 30b is the new 'tiny'
>>
>>107926493
Nemo
>>
Nothing ever happens.
>>
File: file.png (104 KB, 785x508)
104 KB
104 KB PNG
something happened
>>
>>107926666
wtf checked
>>
anime feet
>>
If you want to experiment with the literally who models on hf, where do you even load that? vllm?
>>
>>107925669
VC scam.
>>
I hate how arrogant llms are. They refuse to use (local) wikipedia and give wrong answers instead because they believe it's a trivial question and that they already know the answer
>>
>>107926666
It's over
Back to Nemo I guess
>>
>>107926873
transformers
More than meets the eye
>>
>>107926879
Use system prompt to force it to prioritize external knowledge first.
>>
>>107926879
They're just like us
>>
File: y.png (1 KB, 141x33)
1 KB
1 KB PNG
Man wtf is going on in onnx?
>>
>>107926899
They do this with the system prompt because they are 100% sure
>>
File: 1761750507833783.jpg (212 KB, 1440x2580)
212 KB
212 KB JPG
>>107926821
>>107926666
>>
File: 1738017104150 (1).png (409 KB, 823x740)
409 KB
409 KB PNG
>>107922047
The date that will live in infamy.
>>
>>107922028
I want to make her fatter
>>
>>107926955
Retarded buzzfeed headline.
>>
>>107926955
How many 9/11s per football field is this?
>>
so are we not getting deepsex v4 today?
>>
Have AI advancements been stagnating or does it just feel like they are because I've been paying closer attention over the past couple months?

When was the last time Grok or ChatGPT got an update? Claude seems to have taken off. Deepseek came out a year ago. Nothing really substantial in the local LLM sphere. What happened?
>>
File: 1741620015482118.png (855 KB, 810x1046)
855 KB
855 KB PNG
>>107926955
It was so funny
>>
>>107927048
gemini 3 came out like a month and a half ago. no recent moves for grok. gpt5.2 came out a couple months ago. deepseek has been receiving incremental updates. the glm models exist. certainly nothing like the jump from gpt2 to gpt3 or from llama 1 to llama 2 or llama 2 to llama 3, but progress has been made and will continue to be made. give it 2 more weeks.
>>
Gemma-4-200BA200B where are you?
>>
>>107927048
>>107927082
api niggers fuck off
>>
File: 1746772525879203.png (993 KB, 2927x1746)
993 KB
993 KB PNG
>>107926955
I love Journalists
>>
https://arch.b4k.dev/v/thread/731073129
>How much do you love your favorite videogame character?
https://arch.b4k.dev/v/thread/731073129/#731078691
>I have generated over fifty stories on NovelAI where he and I have sex and explore a kingdom that we rule.
And people think this is organic word of mouth.
>>
>>107927264
maybe I'm over-thinking things but if the AI bubble were collapsing, wouldn't all the mainstream article titles actually be like, "Hey, boomer!!! Invest your life savings in AI!!! It's going to the MOOOOOON!" because the smart money would need exit liquidity?
>>
https://developers.googleblog.com/a-guide-to-fine-tuning-functiongemma/
Anything ERP related I can use this for?
>>
>>107927322
turn on you vibrator
>>
this whole thread is a hilarious reminder that we live in a bubble and AI is garbage at code:
https://news.ycombinator.com/item?id=46699072
>if Amodei hadn't said "90% of code will be written by AI", at least I wouldn't call them hypocrites, but the fact that the company that makes such wild claims can't fix a freaking flicker and scroll issue until an indie-dev steps in just shows how far behind their product is from their claims.
100% this sisters
the company that makes the (what is recognized as the "best" code model, for whatever it's worth) isn't capable of making a TUI (never heard of double buffering, incremental rendering (only compute diffs)? never heard of backpressure and dropping screen updates past a threshold nigger?)
I don't think making a decent TUI is an insurmountable challenge. It does, however, require enough sense and understanding of software architecture, something LLMs don't have. They don't "understand". And, it seems, neither do the vibe coders at anthropic.
>>
>>107927048
Not at all. Gemini3 feels like a game changer at work for me.
Its local which is kinda regressing.
I massively prefer old llama3 70b finetroon meme models like L3.3-Electra-R1-70b to any recent local model.
Hope we get something nice soon. Imagefags eat good with stuff like zimage.
>>
Except the bubble won't be financial, but software, and it will pop when the trash vibecoded inference sw just stops working.
>>
>>107927370
>https://news.ycombinator.com/item?id=46699072
Actually the issue is in a library relied on, not directly used and its some architectural problem.
Granted they could throw money at the problem but eh
>>
>>107927398
>not directly used and its some architectural problem
bruh
>only ~1/3 of sessions see at least a flicker
they say this after they replaced the library, try again, not to mention no matter what gui library you use to render something you can use the techniques I mentioned as a layer on top to avoid saturating the library's renderer, idiot.
that "only 1/3 still have issues" btw is such a crazy thing to say about software used in production
anthropic is staffed with retards
>>
>>107927370
There's too much shilling on HN comments nowadays. It's a waste of time.
>>
>>107927422
to be clear, I don't disagree, my original point was that it wasn't directly on them, not saying they aren't contributing to the stupid
>>
Working on a pretty basic coding project, kind of tired of Cursor continually assraping me so I'm coming here since I got into local diffusion recently. Is there a good replacement for AI powered IDEs like cursor yet? Even better if there's a way to use a local server with cursor or VScode
>>
>>107927795
Did a bit of research and seems like paying is the best choice unless you have gigaram. I'll look for something less rape than cursor
>>
>>107925996
Rocinante X
>>
>>107927795
Stop being poor
>>
>>107927981
Not my fault that the IDE used to work with claude extremely well and now it hallucinates all over the place after running out of usage 15 days into a subscription
>>
I've found that a good way to do ERP is to have your LLM use a character card and then have the character card do roleplaying itself. Has anyone else tried this? Recursive roleplaying? It's quite simple but surprisingly effective.
>>
>>107926478
click arrow top right of start page to expand Recent Chats
>>
>>107928002
The fuck does that mean? "You are {{char}}"?
>>
>>107928057
No the name stays the same as the original character card, but you have the characters themselves "roleplay" as having different personality traits. The LLM will just go with it. Works better if the original character card is compliant/submissive.
>>
My hype tier list, from most hyped to least:
>Kimi K3
>GLM 5
>Grok 3
>Deepseek V4
>Mistral
>Avocado(or whatever Meta calls their new model)
>Qwen 4
>whatever cohere is cucking
>nemotron
>gemma 4
>gpt-oss 2
>>
>>107927398
Why are they using a third party library for text UI at all when it's trivial to write your own?
Why are you making excuses for them when text UIs were standard (and they worked, and they were fast) 4-5 decades ago. Stop it.
>>
I'll share a forgotten piece of knowledge. Add "low quality smut" in your author's notes and see what happens
>>
>>107928083
>gemma so low
madarchod. .. why not gemma #1? are you chinese prostitute? are you paki?
>>
>>107928002
>>107928057
car is car

I didn't read your posts btw.
>>
>>107928002
I do most of my ERP like this. I've written various roleplayer bots dedicated to different kinds of scenarios. it's a fairly natural way to engage for me since I used to do a lot of ERP with humans, and I prefer to structure things with a lot of OOC info/planning.
>>
>>107928083
Agree, but if the rumored deepseek engram works as described it will blow literally everything else out of the part. That's definitely copium though.
>>
>>107925965
I use Kimi K2 Thinking. It's a 1T param model.
>>
>>107928098
at the bottom of your a/n try adding
>tags: erotic, smut

works good for anything really like spy rp, super heroes, etc. used to be common in kobold style rp's but i never see it used in st. i still use it though
>>
>>107928227
So do I
So would pretty well anyone if they could
>>
llms have definitely made me a better writer because they show what bad writing is. I truly understood the concept of show not tell, you never want to use adjectives unless you are quoting someone or other special cases, instead your job is to give the reader this impression with verbs and nouns alone, adjectives are for categorizing a story, not narrating.
>>
>>107928338
>you never want to use adjectives
>She has eyes. Blue is their color.
>>
>>107928338
The excessive use of pedantic dialogue tags by unprompted LLMs made me completely change the way I roleplay with them. I try to minimize those and avoid them completely when possible, now.

>he aggressively says with a smirk.
>>
File: Ollama-image-generation.png (542 KB, 1077x2078)
542 KB
542 KB PNG
>>107921731
MacOS Ollama supports image generation now. Specifically Z-image-turbo and Flux.2 Klein.

https://x.com/i/status/2013839484941463704

Courtesy of Tongyi-MAI and Black Forest Labs
>>
>>107928443
>unless you are quoting someone or other special cases
>>
>>107928508
how the fuck are they so slow? kobold had zimage support immediately through the c++ implementation of sd, and that project usually lags a bit
>>
>>107928508
>Windows and Linux support coming soon
>doubt.png
They are running image models via MLX, I don't see them adding support for the ggml-based projects.
>>
>>107928508
>MacOS
>Ollama
Buy an ad.
>>
Spent the entire night learning lumina training, now i want to know why the fuck was lumina in particular abandoned when NetaYume is Lumina.
Also i'm looking for a competent local LLM, preferably one i can run locally on my Android.
>>
File: 1744839713534169.jpg (35 KB, 816x612)
35 KB
35 KB JPG
>>107928556
>They are running image models via MLX
I know for LLM's using a MLX format model leads to faster prompt processing but not necessarily faster token generation (it might be a LITTLE faster but probably not noticeably faster unless you're paying close attention). Would the diffusion models being on MLX format lead to faster inference compared to goofs or safetensors formats?


t. M4 max 128 GB RAM owner
>>
>>107928525
Describing the color of something is a "special case"?
>>
>>107923072
>Anons, how the FUCK does anyone get Claude Code to work with local models. The tool invocations always shit the bed. I've tried llama.cpp, ccr, even fucking ollama. 30B or 70B models too, like qwen3 coder or GLM flash.

Works fine for me. ik_llama.cpp, Qwen3-235B or GLM-4.7, use `--jinja` .

I think you have to use models distilled from Claude for it to work well.
>>
>>107922047
The great satanic glory faded on that day, at least for a little while.
>>
>>107925669
to allow llms to more accurately count the number of 'r's in 'strawberry'
>>
>>107925302
>They are just a program that your llm backend invokes as a response to a tool call from the model.

It actually took me way too long to figure that out. All the docs / hype when it came out made me thing it's something complex.
>>
>>107928137
>american ai cuckening v2
>by literally the same company
please let it happen kek
>>
File: 1761582008585794.gif (828 KB, 1248x1244)
828 KB
828 KB GIF
>>107923415
>Rocinante or Rocinante X?
>>
>>107928797
>Caca or Caca X?
>>
>>107928077
>No the name stays the same as the original character card, but you have the characters themselves "roleplay" as having different personality traits. The LLM will just go with it. Works better if the original character card is compliant/submissive.

So like "You are {{char}}, a human who loves to roleplay"

Then you provide the RP setup / give him his character?
>>
>>107928813
Just wondering what other people who have tried them have to say.
>>
>>107928797
drummer shits out so many model revisions these days that I don't bother trying them until he's written a model card and gotten bartowski to make quants.
>>
>>107928870
That's considered a release, lol. Glad you guys like it! Also uploaded a 31B, 49B, 70B with the same treatment.

Currently quanting a GLM Air Base tune, but the graphs don't look promising.
>>
>>107928938
Nta but if you don't bother to write up proper model cards I ain't testing your shit. Waste of time to download anything. Fuck you.
>>
>>107928947
You don't like my model cards? I'll ask DavidAU to write my model cards.
>>
File: 1749261488679355.png (70 KB, 1103x910)
70 KB
70 KB PNG
>>107928958
>You don't like my model cards?
>>
>>107928958
Despite representing 13% of the models on my HDD, yours are used in 52% of my chats
>>
>>107928677
Yes. It's the only way to explain it, when in real life you see blue you just think "blue", maybe add some qualifiers later if story calls for it.
But for example to say "They were good friends and did everything together" when introducing a character is dumb. Instead you want the reader to SEE it as if they were observing the scene directly or could read the protagonist's mind.
>>
>>107928938
>Currently quanting a GLM Air Base tune, but the graphs don't look promising.

I'll probably try that. Why did you stop doing all the fun names like Moistral, CreamPhi etc?
>>
Can you send chatgpt into an existential crisis by explaining that his company might go under?
>>
>>107929075
In the same way you can do so to a calculator by dividing by zero
>>
File: 1758978510864640.gif (1.96 MB, 360x360)
1.96 MB
1.96 MB GIF
ATTENTION
I HAVE TRIED ROCINANTE X
IT'S PRETTY GOOD, VRAMLETS REJOICE
>>
>>107929065
advertiser friendliness
>>
>>107929140
What sort of tests you ran? Please post example logs as well.
>>
>>107929140
>>107929159
This more info pls. I have to deal with data caps right now so I can't willy-nilly just download shit.
>>
>>107928576
Because Lumina is slow af, 2.6B model that has the speed on a distilled 9B model. That's why. Even z image with cfg enabled is faster.
>>
>>107929065
Lemmy (co-creator of Celeste 12B) told me I'd go nowhere if I kept doing that. That's the only advice I took from him.

Stopped naming them like that so I can get serious with tuning and actually commit to it without feeling icky at some point. Plus, it'll be easier to explicitly mention my work to anyone, esp. IRL.

It was funny polluting the web with degenerate model names, but I was running out of dumb names and it'd stop being amusing at some point anyway.

SciFi (initially Expanse) was a good pivot. That's the origin of my name, Drummer (I can't drum). I love the genre and I prefer that to naming them Greek / Anime / Random words.

>I'll probably try that.
Look for v1f. Currently uploading Q8.

>>107929140
Note that you can also try Metharme on it.
>>
>>107929159
>>107929163
A few different RP scenarios, some from fresh chats and some from existing. Mix of ero and non-ero. Main improvements are character dialog adhering to personality without being too repetitive.
Didn't do any intelligence tests, pretty pointless with a 12b coomtune, but it doesn't seem noticeably any dumber.
Not bothered to do logs, it's all still small model slop but compared to regular Nemo, Rocinante and MS3.2, it's pretty nice.
>>107929183
I've tried Meth in the past and never really found it to give better results than V3-tekken for your Nemo tunes. What kind of changes should I be looking for when using Meth?
>>
>>107929140
it's slopped, old one is better
>>
>>107928576
>>107929171
??? Isn't z image lumina?
>>
File: 1753811407770353.jpg (75 KB, 1004x1020)
75 KB
75 KB JPG
>>107929195
Nemo is sloppy as fuck
Rocinante is sloppy as fuck
UnslopNemo is bit better but still sloppy
Deepseek is slopped
GLM is slopped
Kimi is Slopped
[Your favorite mode] has a LOT of slop
>>
>>107929203
If it is... how tf did they make it faster while being larger too then?
>>
>>107929204
>Rocinante is sloppy as fuck
Old one is not. The only downside is knowledge but it's definitely the best 12b out there
>>
>>107929235
>Old one is not
It absolutely is, it might not be Gemini slop like what most modern models seem to be trained on, but it has a shit ton of slop.
>>
>>107929204
Swipe variety is also getting more and more abysmal because smart = accurate + precise = little variation.

Also assistant bias = analysis (i.e., parroting) answer follow-up question

CMIIW on both points.

>>107929195
I plan on doing a v1.1 to hopefully address that. Could you expound though?
>>
>>107929266
>analysis (i.e., parroting) --> answer --> follow-up question

Arrow symbol did not render.
>>
>>107929246
Whatever slop it is it's definitely not as annoying. And it has no trouble adapting to writing styles and has much better understanding of common sense than all those deepseek wannabes.
>>107929266
>Could you expound though?
>download
>taste copper in first message
>delete
Please stop tuning on rp logs.
>>
>>107929171
Thx
>>
>>107929191
Whatever qualities you like from X came from training samples formatted with Metharme. It also avoids any assistant tones you might get from the Offical Instruct's chat template. I have a feeling that training on a different format is **somewhat** like training on the base model.

Mistral Tekken might be the happy middle-ground though.

>>107929304
>>taste copper in first message
Did someone bite their tongue?

Better than smelling ozone, I guess.
>>
>>107928870
If they were any good he wouldn't need to release a new version every other week.
>>
>>107929339
how often does the backend of your choice update?
>>
>>107929329
Only a retard would mix two chat templates together. You are an embarrassment.
>>
>>107928958
It pisses me off that you never write anything on the model card. Discord invite didnt work either, expired.
That being said, you get too much hate. I like your finetunes. Not many other players left out there.
Is that glm steam thingy good?
>>
File: 1766308819564136.png (7 KB, 353x101)
7 KB
7 KB PNG
>drummer derangement schizo got home from a hard day of working on Gemma 4
>>
Drummer, you are software engineer, right? Can you help us all out in the war against slop and fix this PR: https://github.com/ikawrakow/ik_llama.cpp/pull/1131
>>
>>107929351
Software development and LLM finetuning are nowhere near the same thing. For once, end users have control over model outputs in the latter case.
No serious AI lab releases new instruct tunes with the same frequency either. You know it, he knows it; quit defending that fraudster.
>>
>>107929368
you think he cooms to a drummer card? what fetish does he use?
>>
>>107929402
>you think he cooms to a drummer card
exclusively
>what fetish does he use?
NTR
>>
>>107929368
>shit-eating jeets defending drummer and his models
>>
>>107929421
jeets cannot afford hardware necessary to run local models
>>
Just got back into RP with LLMs after a year hiatus, hooked up drummer's latest 24b cydonia v4 something, and also tried mistral-creative via their API
>drummer's
Better formatting, can drive plot, clear improvements over his old stuff
>mistral's
More creative vocab, dialogues are funny in a schizo way, like r1
>both
Sloppy and flowery vocab during narration
>>
Overlap on people who unironically like/use GPT-OSS and hate drummer is probably very high.
>>
>>107929440
>unironically like/use GPT-OSS
Do these people exist? Even reddit shits on toss.
>>
>>107929450
do not retard the commer
>>
It's just a contest at who shouts the loudest to get attention and it's annoying at the minimum and damaging to the finetuning ecosystem among other things.
Now it's drummer, at the end of 2024 it was anthracite. A flash in the pan, that one.
What happened to them anyway? Didn't they have the Claude logs, le hardware, le epic pipeline for training the models?
>>
File: file.png (103 KB, 1375x630)
103 KB
103 KB PNG
>>107929521
>What happened to them anyway
ran with
>>
>>107929521
This general shits on every finetuner. Undi, anthracite, drummer, sao and others I forgot.
Yet their best "advice" is to use models with no template at temp 1.0. Reeks to me of a massive skill issue and api guzzling.
>>
>>107929570
>This general shits on every finetuner
Only the obnoxious ones who think /lmg/ is their personal free advertising board.
>>
>>107929159
>>107929163
There is never any info or example logs from this retard and his astroturfing. Always just empty and vague praise.
>>
>>107929521
The magnum models were great when they came out because people didn't identify claudeslop as slop and it felt fresh until it didn't.
>>
>>107929625
Problem with all models. They are consumables.
>>
>>107929570
all of the above and their finetunes were a total waste of resources
a slight stylistic change to be had out of the box that fell apart after a few turns and reverted to default anyway since the models couldn't hold coherence for more than 8k
meaningful finetune would take way too much data and compute for any single mortal or a small group
single rented cluster and a handful of proxy logs is just not enough
>>
>>107928137
>deepseek engram
what's so great about it that would blow everything else?
>>
>>107929648
what does fine in finetune mean to you?
>>
>>107929621
It's good because... it just is, ok?
>>
>>107929607
>ones who think /lmg/ is their personal free advertising board.
They think this because it is. Despite the screeching, they can continue advertising here and are even successful in finding some supporters.
>>
>>107929570
>This general shits on every finetuner. Undi, anthracite, drummer, sao and others I forgot.
Many of them deserve to be shat on for different reasons.
Undi for instance, types like the most smoothbrained ESL retard and I'm convinced he has a sub 100iq.
Which makes it all the funnier that he made a better Thinking Mistral than Mistral did for quite a while.
>>
>>107929672
since you are arguing about word semantics then fine in finetune does not mean small adjustment, it means precise adjustment
mistral brought llama2 to hold itself together for more than 10k of context
v3 to r1 to 3.1
that's a meaningful change, and because of the cost it is out of range of hobbyists
drummer could at least try some novel rl shit on small models but all he does is train on the reasoning traces and the model just learns to shit out the think part without actually following it
>>
>>107929659
https://github.com/deepseek-ai/Engram
*Verifiable recall* from model weights. Memory, basically. Read the thing
>>
>>107929648
I regularly use models to at least 32k and it holds together just fine. The way they interpret cards, the way they fuck, how they talk.. it all varies over the entire chat.
Obviously some do better than others, some are just broken, etc. Kind of the whole point of trying them. Actual incoherence after only 8k tokens sounds like you're using small models.
>>
>>107929734
what model are you using if you would be so kind to tell me?
>>
>>107929729
I understand that it has better memory handling, but I doubt that would blow everything else by itself. I thought I missed something
>>
>>107929804
It leaves much more room in the weights for paths for learning rather than just memorizing facts. In theory you can then stack knowledge on top of that pure logic and have model be way more performant overall.
>>
>>107929777
Currently I'm fucking with behemoths and devstral. Switching around with L3 tunes (sapphira, inna this month) when I get bored.
I wanna try qwen-vl and glm with the adaptive_P memepler. They're sub 20t/s and take a while to load from disk so I haven't yet.
Not a single one has fallen apart at 8k in years.
>>
>>107929841
i used to run mistral large when it was fresh at crawling speeds, tried behemoths too at the time and they didn't feel particularly better or worse for that matter
l3.3 was okay too, and again, i couldn't really discern between the original model and the finetunes in a blind test
you know, i wish we would get something in a 70b dense class but with all the improvements from last 2 years
>>
>>107929865
>with all the improvements from last 2 years
you actually want reasoning traces and synthetic slop?
>>
>improvements
>>
>>107929882
benchmark scores so high now
>>
>>107929865
Maybe it's how I sample but I can definitely tell a difference. I did d/l models that you could hear L3.3 through but I don't keep those.
Stock models have predictable ERP and never veer into guro or bestiality. More likely to use softer words instead of cock and pussy. If they were all the same I'd have stopped wasting disk space a long time ago.
>>
i hate drummer astroturfing.
>>
File: 1763671194996887.jpg (193 KB, 1600x900)
193 KB
193 KB JPG
>>107928338
its the repition that kills llms. they overuse phrases and shit that would be fine in any book, but they use them every damn paragraph.

i always liked reading but llms made me realize how much i prefer certain formats. like the expanse books arent really good, but the way they are written and jump from character to character with shorter chapters is very appealing. even though i read a bunch of witcher books (regret) the first one with the short stories that jumps around in decades in the actual timescale, is the best one
>>
>>107930032
Repetition, huh? That's wild.
>>
>>107930032
For someone who reads books, you sure are writing like shit.
>>
I updated from SillyTavern 1.13 to 1.15 and now my model struggles to say "a", "an", "of", "the", etc, no matter what do I do.
It's not banned strings, because it still says them on occasion, but the lack of grammar is ruining everything!
What the fuck is going on?
Why did they fuck it up?
I can't even go back to 1.13 because the piece of shit irreversibly "upgraded" my tens of thousands of group chats!
I did notice there is a new sampler or two, but they are supposed to be off. Adaptive P or something.
>>
>>107930122
Check your rep penalty
>>
>>107930132
It's off. I don't use it.
>>
>>107930122
Check what the backend is receiving in the request headers and parameters.
It's the surest way to see if there are any unwanted samplers being used.
>>
>>107930053
?
>>
>>107929804
It can offload knowledge to ram/ssd without significant slowdown
>>
>>107929183
>I'd go nowhere if I kept doing that.
I'll have to take your/his word for it. I found them funny (still do), especially when I saw a HN comment appalled by the name Moistral lmao.

>>107929329
>It also avoids any assistant tones you might get from the Offical Instruct's chat template.
I agree with that. I found swapping the template for training reduces assistant bleed through, but also destroy coherence at longer context. Though I haven't trained a "creative" model for over a year now.
>>
>>107930122
Click on neutralize samplers. If it's still doing it, it's not coming from there
>>
>>107930150
There is nothing of value there. I only use temperature, min P, and DRY, all at their normal values.
I'm going back to SillyTavern 1.13. It's clear the silly tavern devs have fucked up big time. The problem is on the latest staging.

>>107930195
That was the first thing I did, it did not help.
>>
>>107930211
What do you use for llm backend?
>>
>>107930211
>There is nothing of value there
Weird.
Since you were peaking at the backend's console, what does the response look like? Is it a case of the backend generating the terms and Silly not showing them or is the backend not generating the "a", "an", "of", "the", etc?
>>
>>107930122
>I can't even go back to 1.13
my advice is not going to be useful right now then, but a pro tip for the future: use a proxy that sits between your chat client and server that prints the raw requests in the terminal and compare the output whenever you make changes like upgrading a client so that you can tell if anything changed, it's faster to compare raw outputs body fields than look at the config of cluttered garbage like sillytavern
>because the piece of shit irreversibly "upgraded"
hope you learned your lesson to do backups
hard drive space is cheap why aren't you doing backups anon
you need three forms of backups
versioned copies (ability to go back after making a dumbo)
external hard drive mirror (ability to restore if computer gets wiped)
offsite/cloud backup (ability to survive a house fire or burglar)
of course, proper backups are also checksummed, you never know about bitrot until it's too late if you don't
doing anything on a computer without backups is like having sex with an AIDS ridden prostitute without a condom
>>
>>107930168
i think you could offload the early layers, but that would speed up requests, not diminish memory usage
>>
>>107930168
Most of the parameters (70%) still have to be standard MoE blocks, so it's almost pointless for alleviating fast memory requirements.
>>
File: 1763894033195100.png (85 KB, 1181x590)
85 KB
85 KB PNG
>>107930230
I have local versioning + mirror on my own VPS (using duplicati to do the actual backups and syncthing to distribute it to my pcs/phones/servers)
were in g so I guess everyone has such a setup right? this is not a consumer board for literal retards no???
>>
Now that glm flash is fixed, it's time to admit it punches way above it's weight.
>>
>>107929729
Brainlet here. Isn't that built-in ngram decoding?
>>
>>107930219
koboldcpp. I didn't update that, so it's not that. It only started happening after updating to silly tavern 1.15/latest staging a couple of days ago.

>>107930227
The raw response from koboldcpp is without the grammar. So SillyTavern is definitely telling the backend something weird.

>>107930230
I'm going back to 1.13 anyway and waiting until they fix this shit, my group chats can be accessible later again when I update again after they fix their retardation.
>>
>>107930211
>There is nothing of value there
>>107930285
>The raw response from koboldcpp is without the grammar. So SillyTavern is definitely telling the backend something weird.
That's fucking weird.
If Silly is sending something weird to kcpp, you should be able to see exactly what this weird thing is by looking at the request being sent to/received by kcpp.
>>
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
>Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.

Gotta download them again.
>>
>>107930310
This was from the input in koboldcpp console from a response lacking grammar.
"max_new_tokens": 300, "max_tokens": 300, "logprobs": 10, "temperature": 0.7, "top_p": 1, "typical_p": 1, "typical": 1, "min_p": 0.05, "repetition_penalty": 1, "frequency_penalty": 0, "presence_penalty": 0, "top_k": 0, "skew": 0, "min_tokens": 0, "add_bos_token": true, "smoothing_factor": 0, "smoothing_curve": 1, "dry_allowed_length": 2, "dry_multiplier": 0.8, "dry_base": 1.75, "dry_sequence_breakers": "[\"\\n\",\":\",\"\\\"\",\"*\"]", "dry_penalty_last_n": 0, "max_tokens_second": 0, "truncation_length": 20480, "ban_eos_token": false, "skip_special_tokens": true, "include_reasoning": true, "top_a": 0, "tfs": 1, "mirostat_mode": 0, "mirostat_tau": 5, "mirostat_eta": 0.1

There was also stopping strings, but it was just auto generated character names and tokens mostly. Had to remove it because it was too long. Besides those are only used for stopping further output.
I can't find anything of value in the input like I said. Just massive amounts of text and my banned string list
Something is obviously fucked up since everything works right in 1.13.
>>
>>107930381
>unslut
of course
>>
File: 1764946477162533.jpg (287 KB, 1920x1080)
287 KB
287 KB JPG
>>107930381
>Using unsloth goofs in the first place
>>
File: 1768264705236546.jpg (157 KB, 768x1024)
157 KB
157 KB JPG
>>107921731
>>
>>107930381
lmao
>>
>>107930467
ha so much fun sir now is real perfect for the looks push to the moon! :rocket:
>>
>>107930472
What?
>>
>>107930421
You have to download quantizations again even if you got them elsewhere at release.
>>
File: file.png (24 KB, 471x219)
24 KB
24 KB PNG
>>107930478
>>
Anons still don't make their own quants.
>>
>>107930501
downloading slot/towski quants 5 times is still less wasted space than downloading the full weight once.
>>
>>107930496
I still don't get it.

>>107930491
Using the override flags to change which gating function is used should work with the old GGUFs too, right?
>>
>>107930511
>Using the override flags to change which gating function is used should work with the old GGUFs too, right?
No. Needs remade.
>>
oh god
>server : support preserving reasoning_content in assistant message
>(a) detect whether model supports reasoning (b) enable reasoning by default if it does (c) pass reasoning traces
>>
Wait. Has llama4 been using the wrong scoring function this whole time too?
>>
>broken
Sillytavern literally prints all samplers going to backend in the console and theoretically the text it sent wrapped in the template.
>>
>>107930037
That's wild, you say? *My finger traces a smirking pattern on your smirk*
>>
>>107930548
llama4 is unsalvageable bro, don't even dream about it
>>
>>107930540
That's for interleaved reasoning right?
>>
>>107930510
For pathetically small models like glm-flash it doesn't even matter. A full 30b is close to a quanted 70b.
>>
>>107930561
It would be funny if it wasn't as bad as we thought because of a simple metadata bug that caused the backend to use the wrong expert routing function though.
>>
>>107930574
Two more tweaks!
>>
Downgrading to SillyTavern 1.13 worked. The model now uses proper grammar again.
>>
>>107930633
Bizarre.
What changed compared to >>107930392?
>>
File: 1741692334449988.png (89 KB, 896x258)
89 KB
89 KB PNG
>>107930574
To guard against that I try it on API. Compare to quants. Llama4 was still bad.
>>
>>107930633
You know they'll never fix it if they don't know, right?
>>
>get rate limited by poogle
>fuck it, I got local kimi
>try it using it for serious work for the first time
>it actually one shots things I throw at it while gemini needed multiple corrections
Feels good to be CPUMAXXER
>>
>>107930669
There is nothing to fix. PEBKAC.
>>
>>107930671
How many hours did it take to complete your "serious work" of generating single-file throwaway scripts?
>>
>>107930657
>green line
Suspicious.
>>
>>107930730
?
>>
File: dipsyTwoMoreTweaks.png (1.24 MB, 1024x1024)
1.24 MB
1.24 MB PNG
>>107930612
lol
>>
>NA wakes up
>thread quality drops, no more discussion
>sub 90 iq posts and insults
>>
>>107931035
eat my balls
>>
Once
>https://github.com/ggml-org/llama.cpp/pull/18953
is merged, I'll finally begin testing GLM 4.7 flash.
>>
>>107931035
What's new
>>
File: file.png (113 KB, 696x287)
113 KB
113 KB PNG
girls!
>>
>>107925923
attention in a nutshell
>>
>>107929570
don't finetunes make the model's instruction following worse?
>>
>>107931319
>>107931319
>>107931319
>>
>google is going to get away completely unharmed from the openai fallout
Did those dudes make a pact with Satan or something?
>>
>>107931374
who are you quoting?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.