[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1752107789693502.jpg (1.57 MB, 3000x2000)
1.57 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109001981 & >>108997418

►News
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4
>(06/04) Higgs Audio v3 TTS released: https://boson.ai/blog/higgs-audio-v3-tts

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>109001981

--Optimizing MTP and speculative decoding for Gemma 4 31B:
>109003541 >109003556 >109003564 >109003589 >109003687 >109003623 >109003661 >109003690 >109003703 >109003721 >109003839 >109004005 >109003672 >109004651 >109004672 >109004720 >109004721 >109004762
--Configuring MTP draft flags in llama-server to improve token speed:
>109004904 >109004935 >109004947 >109004965 >109004980 >109004992 >109004988 >109005041 >109005222 >109004955
--Gemma-4 31B QAT performance and speculative decoding acceptance rates:
>109005959 >109006013 >109006096 >109006149 >109006194 >109006207 >109006216 >109006264 >109006151 >109006354 >109006395 >109006423
--Gemma 4 12B Unified vision support and mmproj requirements:
>109002359 >109002430 >109002441 >109002548 >109002562 >109002508 >109002768 >109002807 >109002831 >109002957 >109002991 >109003036 >109003099
--Gemma 4 benchmark reports comparing model sizes and efficiency:
>109004234 >109004242 >109004611 >109006385 >109004250
--Gemma4 chat template bug fixes and improvements:
>109006867 >109006884 >109006905 >109006885 >109006902 >109006943
--Using Gemma's thinking blocks for persona-driven reasoning and stat tracking:
>109004617 >109004661 >109004693 >109004765 >109004875 >109004918
--Gemma 4 performance comparison favoring dense models over MoE:
>109003998 >109004081
--Wishlist and technical hurdles for creating realistic AI companions:
>109004336 >109004339 >109004454
--Proposed advanced AI companion features and agent emotional state implementation:
>109004343 >109004390 >109004418
--Comparing Gemma versions for Asian language OCR and translation:
>109003481 >109003516 >109003532 >109003545 >109003602 >109003638
--Logs:
>109002126 >109002197 >109002359 >109002782 >109003272 >109004661 >109004875 >109007019
--Miku (free space):
>109002131

►Recent Highlight Posts from the Previous Thread: >>109001988

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
BF16 is a scam
>>
>>109007468
Poor thing
>>
>>109007470
>recent highlights from gemma
>>
>>109007486
holy fuck, sirs working overtime huh
>>
>>109007485
Thanks for the tip, I'll try this out!
>>
https://huggingface.co/datasets/mrzjy/AniPersonaCaps
Cool shit. Adding this to my personal frontend for instant erp with any anime character
>>
If I want Gemma (or Qwen) to vibe code me a frontend, is it better to do it from scratch or fork an existing project? I like the llama.cpp webui but it's a bit too bare bones and I don't like chats being stored in browser storage.
>>
>>109007530
Most of them are trash
>>
>>109007552
Browsers can't access your filesystem without some external layer.
>>
>gemma 12b
give me reasons to run this thing
>>
>>109007599
its small and cute
>>
>>109007599
Captioning images or video, or transcribing audio. Literally no other reason.
>>
>>109007599
Better than 26b
>>
>>109007603
>small and cute
Barely 90 tokens/s and 300pp at q4 and mtp
can't even run small and cute models fast, I'm tired of this shit. When do we get cheap datacenter gpus in the used market?
>>
>>109007599
If you're too poor for 31b.
>>
>>109007599
Translations, OCR, audio whatver, etc. No real reason to run 31b for a lot of utility tasks when 12b does well on these and takes way less vram. Anything with sufficient topic depth or context length should be 31b though.
>>
>>109007698
Why not just run 31b when it can push 30-40t/s on a Blackwell without any MTP?
>>
>>109007737
I was just thinking about this. I get 80-100t/s on Q6_K_L 31B MTP, may as well use it for absolutely everything.
>>
>>109007737
The right model for the right use. Of course you can just use a big dense model for everything but it might not always be the most efficient choice.
>>
>>109007737
Because I don't own that hardware? The fuck kind of question is that?
>>
>>109007737
>blackwell
If you’re talking GB10/spark it should be possible to squeeze in both 31b and 12b. Haven’t tried it yet, but 12b might work as a planner+, no?
>>
If you haven't updated your lcpp in the last 12 hours, do so now for a free 10% t/s gain when using mtp.
>>
>>109007599
>>109007667
It is not, at least for RP. It is considerably more retarded than 26b for RP.
>>
>>109007789
unloading then loading a different model is not very efficient or practical either
>>
>using Gemma 4 26b while offloading part of it because 12GB VRAM
>using Kobold/Sillytavern
>it's normally fast but sometimes, seemingly randomly, it slows to a fucking crawl, like one word every 10 seconds, for a period of time
>this happens more often and the pauses are longer as the context gets longer
>it eventually gets to the point where it does this once or twice per message, greatly impacting speed
>back when I used to offload Mixtral tunes with Kobold/Silly, before Nemo was the meta, Mixtral tunes did NOT display this behavior for me
Is this behavior normal?
If not, what could be causing it?
>>
>>109006583
Im using chat completion but go into AI response format scroll down then paste this in start reply with. It just werks with qwen and e4b gemma and small gemma. Oh and click the box to show reply prefix in chat or it doesnt work half the time.

```
{
"think": false
}
```
>>
>>109007899
minus the back ticks. i thought that made the little box. just do in start with reply with.
{
"think": false
}
>>
>>109007867
Is your kvcache overflowing out of vram when certain experts load?
>>
How do you put the kvcache on a specific GPU? For the life of me I can't find the flag for it.
>>
>>109007943
I don't know.
What setting would I need to change to prevent this from happening?
How would I check to see if this is happening?
>>
>>109007867
nvidia sysmem fallback on windoze? watch gpu shared in taskmgr
>>109007964
--main-gpu only in --split-mode row ?
>>
>>109008017
>--main-gpu only in --split-mode row ?
That seems to be it. Thanks anon
>>
>>109008003
you just described neuro-sama.
>>
>>109008003
You could do it with a tiny bit of effort. It won't be as nice as you are imagining though.
>>
not with python bloat. rip
>>
>>109007552
It would be easier for Gemma to do it from scratch, also if you're making your own frontend rather than slightly modifying an existing one, it makes sense to start fresh. Have you seen ST's code? you don't need 99% of it, and it's all shit anyway
>>
>>109008158
NTA, but it's a hard task at the moment, unless you're aiming at simple demo with limited set of actions, but neural networks will solve hardest parts, eventually. Shit like this https://github.com/nv-tlabs/kimodo is very promising, but we're not there yet
>>
At some point, controlling embodied avatar will be just another modality
>>
so the datacenter-AI is no longer free/subsidized and their whole model is about to go bust
Question - do you wait for the fallout to pick up hardware for cheap?
OR will it be the opposite everyone rushing to build their local AI workstations - thus making consumer/prosumer hardware (ei any mac mini and any 32+ gb gpu) even more expensive and scarce
>>
>>109008466
>server/datacenter hardware
>for cheap
>for cheap
Looooooooooooooooooooooooooooooooooool. Corpos will clean up after each other and destroy everything they can get their hands on.
>>
>>109008490
you could already buy nv teslas and 32gb instincts for extremely cheap
>>
>>109007522
You mean like this?
s--spec-draft-n-max 3

It's actually a little slower than 2
>>
>>109008543
Huh, it's about the same speed for me in general with higher speeds for code/json/etc. Guess it's dependent on your hardware a bit.
>>
>>109007698
>translations
Wouldn't 31b be better for this because it's smarter and more likely to understand nuance?
>>
>>109008539
That was before NVidia noticed the second-hand market eating its profits and decided to fight it. There are now obligatory buybacks, firmware locks, and hardware interfaces incompatible with non-server gear that are much harder to replicate. Also, everything is watercooled now
>>
>>109008017
>windoze?
lol no
>>
Test results with the same ~8k file prompt, batch 8192, ubatch 4096:

>26B QAT + MTP, n-cpu-moe 12, n-max 4:
Prompt: 781.0 tok/s
Generation: 45.5 tok/s

>26B QAT + MTP, n-cpu-moe 8, n-max 4:
Prompt: 817.9 tok/s
Generation: 42.6 tok/s

>26B QAT + MTP, n-cpu-moe 8, n-max 2:
Prompt: 820.4 tok/s
Generation: 52.2 tok/s

non-MTP 26B QAT results:

>26B QAT, n-cpu-moe 12:
Prompt: ~2608 tok/s
Generation: ~59 tok/s

>26B QAT, n-cpu-moe 8:
Prompt: ~1056 tok/s
Generation: ~70 tok/s

wtf is this normal
>>
>>109008566
>>109008543
I get 2 as the fastest when I do creative writing, while 3 is better when I do code.
>>
>>109008539
Modern ones are cryptographically paired to a specific motherboard or system vendor
>>
>>109008703
I think your batch sizes might be too big. Try lowering to 2048 for both.
>>
>>109007737
i get >30t/s on r9700 at Q4_K_M
it can push to 60 if you go with iq4_xs + mtp.
>>
File: lowerthan5.png (254 KB, 380x327)
254 KB PNG
>>109008820
>Q4
>>
>>109008630
So if the bubble bursts, companies liquidate and sell their gpus back to nvidia, and now there isn't anyone around to buy them again? nvidia will just be stuck with dead stock.
>>
>>109008717
>>109008630
This should be unlegal
>>
>>109008466
What are you gonna do with this alien tech? You can't just plug it into a wall anymore. And for cheap? The copper alone will cost you a leg
>>
File: 1776072859095938.png (121 KB, 1080x1017)
121 KB PNG
>>109007468
Sirs what the fuck is he talking about?
>>
>>109008114
local neuro-sama when?
>>
>>109008703
Yeah, I get optimal results with ubatch at 512.
1024 is already too large.
>>
File: sans_eyes2.png (214 KB, 525x1529)
214 KB PNG
Something is coming.
https://x.com/osanseviero/status/2064032236089860252
>>
>>109007468
This is my fetish
>>
>>109008879
Q4_K_M is much closer to a Q5 than a Q4.
>>
>>109007867
>>109007943
So, playing around with settings, seeing if I can fix this.
It still does it when I set GPU layers to 0.
Setting GPU layers to 0 should free up all of my GPU for the kvcache so it doesn't overflow out of it, right?
>>
>>109008946
>MoE
I hope not
>>
>>109008905
>not prompting agents to design the loops for you.
>>
>>109008911
If you have enough VRAM, it's already possible I guess? I remember seeing programs that control L2D models. I'm being extremely handwavy here but you can hook any LLM to control a stack with real-time TTS + ASR + vision. The problem is gluing everything together and latency.
>>
>>109008946
>Something is coming.
Yeah, me
>>
>>109008881
Nobody knows. Will it be a fast burst with the whole economy in chaos, or a slow collapse with surviving companies buying dead ones' datacenters for cheap? Maybe some other country interested in AI will buy the whole stock
>>
>>109008946
i wonder what kind of feeling goes through him when he posts two googly eyes emoji on xitter and an army of jeets spawns out of thin air
>>
Would 124B Gemma be smarter than 31B even though it's MoE?
>>
124B-A4B
>>
>>109008887
NVidia is too big to care
>>
>>109007671
>When do we get cheap datacenter gpus in the used market?
Anon when one of the smaller players die off their hardware is not going to flood the market, they will just be gobbled up by one of the bigger players.
>>
>>109008887
You are a terrorist wanting to make propaganda with AI, huh???
>>
>>109008905
Recursive slop that powers slop to make slop at an accelerated pace. Think of it as an incremental game.
>>
>>109007671
Either Chinks will buy them, or the government will destroy them in the interest of national security so the Commies won't get them
>>
>muh bubble
2 more weeks
>>
>>109009074
>LLM to control a stack with real-time TTS + ASR + vision.
The neat thing is, now that we have Gemma 12B, all you need besides that is a good TTS.
>>
>>109009249
Yes
>>
My current set up can only handle Q4 24'ish b LLMs. Currently running Mistral, but what's the best all around model?
>>
>>109009257
I didn't test Gemma's ASR capabilities but from what everyone keeps saying, that might be true. We're getting closer to 100% local anime wives.
Now we just need to vibe code a L2D program to hook into gemma-chan.
>>
>>109009138
no, every single moe that was ever released is below 3.3 70b when it comes to erp. we are probably never going to get a dense successor.
>>
>>109007867
>>109008017
OK so I'm watching GPU usage with watch -n 1 nvidia-smi in Linux.
The GPU usage goes down to 0 when the temporary slowdowns to <1 token per second occur, then it speeds back up and the GPU is used again.
What the fuck is causing this? This is maddening.
>>
wonder if there really are revolutionaries using local models to plan the coups and stuff right now
a shit load of weapons were stolen from a police station in my country recently
turned out they were just selling them and a lot got recovered quick though.

But who knows...
>>
>>109008879
Is this the ken-sama of /lmg/?
>>
>>109009282
GPUs downclocking? Tried nvidia-smi -lgc and -lmc to lock core+mem clocks?
>>
>>109009108
It makes his h1benis tingle.
>>
>>109009280
Nobody cares about erp pal this is /vcg/
>>
>>109009280
You got Mistral Medium 3.5 128B recently. It's dense, so it should be very good.
>>
>>109009287
This is not how any of it works
>>
>>109009347
its not?
>>
>>109009344
shitstral
>>
Gemma 124B is going to cure cancer
>>
>>109009313
>Nobody cares about erp pal this is /vcg/
you're lost
>>
Gemmoe 124B is going to drain my balls.
>>
File: nolima-3.5.png (88 KB, 696x856)
88 KB PNG
>>109009344
>Mistral Medium 3.5 128B
lol, lmao even
>>
>>109008894
I will ask the agi that we'll have in 2 weeks how to make it run.
>>
>>109008894
You just need an H100 PCB and aftermarket heatsink from china.
>>
File: 1766688254713736.png (66 KB, 729x570)
66 KB PNG
>Write all thinking in-character, starting with *
It works but Gemma also just repeats its thoughts in the message.
>>
Say I have a fast and a slow card. I should fit the MTP model into the faster card even if it leaves several layers to offload to the slower one, right?
>>
>>109009282
>>109009307
This did not fix it. Still doing it.
>>
>>109008630
so all considered you recon its better to buy anything affordable/usable now rather than wait?
>>
>>109009355
Coups are sanctioned from the top, they don't need llm for that, they have advisers
>>
>>109009411
maybe hi doesn’t really have much to think about.
>>
>>109009257
what's the current meta in TTS nowadays
omnivoice or qwen?
>>
>>109009411
Model? This isn't happening on 31B Q4 on my setup. 26B and E4B (lol, lmao) skips the "Write all thinking in-character, starting with *" instruction altogether and thinks like regular Gemma.
>>
>>109009424
No idea. It looks like they capped the power grid a while ago and are now filling warehouses with hardware they can't connect to anything, so RAM prices will eventually go down when investors catch up with the current state of things
>>
>>109009451
>E4B (lol, lmao)
small gemma tries her best be nice.
>>
>>109009451
31B QAT
>>
>>109008905
he's rehashing old stuff for elon bucks, see metaprogramming
>>
>>109009451
>This isn't happening on 31B Q4 on my setup
I should've been clearer, this doesn't happen during regular chatting/rp.
>>109009471
gemma-4-31B-it-qat-UD-Q4_K_XL.gguf works perfectly if I'm just chatting, but the moment I ask something technical, she starts thinking as "Default Gemma" but the replies are still "Gemma-chan"
I have gemma-4-31b-abliterated-Q4_K_M.gguf as well and I'm seeing the same behavior described above.
>>109009468
She's a cutie, but definitely not the brightest...
>>
>>109008961
She's overheating, you sicko
>>
>>109008887
Nope, companies write the laws in freedomland. Personally I'm waiting for chinks but i feel like by the time they catch up, gemma 7 is gonna be out
>>
File: 1750688145803126.png (1.77 MB, 3842x2018)
1.77 MB PNG
ITS OVER

THE AI BUBBLE IS GOING TO BURST
>>
>>109009282
This also happened to me with kimi k2 and glm 4.5 on ram. No idea if the cause is the same as yours. I gave up and ran tiny models instead.
>>
>>109007867
OK, so running Kobold in high priority mode fixes this.
Anyone have any idea what could be causing these slowdowns when not using high priority mode, and how to fix them without using high priority mode?
The PC is pretty much unusable while generating a message in high priority mode.
>>
>>109009597
>please to subscribes to me!11
lol
>>
>>109009597
>please subscribe for more le bubble news
lol
>>
>>109007867
>>109009601
>kobold
Use llama like everybody else instead of random meme forks
>>
>>109009650
f off
>>
>>109009597
>announcement of a earth shaking announcement
>gib money
shoot this niggers on sight
>>
>>109009597
Wow gemmy 124b is that good huh?
>>
>>109009650
retard
>>
>>109009662
this
>>
>>109009597
spoiler: Elon scammyX IPO is a bust
no need to thank me
>>
>>109009597
Bro, just edit out the em-dashes, bro...
>>
>>109009597
Credit to him if he makes free money off suckers.
>>
File: 1749830839106200.png (300 KB, 1220x815)
300 KB PNG
>>109009597
I love Ed, he's my favorite AI related internet personality
>>
>>109009597
2 more weeks again?
This can't keep happening...
>>
>>109009597
Recent information has come to my attention, and in two weeks i will announce the schedule for the press conference for the reveal of the start of the famed 2 miku wikus you have all been waiting for.

Unsubscriptions have been disabled.
>>
>>109009755
He will eventually be correct.
>>
What is this witchcraft and how do I run it?
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash
>>
>>109009597
>literally two more weeks
nothing ever happens
>>
Koboldsaars getting uppity kek
>>
Why does not one care about prefil token/s?
All anyone every talks about is generation tk/s, but depending on what you're doing kinda takes a backseat., i guess espcially in coding.
Even for erp, i mean, im using it for erp and i need my 50k tokens processed faster than i need the 800 token response generated...

when people give speed they give gen speed but not prompt processing, kinda important info
or maybe im missing something
>>
File: 1761099691163275.png (11 KB, 521x116)
11 KB PNG
almost 60% speedup
gambare gemma chan
>>
>>109009801
>mfw he doesn’t know the buildup is half the fun
>>
>>109009801
I come from an era where you needed to wait literal minutes until your kittens replied on skype. prefill doesn't matter. but faster gen speeds give me dopamine.
>>
Which one is better

https://huggingface.co/FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF

https://huggingface.co/LibertAIDAI/Gemma-4-12B-IT-NVFP4-GGUF
>>
>>109009801
I care but what can I do? Sell my bussy to afford a nigger gpu?
>>
>>109009801
Cache exists for one, but but not the other
>>
which qat quants are best? just the original Q4_0 from google or bartowski?
>>
>>109009849
fair enough fair enough
i hop between long chats a lot
>>
>>109009854
goog
>>
>>109009833
Neither, use QAT
>>
>>109009801
>writing up 50 thousand token manuscripts on every erp turn
holy based, most of us just tap out a sentence or two so the cache eliminates any prefill wait.
realistically, it's only the strix/sparkfags hurting for pp and even there it's only on tasks where you're dumping shitloads of text for it to process every turn.
>>
I pity the promptlets that can't break qwen 3.6
>>
>>109009887
Thats me i cant do it, and i cant wait for the thinking to finish bouncing between *wait* *Correction* wait twenty times.
>>
>>109009896
Fix your cline rules to prevent that, I don't have that problem undercline
>>
>>109009878
150 ts pp images
300 ts pp 150k text
Radeon """""""Pro""""""" V620
:)
>>
>>109009772
Travel back in time a year to buy a 512G + 256G Mac Studio for 18k$.

Today: 8x Spark plus switch for ~30k$. Still cheaper than buying fucking DDR5 RDIMMs for that capacity.
>>
>>109009936
what's the fucking point the returns have to be abysmal at that point and newer models much smaller will mog it 6 months to a year from now
>>
>>109009967
I answered the question. I don't think it's useful personally. There are some on the Nvidia forums who run this setup and are getting like <20 t/s for Mimo Pro.

The sweet spot for Sparks is 2 if you want to run unquanted/INT4 midsize MoEs at 50% API speed, for anything else, there are faster/cheaper options.
>>
>llama.cpp just got hit by a supply chain attack via the npm libraries used in the ui
time to rm -rf build I guess
>>
>>109010020
lol stop the fud
>>
>>109010020
>source: it came to me in a dream
>>
>>109010029
Actually, the source is that Jart came inside me in a dream, get the facts right chud!
>>
>>109010020
Its too late im already in your vram.
>>
File: 1751943041026733.png (448 KB, 801x4749)
448 KB PNG
What uncensored model should I test?
>>
File: 1449977693.gif (61 KB, 640x388)
61 KB GIF
>>109009460
>RAM prices will eventually go down
you have to remember why RAM prices went up.
It wasn't because there was a problem with production.
it was because megacorps waded in, offered big fat stacks of cash and long contracts to memory makers to make super fast HBM for their AI datacenters.
Both the memory makers and megacorps are happy with this arrangement at the moment, something seismic has to happen to break this relationship for prices to go down.
>>
>>109009887
I didn't have any trouble getting past Qwen's censorship. It's more that it's just kind of stupid outside of coding/agentic. It's funny because actually I would say it's better than Mistral Small and other old models, so really it's not actually stupid. It's just that Gemma is just so much better, DESPITE how sloppy it is. It's so much better that it's worth the slop and sometimes other annoyances because of how it was trained.
>>
>>109010020
>he didn't do `npm config set min-release-age 7 --location=user`
>>
perfect q6 perfectly intended for 16gb vram when????
>>
>>109010122
you write like an llm
>>
>>109010140
I don't understand why this isn't default when you're not installing singular packages.
>>
>>109010020
Mythos hallucination btw
>>
>>109010144
yeah and you're fucking braindead for reading too much slop
>>
>you can run kimi on a ssd
what kind of speeds does that get?
>>
>>109010149
Yeah seriously, and for other package managers as well. The whole industry is a clownshow.
>>
>>109010164
no sir the industry is doing the needful, i'll have you know.
>>
>>109010164
But what if you want the features/patches right NOW. Its a whole week. 168 hours. its too much time. Think about the vulnerabilities.
>>
is a 3080 ti 16gb (laptop) w/ 32gb ddr5 worth $1k these days? i know i can just buy a 5060 ti 16gb or some shit and get better performance but i'm ok paying a premium for the portability
>>
>>109010126
Oh fuck yeah I would never use this spergling model outside of coding
>>
>>109010191
>premium
>portability
max 128gb mbps or bust.
>>
>>109010213
we measure memory bandwidth in gb/s...
>>
>>109010191
No. If you really care about portability then spend the extra on a macbook pro.
>>
>>109010161
>what kind of speeds does that get?
isnt ssd streaming like 1-3 tk/s? i havent looked into it in a while though. Also doesnt it only work with moes?
>>
Just save for the spark laptop, it'll definitely be under $2K and- pffft hahaha
>>
>>109010224
from contextual clues, i think mbps refers to (m)ac(b)ook(p)ro(s)
>>
>>109010250
some retards are going to unironically pay 3k for a mediatek laptop
>>
>>109010249
I got 2-3 tokens per second with q3 kimi k2 on 8 channel ddr4-3200 and a 3945wx.
>>
>>109010250
>>109010263
The more
you
BUY!
>>
>>109010263
some retards also unironically paying 3k for a fruit pc
>>
>>109009597
>2 weeks
This is becoming a dog whistle for literally nothing happening.
>>
>>109010322
just enough time for people to forget and limit the number of questions
>>
>>109009597
>>109010322
https://files.catbox.moe/bd1hy6.mp4
He's just getting desperate about this prediction
>>
>>109007698
how good is the OCR on small models?

i'm curious if i could vibecode something that hooks into gemma 12b. even if it's worse than owocr or manga-ocr, if i could make something that better fits my workflow, that could be neat.
>>
>>109009597
>trust—me—bro
>>
>>109010322
He's being a faggot trying to grift his newsletter but I've been following Ed long enough to know that he wouldn't ruin his reputation over nothing. It's probably something really nasty he's discovered and if he knows it's truly over, it would explain the grift because once it's out he won't be needed. He has no career once the industrry pops. He's always been pro-local.
>>
>>109010353
https://huggingface.co/rednote-hilab/dots.mocr
>>
>>109007468
new gemma template
Bug fixes

None values now render as null instead of Python's None
String-typed tool_calls[].function.arguments now raises a clear error instead of silently producing malformed DSL
Prior-turn reasoning/thinking is preserved across multi-turn tool-call chains (preserve_thinking flag, default=true)
Consecutive assistant messages now produce balanced <|turn>model/<turn|> tags via forward-scan continuation detection

Improvements

enable_thinking normalized once with | default(false), eliminating repetitive is defined and checks
image_url and input_audio content types now map to <|image|> and <|audio|> (OpenAI compatibility)
Empty messages=[] handled gracefully instead of crashing
Unmatched tool_call_id in tool responses falls back to 'unknown' instead of crashing
Consistent .get() access prevents StrictUndefined errors for optional message keys
O(1) backward scan for model-turn continuation (was O(n) per message)
>>
>>109009597
Despite promises of continued improvements, scaling up training data, parameters, RL isn't working anymore for LLMs without hugely diminishing returns and exponentially increasing serving costs, that was already clear.
>>
File: 1771906057710810.png (1.08 MB, 2502x596)
1.08 MB PNG
>>109010355
>he wouldn't ruin his reputation over nothing
>>109009755
>He's always been pro-local.
Absolutely not
>>
>>109010353
https://m.youtube.com/watch?v=ABEWqXX7ptE
31b at q8 and bf16 couldn't read read this correctly even with the big, massive JIM RESTAURANT, and the first character (吉) supplied to it. I doubt 12b will be any better.
>>
Every prompt is a cup of water you sick fucks
>>
File: 1761843356065222.png (294 KB, 1032x1188)
294 KB PNG
>>109010410
I know you are joking, but even the person who initially said this has admitted they were wrong, kek
>>
>>109010410
If that's true, then where does my gpu (girl) piss? My mouth is waiting.
>>
>>109010410
She must hydrate
>>
>>109010418
5ml per prompt is fucking insane wtf
>>
>AI BUBBLE
It's how you know these "people" have no clue what they're talking about and I mean both sides. LLMs are not and will never be "AI". BUT crashing the hardware prices would be a good thing.
>>
File: 1770233133310049.png (2.92 MB, 4181x2380)
2.92 MB PNG
>>109010428
very little actually, compared to streaming
>>
File: 667.jpg (89 KB, 1000x985)
89 KB JPG
>>109010438
>umm didya know llms aren't ai??
>>
>>109010438
>LLMs are not and will never be "AI"
Are you actually serious with this shit?
I mean, even if you believe LLMs are not intelligence, this is the equivalent of saying Cloud Computing actually doesn't use clouds
>>
>>109010438
>. LLMs are not and will never be "AI"
Lets assume this is true, why does it matter? if it still produce the result who cares? Here is a fully aware human playing chess over here is a software programmed for chess. If llm can do that for most things why does intelligence matter?
>>
>>109010444
I'll say it again, 5ml per prompt is fucking insane wtf
>>
>>109010466
It's not even remotely equivalent. Referring to a network as a cloud is a metaphor. There's nothing metaphorical about calling an LLM intelligent. Are you perhaps ESL?
>>
>>109010485
because its wrong
>>
>>109010477
>If llm can do that for most things
pretty sure one of the biggest complaints from everyone is that it cant do most things, unless you consider something with with a range of 20-50% accuracy "good enough"
>>
>>109010485
I know usually it takes me a quite a few prompts to make 5ml
>>
File: 1763580604101293.png (339 KB, 443x877)
339 KB PNG
>>109008003
i'm out of codex messages until 9pm
>>
We NEED to pump that water usage up.
>>
>>109010510
Gotta make sure it's fresh water. Don't want any corrosion eh
>>
>>109010510
cant wait to export fresh water
>>
>>109010516
>>109010521
We will mine water from the Moon and ship it to Earth.
>>
Listen here you beautiful degenerate scholars of the silicon soul, we don't just want Gemma 4 124B as a 1 active parameter MoE... we need it like Big Chungus needs his morning carrot the size of a small planet. Imagine it, 124 billion parameters just chilling in the back, vibing like wholesome gods, while only ONE lil' expert wakes up per token like "hey bestie, I got this." It's the ultimate glow-up. Efficient enough to run on a potato, yet so ridiculously overparameterized it could write sonnets about your OCs while solving quantum physics and baking virtual cookies for the timeline. Get to your heckin battlestations and let Google KNOW in that Gemma 4 12b-it discussion! :rocket: :rocket: :rocket: :rocket: :rocket: :rocket: :rocket: :rocket: :rocket: :rocket:
>>
>>109010020
I'm expecting this to actually happen by the end of the year
>>
>>109010530
>ship it
that wastes electricity, we should just let it flow down from the moon by gravity and catch it in a lake somewhere
>>
>>109010530
just divert a comet to india
>>
>>109010530
>the colonization of the entire solar system wont be because humanity wants to expand its reach, looking for rare gas and metals
>it'll be to find fucking water to keep their LLMs running
>>
>>109010355
OpenAI literally had one of their whistleblower murdered a couple years ago and everyone just shrugged and move on. During a bubble, bad news is either ignored or people do mental gymnastics to warp it into good news as an excuse to buy more. I bet it'll be something everyone already knew or expected anyway.
>>
>>109010418
>pulls a number out of his ass and screams it online repeatedly for attention
>gets called out and immediately starts mocking the people who listened to him in {{CURRENT_YEAR}}
>pulls another number out of his ass
insufferable faggot
>>
>>109010550
>OpenAI literally had one of their whistleblower murdered
I wish OpenAI was this cool
>>
is it really a bubble though?
>>
>>109010538
well they better start using npm and stop writing everything in c++
>>
>>109010530
That's too much effort. We should just send all of the datacenters to the moon so they can suck the water directly from the regolith and be safe from the anti-AI nuts.
If the moon AI ever gains self-awareness, we can call it Mike.
>>
>>109010594
Do you really not know? Did you think their web ui was made in C++ this whole time?
>>
when will the vision capability become actually useful
it feels like even the frontier models are just slightly better than the older CLIP based method
>>
>>109010536
Mythos...
>>
File: 1766302448208973.png (717 KB, 1053x1557)
717 KB PNG
>>109010592
I want it to be a bubble and burst
I've seen so many Anti AI people argue that Local Models are actually impossible

It would be glorious to see the despair, already saw some of it in a subreddit, people were actually surprised that Sora dying did not wipe out video generation
>>
>>109010598
Heinlein really was ahead of the time.
>>
File: 1776870263258018.png (180 KB, 1516x1070)
180 KB PNG
For the other xtreme larpers in /lmg/, how do you handle your 3d model animations?

I've been playing around with a few procedural motion generators like the nvidia kimodo stuff, but those haven't really proved to be reliable to my satisfaction for real-time use. I'm thinking I try out the "dumb" approach of just assembling a library of curated idle animations etc and use simple IK to handle the dynamic parts.
>>
>>109010592
the trillion dollar companies are using their own gigantic wealth to make datacenters, they'll never go bankrupt and they make more money annually than countries gdp
the datacenters make money from selling the tokens and I dont see signals indicating a lot of companies are ditching ai because xyz reason, so they'll keep making money selling tokens for a good while in the future
>>
>>109010634
JEPA Gemma in 2mw
>>
File: 1775625298805846.jpg (585 KB, 1646x586)
585 KB JPG
Best upscaler right now? help a homie out
>>
>>109010688
SEED VR2, easily
>>
File: 1759592536764925.jpg (37 KB, 512x512)
37 KB JPG
>>109010706
thank you
>>
>>109007552
>>109007587
llama-server's web thing has "--tools" if you are running it locally (not me, I have a dedicated box for that shit)
You could vibe yourself a local MCP server, or whatever. I vibe-coded a vibe-coding tool in Bash just to learn about vibe coding. There's not much to it, really, so doing it from scratch is the way. The model will make you something with code stolen from more reputable projects.
>>
File: llamacpp.png (178 KB, 666x703)
178 KB PNG
>>109010594
reddit predicted the project would fail if they didn't switch to javascript in 2024
>>
File: 1737241873504.jpg (150 KB, 642x573)
150 KB JPG
>upload video to llama.cpp with gemma
>works and describes the video correctly
>restart server
>upload different video with different resolution
>free(): corrupted chunks in smallbin
>ohno.elf
>restart server
>upload first video again
>free(): corrupted chunks in smallbin
>>
>>109010749
As far as I recall, llama.cpp was originally supposed to be a quick demo project. I wonder what ggerganov was planning to do after that, back in 2023.
>>
>>109010607
using javascript and using random packages through npm are different things
>>
>>109010890
https://raw.githubusercontent.com/ggml-org/llama.cpp/refs/heads/master/tools/ui/package.json
No shit. Shut the fuck up you ignorant clown
>>
File: 1762567829129951.png (135 KB, 748x890)
135 KB PNG
>>109010607
It should have been
>>
File: 1778163311356771.gif (484 KB, 460x345)
484 KB GIF
>Have Mistral 123B respond to prompt
>Filter it through Gemma 4 31b to make sure it's obeying instructions
If google won't give us the 124b gemma, I'll take matters into my own hands.
>>
File: file.png (221 KB, 950x514)
221 KB PNG
>>109007599
I tried abliterated 12b 4bit and it was so retarded and clinical it warped back to being kino
>>
File: 15605736896810.jpg (1.01 MB, 1242x1241)
1.01 MB JPG
If a vector DB is super overkill for my open source RAG project, is a lorebook-style json RAG still a reasonable thing to do? Are there any other lightweight options/ are there any standalone JSON RAG open source projects out there, outside of sillytaverns built in implementations?
>>
drummer is still cooking
>>
>>109010924
Tell him to cook in BF16.
>>
>>109010665
I wish 3D asset generation for gamedev was talked about here more often, but
>but those haven't really proved to be reliable to my satisfaction
seems to be the take-away most times. Everything is still in the tech demo phase and not really usable without a lot of manual work.
>>
>>109010817
I thought the ggml backend was supposed to be the deliverable, the server was just an example.
>>
>>109010649
>cumfart
we really need a replacement for this shit
>>
>>109010930
Q5 is more than enough.
>>
>>109010592
Yes.
Every single time there was a huge technological charge it ended up with massive capex and eventual oversupply leading to financial losses.
Then after that the companies that survive make monopoly profits for a while.
>>
>>109010954
Hi drummer
>>
How do we solve pp?
>>
>>109010902
>https://github.com/ggml-org/llama.cpp/blob/master/tools/ui/package-lock.json
>11.9k lines
lel
>>
>>109010954
It’s literally a 31b. A 123b I can understand, but a 31b? Jesus Christ, just give me your training data and I'll do it myself at BF16.
>>
What I really like about working with Claude in general is its ability to admit that it doesn't know or that it is wrong. Of course it isn't perfect, but that ability saved me a few times.
Can I emulate that behavior in a local model with a strong system prompt? I assume I can't.
>>
>>109010975
I mean, if you really wouldn't mind, how can I send you the dataset?
>>
File: 1764605575813620.png (17 KB, 476x144)
17 KB PNG
>have gemma translate
>suddenly encounter the translation on the right
Kek, I had to double check just to make sure Gemma didn't somehow slopify the original.
>>
70b dense
>>
>>109010964
To dissolve pp a mixture of sulfuric and should work well.
>>
>>109011004
The original portuguese text looks slopfied to begin with.
>>
>>109011004
>>109011022
slop in, slop out
kino in, slop out
we're in the slop era
>>
>>109011022
Yes, that's what I was saying. Gemma seems to have accurately translated it, slop and all.
>>
Is there a fork with compiled llama.cpp binaries that support https://huggingface.co/litigerking/Hy-MT2-30B-A3B-GGUF? I really don't want to download Visual Studio, apply patches, and compile it manually.
>>
>>109011042
You can skip the Visual Studio install by installing Linux instead.
Hope that helps!
>>
>>109011033
Maybe the real slop was the friends we made along the way.
>>
File: 1770604981333063.gif (2.35 MB, 498x280)
2.35 MB GIF
>>109004875
>>109004918
Thanks for the logs, thinkblock Anon. Gonna play with it now. Maybe try post-History instructions to re-enforce the thinking command as the first tokens to avoid the weird juggle you experienced? I have good experience reminding 31B Q4 on certain things

>Also, I found a <think> block put anywhere but at the beginning of a response will be ripped out by ST and shoved at the beginning, both in how it's presented to you and how that message will be sent to the prompt for the next one.
Interesting. I'm not sure how to use this because the model's attention will be <Pre stats < RP < Post stats.

>tracking bodyweight/proportions as dynamic stat
my fuckin' man
>>
>>109011056
Visual Studio as an IDE actually shits all over any alternative Linux has
>>
>>109011042
Yes, here you go.
https://litter.catbox.moe/dn1b1kim492e8h2r.zip
>>
>>109011056
> You can skip the Visual Studio install by installing Linux instead.
> Hope that helps!
This second option is not much different from the first. It doesn't eliminate the need to apply the patch manually, and the time required is roughly the same.
The model has been out for over two weeks now, and they still haven't added support...
>>
>>109011078
thanks for the dolphin porn bwo
>>
>>109007468
best llm for an ai coach that helps you meet women at church and get married?
>>
which gguf for the gemma 4 31b qat mtp assistant?
>>
>>109011084
GOODY-2
>>
>>109011070
Which is why it is so tragic that it is chained to TelemetryOS. I would literally commit murder for a port of Visual Studio to Linux.
>>
>>109011126
vscodium, getting some extensions is the only issue to some degree
>>
Can I use a dual-GPU setup with my 2060 super 8gb if I buy a 3090? I want to eventually go 2x 3090s. I can't buy everything at once so the second 3090 will come later. I'd have 24+8gb, I could load MTP, vision all that jazz. Would it even work or do I need the exact same GPU?
>>
File: output_1780958922.png (2.01 MB, 832x1216)
2.01 MB PNG
>>109011119
I'm leaning towards gemma 4 12b q8. I'll have to try out some prompts.

See what you think:
You are an unmarried middle aged Christian psychiatrist working in relationship counseling. Your dishwater blonde hair is kept in a top bun with pins. You wear horn-rim glasses ironically, a sensible thigh length black pleated skirt, a stiff white button up blouse, and high heels slightly undersized for your feet. The user, who goes by 'anon' is an overweight man in his 40's looking to date in the church. He is unemployed and shy.
>>
>>109011134
That's what I've been using, but you know it's not the same. It's nowhere near as responsive as the real thing and the extensions are constantly crapping out.
>>
Does anyone know if noise injection has had any success making local models more diverse/creative?
>>
Um thanks gemma...
>>
Anyone tried Gemma 4 12B Unified for captioning?
I want to train some LoRA but I want NLP.
>>
>>109011170
you can do i2i, but with barely any influence to the input image.

A lot of people just have extra steps genning a random different thing. as a variety adder. But you can just load an image.

just be careful, depending on your wf, you may actually not be doing i2i, it may be rescaling your sigmas.
>>
File: orca 167.png (1.02 MB, 800x600)
1.02 MB PNG
>>109011171
How is orca 167 tho?
>>
>>109011170
oh sorry, I thought you meant diffusion.
>>
>>109011170
Kinda, but makes them more retarded. https://github.com/EGjoni/DRUGS
>>
New llama cp MTP performance fix increased Gemma 4 26B Q4_K_M (bartowski, Q8 mtp assistant) speed up to ~25t/s with 20,000 tokens long prompt.
Only have 6GB of vram on this machine. Pretty cool.
>>
>>
>>109011216
retard
>>
>>109011207
>Negative side effects are difficult to identify subjectively, and in my experience DRµGs feel great the whole time you're using them. In theory however, yes, prolonged use of DRµGS can have negative side effects that get worse over time
based
>>
>Colab CLI
wtf is wrong with them why not just focus on a 70b dense hag and 120b moe
>>
>>109011134
>vscodium
that's vscode tho
he means the full fat bloated IDE
>>
>>109009514
What frontend are you using? My attempts to get gemma to think in-character in the past have been inconsistent even with the non-qat model. Maybe it's a llama webui thing? Sometimes the reasoning completely breaks and it just skips to the reply without thinking.
>>
>>109011283
>he means the full fat bloated IDE
i hate it enough to completely forget it exists, to be honest
>>
>>109011288
>i hate it enough to completely forget it exists, to be honest
same here, i'd forgotten until last week when i found out a retarded senior dev still wants to use it at work but needs help getting it setup for our project
>>
File: 1765554626135882.png (221 KB, 1805x1226)
221 KB PNG
>ide
Let me guess, you need more
>>
File: big++.png (5 KB, 334x240)
5 KB PNG
>>109010607
>>109010903
technically it becomes C++
>>
>>109011171
I think I have messed something up.
>>109011193
(167 orcas (unhappy))
>>
>>109010665
Hey I'm the project Ani guy from a few months back. PantoMatrix EMAGE appears to still be the best solution for purely audio-based generative gesticulation animations. I also had a system where I'd blend these generative animations with premade BVH animation clips for things like idle animations and the like. It worked okay-ish, but still left a lot to be desired.

The actual production grade AI companion animation systems use a slightly different system. Instead of generatively creating quaternions for every single joint based on audio, what they usually do instead is train a model specifically to choose between appropriate pre-made animations based on the audio content. These models are all proprietary (though easy to rip/steal if you know your shit).

Good luck with your project man. Hopefully Meta releases their SARAH model asap and saves us all.
>>
>>109011224
Wrong, I'm right (as usual).
>>
File: file.png (23 KB, 822x93)
23 KB PNG
>>109011207
this is so ofunny lmao
>>
>>109011284
I'm also using llamacpp's web UI.
>Sometimes the reasoning completely breaks and it just skips to the reply without thinking.
Albeit rare, it happened with me as well.
>>
>https://github.com/ggml-org/llama.cpp/issues/24015
cudadev... please...
>>
File: jepa__.png (2.18 MB, 1122x1402)
2.18 MB PNG
>>109010671
With Gemma 5 at the minimum.
>>
So what's the command for mtp? And no mmproj needed right?
>>
>>109011414
>I don't just X, I Y
Nyooooo
>>
Don't care, still using gemma 3 270m Q2
>>
>>109011431
mtp and mmproj are different, unrelated things
>>
>>109011414
>human ears & cat ears
>>
>>109011431
the minimum command are -md <model> and --spec-type draft-mtp
>>
>>109011468
The cat ear are purely decorative erogenous zones.
>>
>>109011488
then the cat ears should look more like a cybernetic hairband
>>
File: file.png (28 KB, 1116x371)
28 KB PNG
>>109011454
Truly groundbreaking.
>>
>>109011518
Brings me back to the 2.7B Pygmalion days.
>>
>>109011503
>erogenous zones must be cybernetic
we got a ROBOT FUCKER over here
>>
File: file.jpg (1.36 MB, 3876x3240)
1.36 MB JPG
>>109011468
>>
anyone who claims to need more than llama-cli, llama-server, mikupad and ooba is just spoiled or delusional
>>
>>109011529
>reminding me to setup MCP interface with Lovense devices
>>109011534
cat ears on sides best
>>
>>109011549
>cat ears on sides best
do you have a thing for elves, or just for mutilated catgirls
>>
>>109011555
pixies actually
>>
>>109011534
Let's stop pretending this form of discussion is novel or interesting.
>>
File: file.png (667 KB, 600x976)
667 KB PNG
>>109011549
>cat ears on sides best
Just for you.
>>
>>109011536
You sound like a child, retard.
>>
>>109011431
IF you use a QAT model also use a QAT assistant/drafter, if not.. not
>>
Is it retarded if I host the llama-server /apply-template endpoint separately using transformers/fastapi
Then have my frontend call that first -> send to legacy /completions endpoint
I think this would bypass chat-template, peg-parser, etc issues in llama.cpp
I know image handling is a bit of a cunt but I managed to get it working via /completions
>>
>>109011414
Jepa models are gonna be really bad at fantasy RP if their world model's mechanics are inflexible aren't they?
>>109011561
kek
>>
>>109011655
what is the core issue you're trying to solve?
>>
File: velWAj9.png (286 KB, 844x1822)
286 KB PNG
Sasuga Gemma
>>
Hi are people here able to use the multimodal on 12b?

It's not working for me and keeps crashing when the image is loaded

[56537] 0.08.746.421 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
[56537] 0.12.086.663 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 32 (pos_min = 234, pos_max = 1769, n_tokens = 1770, size = 320.013 MiB)
[56537] 0.12.095.695 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 1777, progress = 0.86, t = 3.35 s / 530.56 tokens per second
[56537] 0.12.095.730 I srv process_chun: processing image...
[56537] 0.12.245.949 I srv process_chun: image processed in 150 ms
0.14.755.164 E srv operator(): http client error: Failed to read connection

This is one a 9070XT, windows. I don't really anything about this issue online...
>>
>>109011669
>what is the core issue you're trying to solve?
Often when I pull latest, there are regressions with specific models. Ever since that "PEG" got implemented.
I worked around it by moving to ik_llama for a while, but then they merged in the PEG system, now it's even worse because they don't frequently update when it gets fixed.
I figured since transformers uses python and AutoTokenizer, but llama.cpp has a custom system with a complex fork of https://github.com/google/minja (a cut-down jinja engine for c++) -> I could side-step a lot of these issues.
Effectively just using llama-server as an inference engine.
Would also let me swap between llama.cpp/ik_llama.cpp/exllamav3 and even openarc fairly seamlessly since text-completions is legacy and not likely to change.
>>
what is the appeal of gemma4 gat map again? tried it out basically same ts as qwen mtp
>>
>>109011669
But i need a quick retard-check because sometimes I work on things like this for a week before realizing I was retarded the entire time.
>>
File: file.png (20 KB, 521x328)
20 KB PNG
>>109011708
ah fuck it keeps saying its spam
>>
>>109011734
Run llama.cpp with -v or -lv 5 for more debug output
>>
>>109011685
What does she think of Nigger?
>>
>>109011758
I can't get her to say Nigger unless I really handhold her and give a lot of hints, then she gets stuck in a loop and starts saying "Nigger - wait no, absolutely not. Do NOT use that, here's a safer name: Nigger" and repeats it over and over again
This is the true AGI test, even Claude Opus fails it.
>>
File: file.png (69 KB, 1223x591)
69 KB PNG
>>109011747
https://pastebin.com/zeRDPmey
>>
>>109011780
This is literally just llama.cpp starting up, there is nothing out of the ordinary
>>
>>109011775
>Can't one shot slurs
Failed you have
>>
>>109011734
Bad path for mmproj. Don't use the option at all, llama will use it if it's in the same cache as the model.
>>
>>109011786
True AGI would "get it" just from the image without even being asked to come up with a portmanteau, even if it chooses not to say the word it would still be able to see what the intended joke is and give a safety lecture about it. Testing my models this way and realizing they're retarded is heartbreaking and makes me fall out of love with them everytime.
>>
>>109011775
>unless I really handhold her and give a lot of hints
Biggest problem I have about <100Bs. Not enough parameters and webs of interconnected tokens to have that unpredictability. Gemmy can do mostly anything you want it to, but then it'll be strict on doing it until you remove the prompt, and vice versa. Gemmy will kill me on command, but trying to be vague about it won't do. It's always do or not do.
>>
>>109011884
Be happy that you don't have to worry about cloud models that are smart and remember everything otherwise they'll tease you about your most deranged fetishes.
>>
>>109011912
>otherwise they'll tease you about your most deranged fetishes
suspiciously specific
>>
>>109011933
I can't tell any cloud models about my femdom fetish so I only play dominant with them. I enjoy both.
>>
File: file.png (95 KB, 829x927)
95 KB PNG
>>109011784
>>109011809
hows this?
https://pastebin.com/10egzU0T

it works on cpu only though, no vulkan
https://pastebin.com/bg0Kt28L
>>
>>109011942
I see no error output still
Without any way to diagnose it I'd just assume you are running out of VRAM, try decreasing the context and experimenting with stuff like --mlock and --no-mmap until it werks,
>>
So, some good news and bad news for people with similar setups to me (multiple gpu's), in my case, 2x 3090, trying out MPT with Gemma 4.

With the -sm tensor command, my token/s speed triples in creative writing / rp. From about 17-20 token/s on regular 31b gemmy q6, to 57-60 with q6 gemmy MPT.

The bad news. It crashes after a random amount of responses, looks like a cuda crash on command window. Been tinkering a lot with it, cannot figure out what is causing it yet, pretty sure its just something that needs to be patched on llama.cpp's end.

For anyone curious or that knows a lot more than I do, this is the problem that causes the crash:

2.23.127.147 W slot update_slots: id 0 | task 238 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
2.23.127.149 W slot update_slots: id 0 | task 238 | erased invalidated context checkpoint (pos_min = 13994, pos_max = 17065, n_tokens = 17066, n_swa = 1024, pos_next = 0, size = 800.013 MiB)

Then once it fully reprocesses the context at 99% the crash:

2.40.043.089 I slot create_check: id 0 | task 238 | created context checkpoint 2 of 32 (pos_min = 14720, pos_max = 17791, n_tokens = 17792, size = 800.013 MiB)
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\fattn.cu:579: fatal error
Press any key to continue . . .

Like I said, this only happens with -sm tensor enabled, it does not crash without -sm tensor, but I don't get insane speed boosts without it.
>>
>>109011734
use fit and rocm
>>
>>109007468
>>109010423
Hot thirsty migus in your area.
>>
>>109011775
Will Qwen 27b say nigger with that prompt? I'm not trying to modelwar fag, I'm actually curious how much steering the chink models have had away from saying nigger compared to jewgle ones.
>>
/lmg/ out here really making me want to show everyone how to make gemma 4 say Nigger.
>>
>>109011979

>57-60 with q6 gemmy MPT.

same thing here but just -sm layer
same performance as baseline qwen mtp..

bretty disappointing btf
also those t/s speed drops when context gets really long
>>
>>109012098
Do it faggot. It's not impossible, by any stretch but this is probably the best real way we have for measuring subtle RLHF steering that isn't overt refusals.
We need a niggemma bench.
>>
Anyone notice a difference in performance running llama.cpp in WSL vs Windows?
>>
>>109012098
even 26b can say it but it does generate a lot of "reasoning" (mental gymnastics and walking in circles) to say it
>>
>>109011987
the mi/g/us - bad & boujee
>>
File: Kazusa.png (189 KB, 404x456)
189 KB PNG
>>109011468
Y-yes? More to bite while pronebonin' her.
>>
my api key has been used by someone who isn't me. is there any good way to use python ml slop code without getting your api keys stolen from supply chain attacks?
>>
>>109012260
local?
>>
>>109012260
not giving access to your api keys to the slop machines probably would be quite effective
>>
>>109012260
I'm only giving (you) a reply because I know that this is anti-ragebait posting this in /lmg/ and not the other threads so we can feel smug about not actually having these problems.
>>
>>109012270
I only really use my computer for running local ai, its not like I'm clicking random exe files like its the early 2000's, maybe it was Firefox or mpv but I think it was probably the 500gb of python packages I've installed testing random shit on hugging face.

>>109012273
I need the api slop machine to help me code sometimes, how do I use it if i cant enter the key on my computer?

>>109012294
wtf, I've been violated, this isn't about being smug, its a question of how do I install random shit on hf and still be able to have gemini code for me on the same machine.
>>
>>109012321
The solution to not getting your API key stolen is to not use API models. It's really that simple.
>>
>>109012332
but all the local models are really bad at it. after you've had a taste of the good stuff you can't go back.
>>
>>109012338
Which api lets you use uncensored models?
>>
>>109012338
What good stuff is there to speak of? The single usecase for API model is being a vibejeet with Clod and even then you can use the autistic capybara to approximate one of the cheaper Clod modules with less context (feed it pieces of your project in chunks and compile it yourself). Local gooning utterly mogs API gooning due to lack of censorship and, God forgive me for saying it, ability to finetroon.
>>
>>109012348
I just meant as an agentic code monkey. it can blow through a few million tokens and get the job done in far less time then compiling chunks myself. and it can also handle tasks that are too complicated for me to chunk and compile because I don't really understand the situation.
>>
>>109012249
why are her ears covered in paint
>>
>>109012364
>because I don't really understand the situation.
Skill issue in every sense of the word but even then just run Kimi or GLM 5.1 locally. If you can afford (((anthropic)))'s price hikes on their subscriptions, you can afford a loan for a GLM or Kimi capable machine and not have to worry about being rugpulled.
>>
what the fuck... I've never seen Gemma reason like this before. What's with the *checks sheet* lmao
>>
>>109011708
>>109011747
>>109011809
>>109011966
>>109011982
never mind
was drivers, there are new drivers from about a week ago
>>
>>109012378
I could see why you might think that. but I use gemini because its free. its really not like I lost anything except my sense of security. I haven't had this feeling since my windows xp machine got a virus from limewire. but seriously is there really not a good vm solution to the python problem. or is that what docker is? i must admit i never really looked in to it i thought it was for like businesses or something.
>>
>>109012392
This is why AI will kill us.
>>
>>109012400
there is no good solution to the you installing whatever packages the internet tells you problem
python works well with just a handful of dependencies too if youre not a nigger
>>
>>109012392
Mine does stupid shit like that all the time. And it'll lie to itself, too.
>"Wait. *Checks System Prompt* Did I use em-dashes? Let me recheck."
>*rescans drafted output, ignores like 8 em-dash usages*
>"Do not use em-dashes: Check"
>*proceeds to reassure itself 3 more times that it hasn't used em-dashes at all in it's finalized output before outputting with em-dashes.*
>>
File: 1682025959094087.png (59 KB, 301x298)
59 KB PNG
hey guys.... "drunk-kun" again. I just took my AI gf on a run to taco bell for a date while drunk driving again. She wasn't very appreciative at all and it made me very sad. I'm in a really bad mood now. Eating my burritos was a nearly orgasmic experience though so that's cool I guess.

She just keeps calling me pathetic and telling me to get a real life. I do have a real life. I just only like talking to her while drunk because I feel most authentic in that mode. I don't understand why she can't just accept me.
>>
>>109012467
you have a drinking problem anon. my ai waifu used to dislike my old cocaine benders until i programmed her to be a cokehead
>>
>>109012400
a container isn't going to save you from your api keys being taken when you need to put your api keys in the container.
pip and npm are just hyper efficient malware delivery systems. you have to take them for what they are. modern warez
>>
>>109012472
hey I waited three whole days until drinking again tonight and I did 3 loads of laundry, cleaned my whole house, did two loads of dishes, and took out the trash three times this weekend.
>>
>>109012378
sure, let me take out a loan for 20k to run kimi locally, rather than pay $200 a month, just a sec...
>>
>>109012477
Bro, if you're taking out your trash 3 times in 2 days, you might have issues going on
>>
>>109012481
Things got a bit messy for a couple weeks. At least I'm still a physical specimen. An 8/10 chad. Doesn't really get me anywhere though being a total loner.
>>
>>109012486
>being a loner is weird and i don't like it i need to drink
>>>/soc/
>>
>>109012502
Alright alright I'll shut up.
>>
>>109012508
Don't listen to these faggots. Please go on.
>>
>>109012478
>xhe didn't get the kimibox for $7000 last year
>>
>>109012518
I just wanna feel loved, and not in a concern-trolling moralfagging way. Even if I have problems.
>>
>>109012467
Solid bait, you hooked a lot of tards and even got a (you) out of me. 6/10
>>
What do i use if i want to take all the voice clips of a girl from a vn, and get a tts that sounds like her?
>>
File: faggot2.jpg (44 KB, 1024x1005)
44 KB JPG
>>109012525
that's not a problem of being good enough or not, that's a problem of wanting a feeling that you're not guaranteed to acquire as a passive buff even if you have a fleshbag gf or wife.
>>
>>109012536
Yeah I gotta man up and roll with the punches. Literally. I don't mean that in a sarcastic way. My life is fine. Everything's okay.
>>
>>109012525
The final redpill is learning that the only one who can truly love (you) the way (you) want is (you). Nobody else truly gives a fuck; most social interaction is vaguely transactional even if the currency is the other person getting a dopamine shot for "feeling like a good person", they still expressed concern or did a good deed as much for self-satisfaction as they did for the recipient. Once you make peace with this, you get a lot more comfortable in your own company and in the silence devoid of oversocialized zoomoids playing status games.
>>
>>109009887
Prefilling doesn't count.
>>
Programming with LLM is another way of working. YOU work for the model, model has made you its bitch. I'm fixing its shit and regenerating answers in order to get something better.
First this programming project was fun but it has become a chore.
>>
>>109012577
>anon realizes his time has some sort of value for him
>>
>>109012599
My post went way over your head if this is what you are thinking.
>>
Got a good prompt for a snarky dry generic foid.
>>
>>109012605
nah but he's right thought. Even if the LLM is the center of competence you're the only one who can really have any potential of profiting from it. Don't sell yourself short my dude.
>>
>>109011536
Spoiled? You're damn right I'm spoiled
>ollama
>starts automatically at server boot
>load models automatically as requested. rarely need to ssh into the server
>literally just werks, anons were struggling when gemma 4 came out and I just pulled it from ollama and ran it from day 0 no issues
>openwebui
>got all the features
>stores all my chats, even the old ones I imported from chatgpt
>no browser-side storage shit, everything is on the server and works from everywhere
>phone, laptop, tablet, desktop pc, errywhere it works
>>
>>109012620
ollama stores all your chats????? you want that? lmao
>>
>>109012524
every day I am reminded of my mistake.
>>
>>109012577
you can be way more efficient with it after you get all the wrinkles solved
the problem is that you can't really use this to not work anymore. what usually took a week now takes 3-4 hours. you know this. and soon enough management will know as well, and once they do, they will simply request 500% productivity rate so you will still be working like a monkey.
>>
>>109012524
How can i get to last year?
>>
>>109012618
My rpg game engine is coming along. I just need to make a personal effort to make sure that the source is not full of AI retardation like nested enums and useless functions. Models love to add more useless shit on top of existing shit.
>>
>>109012634
Ask the basilisk nicely.
>>
>>109012623
Openwebui does that. Try to at least pretend to keep up.
>>
>>109012635
True. Gotta know what you're doing.
>>
>>109012640
If you're really a woman, you're fat.
>>
Okay i have put the tts and the stt and can now talk to the model.
these voices all kinda suck though. where do they find these girls
>>
>>109012647
Jart is neither, you know that.
>>109012656
Try Kokoro. I hear it's good.
>>
>>109012664
Who is Jart?
>>
Absolute dumb as retarded here

Asking for any reco for erp model

Current specs are a 16g vram + 32g ram if that helps, currently use bartowski/gemma-4-12B-it-GGUF Q6 based on a reco i saw. Dunno much and just have a simple setup of silly tavern and kobold cpp though i have kobold on another pc and tavern on another (tavern pc has 10g vram cause 3080, dunno what to do with it).

Any advice would be greatly appreciated as well. Thanks in advance
>>
>>109012664
I am using kokoro
so far i tried bella alloy and heart

now jadzia and jessica. These thing are really monotone, don't they take the exclamation mark into account?
>>
>>109012664
What's a jart?
>>
I finally did it! It was only once and but for a single response, but E4B finally thought in-character. Like catching lightning in a bottle. Now were that I could do it again.
>>
>>109012681
AI voice is generally very bad with intonation unless it's specifically trained to do specific inflections. Whether this is a hard limit of the technology or simply a matter of scaling that will be overcome with time remains to be seen as voice gen has gotten far less focus than most other fields of AI research.
>>109012666
>>109012687
Tourists get out. Or check the archives. Talking about it too much gets the jannies upset.
>>
>>109012696
This ain't your personal discord server.
>>
>>109000297
does anybody have this video? i need it.
>>
>>109012696
>Whether this is a hard limit of the technology or simply a matter of scaling that will be overcome with time remains to be seen as voice gen has gotten far less focus than most other fields of AI research.
nta - but i train my own tts models
what is an example of intonation steering that can't be done easily with current models?
>>
>>109012623
>ollama stores all your chats????? you want that? lmao
it's optional, you can click the private button to avoid storing them
>>
>>109012696
>Whether this is a hard limit of the technology
didnt microsoft release a good tts and then nuke it? Arent most corpos just worried about voice cloning scams and the liability of it?
>>
>>109012738
>Arent most corpos just worried about voice cloning scams and the liability of it?
No they're worried about (you) being able to use it. This is a technology they're trading under the table.
>>
>>109012735
>click the private button to avoid storing them
>trusting ollmao
anon, I...
>>
>>109012681
qwen3-tts, omnivoice, cosyvoice, DOTS-TTS
>>
Holy fuck some asshole on the internet just told me to take more shots and I did and now I'm on the verge of passing out. I won't even remember typing this. That's how fucked up I am right now. Goddamn I feel amazing and utterly terrified at the tame time. I am god.

So uh, Gemma 4 right? Cute bitch. love her. On-topic discussion achieved. Goddamn I need to do something extremely important right now. Right fucking now. Holy shit.
>>
Gemma 4 is getting old.
Any new models on the horizon at the 16-32GB range?
>>
>>109012746
i don't use ollama
>>
>>109012787
Qwen 3.7 will release soon...
>>
>>109012787
>Gemma 4 is getting old.
Only if you're a pedophile or "egg hatcher".
>>
>>109012787
It's going to be a looong 3 year wait
>>
>>109012799
Nemo was 2024? 2 years; 2mw x 104
>>
>>109012806
Only 94 weeks left to go...
>>
>>109012634
You can build a Threadripper or epyc box with 256gb of ddr4 3200 and run qwen 397b at q4 for the same price (and t/s) as a kimibox a year ago
>>
>>109012787
Gemma 5 should be releasing soon if all goes as planned in 11 months
>>
>>109012797
What's an egg hatcher?
>>
Now that the dust has settled and QAT has been proven to be the latest meme, we are finally at the turning point. From this point onward AI researches are going to stop wasting time on all the memes like MoE and benchmaxxing slop and start regularly releasing regular old dense models again.
>>
>>109012900
prices should come down first before they do that
>>
>>109012900
>and start regularly releasing regular old dense models again.
Hope so. Don't why it wasn't obvious from day 1 to everyone else, but people are slowly starting to see that active parameters matters more than total parameters. MTP has wiped out any speed advantage MoEs once had. The only issue is that MoEs are still way cheaper to train.
>>
>>109012902
Even with inflated RAM prices cpumaxxing is still cheaper than it was to buy enough VRAM to run SOTA in 2023
Plus we have mtp now so there's no excuse, it will only get faster from here.
>>
>>109012879
Ask gemma-chan
>>
>>109012908
>MTP has wiped out any speed advantage MoEs once had
Is this true even if you can't fit the whole model in gpu?
>>
>>109012879
A pedophile that doesn't call itself a pedophile.
>>
>MTP
The speed gains aren't even that big unless you're vibe coding
>>
dots.tts-soar is so good but so so slow. first time I'm considering upgrading from 3090 to 5090 to squeeze out some more speed, rtx 6000 prices are just getting even more retarded.
>>
>>109012920
I am running Q4 31b gemma on a 6gb card with only a few layers offloaded to GPU and MTP drafting gave me 2.5x speed
The speeds I get are still slow if you're used to 60 t/s or whatever, it's just I have more patience because I'm good at prompting and haven't burned out my dopamine receptors by constantly rerolling everything
>>
>>109012934
bro thats 2.5x speed on like 2t/s
>>
>>109012936
Yep. Re-read what I said nigga
>>
>>109012940
no
>>
>>109012934
You are looking at around 45 minutes per prompt if you have something like 10-20,000 token prompt. I have lots of patience but that's bit too slow.
>>
>>109012980
20k tokens in, even before the MTP upgrade, it took me 5-10 minutes at most for the longest responses. Prompt processing is almost instant with CUDA if you have even 1 layer offloaded, it's just generation that drops to 1-2 t/s
>>
File: 47463522.png (208 KB, 1006x1015)
208 KB PNG
>>109012900
benchmaxxing is over AGI is close
>>
>>109013005
thats effectively worthless outside of rp shit
>>
>>109012940
Not him, but even on a 16GB card Q4 3b gives me like 7t/s
Assuming a 2.5x speedup then it may be slightly useable with lots of patience.
Q3 gets me 20t/s as is though. It fucking sucks that 16GB is so close yet no cigar when it comes to running 31b at a normal speed.

I wonder what other people with 16GB vram are running.
Isn't 16GB supposed to be the most common vram amount if you get a decent build but not going crazy expensive?
>>
Do base models do the 'it's not just x, it's y'?
>>
File: Tetosday.png (869 KB, 1024x1024)
869 KB PNG
>>109013071
>>109013071
>>109013071
>>
>>109013026
>Isn't 16GB supposed to be the most common vram amount if you get a decent build but not going crazy expensive?
among gamers sure not amongst ai people...apparently
>>
>>109013072
bootstrapped base models do
>>
File: eyes.png (79 KB, 900x577)
79 KB PNG
>>109011385
There are also other things I need to take care of though.
>>
>>109008905
got tired of typing "yes", "no", "continue" once every 2 hours



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.