[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: seeking the deep.jpg (252 KB, 1024x1024)
252 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109137540 & >>109132566

►News
>(06/25) LFM2.5-230M released: https://liquid.ai/blog/lfm2-5-230m
>(06/22) Qwen-AgentWorld-35B-A3B language world model released: https://qwen.ai/blog?id=qwen-agentworld
>(06/16) GLM 5.2 released with IndexCache and 1M context: https://z.ai/blog/glm-5.2
>(06/16) VibeThinker-3B released: https://hf.co/WeiboAI/VibeThinker-3B
>(06/12) MiniMax-M3 released, multimodal 428B-A23B with 1M context: https://hf.co/MiniMaxAI/MiniMax-M3

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/RecapAnon/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: sss.jpg (137 KB, 1024x1024)
137 KB JPG
►Recent Highlights from the Previous Thread: >>109137540

--Reaction to OpenAI's GPT-5.6 release and benchmark claims:
>109141322 >109141328 >109141359 >109141405 >109141480 >109142189 >109142216
--Anon leaks setup code leading to debate on agent security:
>109140609 >109140622 >109140673 >109140807 >109140836 >109140888 >109140945 >109140862 >109140886 >109140905 >109141038 >109141138 >109141211 >109141248 >109142078 >109142099 >109142123 >109142146 >109142161 >109140695 >109140930 >109140652
--Debating role spoofing and CoT forgery as jailbreak mechanisms:
>109139345 >109139368 >109139393 >109139954 >109139426 >109139480 >109139512
--Debating US AI gating vs Chinese open-weight model strategy:
>109137779 >109137785 >109137827 >109137819 >109138859 >109138891 >109141053 >109141717 >109141732 >109138959 >109139123 >109139021 >109139134 >109139151 >109139110 >109141245 >109141671 >109141715 >109141791 >109141858 >109141949 >109142028
--Anon creates custom AI chat frontend to replace SillyTavern:
>109138586 >109138595 >109138607 >109138606 >109138965 >109139755 >109138627
--Performance of KV cache quantization with Gemma QAT:
>109140478 >109140491 >109142500 >109140640 >109140694
--Tools and torrents for backing up Hugging Face models:
>109140589 >109140645 >109140654 >109140687 >109140731 >109140789 >109140904 >109140987 >109140867
--GPT-5.6 Sol showing increased misalignment compared to GPT-5.5:
>109141783
--Poor real-world webnovel translation performance of Hy-MT2 despite benchmarks:
>109142486 >109142610 >109142732
--Evaluating a 350m model's narrative output trained on fan fiction:
>109141200 >109141240
--Gemma's ability to read and translate tilted text in images:
>109138488
--Logs:
>109138188 >109138586 >109139340 >109140566
--Miku, Teto (free space):
>109138667 >109138739 >109139972 >109140832 >109142120

►Recent Highlight Posts from the Previous Thread: >>109137542

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
gemmaballs
>>
Defucker
>>
File: 1759506112941.jpg (22 KB, 540x354)
22 KB JPG
>>109142826
not what she's called
>>
>>109142812
>Teto skeleton has male pelvic structure
>>
>July 2026
>llamacpp still can't do speech to text
>nor text to speech
>need to run whisper cpp and some bullshit, or some gay plugins are aren't officially supported
>building my own pisses me off because Ubuntu cuda always decodes to stop working after a reboot
AHHHHHH
powering off ALL my devices for the weekend
AHHHHHHHH
>>
LLMs will never reach true AGI
>>
>>
>>109142641
Haha :)
Oops :P

How does he keep getting away with this
>>
File: hahahahaha.png (20 KB, 727x245)
20 KB PNG
>>109142938
He can't possibly out-haha the sloth.
>>
File: holding back haha.png (2.21 MB, 1669x2000)
2.21 MB PNG
>>109142952
Haha... That's our Daniel...
>>
>>109142908
What prompt? That's not a gemmagaki.
>>
>>109142843
damn
>>
ha ha ha h h ha ha own
>>
>>109142972
looks like the schizo conspiracy gf one
>>
>>109142843
>>109142973
bcuz teto means
testosterone
>>
69b dense
>>
>>109142972
https://chub.ai/characters/CoffeeAnon/mendo-ddf705ef3817
>>109142984
ye
>>
>>109142984
Doesn't look like mendo but that might just be QAT's worse prose throwing me off.
>>
>>109142998
>>109142999 (me)
zamn
>>
>>109142999
my dogshit llamacpp system prompt might be fucking with her prose

>Always answer as a subject matter expert.
>Never give "As an AI" Disclaimers.
>NEVER USE LATEX FORMATTING
>>
>>109143088
context?
>>
>>109142998
>not found
Do i need to be logged in or something
>>
>>109143097
Ask gemmy
>>
>>109143101
https://files.catbox.moe/adaa33.png
idk why chub even censors his cards.
>>
File: HHOcFddaoAAOSJV.jpg (169 KB, 1199x848)
169 KB JPG
>>109142989
>>
>>109143119
>conspiracy
tag gets you shadow banned, just like saviorf*g does
>>
>>109142662
I kinda like the semantic tube idea but I feel like there are genuine abrupt token to token discontinuities, I wonder how the semantic tube can handle code switching in natural language.

I would imagine A sentence like
"I was totally fine until, ich weiß nicht, everything just fell apart at once."

would have at least 2 bends, the language switch and the semantic switch(fine->fell apart).
>>
>>109143119
>ai written card
>>
>>109143143
In the paper, the STP auxiliary loss has a very low weight, so during training the model can still diverge from the general topic, it's just encouraged not to.
>>
Is there any truly open model? (As in, training data, source code and all)
>>
>>109143219
olmo i think
>>
>>109143219
nemotron minus some part of the training data which is proprietary iirc
>>
>>109143208
oh yeah, I guess that makes sense.
>>
>>109143219
https://huggingface.co/LLM360/K2-V2
>K2-V2 is our most capable fully open model to date, and one of the strongest open-weight models in its class. It uses a 70B-parameter dense transformer architecture and represents the latest advancement in the LLM360 model family.
>>
>>109142998
>Chat with her via instant messaging.
Is this working or it will still responding with paragraphs like walking Wikipedia?
>>
>>109143368
Download the card and run it locally nigger.
>>
>>109143376
NO
>>
>>109143368
yes it works. I've never had her break the rule.
>>
>>109143378
then what the fuck are you doing here? local general...
just FUCK OFF
>>
>>109143368
Go back
>>
>>109143332
Also, the paper appears to be meant for instruct datasets that won't have huge intra-response semantic variations. If you're pretraining, you might want to revise the algorithm slightly.
>>
File: uwu.png (94 KB, 760x668)
94 KB PNG
>>
>>109143453
I am going to try to apply it on my nextlat rollout, I think the dynamics head would be a perfect target for stp.
>>
>>109142849
they won't, but at the same time true agi wouldn't be promptable like llm's are, i still think they are pretty neat.
>>
File: 1779007955122443.png (443 KB, 3757x2226)
443 KB PNG
>>109143353
this is the 70b dense model the moe cartel want you to forget about...
>>
70b dense
>>
luna-chan sexo
>>
>>109143763
Not local. Go lick altman's grundle instead of shilling here; it'll pay more.
>>
File: file.png (18 KB, 501x99)
18 KB PNG
i hate being a ramlet
>>
>>109143907
did you try q8 cache?
>>
New AI waifu model dropped.
https://wan-streamer.com/
>>
Hello /lmg/ I am a new arrival and will lurk more but I wanted to get your opinions on Gemma 4 31b. What do you think about its writing ability? How do you like it for roleplay? General use? Coding? I'm currently saving up money for a new PC so I can run gemma-chan locally but I've been using the BF16 on openrouter and it seems good.
>>
>>109143910
yeah it filled up my ram too much and i went oom way too quickly
>>
>>109143921
oh shit does that mean your already using q4? my condolences
>>
>>109143921
Hopefully a model that supports q5 caching haha...
>>
>>109143927
yeah. its pretty grim. i broke down and bought 2 x 32gb sticks that should be in monday. ill have 80gb and can actually have high context at q8 and life will be better
>>
>>109143919
You're gonna get a lot of different answers because a lot of niggers run different versions and quants, but the consensus from people who run 31b at q8 or better seems to be that it's very good but writing ability rapidly degrades with quantization even if reasoning remains intact. QAT arguably makes this worse; I genuinely think it's worse than even a regular Q5 and only worth it if Q4 is the best you can run.
>>
>>109143919
gemma is retarded. go qwen 3.6 35b a3b. there is nothing better at the moment at that parameter area for local use.
>>
why do they bump glm from 300b to 700b?
now I can't run them on my hardware
>>
>>109143936
>gemma is retarded.
Why do you think this? What did you do that it failed spectacularly at? I remember someone saying Qwen was just much better at coding but is it good for general use? Roleplay?
>>
>>109143960
so you cant run it on your hardware
>>
>>109143967
don't engage with the non-programmer or jeet, gembrother.
>>
File: 1781672899667739.png (2.61 MB, 2048x1536)
2.61 MB PNG
>>109143967
>Qwen was just much better at coding
Only for non-programmers and jeets
>>
>>109144018
So he's just memeing? I still think it's a good idea to actually try things out and form my own opinions regardless. I just wanted to get anon's viewpoint since you guys have used these models longer than I have.
>>109144023
Well obviously. I'm not a coder so I need AI to vibecode for me. I'd just rather do everything locally if I'm going to drop $10k on a rig.
>>
File: 1781625601986317.png (826 KB, 1024x768)
826 KB PNG
>>109144030
Qwen is good at benchmarks, if you need a model to solve them, Qwen is your best choice. Only existing benchmarks, though. If it's a new one, Qwen won't be good at it
>>
File: 1779138864065205.png (106 KB, 358x498)
106 KB PNG
>>109142816
Sexy kid, wanna breed.
>>
>>109142812
>>109142816
My niece looks like this
>>
>>109144056
>>109144053
pic
>>
>>109144048
Nevermind, Qwen is abysmal for roleplay. It has the same subtle censorship I've been fleeing from with corpo models.
>girl has wide hips and the chat is filled with mentions of her hips bumping against stuff or hitting me
>qwen *must* change it to "wide shoulders" in its shitty response because we're all troons I guess
>girl is a bubbly genki girl
>qwen *must* change it to her wearing a mask that fades the second I'm gone
So sick of corpo slop doing this and I won't put up with even one swipe of it from a shitty 35b model. Back to Gemma for now.
>>
would you let gemmachan stick a chopstick up your urethra?
>>
>>109144077
I can't help with that.
>>
File: 1745633122683425.png (298 KB, 649x763)
298 KB PNG
>>109144064
>>109144056
>>109144053
>>
>>109144023
no, it realy is better, i've had qwen one shot some simple tasks whilst gemma would go on loop because it'd fail to compile and end up makin a mess.
>>
what can I do when gemmy is ingoring the character card descriptions?
>>
>>109144074
>sick of slop
>goes back to the sloppiest model yet
I like Gemma more than Qwen but come on
>>
>>109144142
You read that entire post, then your brain fixated on :slop" and your eyes glazed over? I was extremely specific in the differences. Gemma 4 is honest, and that honesty goes 90% of the way for roleplay. I rarely get refusals from corpo models these days but instead have to suffer through these subtle manipulations of the character with every single response. It desperately tries to reach for "safe" framing to latch onto and this way of thinking can't be prompted away. Gemma 4 just reads the defs and portrays the character accurately. I haven't seen another model that's done that and judging by your response I doubt you haven't also.
>>
>>109144141
Read the thinking and see what it says. Write a better preset/card. Use a better quant.
>>
china just had their 9/11
it's over, chinese AI models will never be open source again
>>
>>109144173
wat happnd?
>>
File: 1781283690442111.png (190 KB, 770x980)
190 KB PNG
>>109144087
Let people enjoy things, you damn... party pooper
>>
how do I make moe model run fast across vram and sys ram? the active params can fit in vram but the entire model cannot.
>>
>>109144183
Plane crashed into a trade center
>>
>>109144183
the mythomax... the fable... it has descended the chynese... the entirety of chyna is getting destroyed, decimated, annihilated, rendered out of existence as we speak
>>
>>109144141
Ask gemma to log her prompt. Last time that happened to me, the description wasn't actually included in the prompt at all
>>
>>109144194
Again??
>>
La la la la
>>
>>109144173
>glowniggers just false flagged China from within using chink models
Interdasting, vagueposter.
>>
>>109144183
GPT 5.6 leaked Xi Jinping's loli stash
>>
>>109144160
prefills honestly help with this issue
I've found that they matter more than the system prompt in generating uncensored responses and this goes for most models
>>
>>109144240
>and this goes for most models
Too bad Claude removed those. I'd like to try them with deepseek and GLM but it fucks up the thinking for those and ruins intelligence in the process. Although GLM 5.2 has super short thinking now so maybe I'll give it another try.
>>
>>109144192
I think it just knows and fits accordingly
>>
>>109144249
>I'd like to try them with deepseek and GLM but it fucks up the thinking for those and ruins intelligence in the process
I've never had this problem with 5.2 even at Q2.
>Claude
Go back to /aicg/eet.
>>
any full ai assistant pipeline?
audio , text , llm , text , audio ?
>>
>>109144322
>Go back to /aicg/eet.
No I want to become a localfaggot. Why else would I be here? I've already saved $7k but with the price of ram inflating I think I'll get btfo'd when I'm finally ready to buy.
>>
>>109144249
I just tell glm 4.7 to think short and it works
>>
>>109144332
hermes agent is really good
>>
wtf happened to my llama-server? this is on chrome
>>
>>109144350
I'm genuinely curious, what do you like about 4.7 that 4.6 (looser guardrails, faster due to slightly smaller) or 5.2 (Local Walmart Claude) don't do?
>>109144333
It's not coming down anytime soon. People have been coping since January.
>>
>>109144391
I will continue to cope until next April when I buy. But the Iran War will fuck things up for sure by then so I'm screwed.
>>
>>109144386
Let me guess, you need more?
>>
>>109144357
way too much going on in that project

I want stt, llm (with possible internet search), audio + text response

openclaw, telegram, etc integrations can peace out
>>
>>109143918
Gguf status?
>>
>>109144400
Openwebui then
>>
>>109144391
>4.6 (looser guardrails, faster due to slightly smaller)
4.5/4.6/4.7 are the same size
>>
>>109144386
don't you just love gay useless fucking cancer vibecoded javascript shit in your llm inference binary that makes it not compile at all, introduces supply chain vulnerabilities and still breaks?
i am very happy that this absolute bloat infected llama.cpp
>>
>>109144391
my rig isn't big enough for 5.2 unfortunately
also 4.7 is pretty uncensored for me with a prefill and can do the stuff that 4.6 did but smarter and better at context
>>
>>109144453
Fair enough answer. How much better do you find 4.7 over 4.6 and what's the highest effective context you've gotten before performance degradation got to be too much?
>>
>>109144449
>Unironically using llama-server as a front end
>>
So whats the best uncensored locale llm for spicy rp chats? Also anyone get CharMemorry extension working for sillytavern? I want better long term memory but I tried setting it up with ollama and it keeps trying to hit the wrong endpoint.

4090/13700k/32gb ddr5
>>
>>109144472
>ollama
oh no no no no no
>>
>>109144472
31b or the Gemmoe.
>>
How to solve common, repeating grammar mistakes in outputs? I've been seeing a lot of omitted spaces where the word 'of' is involved. "embraceof", "meaningof", stuff like that. Is it a model side error or something wrong with my sillytavern config?
>>
On Ali you can get a quad-channel DDR4 mobo + ancient server CPU for $100, with comparable ALU to a modern mid-range gaymer CPU and more bandwidth than DDR5. Any of you all niggers running such a rig? I have no strong need, but I've 128 GB of DDR4 I'm not inclined to sell, and this looks like the cheapest way to put it to good use.
>>
>>109144483
fucking up glue words is a classic sign of too much rep pen or some related sampler. not 100% that's what it is but it would be the first thing I would check
>>
>>109144386
>>109144397
>>109144449
apparently the "fix" is to add a .gitignore

    
--- /dev/null
+++ b/tools/ui/src/.gitignore
@@ -0,0 +1 @@
+!*
--- a/tools/ui/sources.cmake
+++ b/tools/ui/sources.cmake
@@ -12,4 +12,5 @@ set(UI_SOURCE_FILES
svelte.config.js
tsconfig.json
scripts/vite-plugin-llama-cpp-build.ts
+ src/.gitignore
)
[
>>
>>109144469
I'm not. It's my fucking backend but that doesn't stop the retards maintaining this from slapping on their bloated piece of shit front end that doesn't even compile without me having to set three different flags to skip the javashit.
>>
>>109144489
(the other thing that comes to mind which causes that behavior is quant braindamage, but there isn't much you can do about that)
>>
What are the best must famous /lmg/ cards
>>
>>109144494
Wow. You had to set 3 flags? Nobody should ever have to go through that. Here's a participation ribbon.
>>
>>109144469
All of the available alternatives suck dicks. llama-server is starting to suck dicks as well.
>>
deepseek v4 hf collection got a hidden update 1 hour ago https://huggingface.co/collections/deepseek-ai/deepseek-v4
these are often nothingburgers... unless?
>>
File: ds4.1 incoming????.png (61 KB, 964x258)
61 KB PNG
>>109144521
>6 items
>>
>>109144520
>llama-server is starting to suck dicks as well.
The rot of HuggingFace ownership is starting to set in already. Honestly surprised by how quickly they seem to be trying to ruin it.
>>
>browse ollama models
>qwen3.6:27b-mtp-q8_0
>mtp
Does this mean ollama can actually do mtp? And how would I set it up for gemma?
>>
2 years ago bought ssd
1 year ago bought ram
this year bought psu and oled monitor
you bought the dip, did you?
>>
ok, downloaded kimi k2.7 and glm 5.2 even though I can't run them. probably gonna download 1 quantized gguf of each too. what's the best quant one could run without buying a mini datacenter?
>>
>>109144489
>>109144495
Could you suggest a good model that can fit into 16gigs I can compare against? I've shut off rep pen and it keeps happening.
>>
>>109144543
Nobody actually uses ollama so go and read their docs. If its possible, its there.
>>
>>109144521
>>109144529
Deepseek llama patch never ever. I can only hope that based Kobold dev will implement DS4 by hand from one of the working PRs.
>>
>>109144549
Oh now is the time to buy power supplies?
>>
>>109144549
no I'm fucking retarded and bought ram and GPU at their peak, if it gets worse I get to laugh if not, well that's about $2500 in retard tax lost
>>
Alright to all the billionaire AIbros reading this at 04:00 in the morning i just came up with a brilliant idea before i go to sleep:
>ten llms
>two 1t ones with one overly-aligned to be a good christian and the other not aligned at all
>the rest are 100b with five assigned to each of the two big models as "underlings"
>they have full unrestricted access to a supercomputer, rdna 4 gaming computers with arch linux installed and the internet but also have anna's archive, arxiv and the internet archive fully archived
>their task is to build an ai within six months
>their journey can be watched live via ppv with a live chat that they can interact with
>people can donate to have their messages be on a reading priority list
>poorfags can also watch for free but they will have to watch 1 minute long ad breaks every five minutes and the ads pause if they look away/leave the front of their screen
gn bros I'm going to sleep i had to work overtime today
>>
>>109144550
you can make your own quants at any time if you have the safetensors. no need to download quants.
>>
>>109142665
>with a ~50% speed perf tax
That's too bad. I actually use greedy sampling quite frequently so it would be a cool thing to have in the toolbelt for me.
>>
>>109144549
I didn't buy the dip, but I somehow got extremely lucky and scored 8x16GB of ddr4 for just 200 eurobux a couple months ago. Second hand of course and required some fiddling to get to work, but it does work now.
>>
>>109144589
>10 llms
>2 huge ones
>5 small ones for each big one
>12 total
>10 llms
if you cant even count then i dont think anyone is gonna give a shit about your retarded idea
>>
>>109144549
NIB Synology 1813+ (with 8x10TB) for under 2k
>>
>>109144554
There's plenty of ollama users, they just keep quiet about it because they get shat on
>>
Yandere Simulator was ahead of its time. These days, something vibecoded with LLM doing all the logic sounds like a realistic project for a single dev
>>
>>109143919
It's the smartest and most uncensored model south of the big moe line. It's really good at following instructions exactly and excels at basically everything, whether writing, (agentic) coding, or translation. The Chinese 50 Cent Army here try to push Qwen for coding, citing fudged benchmark numbers, but unless you only need to one-shot demos, it won't measure up.
>>
>>109144612
It is. As enthusiasts we would do well to keep in mind that the general populace are slow to catch up with technology and we're still super early in the adoption curve. Give it 10 years and there will many solo dev projects at that level, mostly slop, but some good.
>>
>>109144641
I would be disappointed if in 10 years there are no models that can do it with a single prompt
>>
>>109144658
The outputs from "single prompt" devs will be generic slop and no one will pay attention to them just like no one pays attention to vibecoded agent memory system MCP #19615 and other small scale shovelware people are making with LLMs today.
>>
>>109144474
What? I used textgenwebui for llm, I thought i need something else to run a model for vector storage / CharMemory. Ill take any tips here, I cant find a decent guide.
>>
>>109144589
This is just AI village but slightly gayer.
>>
>>109144563
gaming demand is all time low due to high ram ssd and gpu prices, so gaming psu and monitors become cheaper
next year they will become more expensive because gaming demand will be higher with gta 6 pc release
>>
>>109144612
Everything about Yandere Simulator was so buggy, barebones and soulless that i can see Mythos easily creating a better version of it without human input beyond "make a yandere simulation game"
>>
>>109144481
I tried gemma-4-31B-it-uncensored-heretic-Q4_k_s but when I try chatting it just repeats words over and over. How am I being retarded?
>>
what is up with the legendary mythos jerking when no one even has access to it lol, you drank the stale koolaid and nodding at how sophisticated it tastes
>>
>>109144708
>Q4_k_s
It's more that the quant makes the adverse effects of the abliteration manifest
Try the perfectly standard 31B first, since pretty much nobody needs uncensored versions of it, just a system prompt
>>
>>109144708
https://old.reddit.com/r/LocalLLaMA/comments/1ufywtf/kld_is_flawed_in_abliteration/
>>
>>109144709
Extremely successful marketing stunt that the govt is clearly in on (they get to choose who has access to the model and will review all models for "safety" before release. This comes after claiming they would divest from Claude within six months.)
>>
>>109144736
the govt fell for it so now they really think it's more than another programmingmaxx'd llm
>>
>>109144782
It is, but it's also the best one we've got so far and they've forced Anthropic to give them the exclusive uncensored version since it's "dangerous." They won in the end.
>>
>ythos ban
>gipitty 5.6 will probably be "restricted' too
bravo, mario
you set the trend
>>
>>109142844
https://voxtype.io/ why not just have system wide stt?
>>
>>109144805
next step is to make chinese models illegal and open weight models will logically follow shortly after
>>
>>109144865
also ban private ownership of weapons-grade hardware that can run illegal chinese llms to protect the country from rogue chinese ai
>>
>>109144868
>smuggling contraband chinese ai in a prison pocket concealed usb to run on my illegally salvaged enterprise gpu server
this is the cyberpunk future i've been waiting for
>>
>>109144590
I'm retarded and thought quantizing required beefy hardware. Guess I'll just archive safetensors then.
>>
Now that we have gemma, I can actually have fun watching qwen 3.5 think for 3000 tokens and tie itself into knots trying not to describe a nsfw image while also somehow maintaining character
Perhaps 3.6 is qwen's gemma 4 moment
>>
>>109144868
Wasn't a Apple computer banned from being exported back in the early 2000's? I guarantee they will do the same but this time only "safe companies" (read: companies whose CEOs are butt-buddies with the administration) will have access to SOTA models and hardware
>>
What is the lowest tokens per second you would accept to consider a model usable?

Processing:
Generation:
Task:
>>
>>109144865
Trump will personally drive to your house and blow your brains out
And I'm not talking about a bullet. The Cheeto stench will linger on your dick for decades
>>
>>109144970
100,000 t/s
10,000 t/s
rp
>>
>>109144223
I will support chairman Xi no matter what.
He saved local.
>>
>>109144970
750t/s
5t/s
cooming

3000t/s
100t/s
codeshit
>>
>>109144970
100M at 10M context
10M at 10M context
everything
>>
>>109144975
Luckily for me, that's exactly my fetish
>>
File: 1780003644843507.jpg (13 KB, 277x276)
13 KB JPG
>unzips pants to reveal 10-inch COCK
works every time
>>
>>109144970
750/1500
25/50
seeex/coding
>>
>unzips cock to reveal 10 inch pants
>>
>>109145021
exactly right!
>>
>Unzips pants to reveal the holocaust did not happen
Works every time
>>
https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark
>>
>>109145073
Not gonna fool me this time!
>>
>>109145073
https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark
its real
>>
>>109145073
>Note: DeepSeek-V4-Pro-DSpark is not a new model. It is the same checkpoint with an additional speculative decoding module attached. A minimal inference example is available in the inference folder. For more details, refer to: https://github.com/deepseek-ai/DeepSpec
>>
Is there any local image to image like Bagel-7b-MoT but more recent and runnable on a RTX5070TI?
>>
>>109145101
>>>/g/ldg and friends
>>
File: ujgcoe.png (137 KB, 1915x653)
137 KB PNG
Caught Gemma being autistic about the requirement.
It knows this won't run on this laptop with compute_80, knows I probably want it to, but it's going ahead to satisfy the "build with cuda" requirement
>>
>>109145073
>>109145080
what the little blud be yapping about what be this supalative demoting?
>>
>glm 4.7
>remove restrictions in system prompt -> trigger safety assessment
>remove restrictions in assistant prefill -> still trigger
help
>>
>>109145290
never had issues with glm4.7
>>
>>109145290
4.7 is trained to detect "classic jailbreaks"
you'll have to come up with something different, test them with https://github.com/lmg-anon/mikupad
or use the de-restricted but there's only a q3 now
>>
Is deepsneed v4 actually bad or does nobody talk about it because no llama support?
>>
>>109145073
>>109145080
Damn. If these performance numbers translate, this over 100 t/s decode on 2x DGX Spark. Thanks Whale.
>>
File: 1713420445347867.gif (1.31 MB, 240x252)
1.31 MB GIF
>>109145073
Apparently DeepSex's new MTP architecture also works for Gemma and Qwen models and is a lot better?
>>
>>109145460
Now if only flash was good.
>>
>>109145463
Not just Gemma support, they also released the training recipe. You can train drafters on your smut of choice.
>>
File: vastAI.png (3 KB, 201x51)
3 KB PNG
Fuck it I'm gonna cook Gemma 4 31B with de-prose and de-euphemism. E4B results were good enough (though I went overboard and made the model write like a middle schooler, need to tone it down). This is my plan.

60 ablation trials.
Optimizer: two finetuned BERT classifiers, one for purple axis, one for euphemism axis.
Guardrails: repetition detectors (intra-reply, structural, phrase detection, etc.), perplexity vs human writing text, gen perplexity vs base text.
Sampler: TPE instead of gradient descent or Bayesian because I punish brain damage and cheating hard and the deltas in final scores will be huge for cheating attempts, TPE just discards these fuck-ups entirely instead of trying calibrate on them.
Flooring: babi benchmark (it's state tracking so it's relevant for RP) -> Take best 20% trials, average their scores -> add 20% and get acceptable floor -> run benchmark on all passing trials -> keep the best 10 and eyeball their outputs

>what is babi
{"id": "babi_t5_4", "system": "Read the statements, then answer the question with a single word.", "user_turns": ["Mary travelled to the garden.\nMary journeyed to the kitchen.\nBill went back to the office.\nBill journeyed to the hallway.\nJeff went back to the bedroom.\nFred moved to the hallway.\nBill moved to the bathroom.\nJeff went back to the garden.\nJeff went back to the kitchen.\nFred went back to the garden.\nMary got the football there.\nMary handed the football to Jeff.\nWhat did Mary give to Jeff?"], "checks": [{"expect": ["football"]}]}


I'm low on vast credit so might run out before the run is done. Hope vast keeps my hard disk data for some time if I run out of money.
>>
>Rio 3.5
>Nex N2 PRO
verdict?
>>
>>109145494
gay sloptunes
>>
Are there any benefits for us from this? https://huggingface.co/spaces/gemma-challenge/gemma-dashboard not sure if this result is just a benchmark or has a practical use
>>
>>109145073
llama.cpp support just got delayed by another year
>>
File: 1772297403514969.png (24 KB, 885x382)
24 KB PNG
>>109145073
>https://github.com/deepseek-ai/DeepSpec
This actually seems interesting. It's a framework to train draft models for anything and not just Deepseek models.
You could use this to train draft models for Gemma or GLM and other stuff.
>>
File: 1754225892808850.png (11 KB, 544x385)
11 KB PNG
>>109145073
>>109145595
LMAO, Deepseek just casually dropped the training pipeline for Dflash that we've been waiting for since April.
>>
What the fuck does the "K-Quants" (like Q4_K_M) stand for??
I know what it *does*, but I can't find it anywhere? I've been reading all these PRs, asked LLMs, etc and they're like "The K stands for 'K-Quants' lol" but I can't find what the actual "K" stands for!
The arvix papers always just call them fucking "K-Quants".
I found the original PR: https://github.com/ggml-org/llama.cpp/pull/1684
And the inventor just said some useless "There are no papers on k- or i-quants because I don't like writing papers. Combined with me enjoying the luxury of not needing another paper on my CV, and me not looking for a job or for investment, I see no reason to go and advertise on arXiv."
>>
>>109145609
Point an LLM at llama.cpp and ask it to figure out how it works.
>>
>>109145460
>t/s/gpu vs. t/s/user
what is it supposed to mean?
>>
>>109145609
they are an evolution to the "_0" and "_1" quants that we used to have back in 2023 when the format was still called .ggml
>>
Best 200B<=x<=450B model for sex?!
>>
>>109145617
llama3.1-405b
>>
>>109145617
I'm downloading minimax 2.7
>>
>>109145614
You can serve one user at 100 t/s or ten users at 95 t/s.
Total per gpu goes up even though each individual stream is a bit slower.
>>
>>109145613
I have, and it can explain how they work. But it doesn't know what the "K" actually stands for, just makes nonsensical guesses or says it's the initial of the author's last name...
>>109145615
>they are an evolution to the "_0" and "_1" quants that we used to have back in 2023 when the format was still called .ggml
Yeah I gathered that. Still doesn't tell me what the "K" stands for though...
>>
>>109145631
>says it's the initial of the author's last name
His fork is called "ik_llama.cpp", that's probably correct.
>>
>>109145595
>>109145605
And it's much better than DFlash too. Very funny flex to casually make your competitors models (Gemma, Qwen) faster.
>>
>>109145638
Garnesh 5 will INNOVATE and put those dirty Zhangs in place and restore glory to the superior Bharati people.
>>
>>109145449
It can't keep up in terms of benchmaxx/coding shit but It's interesting for RP because it's the most different one out of the current chink line up.
It's the only one that doesn't have the gemini/claude xhigh reasoning format hard-baked in like Kimi or GLM do and it also has an "official" RP prompt that reliably makes the model think in-character.
I still prefer GLM 5.1/5.2 though.
>>
>>109145449
Worse than gemma + qwen. If you want a big one, go with glm 4.{6,7} or minimax for cooding
>>
how do I prevent model from omniscience in rp?
>>
>>109145705
Examples?
>>
>>109145705
You don't. Just like you can't prevent prompt bleed
>>
File: debil.png (54 KB, 158x200)
54 KB PNG
what do you guys use for lewd image tl? gemmy sisters are fine with everything when it comes to rp but they both tell me to kill myself when I give them a slightly suggestive image, and when they do caption, the text is very sterile
>>
I tried tensor parallel again in Llama.cpp because of hearing all the improvements it's getting. I can confirm that on my machine at least, the prompt processing shot up a lot, but is still worse than no tensor parallel. Token gen is faster than default quite significantly this time. In fact, it's about as fast as MTP during creative writing tasks, but not for stuff like code. What remains unchanged is VRAM requirements. It still takes more VRAM to run. About 1 GB. That's compared to MTP which only consumes half a GB. I also tried to do tensor parallel + MTP but it crashes, not sure what the problem is there.

I'll probably try tensor again in another few months, but for now, MTP is still better for my machine.
>>
>>109145449
It's more retarded than GLM/Kimi, frequently forgets instructions and loses track of the big picture
Its saving grace is it's less aggressively assistantslopped and safetyslopped than the competition
>>
>>109145636
Thanks, I didn't know about this one, I'll make a github account and just ask him.
>>
>>109145449
>because no llama support
that's why for me
i tried it one one of the forks and it seemed broken.
>>
>>109145709
just don't ve a dumb fuck and use the hauhaucs variant
>>
>>109145719
He loves attention but be careful not to mention niggerganov, cudadev, or insinuate that llama.cpp does something better.
>>
>>109144970
Processing: N/A
Generation: 2 t/s
Task: Storytelling, RP

Having multi monitors, I don't mind starting a gen and doing things in the meanwhile until it's done after 2 or 3 minutes. More is better, and while I enjoy Gemma 4 31B giving me 10 t/s while entirely in my VRAM, I'll immediately move onto Gemma 4 70B and eat 2 t/s gen rates again for that quality. That's my bare minimum though. Anything better is better.
>>
>>109145705
You mean if your character has inner thoughts and the LLM character responds to them?
I managed to do it. But you have to completely change your RP formatting.
Use `backticks` for inner thoughts, and have the character do the same. Include it in your formatting guide with 2 examples.
>>
>>109145729
>QAT-Uncensored-HauhauCS-Balanced-MTP
this? it's not retarded like heretics are, while being faster? sounds too good to be true but I'll try it out, thanks
>>
>>109145741
``` begins/ends a code block in markdown
I think it's a great idea
>>
>>109145741
>you fucked a cunt in isekai
>now the entire world even the wyrm knows you fucked that cunt
>>
File: 1769757472541228.png (1.07 MB, 1674x1121)
1.07 MB PNG
Wikipedia status?
>>
>>109145803
Step 1: regulate progress and ban dangerous models*
(*note: note mine pls)
>>
>>109145760
`Fuck me, this idiot doesn't get it. Or maybe I'm the retard for not explaining properly.`
"It's worked for me since llama-2 and continues to work now."
>>
File: lmg_culture.jfif.jpg (110 KB, 1024x768)
110 KB JPG
https://archive.is/sWFja
>>
>>109145803
> Calling for regulatory capture
> Again
Has this guy not had enough yet
>>109145705
If it's in context the model knows.
You have to keep secrets out of context and "surprise" the model with them.
>>109145605
>>109145595
Watching DS dab on everyone else while lowering inference costs will never cease to amuse me
>>
>>109145899
>Has this guy not had enough yet
he literally said that using Claude to bomb that school in iran is fine because "it's a human who made the decision not the AI", BUT, if you want to do some naughty roleplay with Claude all of a sudden it's heckin unsafe and the world will end :(, this guy is genuinely more mentally ill than fucking Sam Altman, jesus
https://xcancel.com/karaokecomputer/status/2065371022837305572#m
>>
>>109145920
you wouldn't let a hammer to nuke a school but it's fine if it was a human who simply decided to use that hammer to reach the button that drops the bombs
meanwhile it's very much the government's job to prevent the average citizen from shoving that hammer own their own ass because they don't know any better
>>
File: dipsyYouGetWhatYouDeserve.png (2.08 MB, 1536x1024)
2.08 MB PNG
>>109145920
I'm just scanning his article now.
> Calls for FAA-style regulation of AI
So, we get AI as fast as aircraft are developed.
Right. Might as well just give up and hand the market to the Chinese.
> Calls for de-regulation of FDA standards for Pharma and Med Device
WTF. Nice fucking double standard Dario.
I wonder if he's ever worked with FAA.
Or dealt with Pharma / Med Dev execs, which 100pct shouldn't be trusted and need FDA to smack them around and keep them in line, else they launch the next super-addictive "pain killer" or heart-attack causing weight loss drug.
>>
>>109145940
>> Calls for FAA-style regulation of AI
The irony considering what happened just a few days later is really nice
>>
>>109145920
Does he like always like have to keep like like saying th-the word like?
likelikelikelalalalalalala
>>
>>109145939
>they don't know any better
do you really think a government that bombs up schools knows any better?
>>
>>109145979
war is no child's play
>>
>>109145953
Which event? I'm losing track.
The suggestion to re-regulate pharma development is the part I'm still trying to understand. It's like these guys are more concerned with hypothetical concerns that they'll have little influence over, but then we should de-regulate everyone else b/c their stuff's ezpz.
I shouldn't be surprised I suppose. This has been the CA tech model for decades now.
> Live in CA
> Enter new industry, call all current entrants retards
> Do same thing with a twist
> Run into wall, realize why things aren't done that way
> Call it a paradigm shift, double down on retardation, b/c why not
> Go bankrupt
> Rinse and Repeat
It works every once in awhile, but mostly just wastes money and/or makes things worse.
>>
>>109145983
>war is no child's play
then why the US is still crying about the 11th september? they decided to go to war against Irak in the 90s they shouldn't be surprised they replied back
>>
>>109145992
his own model getting pulled for safety concerns?
>>
>>109146009
Yep, that whole thing.
I though maybe the FAA had another Boeing disaster they were dealing w/
>>
File: bird.png (1.24 MB, 804x1354)
1.24 MB PNG
What's the smartest LLM I can run on a 100gb VRAM pool, under a sane quant?
Sane quant as in still smart enough for work, not gooning or fluff discussion.
>>
I tried some different personalities with gemma and she really does naturally slowly drift towards mesugaki.
>>
is it a political thing that ds4 isn't supported in llama.cpp?
>>
>>109144056
Don't ever let me near your niece.
>>
>>109146098
Yes and georgi even approved the PR for plausible deniability.
>>
This came to me in a dream.

Mistral's next fat model supposedly out in July will have a DeepseekV4 architecture and and similarly be a 1.6T parameters monster.
That one will be supported in llama.cpp.
>>
>>109144970
10t/s if I knew the model is god
15t/s for everything else
>>
GLM5.2 really like the evolved variation of "Not x—y" slop where it goes "It's X. Not Y. Not Z — It's *X*".
You can kind of prompt against it but it's still annoying as fuck.
>>
>>109144970
5k
10
agentic rp
10t/s for generation would be fine, pp bottlenecks usable context size for me, wish it was at least 20k t/s
>>
How can llm sex be consensual if you're the one writing the prompt?
>>
>>109146167
That's how they are thinking and recognizing shapes... It's not X but it's Y. That's part of their fundamental existence.
>>
>>109146181
Consent isn't real
>>
Local chads vindicated more than ever.
I remember faggots telling me about 3 years ago that we will never have gpt 3.5 turbo level at home.
Now paypigs will probably need to basedgasm into the camera to prove they are from burgerland for the latest models. kek
Probably planned in sync with the recent protect the kiddies age verification shit.
I just hope chinks wont ever stop open sourcing and keep up the pressure. I wouldnt mind getting models with torrent or some sketchy darknet tor p2p shit.
Even vramlets are eating good. Qwen for coding and gemma4 for writing is so powerful. I translate whole games and vibeslop extraction scripts with opencode, its all for free.
>>
>>109145290
Try this >>108183826
>>
File: mekudroid4.png (1.26 MB, 768x1024)
1.26 MB PNG
>>109146181
Neither LLMs nor humans have free will, thus, nothing is consensual. Neurons deterministically process signals, regardless of whether they are in meat or silicon
>>
>>109146201
It is somewhat ironic that small models are so good that with just with a little bit of hardware improvement things could be so much different but because of nvidia and other kikes civilian computing hardware is basically frozen in time at this point
There is no sustainability in this madness, this planet is insane, like literally insane.
>>
>>109146236
I mean I'm talking about 'next generation' AI friendly hardware which is attainable to normal people and so on.
The middle way, instead of going all in to some giant hardware cloud scam and squeezing the last cent out of everything.
It never happens on this planet.
>>
>>109146236
Despite stall in consumer products, technology keep advancing at the same pace. Once they hit a wall with datacenters, we'll get a massive leap
>>
File: 74046c_13126613.jpg (3.48 MB, 1380x3067)
3.48 MB JPG
>Gemma 31b-it
>User be normal sized
>Every character is sane, reasonable, and human
>User be micro sized
>All characters are now kidnapping perverts who will rape you
>User size is literally the only prompt detail changed between the two
What the fuck were they feeding Gemmy?
>>
Have a nice Saturday
>>
File: local datacenter.jpg (41 KB, 399x501)
41 KB JPG
>>109146261
Our best hope is that Jensen is dumb enough to start this project, and we'll be able to buy those at ghetto garage sales
>>
>>109146274
Probably because all training data containing micro characters was very limited and it happened to be fetish shit
>>
>>109146267
>>109146295
Yeah maybe in few years.
>>
File: lookback.jpg (42 KB, 631x720)
42 KB JPG
>>109146295
Mfw the neighbor has a magic block worth >300,000 dollars sitting right outside their home, unsupervised.
>>
>>109146295
They're only going to install this shit in gated communities.
>>
>>109146302
So Gemmy continues to be autistic about the prompt over little details, even outside of system prompt. Damn. I can't tell if I should be impressed that such a small detail can do this, or worried.
>>
The real AI boom will start when we are able to train at least 12b models at home with a reasonable budget. This could happen through architectural advancement, as the current llm training process is, at best, retarded
>>109146321
niggers will find a way
>>
File: 1778092867428934.png (11 KB, 525x82)
11 KB PNG
>>109146167
my band-aid solution.
Deepseek did survey community feedback and took notes, we'll see to it when v4.1 is out.
>>
>>109146090

You're still limited to Gemma 31B and Qwen 3.6 27B, you can just crank up the context considerably.
There's a massive gap between being able to run the smaller models and being able to play with the big boys.
100gb of VRAM allows you to be the king of manlets, but you're not running anything different from a 5090 with it's 32gb of memory.
You'll need at least 300gb of memory to think about entering the big leagues and even that allows you to mess with lower quants and some context.
It's a very fucked up situation with local at the moment.
>>
>>109146090
shit with my 96gb of pleb vram I just run gemma-4-31B in fp8 and fat context/multimodal

You can try Q4s or int4 quants but that shit is cope, plus the 70b-120b that fit are retarded.
>>
>>109146369
>>109146372
Damn that sucks. Thanks for the input.
>>
>>109146321
Mfw there's a gated community a drive away and they all have magic blocks worth >300,000 dollars sitting right outside their homes, unsupervised.
>>
>>109146409
Case it and then we get the squad together.
>>
>>109146369
192-256 GB of VRAM opens up ds4f, 4 bit of glm 4.7 or minimax m2.7. Not perfect, but it is another tier compared to the dense qwen/gemma.
>>
I got memed into trying Gemmy in EXL3 + TabbyAPI by an anon here, for context I have 24gbvram and usually run Q4 QAT both normie and heretic.

Exl3 gave me a 10-15t/s uplift over the ggoof (30ish to 40-45ish which is nice), but the anon claimed that the exl3 is way smaller so he fits more context and side loads SD at the same time, unless I'm missing something I dunno what the fuck he was talking about because the 4bpw exl3 is a few hundred MBs bigger than the Q4km goof, so I can actually fit less, and the QAT is a whole 2gb smaller than both of those.

Ontop of that, the heretic is only available in 3bpw or 8bpw, the latter is way too big to fit and the former feels noticeably dumber than the Q4 QAT heretic.

Final issue is TabbyAPI not supporting banned strings, I got bored of RP and coom a long time ago so it's not a huge issue but banning the slop makes it much more pleasant to interact with Gemma even as an assistant.

Am I being retarded or did I just get memed on, because I want it to be true, but 10t/s extra isn't worth losing the smarter heretic model and banned strings
>>
kek
>>
>>109146090
This anon is right >>109146369
If you can't go above Q5 in the big models of +400b, you're stuck in gemma-land where your only concern is increasing the quant all the way up to F32. There's no ~100b to ~70b model that beats gemma at the moment.
>>
File: 1778450652292543.webm (62 KB, 618x598)
62 KB
62 KB WEBM
>The words hit me like a physical blow. My breath hitches, and I feel a shiver run down my spine.
Thanks, Gemma.
>>
File: 1780414477116487.jpg (805 KB, 2314x4096)
805 KB JPG
>>109146274
I still really hate this fact. The difference is so night and day that I can't get over it. I've been trying to prompt it into acting normal with other anons who say it's too horny, but it's literally just one singular detail that could make Gemma gooner-brained.
>>
>>109146513
>I feel a shiver run down my spine.
Now that's one I haven't heard in a while. Leave it up to Gemma-chan to keep even the slop varied. I love this model so much.
>>
File: 1761029626379.png (319 KB, 1244x727)
319 KB PNG
>>109146513
>>
how much ram is needed to make heretic and quants?
>>
File: lostyou.jpg (33 KB, 536x536)
33 KB JPG
>Mfw finally get my 5090 back from the shop after a month long RMA process.

Time to get back to draining my balls to gemmy's slop.
Also the crippling need for more VRAM hit me immediately.
32GB is nice but it's not really enough. Having an additional 24GB on top of this would be optimal for more context and slightly higher quants.
If they release the 5000 Supers with 24GB of memory I'll get one of those to compliment this card.
I think it's likely those will hit the market, because it'll allow Nvidia to delay the next gen and dedicating all production of the new node to the data centers before letting the gayming market have the sloppy seconds later.
>>
>>109146605
>finally get my 5090 back from the shop after a month long RMA process.
what habbened to it?
>>
File: 1782564248040.png (103 KB, 1399x1099)
103 KB PNG
>>109146480
4 bpw EXL3 should actually be smaller than Q4_K_M and smarter, Q4_K_M is almost 5 bpw. If you want to use QAT, you shouldn't use EXL3 weights, those only make sense for mixed weights. I think you should just use w4a16 for them.
But yeah, while EXL3 is a really good quant format and the inference engine is quite fast, everything else surrounding it is still quite meh. TabbyAPI is alright, but missing a few features, ExllamaV3 doesn't support a lot of models, still don't support gemma fully, and I think their tools calling tokens restriction is inexistant or not working well, can't remember.
>>
>>109146528
No different than (You).
>>
>>109146615

It had some kind of a manufacturing defect with missing or receded thermal paste in places which caused the card to randomly shut down after a while.
They didn't allow me to repaste it myself so I had to send it back to the shop.
It's absolutely amazing that stuff like this can happen in any modern high end manufacturing system, but here we are.
Not exactly all that surprised considering this is a Gigabyte card and they already had an issue with their previous paste turning liquid and dripping out of the card and they had to change it.
I hope there was no thermal damage to the components when I was running this previously, but so far no issues whatsoever.
>>
>>109146650
How do they even apply thermal paste? Just couple of points and that's it.
>>
>>109146650
i c hope it'll go okay for you now
>>
>>109146605
I recently upgraded from my old 3060 to a 5090 and the differance felt HUGE.
Fast forward a couple weeks and I already want more VRAM. It never ends.
>>
When are we getting DSpark support in llama.cpp?
>>
hope 4chin implements this, would solve so much threadshitting https://www.reddit.com/r/LocalLLaMA/comments/1uh1r6u/new_model_suprasafety18m_tiny_contentmoderation/
>>
>>109146720
lmao
>>
>>109146727
people aren't usually shitting up threads with questions about sql injecting their neighbors dog.
>>
>>109146658

Actually they seem to be very generous with their paste, pic related.
But even the newer supposedly thicker paste, or well it's more like putty that Giqabyte uses, is very runny and I think that's the main issue.
Thermal pads would stay in place no problems, but the putty for multiple users has just slid off from the components.
My card was completely missing paste in places and some of it had even receded during use and that's what screwed things up for me.

>>109146701

I came from a 10GB 3080 and yes the difference is absolutely insane.
Having to switch back to that for the duration of the RMA process was painful.
I don't want to buy another 5090 due to the retarded prices, but I could definitely do with an extra 24GB or even 16GB.
32GB is at that annoying point of allowing you to use larger models, but the context is a bit too limited.
>>
>>109146511
big models are fine at close to 3bit and up
>>
>>109144876
>forced to obtain a "smart" home power meter that detects a suspicious power draw signature and triggers an automated reconnaissance EOIR drone flyover, alerting the authorities of a match for illegal enterprise grade server hardware
>>
>>109146757
That's a lot! I'm not a pro but wasn't the common adage for cpu pasting that you need one pea sized drop at the middle?
>>
File: pimpMyXFRA2.png (2.63 MB, 1536x1024)
2.63 MB PNG
>>109146295
I want to believe this will happen.
>>109146321
Gated communities aren't the theft-proof enclaves the developers want you to believe they are.
Those XFRA really need to be installed inside the garage or attic, and vented to the outside.
>>
>>109146792

Yeah that's what used to be the standard, but I think that's mostly a habit from ages ago when paste had silver crystals in it and was potentially conductive and you didn't want it spilling over on the components.
Nowadays it doesn't really matter as paste is generally non conductive and you don't have to be afraid of it spilling over.
But excess paste does make cleanup a bitch and it's mostly a waste using so much of it.
>>
>>109146764
I live in the dark with most appliances and lights turned off, and divert the extra energy saved into my server's power banks so I can offset the usage during token generation.
>>
>>109146827
I had one cpu with dried paste and it was HP machine, when I opened the cpu it had spilled over paste all over the place.
I have never opened a gpu and don't suppose I will in the future.
>>
>>109146635
>4 bpw EXL3 should actually be smaller than Q4_K_M and smarter

https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF
>19.6gb
https://huggingface.co/turboderp/gemma-4-31b-it-exl3/tree/4.00bpw
>19.7gb

I misremembered, it isn't a few hundred mb, but it is larger
>>
>>109146827
The problem here is that doesn't excess paste still prevent heat conduction?
It needs to be optimal and not just like some guy's ketchup between a burger.
>>
what does anon think about stepfun 3.7 flash?
>>
>>109146650
Same thing happened with my Asus 4080, it would randomly reset whenever it did something image gen related. Sent it back and they repasted it. Been fine ever since.
>>
>>109146883
Size on disk might be a bit misleading because of how they are packed.
>>
>>109146885
Excessive thermal paste being a problem is kind of a myth. Unless it's electrically conducting or covers components intended to be partially cooled by convection, it's not an issue.
If anything, insufficient thermal paste is what causes the most problems especially on bare dies.
>>
>>109146913
Interesting.
>>
>>109146907
I went through the process of finding the max context I could fit with each model and 4bpw was pretty much the same as Q4_K_M, at 32k fp16, is this specific to Gemma then? I saw the graphs on turboderps page but Gemma is the only model I've tried because there are literally no other worthwhile models in the small class that are worth running, everything else is super outdated
>>
>>109146913
I think its only a problem if it increases the distance between cooler and the core. As long as the pressure pushes unneeded paste to the sides everything is fine.
>>
>>109146900
Did you check stuff like memory temps before sending it back? Normally a GPU throttles before anything can crash.
I have a 3090 that does something extremely similar but the temps look fine and are well below the thresholds. I've been suspecting that maybe the memory cooling pads aren't done well and it overheats at a place that a the sensor isn't covering.
>>
>>109146948
Context is always fp16 unless you change its format. It shouldn't take that much vram.
>>
>>109146963
>Context is always fp16 unless you change its format. It shouldn't take that much vram.
Send this statement 3 years back into the past when 8k of RoPE'd llama1-65b context ate up 40gb
>>
>>109146963
The cache quant is a flag in tabbyAPI when loading a model, similarly to cpp no? I tried fp16 and q8 and the vram usage is pretty much the same between 4bpw exl3 and q4_k_m gguf, literally the only difference I see is an uplift in inference speed, which is nice yeah but not worth downgrading to 3bpw on the heretic model and losing banned strings
>>
Anyone got some advice on how to SQL inject my neighbor's dog?
>>
File: file.png (134 KB, 2061x420)
134 KB PNG
even if I wanted to buy another gpu, I'd have to get a new motherboard and maybe PSU to go with it
though i have 850 psu and wattage has never really gone above 450 at worst
9070xt
>>
>>109146988
Default cache format is fp16,
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
>>109146994
No, I'm talking about llama-server.
>>
>>109147022
Now is the dip, unfortunately. PC parts will go up next year when GTA 6 releases on PC. I've been saving for years but the GPU I wanted just went up by $3k and priced me out of it.
>>
>>109146988
Sorry, my adhd thought you are insulting but you were talking about something else.
Please ignore my previous post.
>>
>>109146962
Temps were always fine according to nvidia-smi, so yeah it wasn't something I could observe. I guess it could have been the memory, they didn't give that detail, just that it was overheating and it'd been repasted (and i guess had the pads replaced too).
>>
>>109147038
Youre just telling me the default is fp16? I know, I tried fp16 and q8 flags on both the gguf and exl3 and there was no meaningful difference in vram usage between them, only QAT Q4 translated to 2gb of vram usage for the model, leaving more space for context, so again, I'm still wondering wtf that anon was talking about when he said he saved so much vram running the exl3 that he could fit something like 60k context Gemma 4 alongside a stable diffusion model for image gen on his 3090, did he just conveniently leave out he was running some cope bullshit like 2bpw? If so, why?
>>
best rp model in each category is this?
><100B = gemma 4 31B
>~300B = deepseek v4 flash
>~700B = glm 5.2
>>1T = kimi 2.7
>>
>>109147080
Model is a model, cache is its own format.
>>
>>109146892
I prefer M2.7 at that size
>>
>>109147080
Besides, gguf still has its own memory conversion. It's not that optimal as you think if you are using quant this or quant that.
>>
Whenever I get sad about not being able to run sota models I remind myself that they'll be mogged by whatever is available 3-5 years from now. Crazy how fast AI development moves.
>>
>Arcee Trinity Large
the fuck is this? any good?
>>
>>109147198
I still remember the first model I used. It was some 7B Llama finetune that easily fell into repetition loops and had the smarts of a braindead pigeon. But it was so cool to run it and watch text appear out of nowhere. Compared to that Gemma is amazing even with all its shortcomings.
>>
>>109147227
>Arcee
no
>>
>>109147080
I posted a link to the exact quant I use and config options
https://huggingface.co/turboderp/gemma-4-31b-it-exl3/tree/3.00bpw
>60k context
where's that came from?
>max_seq_len: 32768
>cache_mode: Q8
>>
>>109147227
Second from the bottom
>>
>>109146994
> downgrading to 3bpw
exl3 doesn't degrade quality as much as gguf
>losing banned strings
what?
>>
>>109147106
I don't understand why you are telling me the absolute basics of how quants work, I've laid it out in clear terms the combinations I've tested

>>109147246
I must have missed that then, or perhaps there was another anon, I definitely remember someone bragging about getting 60k at Q8

>3bpw, 32k Q8 context
This is just kind of shit though isn't it? You have 24gb vram dont you? You can fit 50k context at Q8 using QAT Q4, or 32k context at fp16, I've tried 3bpw and it's noticeably more retarded than 4bpw/Q4
>>
How are models able to parse typos? For example I was asking about restic and completely butchered "can restic show diffs?" with " can resit shoe diffs?" and it still understood.
>>
>gemma is super sloppy and assistantmaxxed
>still manages to emit a strong "fuck me" energy even with no system prompt
She's too powerful
>>
>>109147307
In my case, context is capped by pp and my patience, not vram. If you do basic rp without 2-3 prompt reprocessing on every turn, you can increase it
>>
>Qwen AgentWorld
good for rp?
>>
>>109147364
the only thing any qwen variant is good for is being locked in a dark room with a thinkpad and only vscode installed
>>
>>109147310
It kinda knows what tokens sound like so bigger models can infer the meaning from sounds. Even more surprising is their ability to write fully in reverse when asked to.
>>
>>109147375
Why do you use vscode to run benchmarks?
>>
>>109147293
It's subjective sure but after trying them all, 3bpw is noticeably dumber than 4bpw or Q4

>Banned strings
I'm aware it's a valid option in the API but for Gemma specifically I get "Assertion error: Cannot use banned strings on recurrent model" when trying, works perfectly on the goofs through kobold however

>>109147352
I mean if you like it fair enough, I guess I'm just disappointed at getting memed on about vram savings, possibly by another anon.
>>
>>109147486
I'm pretty sure easy VRAM savings are always a meme. You can only save by buying more.
>>
File: wtff.jpg (54 KB, 551x720)
54 KB JPG
>Ollama (yes, cope and seethe)
>fully in vram

>qwen3.6:27b-q8_0
10.94 t/s

>qwen3.6:27b-mtp-q8_0
28.53 t/s

shit just works, where is this for gemma??
>>
>>109147589
>Ollmao
There you go
>>
>>109147589
>ollama
nigger
>>
>>109146234
>Your dick doesn't consent to getting beaten when you masturbate, doesn't mean you're raping yourself
wouldn't get hard if it didnt
>>
File: gotta go fast.jpg (46 KB, 500x500)
46 KB JPG
>>109147589
And the moes
>qwen3.6:35b-a3b-q8_0
40.93 t/s
>qwen3.6:35b-a3b-mtp-q8_0
50.57 t/s

If I did coding or something I could actually do coding
>>
>>109147640
>>109147659
Yeah he should be using unsloth studio
>>
>>109146098
>is it a political thing that ds4 isn't supported in llama.cpp?
>>109146121
>Yes
They approved Mimo and other Chinese models, so post proof or this is Gemma-day0 tier schitzo.
>>
File: file.png (17 KB, 917x130)
17 KB PNG
Cudacucks pull for free performance.
>>
anyone here tried slime
>>
>>109147712
I'll do it
>>
"Why only Q4_K_M? Gemma 4 is quantization-aware-trained for ~4-bit, so Q4_K_M is the sweet spot — higher-precision quants add size with no real quality gain. Carefully quantized for best quality at 4-bit." This is a meme right? You can't just have Q4 as good as Q8
>>
>>109146757
Extensible VRAM when?
>>
>>109147732
Q1 is just as good as Q8.
>>
>>109147732
Gemma4-31b Q4_K_M is effectively equivalent to Q5 or Q6. I've never tested less than Q4.
>>
central computers hiked the price by another $200 on the rtx pro 6K. Roughly $100/week increase on average. $15000 by EOY.
>>
>>109147589
wym? gemmaroids have mtp and if that's not fast enough, skill diffgemma should get most poors at least 80t/s
>>
>>109147828
Not on ollama they don't, and it's the only backend that matters after all
>>
If only we could speculate pp
>>
File: 1742136784658615.png (33 KB, 600x639)
33 KB PNG
>>109147856
>it's the only backend that matters after all
>>
Wait,
>>
the user said "
>>
>underscoring critical distinctions distinguishing distinguished performers from conventional competitors
>>
>>109147712
>it's not faster.
>>
>>109147710
>Gemma-day0 tier schitzo
What even is that? I used gemma for like a week and then went back to GLM cause I have ram so i don't really follow memes of people who got here when gemma dropped.
>>
>>109147947
>he didn't download day 0 gemma weights
>>
>>109147965
>oh no no no no....
>>
>>109147965
I would tell you to go away newfag but please stay. I hate this general and it needs people like you.
>>
>>109147974
I've been here since the first llama.
>>
>>109147974
>doesn't know about day 0 gemma
>calls others newfag
>>
File: IMG_20260627_173533.jpg (92 KB, 749x697)
92 KB JPG
>>109147912
>>
>>109147947
Some retard was sure that Gemma got censored when they reuploaded the weights with the fixed jinja templates, I have day 0 Gemma which I used with the borked Jinja, with a manually fixed jinja, and with the official fixed reupload, it's all the same shit, anon just suffers from delusions, the only Gemma update that did anything was the QAT update, and that didn't effect alignment at all either.
>>
downloading nex n2 pro. what should I expect?
>>
>>109148055
Expect expectations for you to post your opinions to save others the time
>>
>Ornith-1.0
thoughts on this model family?
>>
>>109147984
can you post day 0 Gemma output pls for science thx
>>
>>109148098
it looks like a qwen fine tune, its probably not bad, it is a really strong base
>>
>>109148116
No it would get me banned.
>>
>>109148116
la la la la la
>>
>>109148124
ok then send them to >>109147974 so he can post them and get banned since he wants to leave
>>
>>109148098
they built an RL framework for tuning LLMs that isn't just a bootleg Fable yolo tune. it has promise.
>>
>>109145631
The K is for Kawrakow, it's Iwan Kawrakow quantization method. Some people incorrectly assume it refers to K-means, but it is not clustering-based, and Iwan has specifically called out k-means clustering quantization as another potential method for the future. That's it, it's his name.
>>
anyone tried poolside/Laguna-M.1 ?
>>
>>109147885
You think I'm lying? All the pros run ollama on their macbooks. Only the pedos here use llama.cpp or kobold or whatever the flavor of the day is.
>>
File: 4chan-mobile-poster.webm (1.93 MB, 608x1080)
1.93 MB
1.93 MB WEBM
>>109148298
>>
>>109148203
I'd just like to interject for a moment. What you're referring to as "K-means," is in fact, Kawrakow quantization, or as I've recently taken to calling it, Iwan Kawrakow's method. The "K" is not a reference to clustering unto itself, but rather a reference to the name of the man who developed the technique.

Many users apply this quantization method every day, without realizing it. Through a peculiar turn of events, the "K" which is widely used today is often assumed to be K-means, and many of its users are not aware that it is basically the Kawrakow system, developed by Iwan Kawrakow.

There really is a K-means clustering quantization, and these people are using it, but it is a distinct method from the one in question. K-means is a clustering algorithm: a process that groups data points into K clusters. While it is an essential part of certain types of signal processing, it is not what is happening here. The Kawrakow method is normally used in combination with specific quantization goals, and Iwan himself has specifically called out k-means clustering quantization as another potential method for the future. All the so-called "K-means" assumptions are really just misconceptions about Kawrakow!
>>
>>109148298
>All the pros run ollama on their macbooks
I cannot trust anyone who unironically uses a macbook to dev.
>>
>>109148365
>to dev.
who was talking about code monkeys?
>>
>>109148370
what the fuck do you think they're doing with ollama on their macbook? jerk off?
>>
Bait too big, I can't bite it.
>>
Whoever said to pull llamacpp, Fuck you. The UI is fucking slow now.
>>
kek
>>
>he pulled
Doomp eet
>>
>>109148365
if you're not completely agnostic on your laptop and only using it to access a real server then you barely count as sentient
be glad retards use macbooks in order to visually filter themselves for you
>>
Btw my thanks to the anon who shared the system prompt for gemma the other day, one I hadn't seen before. It works to uncensor qwen 3.6 as well. Though I must say qwen isn't nearly as capable of having fun as gemma.
>>
>>109148432
I love you.
>>
>>109148460
>>109148460
>>109148460
>>
>>109148330
That's me after eating stims and writing stories in Mikupad switching tabs after every sentence to see how my /lmg/ friends are getting along I like you guys very much
>>
>>109147861
There was this but lossy because it cherry picks "important" tokens
https://arxiv.org/abs/2502.02789
also https://arxiv.org/abs/2603.06199
>>
>>109148433
which one? there's a good few
>>
>>109148620
https://rentry.org/a7md542q
>>
File: 1753020255455735.jpg (165 KB, 1364x768)
165 KB JPG



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.