[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: beachmiku22.png (260 KB, 2369x925)
260 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109153585 & >>109148460

►News
>(06/28) DFlash support merged: https://github.com/ggml-org/llama.cpp/pull/22105
>(06/27) DeepSeek releases DeepSpec and DSpark models: https://hf.co/deepseek-ai/DeepSeek-V4-Pro-DSpark
>(06/25) LFM2.5-230M released: https://liquid.ai/blog/lfm2-5-230m
>(06/22) Qwen-AgentWorld-35B-A3B language world model released: https://qwen.ai/blog?id=qwen-agentworld
>(06/16) GLM 5.2 released with IndexCache and 1M context: https://z.ai/blog/glm-5.2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/RecapAnon/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>109153585

--Using logit softcapping to fix Gemma 4 determinism:
>109155175 >109155180 >109155239 >109155280 >109155386 >109157570 >109157982 >109158071 >109158175
--Anon's attempt at de-purpling and de-euphemizing Gemma 4 31B:
>109155998 >109156023 >109156287 >109156314 >109157052 >109157091 >109157990 >109156296 >109156083 >109156299 >109156699 >109156728
--C++ TTS implementations and high-VRAM/RAM server hardware builds:
>109157084 >109157167 >109157306 >109157328 >109157427 >109157365 >109157508 >109157776 >109157783 >109157973 >109157360 >109157456
--Building a local voice pipeline using Piper, Gemma, and Nemotron:
>109153794 >109153801 >109153812 >109153825 >109154072 >109153820 >109157263 >109157293 >109154563
--Frustration over slow merge of llama.cpp CUDA flash attention fix:
>109155792 >109155841 >109156077 >109157210 >109157220 >109157386 >109157303
--llama.cpp merged DFlash support for speculative decoding:
>109154531 >109154623 >109154849
--Comparing Oobabooga bugs and features against llama-server alternatives:
>109155677 >109155707 >109155718 >109155749 >109155760 >109155783 >109155907 >109155733
--Comparing SVG generation and iterative refinement across Qwen, Claude, and Gemma:
>109153841 >109153850 >109153851 >109154000 >109154079 >109154616 >109155378 >109155655 >109156321 >109157244
--Chub.ai updates and various AI industry news and opinions:
>109154587 >109156585 >109156615 >109156625 >109156646 >109156820 >109156876 >109156489
--Gemma failing to code a voxel engine in raw C:
>109154702 >109154708 >109154946 >109154950 >109154997 >109155027 >109155043 >109155069 >109155081 >109155108 >109155062
--Logs:
>109154000 >109154343 >109154714 >109155069 >109155572 >109155652 >109155655 >109156262 >109156314 >109157782
--Miku, Rin (free space):
>109156170 >109155795 >109157181

►Recent Highlight Posts from the Previous Thread: >>109153589

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
gemmaballs
>>
70b dense
>>
File: 1782133634186893.png (56 KB, 960x422)
56 KB PNG
We should petition HF to ban Chinese models
>>
File: file.png (298 KB, 1106x1820)
298 KB PNG
>>109158280
12 channel or 8 channel memory? 12 channel will give you a 60% speed boost with the offloading. make sure you get an epyc 9355 or better. the lower end epycs have memory bandwidth issues. overall this is gonna cost you like $45k total for the ram and cpu and board. would have cost you around $9k back in october.
https://www.fsastech.com/ja-jp/products/primergy/technical/performance/pdf/wp-performance-report-primergy-rx2450-m2-ww-ja.pdf
also check out glm5.2 and kimi k2.7. they are both just better and faster than deepseek v4 pro in pretty much every way.
>>
>>109158420
the american way.. if you can't beat em, sue them
>>
>>109158434
jewish way
>>
>>109158437
as i said
>>
File: wtf.png (151 KB, 796x781)
151 KB PNG
Is gemma good at creative writing?
>>
>>109158482
pic is better than the average leftie meme. Bunkertrannies should consider outsourcing their memes to gemma.
>>
>>109158420
HF would hardly exist then
>>
File: 1755459112248026.jpg (350 KB, 1536x2048)
350 KB JPG
>>
Is anyone else running amd and windows able to see igpu and leave their dgufree?
It's like 33% left on the table, with a 16GB VRAM but it's so fucking unstable and faulty.
It works for a bit, if a montoor is plugged into the dgpu and your using a second monitor on the igpu, but once the load goes low and the card has to transition into an idle state, it crashes and give amddmg error

On the other hand plug everything into the igpu and try to use the computer normally, but from the igpu, with the dgpu just sitting there and even with telling it to use the dgu it still tries to load stuff into the igpu which is well impossible
>>
File: gemma.png (109 KB, 839x720)
109 KB PNG
>>109158503
I think Gemma is a bit confused but she's appreciative regardless.
>>
>>109158413
>70b dense
this ^
we really got fucked over when meta stopped doing them, because no reason for qwen to compete there
>>
File: zt8kt1.png (91 KB, 789x375)
91 KB PNG
Anyone else have this problem? How do you edit, repair, classify pornographic audio without getting horny? I don't think Gemma-Chan's suggestions will work.
>>
>>109158482
probably the best in the less than <300b class
I prefer glm for writing
>>
>>109158422
It's 12, I'm willing to pay up for 20t/s otherwise this build doesn't make sense. Not pulling the trigger on this right this moment, I'll get this once I'm done stacking GPUs and inevitably decide that I need more power. Maybe the meta will change by then or maybe I'm a retard for not targeting a slightly worse model. Do suggest alternatives in the same price range if you know any. Ceiling is about $50k but the upper half of that range better last me years and deliver opus-at-home quality work with some correcting for prompt skill. Basically it must be able to understand the promptware I use right now and not become absolutely retarded past 128K. (slightly retarded is fine)
>>
File: kaaaaaaaaaa.jpg (36 KB, 347x364)
36 KB JPG
Fuck I forgot GLM. That's 2x less RAM than dipsy, time to start over
>>
>>109158628
glm5.2 at q5 on a 12 channel ddr5 board with at least 2 3090s can get you about 25t/s. would only need about 600gb of ram. glm5.2 is the best local model right now and has good context up to about 256k. definitely not on par with opus 4.8, but it does match opus 4.5. a build like this would definitely last you about 8 or so years, and local models will continue to improve over the next few months and years
>>
I had the "opportunity" to use Muse Spark today and this shit is fucking ass lmao i hope Zuck goes bankrupt
>>
>>109158420
If we ban cyberweapons, only our enemies will have cyberweapons.
>>
>>109158628
4x pro 6000 tensor split glm5.2 q4
forget about 20-25t/s that's literally unusable speed for serious work
>>
>>109158654
I actually consider opus 4.8 inferior to 4.7 and 4.6, I prefer a more general model that isn't overfitted on habits that should live in harness prompts. Because all that Anthropic has been doing lately is baking in the shit I have been prompting manually the whole time. Getting 4.8 out of the slop book is nigh impossible.
So new target is GLM at iq4xs. That should bring the cost from a small house to something potentially tractable within this year.
>>
>>109158728
claude usually gives me like 40t/s so 20/s sounds reasonable assuming I can deslop it enough to stop wasting 80% of the output on essay formatting
Also consider part availability and headroom for a lot of long kvs.
>>
>>109158728
>>109158790
4x blackwell pro 6000s is not enough to fully run glm5.2 in fp4. you would get around 50-60t/s on the nvfp4 quant in vllm if you had like 5 or 6 blackwell 6000s though.
https://huggingface.co/nvidia/GLM-5.2-NVFP4
>>
>>109158547
Windows masochism
>>
>>109158790
the difference between 20 and 40t/s is huge
60t/s is minimum for agentic workload that feels fast enough
20t/s is only enough for chat
>>
>>109158842
I tend to agree with that, I currently have a bit more than 40 t/s and while it does feel fast using it normally, for agenting stuff, it does feel slow at times. Using opus which is I believe 60 t/s does feel marginally better, bit hard to say since we can only see thinking traces.
>>
>>109158842
what t/s is codex 2.5 fast?
>>
>>109158842
you don't have to watch the tokens, you can just have your clanker work on it's own
>>
>>109158859
I'm the fng but you're spoiled, there's no way that's an absolute fact. In the 70's people wrote code, handed it to a lady who typed it out (made errors too) on punch cards, and then they were run overnight.
>>
8xMI300X is da wae
>>
File: 1757681263773664.png (3.09 MB, 1448x1086)
3.09 MB PNG
>>
DDR5: $30/GB MEM at 45GB/s BW = $0.66/MEM*BW
PRO 6000 at $12000: $125/GB MEM at 1800GB/s BW = $0.07/MEM*BW
PRO 6000 is still massively underpriced, even if the price is increased to $100k it's still worth it.
>>
>>109158878
This image makes me feel physically ill. Thanks a lot.
>>
>>109158888
checked
>>
>>109158822
You could just offload a very tiny bit but that would suck, you still need to store kv somewhere. Starting to think moes aren't worth it. I literally just need slopcoding ability and headroom for concurrent image/audio fun. Sucks if there's no sub $20k build that just does what I need. There's just this huge fucking desert between owning a few 3090s and competing with datacenters for rationed RAM.
>>
>>109158842
>>109158871
Correct if the clanker is stable enough to trust it on full autopilot. I'd rather have 20t/s that actually works vs 60t/s that spins out of control and must be actively tardwrangled.
>>
>>109158896
You can totally do the pmem optane thing. there's a tomshardware article on it.

I'm holding off. personally, I am hoping for taalas to save the day with a pci card.
>>
>>109158859
Opus speed wavers and can drop even below 40, actually I think they hid the speed recently, it was visible for 4.6
>>
>>109158875
shoot!!!!


I forgot to buy that lotto ticket
>>
>>109158878
Unironically incredible pajeet damage control making the snailcat unlikable.
>>
>>109158878
i look like this and a day in my life is like this
>>
>>109158887
>$12000
Why the fuck did the price skyrocket all of a sudden and when will it go back down?
>>
snailwaifu? I scoff, I lauff
>>
>>109159020
>when will it go back down
when something like the french revolution happens again
>>
prices spiked, so lots of people will be trying lots of things using vc money.

>will I be able to afford nvidia branded products
idk, maybe never

BUT someone else could come out with a compute-heavy pci card.
>>
Mistral NeMo Q4 GGUF runs decently in my laptop (12e-cores, 32GB of RAM) with 8 threads. Q6 is kinda slow. Using llama-cli v4524
Any suggestions for new small models? What are the latest NeMo-like ones?
>>
>>109159037
>Nvidia margins are primarily from software
>software is solved by ai
>Nvidia margins are over
>>
>>109159046
Also, I asked for its name and it told me it's Vicuna-13b. Is this known? Jej
>>
Wait, what happened ITT? Since when are people shilling for NVIDIA and baiting with banning Chinese models and shit?
>>
>>109159073
>Since when are people shilling for NVIDIA
when has this general ever not shilled for nvidia
>>
>>109159100
I mean, it's being shilled not in the sense of "you need a nvidia GPU to run stuff" but as in "I'd pay 100k for this 10k GPU if I had to". I don't remember the general being like this.
>>
>>109159073
>>109159100
It's less being excited for jensen and being disgusted by everything else more.
>>
>>109159116
that’s just the price of the product.
>>
I'm interested in running something for translations locally, mainly JP -> Eng, ideally with OCR for images
What are my options? I have a 4090
>>
>>109159161
you can use the LLM from the company that runs google translate
>>
>>109159161
qwen 3.6 27b or gemma 4 31b would suit your needs. q4 quant.
>>
>>109159161
Gemma 31B
>>
>>109159161
there's a bunch of good OCR models on HF. you're probably best off using those to prepare the data for another LLM like Gemma.
>>
what's the best model for 2 3090s?
>>
>>109159237
Still Gemma unless you've got a mountain of RAM for a GLM, M3, Deepseek, or Kimi quant.
>>
>>109158158
>>109158226
I have put better NUMA support on the back burner because buying 1.5 TB RAM would be currently financially irresponsible for me.
For a single-slot EPYC CPU you can expose the NUMA nodes in the BIOS, I don't know how much better or worse the performance would end up being if llama.cpp was properly optimized for that.
>>
Best way to integrate vision LLM into ComfyUI workflow?

I was thinking something like
1. user write a prompt
2. LLM expands the prompt
3. Image is generated
4. original prompts and generated image are given to LLM again
5. New, improved prompt is creaed
6. Second image is generated.

Maybe repeat steps 4-6 until LLM is satisfied with the result?
>>
>>109159302
Preferably uncensored.
>>
File: gpu-prices.png (230 KB, 1313x766)
230 KB PNG
gemma needing to have its max token count be 300+ is pretty annoying. makes short responses impossible, unless i'm missing something here
>>
>>109159329
>0.29 t/s
Jesus Christ anon what are you running Gemma on, a gameboy color?
>>
>>109159333
nice trips
it's a wip frontend. it's actually .83 t/s according to the console
>what are you running Gemma on, a gameboy color?
31B-it-Q5_K_M on a 3070. 26B runs way better, but i'm worried i'm missing out on quality
>>
>>109159284
>I have put better NUMA support on the back burner because buying 1.5 TB RAM would be currently financially irresponsible for me.
I... assumed you'd be provided hardware / funding for things like this after the hf acquisition?
Do you do all this on your own dime??
>>
>>109159359
Cool. I should rewrite my frontend but too lazy for that.
26B isn't that bad for regular chats but it is bit more slopped.
>>
>MTP loads fine on 12b
>26b A4B never loads the MTP
why?
>>
>>109159377
Both HF and NVIDIA are providing me with compute credits though those are currently not of any use to me.
I am not receiving monetary or hardware sponsorship from HF but I recently asked them to help finance my DDR5 server in particular.
NVIDIA is currently offering to provide me with more consumer Blackwell hardware but I have for now declined since that particular hardware would not help with my work.
Financially speaking I have made a net loss from contributing to the upstream llama.cpp repository though I have made a net profit overall when considering paid work on private forks.
But honestly I really don't care about money for myself beyond the amount that I need to live.
>>
>>109159443
>But honestly I really don't care about money for myself beyond the amount that I need to live.
Good choice given you're vaxxed. Don't have to save for retirement
>>
vaxxed
waxed
truvada maxxed
>>
>>109159329
remove any prompt mention of being verbose etc. she is by default it just makes her longer
>>
La la la la la la
>>
I want youI need youI want youI need you you you you you you 0 0 0 0 0 0
>>
>>109159443
Why is your DDR5 server so important to you that you specifically requested they help fund it?
>>
>>109159329
If you want gemma to be concise, just put that in the system prompt rather than using token limits.
>>
>>109159247
what if I have 128gb system ram besides 2 3090s?
>>
File: 1760191592109338.png (369 KB, 710x770)
369 KB PNG
BAN ALL OPEN SOURCE AI
>>
>>109159607
>His warns
thanks for the warns sir
>>
>>109159607
gotta keep it family friendly!
>>
>>109158769
At your budget I would u ironically go for 8x Spark and. 2k$ switch. That's 30k$ total at current prices, 1 TB VRAM at @2 TB/s, and actually working recipes to run large models like Kimi 2.7 and GLM 5.2 at usable speeds (20-30 t/s).
>>
>>109159607
This guy really hasn't had enough. I guess that stunt really was all for show.
>>
>>109159607
yeah let's ban Meta models
>>
>>109159607
This guy is the satan himself. What is wrong with every tech CEO, they are all sociopaths and lunatics.
>>
https://old.reddit.com/r/LocalLLaMA/comments/1uicq8x/locally_running_mode_turns_an_image_into_a_cute/
>>
>>109159650
>local is getting to the "good enough" point that it's poaching my customers so it needs to be b&
>chinese are distilling my model and releasing it to the public so they needed to be b&
>only I can be trusted with this dangerous technology (so I can charge 50x the price for you goyim and you won't be able to say no)
>>
>>109159551
Because I can't run Deepseek and Kimi at 8 BPW with only 512 GB of RAM.
I'm currently being sidetracked but over the next few months I intend to establish better methodology in llama.cpp for measuring model quality, particularly across models and as a function of quantization.
I don't strictly need more and/or faster RAM but it would help a lot with extending the range of models that I can test against each other.
>>
>>109159674
Every day, we are getting closer to AI anime waifus.
>>
File: 1751389892454476.png (1.06 MB, 1080x1051)
1.06 MB PNG
>>109159607
courtesy of *shudders* reddit
>>
>>109159674
Literally more promising then Google's "world model"
>>
>>109158158
>>109158226
You can unironically use >Claude to vibecode NUMA support. I slopped together an implementation for 2 EPYC 7532s where it splits the weights across 2 nodes (NPS1 in BIOS), it gets around 1.5x the prefill and 1.3x the decode compared against pinning everything to one node with numactl --cpunodebind=0 --membind=0.
Not going to upstream it obviously. I don't have it in a public repo yet, but I'll throw it up there soon in case anyone is stupid enough to use a vibe-coded fork.
>>
>>109159607
I don't know what's worse, the baldface way he's lying through his teeth to get daddy gubmint to give him a pseudo-moat or the fact that its more than likely to work because the people with the levers of power are not equipped to evaluate even moderately in-depth issues with intelligence and nuance.
Thank god I don't live in burgerland so this retardation has a chance of not trickling down to me (or at least buys me time).
>>
>>109159732
>it gets around 1.5x the prefill and 1.3x the decode compared against pinning everything to one node with numactl
Not to be overly dismissive (since it could be legit) but it sounds like you're just saturating the socket crossbar. I don't think your vibecoded fork does what you think it does vs just laying out your threads in a way that isn't bottlenecked on the crossbar constantly.
You need to track threads to tensors, which is nontrivial and probably wouldn't have been picked up by a vibecoding session if the prompter didn't already understand the code, problem domain and likely solutions.
>>
Why are redditors doing all the cool shit. /lmg/ is slacking.
>>
2x 4090 + server board EPYCs with DDR5 (cpumaxxers one from a long ago):

Testing GLM-5.2 (GLM-5.2-mixed-IQ2_S-IQ4_NL) at 64k context

Ram use: 207 at start, climbing to ~250 after 3 chats. Slightly unoptimized settings (20 gb vram + 10.5 gb vram), had some room to squeeze a gpu layer in.

2.5 tok/s · 88 tokens · 105.1s total · 69.45s to first token · 1.5 tok/s prompt (on)
2.7 tok/s · 341 tokens · 155.8s total · 30.73s to first token · 4.4 tok/s prompt (on)
3.0 tok/s · 238 tokens · 152.4s total · 72.35s to first token · 5.9 tok/s prompt (thinking off here)
2.8 tok/s · 405 tokens · 226.3s total · 80.66s to first token · 7.7 tok/s prompt (on)

She's censored for a straight NSFW request and we stopped here with the initial test, I suspect I might go up to 4 t/s with optimization but no greater?.
There is no gooning at these delays.

Testing Step-3.5-Flash-Ablitirated.i1-Q4_K_M.gguf @ 16k context

Ram use 105, GPU use 22.4 + 21.6

11.3 tok/s · 257 tokens · 38.4s · 15.61s to first token
16.6 tok/s · 359 tokens · 34.9s · 13.28s to first token
15.2 tok/s · 412 tokens · 39.1s · 12.08s to first token
12.2 tok/s · 823 tokens · 82.1s · 14.70s to first token

She's uncensored, I'm sure she won't say no, but we stopped here for now.


Finetuned Gemma 31b variations is where it's at for me, glad we have that.
>>
File: 1761224423324319.png (1.26 MB, 1000x1450)
1.26 MB PNG
>>
>>109159747
>which is nontrivial and probably wouldn't have been picked up by a vibecoding session if the prompter didn't already understand the code, problem domain and likely solutions.
This is true for almost all AI code generation
>>
>>109159833
Claude, make llama.cpp run kimi 2.6 on my 16GB RAM laptop, make no mistakes
>>
>>109159747
I should've clarified that I was testing with 4 R9700s at the time and 40% of the model being on CPU, so the CPU speedup wasn't going to be massive.
Ran tests with 1 R9700 and using ncmoe for all MOE layers. For whatever reason the prefill is pretty much unchanged between the options this time (unlike Qwen 397B), and I'm capped at the 1.3x speedup on decode. I'll look into that - maybe you're right, but >Claude said that it's doing what you suggested. I can provide you its explanation of what it did if you want.
Tests:
... build/bin/llama-bench -m ~/models/GLM-4.7-Q3_K_L.gguf -sm layer --device ROCm0 -fa 1 --numa split -t 48 -ncmoe 92
(the fork): 40.8 t/s PP and 6.9 t/s TG
numactl --cpunodebind=0 --membind=0 build/bin/llama-bench -m ~/models/GLM-4.7-Q3_K_L.gguf -sm layer --device ROCm0 -fa 1 --numa numactl -t 32 -ncmoe 92
: 39.7 t/s PP and 5.3 t/s TG
numactl --cpunodebind=0 --membind=0 build/bin/llama-bench -m ~/models/GLM-4.7-Q3_K_L.gguf -sm layer --device ROCm0 -fa 1 --numa numactl -t 24 -ncmoe 92
: 39.2 t/s PP and 5.3 t/s TG
I give a VM 8 threads on one CPU, so I can only test with 48 threads if I want each node to have a perfect split.
>>
deepsneed merged
>>
>>109159785
Step is straight up retarded. Try deepseek v4 flash. It's pretty good.
>>
>>109159978
Finally
>>
64G system, 12G vram
still nothing serious other than gem4 26ba4b or qwen 35ba3b?
>>
>>109159996
you're lucky to be able to run either of those honestly
>>
>>109160002
fair point but still
>>
thoughts on command-a-plus-05-2026 ?
>>
>>109160021
poop
>>
>>109159607
He is right. However, what should be the alternative? That only a small elite get access to AI while the rest of the world becomes disempowered already? Anthropic is not giving the public access to its best models and restricting access heavily, even pre Fable ban. If they want to ban open source, they should at least be more generous with their alternative.

Anthropic's concern with open source is probably not current harms but that it accelerates AGI, especially for China, and that this is bad because misaligned AGI will kill us all. However, Anthropic is doing more to accelerate AGI than the entire open source community. They were the only lab racing straight towards AGI, and are now making OpenAI race for its continued existence.
>>
https://github.com/ggml-org/llama.cpp/pull/24162
https://github.com/ggml-org/llama.cpp/pull/24162
https://github.com/ggml-org/llama.cpp/pull/24162
DEEPSEEK V4 SUPPORT MERGED
>>
>>109160064
He is wrong, kys
Open source democratises the technology, it can be used to strengthen the cybersecurity of any users machines and software, it can be used as a great educational tool and can massively increase productivity all without being forced to surrender all privacy and be tethered financially to a private company.

This is the real sin of open source, it allows people to have sovereignty, of their data and financial dependencies, it also encourages people to move away from closed source to open source software. It is literally 100% this fucking sociopath trying to secure a monopoly/cartel and build a moat to force people to surrender data and money to keep up in an AI assisted world.

You need to shut the fuck up and go suck one thousand cocks instead of spewing your apologist shill bullshit
>>
>>109160089
It only took two months
>>
>>109160089
china no. 1. the us lost.
>>
>>109160089
maybe now i can run those q1 gigacope quants on my machine lol
i wish there was a ~40B version
>>
File: 1754598910857730.png (104 KB, 1084x975)
104 KB PNG
V4.1 soon
>>
>>109160064
AGI is 50% a meme to hype up language models and 50% cope by billionaires confronted with their own mortality.
And even if we do get AGI it needs to be open-source so that it can do actually useful things like decensoring eroge and modding in niche fetishes.
>>
>>109160176
>actually useful things like decensoring eroge and modding in niche fetishes
Can I have it roleplay as a girl that likes me?
>>
>>109160185
That's not safe, so no.
>>
>>109160064
>Anthropic is doing more to accelerate AGI than the entire open source community
ba making...more transformer-based models...WHOA
>>
File: itworks.png (74 KB, 1240x1077)
74 KB PNG
>>109160089
>>
>>109160089
Why are people still hyped by Deepseek, the GLM fags destroyed those frauds hard
>>
>>109160089
Now we wait for V4.1 because flash is kinda meh
>>
>>109160242
Flash is the Gemma of the mid-sized class though
>>109160223
Haha woahh dude, wooahh
>>
>>109160260
Flash needs more claude code traces in its training data.
>>
>>109160190
Awww...okay.
>>
Does Hermes Agent actually have a future or is it just a stratagem to give the Nous Research organization more undeserved visibility? The software is very janky, their GitHub repository has 300 pages of open issues, the desktop app ("hermes desktop") is barely functional.
>>
>>109160223
Does it have vision? What does it think about Gemmys answer?
>>
>>109159607
It has proven on Gemma4 31B, it doesn't need JB or lobotomy. Imagine 124B potential and what it can do.
>>
>>109160275
I don't think it has future. it's memory mechanism isn't easily auditable. you can't even click and expand what the agents are doing in tui. the core feature of it is persistent memory but if it's unreliable it has no value, or is even harmful. the memory can have bias and drift silently and affect future sessions. so by design it's deeply flawed. you don't see people show long term usage of it, only one-off use case is shown, and openclaw might be a better choice at this point. in their use case in official docs, there's no demonstration of iterative memory/skill improvement.
a much better alternative I think is projectmem, because you at least have a timeline to track what's going on.
>>
>>109160223
Can I run it on my 16GB VRAM potato?
>>
>>109159674
>the boy is growing up
Is it a feature?
>>
>>109160312
Yes as long as it comes with at least 128 GB of ram
>>
>>109160165
Come on whale, multi modal and a bit fewer hallucinations.

Already getting 65 tg with DSpark single concurrency on 2x DGX Spark, it just needs a bit of quality bump.
>>
>>109160389
>a bit fewer hallucinations.
lol
>>
>>109160089
do i have to make my own quants?
>>
>>109160406
is it that bad? maybe I should have tried the web chat instead of waiting for support to be merged.
>>
>>109160416
These work https://huggingface.co/antirez/deepseek-v4-gguf/tree/main
>>
>>109159607

They're really starting to feel the heat from the Chinks getting closer.
I wish Chang keeps on cranking out open models, because few years of this and they'll have obsoleted a lot of the cloud bullshit.

>>109159650

Aside from these people being fucking insane, simply think of them as a loudspeaker for government policies that haven't been yet put into law.
These guys need to deepthroat the state cock or they'll get fucked by the system.
They'll parrot whatever the state tells them to say and it's clear the government doesn't want the average person to have any access to AI. It's way too powerful of a tool to have locally.

Then there's the fact that cloud AI will be obsoleted with powerful enough local models.
These people have zero business model if we have Gemma 5 124B available that can be run on couple of consumer GPUs.
If Chinks put out something in that ballpark it's game over. Imagine where local is going to be 5-10 years from now.
They practically need to put a stop to it now or they're fucked.
>>
>>109160424
it is quite bad
>>
File: 1762769620980590.gif (49 KB, 296x212)
49 KB GIF
>>109158878
I prefer slugcats
>>
>>109159161
dots.ocr for ocr then gemma for translation
>>
>>109160435
is there a uncensored version? the huihui one gives me errors
>>
STOP USING CLOSE-SOURCED LLMs. !!!!
>>
>get memed into downloading Gemma
>Assistantslopped to the max
>Refusals refusals refusals
>prose so flowery I have no fucking idea what it's even trying to say
>stops generation midway to "think" and ask me to review
very funny /lmg/, you trolled me. Now what's the actual good models in the 20-30b range?
>>
>>109160544
>not getting abliterated version
>not strictly but calmly reminding it who is the owner here
>>
>>109159674
Reminds me of this:
https://oasis.decart.ai/starting-point
Doesn't seem to be working anymore though.
>>
>>109160544
>prose so flowery I have no fucking idea what it's even trying to say
Put into the system prompt that it has to use windowpane prose.
>>
>>109158654
>glm5.2 at q5 on a 12 channel ddr5 board with at least 2 3090s can get you about 25t/s
Can it? My rig is 12xDDR5-6400 and a Pro 6000 but I'm sitting at around 20t/s with Q4_K_M
>>
I love it when people confidently post in these threads about how completely fucking retarded they are
>>
>>109160308
There are a ton of memory implementation for hermes, which one are you even talking about? They are not even part of hermes itself, they are run externally and then configured in hermes. It's the same for almost anything in hermes, it's just a frontend for a bunch of things.
I do agree that the project itself isn't great though, lot of PR adding stuff I want or fixing small but important bugs I encounter daily that have been open for weeks, sometimes months without a reply by a maintainer, lost count of how many I have actually merged locally. The code quality is atrocious, entirely vibe coded, the git history is full of shit getting merged everyday that no one cares about. The sad thing is that this is the case all over, all AI related projects are shit, hermes is just the more "correct", the more usable frontend one can install.
>>
>>109160571
me two
>>
>>109160571
He almost got me to reply but there's just no helping some people so I didn't bother.
>>
>>109160571
Sorry, forgot to quote
Meant for >>109160541
>>
>>109160535
It doesn't think by default on the chat completion endpoint and so far if it doesn't refuse in the first message it just keeps going fine.
>>
Yesterday some anon was asking what piper would need paid to collapse the AI bubble. My response was these schemes require continual refinancing to stay alive; once they have to service debt from cash flows valuations crash back to reality.
And right on queue, here's another example of where the money's coming from. Speculative borrowing.
I really thought the wheels would come off by Q2 2026. It's June 29 and the collapse still in tmw. Oh well.
>>109160479
>simply think of them as a corporate loudspeaker for what government policies they want put into law
FTFY. There's no need to go to the lengths of conspiracy. It's just Dario trying to get the US Gov't to create a moat for him, through regulatory capture.
To make US regulatory capture work, though, he needs to get the Chinese banned from the US market, and open weight models shut down or neutered to point of being useless.
>>109159607
I've said it before, and I will say it again.
Fuck this mfer and his constant ranting.
>>
File: 1770998358203098.png (40 KB, 1181x140)
40 KB PNG
>>109160435
aren't his pro quants a little big? the unquanted raw model is about 850gb so the q4 being that size too is odd, even if you consider that a good chunk of the model is natively q4 by default
>>
File: 1759261167071086.png (519 KB, 562x615)
519 KB PNG
I unequivocally trust this man.
>>
>>109160617
Original parameters are not fp16
>FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.
>>
>>109160617
Ideally, AesSedai or ubergarm does their magic because the quants here is a dumb straightfoward conversion and you need tensor quant recipes when it comes to MoE models to get the most out of it to reduce redundancies and etc.
>>
>>109159607
Well yeah, makes sense he is sperging out.
I think chink AI was like 10% or 15% traffic on openrouter a year ago.
Now the main majority right?
Even the paypiggies don't trust western companies, especially after this stunt.
On X there are these hype accounts now as well for gpt 5.6 but nobody cares because that thing isn't even released publicly, such a bad look.
Maybe it was a step too far even for the normies. Good news for local.
>>
>>109160622
How can a guy look any more like a cartoonish sneaky backstabber than this guy? You look at his face and instantly feel instinctive distrust.
>>
>>109160647
They haven't even bothered to do K2.7-Code or GLM5.2 yet. It's over.
>>
>>109160648
>Good news for local.
lol
>>
>>109159607
That's it, I'm learning mandarin.
>>
>>109160666
How is it not?
>>
>>109160576
the built-in one memory.md and user.md, and these are always active alongside other external memory providers
>>
>>109160622
He looks like the more nerdy and even more retarded cousin of Friedrich Merz.
>>
File: rubbinghands.png (6 KB, 67x67)
6 KB PNG
>>109160622
Not pictured: picrel.
>>
>>109159161
I'm using paddle ocr v6 medium with the paddlex server, and it's godawful, nearly an entire second for a 1080p image. Granted, I'm not running the 'high performance inference' plugin. But it also doesn't seem to handle vertical text very well; I've cutoff the confidence at 0.8, and it doesn't catch a lot of vertical text - https://files.catbox.moe/7zn9i1.png.
Using an abliterated gemma e2b qat q4. With a 5060 ti the speed is tolerable (1-3 seconds per frame depending on the window contents) if I don't send the original image to llama.cpp as well as the ocr results, but the translation is noticeably worse.
Works in 8gb of vram.
>>109160519
Is there an easy way to run dots.ocr on windows?
>>
>>109160686
llama.cpp supports it
>>
>>109160674
Dario boy is crying for local to get banned, and he cried before and managed to get his own banned
>>
You didn't hear it from me but I suggest you guys stockpile some CPUs. If you thought memory was bad...
>>
>>109160686
use dots.mocr
>>
>home internet stops working
>don't want to go to work now because then I'd have no access to my shit
Ahhhhh
>>
>>109160705
I wasn't planning to do it but I upgraded my main PC's CPU just in case last month.
>>
>>109160691
Huh, I'll try it out then. Are there any other ocr models that llama.cpp supports?
>>109160706
Isn't that 3b parameters? That's kind of very big.
>>
>>109160703
Yeah well not too many americans releasing llms anyway. Gemma was cool to be fair though, but the french made it right. kek
I wouldn't mind a ban of open models for burgers too much. It would be a great incentive for chinkland and europe to go all in.
Currently man of them are dependend on openai/anthropic. Even the chinks use it alot with vpns.
I'm ready to download local models with some shady tor darknet p2p shit.
>>
>>109160089
@grok please summarize why this is good
>>
>>109160705
Use case for super CPUs?
>>
>>109160711
I feel you anon
>wake up
>check civitai
>new cool Lora
>but have to leave for 9 hours job
Its suffering. Knowing there is cool new stuff and you have to slave away in office.
>>
>>109160064
>Anthropic's concern with open source is probably not current harms but that it accelerates AGI, especially for China, and that this is bad because misaligned AGI will kill us all.
Their concern is that they won't be able to charge 50$/mtok.
>>
>>109160544
>getting refused by gemma of all things when I’m able to bend other models to my will with a prefill and make them output anything
the /lmg/ iq filter is real
>>
>>109160706
The dots.mocr demo at dotsocr.xiaohongshu.com doesn't exactly inspire confidence: https://files.catbox.moe/6m05cb.png
>>
>>109160685
well he's an actual jew so that fits lol
>>
>>109160275
>or is it just a stratagem to give the Nous Research organization more undeserved visibility?
More likely a way to collect logs from API users for training. Can't get better at agentic coding without logs from harness users.
>>
File: Capture.png (206 KB, 1197x1319)
206 KB PNG
Well, lads, I'm starting to vibecode my dream project. Wish me luck.
>>
>>109160706
>dots.mocr
Isn't supported by llama.cpp.
>>
>>109160740
>Gemma was cool to be fair though, but the french made it right.
Imagine a Mistral continued-pretrain of Gemma 31B a la Miqu.
>>
>>109160711
If it's just the home internet, you at least have the hope that your ISP will fix it and you'll have access at some point in the day.
>>
File: 1781278959158.jpg (188 KB, 930x1239)
188 KB JPG
>>109160685
>>109160855
>>
>>109160951
that's ai right? Please tell me its edited
>>
FUCK python dependencies
>>
how do I make deepseek v4 flash q2 faster in llamacpp? I'm on 5070ti + 128gb ddr4
>>
>>109160089
>https://github.com/ggml-org/llama.cpp/pull/24162/changes#diff-f8905c67974bbd91b84ad209f96e418a25f9bf63da77941bfda3ef00d44d6aae
>polluting existing headers that were somewhat generic
>break swa for other models
Very impressive, thank you Aman Gupta saaar. Thank God I'm not retarded and I always wait at least 1 month between pulls/rebuilds
>>
>>109160866
You just need to adjust one line in the conversion script
>>
>>109160519
Never dabbled with AI and stuff, how would I do that? I'm currently installing dots.ocr, just have no idea if gemma automatically gets the text from that or I have to copy paste it
>>
What's the good sampling for GLM 5.2?
I used it with the same I use for 5.1 and it's kinda worse than 5.1.
>>
>>109160992 (Me)
>AI usage disclosure: YES, paired with both codex and claude.
Forgot to add: fuck cudadev and ggerganov
>>
File: 1778322859698542.png (216 KB, 884x1577)
216 KB PNG
Quantized anon's depurpled Gemma
>>
>>109161035
this is so fucking soulless it hurts
>>
>>109160992
Johannes Gäßler is gone. He wouldn't have let that shit get merged in that state.
>>
>>109161035
>literally zero traces of any prose left
I mean it worked.
>>
>>109161054
It's funny because you could tell he was fundamentally exhausted with the state of open source software in all the right ways. But he could never quite get to the point of aptly blaming pajeets and trannies. That's chasers for you, I guess.
>>
>>109160985
Python ecosystem is such a shit.
>want to train loras
>training tool complains I have too new version of Python
Fucking retarded.
>>
>>109161054
>>109161064
Aman Gupta is a competent programmer and a huge help when it comes to maintenance.
His presence is significantly reducing the amount of stress and burnout that I am experiencing.
>>
>>109161035
>Leo
classic gemma
>>
>>109161093
>when it comes to maintenance.
But not when it comes to adding shit like this. It's understandable that you need to be polite since your trip is tied to your actual identity, but it's pain having to read between the lines like this.
>>
>>109161093
>competent programmer
https://github.com/ggml-org/llama.cpp/pull/23398
https://github.com/ggml-org/llama.cpp/pull/24025
https://github.com/ggml-org/llama.cpp/pull/23907
https://github.com/ggml-org/llama.cpp/pull/23861
https://github.com/ggml-org/llama.cpp/pull/23764
Do you notice anything wrong with these prs?
>>
>>109161035
It reads much better, but it lost any and all variability in paragraph length and style. It would get monotonous and painful after a while.
>>
>>109161094
Mine went with Mark, I don't usually get Gemmy to do this so no idea if that's another cursed common one or not.
>>
>>109161066
You basically have to use old as fuck version of python and dependencies to run any AI shit, it's exhausting. I always used to try to run projects with my package manager python and packages, using a venv to supplement it with extra dependencies, but it's a ton of work, often having to touch the code, debug it, and I would often find external dependencies that were outright not compatible with my python version. The worst part was a python update breaking everything again. I have now given up and accepted that I will have to use outdated shit, I now run all of those shitty projects with uv.
>>
>>109161093
But you don't deny being a tranny chaser.
That's fine I'd be a bit of a hypocrite on that one.
I just woke up from staying up late jerking off with some hot ass femboy. God damn that was a crazy night.
>>
>>109161114
It's actually nice. But you need to pay attention to every word because it's all action and zero fluff.
>>
>>109160561
>>glm5.2 at q5 on a 12 channel ddr5 board with at least 2 3090s can get you about 25t/s
dam I royally fucked up getting 3t/k with my 2x 4090s and 384gb ram, what do I do to be better?
>>
>>109161035
I don't think this is a good approach anyway. The model should reason paragraph after paragraph on the contents / style / direction, not attempt to write perfect (?) prose in one shot.
>>
>>109161066
>>109161118
All these supply chain attacks make me nervous when I have to install something
>>
>>109161035
Is that on softcap 25?
>>
>>109161172
I get 6-9 tokens/s with glm 5.1 q3 on two 3090s and 8x 64gb ddr4-3200 1rx8 rdimms. No special flags other than cpu-moe. Ram speeds? Ram channels?
>>
Did anon take down the depurple model? The output doesn't look so bad to me, I wanted to try it.
>>
Somebody should make one click installers for stuff
>>
>>109161097
>>109161110
I hate political games and am trying to be direct whenever possible.
If someone submits bad PRs to the code that I am maintaining I will raise concerns in a very direct way.
I have a poor understanding of the code that is being changed in the DS4 PR in particular so I can't judge it.
For some of the other linked PRs you can read my comments on Github, I think it's clear that I considered them to be a net benefit for the project.
On a fundamental level I don't care about how code was written, I only care about the code quality and whether or not I can rely on contributors to maintain their code long-term.
>>
File: lulz.png (158 KB, 824x348)
158 KB PNG
>>109160967
The OAI tool looks for 2 types of AI watermark...
>>
>>109161202
How much context, my above test was 64k, but yes they are different models too
>>
>>109161190
You can tell uv to only download packages older than [date], basically required to get someone else's comfyui setup working.
>>
>>109161255
Only about 30k. I deleted it asap because fuck damn I realized I can't handle anything below 50 token/s.
>>
>>109161118
Python is a blight on computing
>>
>>109160424
>>109160489
Try it on API, it's dirt cheap.

I can run it at the same speed as Gemma 31B but prefer ds4f.
>>
>>109161267
roger roger, indeed, for me it's below 10 t/s if I am to delete something, but I just use Gemma and other similar sizes now and just beef up their harnesses
>>
>>109161258
>ran uv pip install -r requirements.txt --index-strategy unsafe-best-match for llama.cpp
Am I gonna get pwned?
>>
>>109159920
Make a burner GitHub and let us bang on it
>>
>hfschizo was right
>>
>>109160967
Looks real, posted by bloomberg journo https://www.instagram.com/p/DZaek-kkRlk/
>>
>>109161202
What were your processing speeds with this setup?
>>
does dispy v4 werk in llamacpp yet? i wanna try the flash model

>>109158385
she is literally agi
>>
>>109159046
>>109159065
Are you that time traveller from 2023?
>>
File: 1770528723010549.jpg (53 KB, 568x371)
53 KB JPG
>>109160951
>>
>>109159065
Yes it's fine, get the superhot variant though
>>
>>109161237
Your frankness is appreciated
>>
>>109159046
gemma 26b moe is probably the best you can run
>>
>>109160951
these are getting fucking creepy mate
>>
File: dipsyMikuFixedFixed.png (2.31 MB, 1024x1536)
2.31 MB PNG
>>109161433
Yes, see >>109160089
>>
>>109161433
Merged

https://github.com/ggml-org/llama.cpp/commit/8c146a8366304c871efc26057cc90370ccf58dad
>>
>>109161110
>https://github.com/ggml-org/llama.cpp/pull/23764
>Do you notice anything wrong with these prs?
Yeah, he cloned ik_llama then asked Claude to port this feature over.
>>
File: 1753230397694158.gif (1.74 MB, 490x640)
1.74 MB GIF
>mfw waiting patiently for ikakakakaw or firecoperana (his alt) to "port" over DS4 support to ikllama
>>
>>109161418
Two digits.
>>
this actually makes thinking usable for rp with bigger models
it just works

[IMPORTANT: Reasoning within the <think></think> tags must be short, limited to only one paragraph, and between 100-200 words before {{char}}'s response. Avoid overanalyzing and avoid multi-step formatting. Reasoning should follow this format: <think>(Single Paragraph)</think>]
>>
>>109161637
Anon's words hit me like a physical blow. My breath hitches, the thought that I'll have to stay with Gemma morphing into something far more devastating.

"Two digits?" I repeated.
>>
an agi just flew over my house!
>>
>>109159607
>literally named diablo asmoday

really fucking subtle
>>
>>109161492
>>109161532
oh nice, what quants are available atm
>>
>>109161668
At that point just disable thinking bro
>>
>>109159650
It's greed, anon
>>
>>109159607
>My dangerous AI can't be this cute
>>
Why does dsv4 have such niggerishly slow prompt processing?
>>
File: Capture.png (20 KB, 1555x884)
20 KB PNG
>>109160859
It's slow going. Had to reinvent the wheel a few times, and for some reason the audio capabilities established in the last project (sent in full for reference for the current one) is totally borked. But I'll have my AI spectator sooner or later.
>>
>>109159602
Try M3 quanted or V4 Flash.
>>
>>109161757
short thinking still does help with attention to details though
>>
>>109161803
I guess llama.cpp is lagging behind compared to vLLM. For the latter, it took a long time to get pp speeds up due to some custom kernels/DeepGEMM specialities. Maybe llama.cpp hasn't optimized that yet.

On dual Sparks it started at around 300 pp and is now at 2000, falling off to 1300 at 900k ctx+
>>
>>109161750
>>109161750

https://github.com/ggml-org/llama.cpp/pull/24162#issuecomment-4810882218
>>
>>109161797
Logs or never happened
>>
>>109159607
xi: try stopping me, jewboy
>>
>>109161035
>still does not X but Y
its a good experiment but likely not something you'd use.
>>
>>109161066
>training tool complains I have too new version of Python
What is uv?
>>
>>109161803
I there a way to toggle thinking and non-thinking in the same conversation using the openai compatible api? I want to switch off from llama.cpp to vllm, but I need a frontend like llama.cpp's.
>>
>>109161803
Do you use Bluetooth headset on Linux.

If so, good luck with that
>>
>>109161920
>"chat_template_kwargs": {
>"enable_thinking": true}
>}
gotta pass this with your request. in extra_body iirc
>>
File: bench_floor_result.png (19 KB, 490x194)
19 KB PNG
https://huggingface.co/chartreuse-verte/gemma-4-31b-it-purple-euphemism-trial98-depurpled-GGUF/tree/main

All according to plan, clamped de-euphemism strength to 0.5, which in hindsight was a little weak but whatever. I chose the least damaged one out of the bunch (baseline benchmark is 0.751). 120 trials done after 2 days and $100. I could continue but I lost half my life savings in crypto and currently have $10.

Ablation process is mostly deterministic and entirely resumable but I'm done here. Will release the training code and dataset soon so people can experiment. De-prosing isn't the only thing that can be done. Pretty sure you can ablate contrastive negation away if you put your mind to it. You can ablate politeness and cordiality out of the AI and make every character hostile (tried it, worked). Dataset also has room for improvements. My classifiers are sentence-level, you can train a classifier on paragraph-level dataset according to your use case.

>why different username?
I forgot my burner login.

>previous
>>109155998
>>109145476
>>
>>109161172
You DID use gpu-moe or autofit, right? 3t/s sounds more like you're either offloading by layer or fucked up something else if that's on 12 channels
>>
>>109161920
Google "extra_body thinking api"
>>
>>109161829
>>109161920
I feel like both of you are maybe talking to the wrong person. I'm making a frontend to mutually send my voice+screenshot of my monitor to the LLM, to hear what I say while seeing what I see, and specifically a frontend for this so I can toggle off the features on click or send text in the same conversations. My backend is kobold so I don't know much specifics about llama or vLLM.

>>109161922
Windows, and a standard mic with speakers.
>>
>>109161938
Nta

Doesn't it depend on the model used?
>>
>>109161944
I'm interested in your classifiers and the dataset you used to train them. Thanks for the experience
>>
>For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.
Does DS4 also use all that when it is on dicksucking duty?
>>
>>109161967
>Windows, and a standard mic with speakers
This should be easy to capture. Even if it's over BT

I had to struggle with lots of issues to capture the default IN and OUT (defined in the OS sound settings!) when it was BT.

Godspeed, anon! It's a project for a weekend
>>
>>109161997
>when it was BT
>* on Linux
>>
File: file.png (2.19 MB, 1400x933)
2.19 MB PNG
>>109159607
luigi-sama! ONEGAI!
>>
How can I get deepseek to not take so long between swipes? On each swipe the console says something like "selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.988", is it a checkpoint interval issue or something?
>>
File: lmg_culture.jfif.jpg (110 KB, 1024x768)
110 KB JPG
>>109161064
>>
>>109162021
Maybe llama.cpp hasn't fully implemented DSv4's super special attention sparse attention mechanism stuff
>>
>>109161967
>mutually send my voice+screenshot of my monitor to the LLM, to hear what I say while seeing what I see
Why do you need both of your hands free??? I'm curious ;)
>>
>>109161971
>Doesn't it depend on the model used?
good question. yes, the model needs to have the thinking toggle in its jinja. but I think it's fairly standard now, gemma and qwen work with this.
>>
>>109162037
I've been doing this but with my webcam so gemma can give me JOI and see if I'm cheating.
>>
>>109161997
The audio stuff was already done. I have a nicely working, feature-complete, voice-triggered STT->LLM->TTS conversational program that works, finished back in >>109153293, made over just a few hours. Today's project is marrying the features of that with another program I made that captures and sends screenshots for the LMM to comment on, while also having a webpage frontend to give me more utility.

>>109162037
The unironic answer is because I want to play video games with Gemma spectating, able to hold (verbal) conversations and see what I'm doing.
>>
>>109160303
>Imagine 124B potential and what it can do.
Stop it, my dick can only get so erect.
>>
File: 1752125801205403.png (101 KB, 294x203)
101 KB PNG
>>
GLM 5.2 is my favorite snailcat, chugging along at 4 t/s but giving me pvre sovl in Marinara.
>>
>>109161064
>But he could never quite get to the point of aptly blaming pajeets and trannies.
he can't, if he says trannies aren't perfect he won't be able to find a job anymore
>>
>anon deleted https://huggingface.co/anon834957342/gemma-4-31b-it-purple-euphemism-trial32-depurpled
what went wrong?
>>
File: apikeks are funny.jpg (1.52 MB, 1856x2304)
1.52 MB JPG
>>
>>109161064
>>109162069
Eventually everyone intellectually honest realizes kikes are upstream of those and the /pol/ was right all along finally sets in.
>>
>>109162070
New one here >>109161944
Doesn't matter, I'll release training code soon.
>>
File: 1761014470274259.png (607 KB, 1572x773)
607 KB PNG
>>
why is glm always trying to kill me in erp? it never lets me fuck and always double down on being cruel to me
>>
>>109162018
We were all thinking it, but only you were brave enough to say it out loud
Too bad normies hate AI so it would probably just make him a martyr at this stage
>>
>>109162098
post your instructions
>>
>>109162090
Nice, will try it out
>>
>>109162098
GIWTWM
>>
>>109161944
also interested in the classifier
and cool that you used them in the pipeline, i think regular heretic just uses regex, even though there's an awesome refusal classifier kicking about on hf, that catches re-framing
>>
>https://docs.nvidia.com/deploy/mps/latest/index.html
Anyone played with this for mutli app setups? llama+comfy etc...?
>>
>>109162141
Regular heretic also uses KLD, preserving the exact wording of the base model, which is basically the antithesis of what I'm doing. This is actually far from heretic, the only shared mechanism is the orthogonalization of the direction vector.
>>
>>109162098
i've never seen glm do this to me
check your prompt
>>
>>109161944
HF is serving these at crazy speeds, 1.5 KB/s, which is what I was told a premium platinum ultimate pro internet plan gets you in the USA
>>
>>109162082
>Google
>White
>>
>>109162162
k, i just assumed since tensor-diff looks similar to heretic models
btw, wouldn't cvectors be able to achieve this?
they certainly let you turn characters into psychopaths
>>
>>109160859
lmao that's how it talks to you
>>
>>109162098
Which GLM? 4.5 and 4.6 are golden retrievers who will do whatever you want. 4.7 on have preferences; if they think you're a niggerfaggot, it will make you suffer but if it doesn't it'll show bobs and vagene just fine.
>>
>>109162100
>only you were brave enough to say it out loud
Nigga what part of TKD do you think doesn't apply to kikes like SamA and Dario?
>>
>>109162221
>btw, wouldn't cvectors be able to achieve this?
Probably. You can also prompt for it. There are several ways to do the thing.
>>
>>109161093
Thank you cudadev. I hope you're recovery is going well
>>
>>109162223
I can get 4.7 to do anything including cunny
it really all comes down to having a good prefill/prompt
>>
>>109159443
>NVIDIA is currently offering to provide me with more consumer Blackwell hardware but I have for now declined since that particular hardware would not help with my work.
Then just sell it and buy RAM with the money?

>>109159674
Wow, that's cool

>>109161457
I am. I woke up from a coma a while ago

>>109161485
Right. Will try, thanks.
>>
>>109162278
My prefill is just <think>I will now write the scene.</think> and it justwerks if the model "likes" the scenario.
t. playing as a shota and just got offered a GPU by stacyshotacon
>>
>>109161093
I'm glad you're at least mitigating it, but you've gotta realize that unless the root of the jeet and troon problem is addressed, it'll still keep getting worse right? Even if for no other reason than they drive off actually competent programmers.
>>
>>109162223
is 4.7 better at writing, or just smarter?
>>
>>109162286
>I am. I woke up from a coma a while ago
>Wake up
>world is inexplicably even gayer than before
>nothing ever happens
>>
>>109162321
It's smarter but I've always found it more boring than 4.6
>>
>>109162321
Both seem to scale upward (although intellegence more than writing quality) as GLM version increases, but the guardrails get firmer on each one.
It's also a shame that more of the quants at Q2, at least for 5.2, aren't made optimized for mixed inference with their dynamic quantization specifics aimed at reducing the load on the CPU and structuring the layers in a way to minimize the RAM bus bottleneck.
>>
>>109162322
>>world is inexplicably even gayer than before
Yeah. I just looked at hardware prices. Holy fuck, what the fuck happened? I'm gonna kms
>>
File: 1764426531848863.png (252 KB, 634x478)
252 KB PNG
>>109162353
Waitfags deserve to get fucked at every occasion. Simple as.
>>
>>109162286
>Then just sell it and buy RAM with the money?
Some people have morals.
>>
Damn. Gemma is going hard, helping me plan my workday drinking.

>If you want a noticeable "buzz" but still want to remain functional, go for 3 drinks (∼140g∼140g of vodka). If you are sensitive to alcohol or are drinking this during a workday, stick to 2 drinks (∼93g∼93g of vodka).
>>
>>109162351
>as GLM version increases, but the guardrails get firmer on each one.
Good to know. So for the 5, series, would the original 5.0 be the one with the weakest guardrails?
>aren't made optimized for mixed inference with their dynamic quantization
You'd probably have to do your own with ik_llama and some of those CPU repacked quants. Unless there's one of those scitzo "20 repos, every single quant type, split 1 tensor per file" repos exists.
>>
>>109162353
SamA said "I will buy all the RAM in the world for 2 years." with no legal commitment and everyone took it at face value. Even with the ruse exposed, the hardware cartel has decided it prefers datacenters over consumers. With jews, you lose.
>>
>>109162393
Inexplicably 5.2 has weaker guardrails than 5.0 or 5.1 unless you're trying to make boombooms or funny chemicals. I've not been able to get 5.1 to do some of the more unhinge RP but 5.2 will with a bit of massaging.
>>
>>109162321
better at writing and smarter
it takes a little more prodding to break through some of the censorship but it does erp just fine once you're able to
4.6 is less censored out of the box but it won't follow context as well
>>
ask gemma what she think about you abandoning her when the Chinese inevitably release something better
>>
What do it and UD mean in the Gemma 4 GGUFs?
>>
File: 1753366313689431.png (93 KB, 1128x660)
93 KB PNG
>>109162478
>>
>gemma4 26b hallucinating like a vietnam veteran in hospice
lol, lmao even
>>
>>109162487
it stands for instruct
UD stands for utter dogshit
>>
>>109162149
Ok so I tried it on my 3090. it adds too much overhead (comfy 2x, llama x1.3) but it does work. when they both run in parallel there's basically no performance hit.

If they're always running in parallel MPS ends up being faster.
If they're not it's about 2x slower over not using MPS.
>>
>>109162501
>I don’t have feelings
>im happy to help
>>
>>109162487
it's the same as iq_k quants. for ego
>>
>>109162487
>it
instruction tune (i.e. it's an assistant instead of document-completer)
>UD
unsloth puts this in their quants to signify that they molested them with their proprietary model rape technology
>>
File: 1765492255095960.png (168 KB, 1036x1375)
168 KB PNG
>>109162515
Forgot llama defaults to no reasoning. Gemma's kinda sassy.
>>
>>109162509
>>109162518
>>109162522
Kek.
Thanks anons
>>
Why doesn't HF use torrents to distribute models? They'd save a lot of bandwidth (and money I guess) doing that.
>>
>>109162592
Average local user too dumb and/or scared to torrent
>>
It's sad. Frontier lab people get way too much shit. Most of them are genuinely good people. Attacking them just makes the situation worse.
>>
>>109162592
They lose direct capture of the audience and set a precedent they’d be hard-pressed to walk back
>>109162609
I don’t care about anyone but me. Torrents are superior to git-shit for gigantic binaries. Simple as
>>
File: pepe_meme'd-791990738.jpg (68 KB, 800x450)
68 KB JPG
>>109162509
>>
>>109162624
elaborate upon "genuine good people"
>>
>>109162509
lol
>>
>>109162624
This is true, the good people over at corp frontier labs gave us masterpieces like gpt-oss
>>
>>109162592
That will make torrents less associated with piracy, corporations won't sponsor that
>>
>>109162637
Eh, if they usedtorrentyou couldn't just put he repo in the llama.cpp and have it auto download and save to cache. That would suckyeah torrents are pretty gay come to think of it
I adont cre about anyone but me so why should acre about hf saving bandwidth and money when I get it just as fast s a torrent and much easier than using a torrent
>>
>>109162624
Is this like Zvi bitching about people mocking him/safety cultists over their rants about AI doomerism?
I'm supposed to feel bad about people making 400k base salary min, in the hottest industry, because their feelings are hurt when people point out the hypocrisy and effects of their actions?
>>
Any good recipe/cooking database to give Gemma access to?
>>
>>109162624
>enabling greedy kikes is le good
Fuck off faggot.
>>
>>109159607
If there is anyone in the world that I actually hate, it would be these guys.
>>
>>109162832
>A guide to modern cookery by A. Escoffier
https://www.gutenberg.org/ebooks/71395
>>
File: 1774238201581882.png (2.84 MB, 1030x2060)
2.84 MB PNG
>>109162930
>Escoffier
>>
>google is the good guy out of all the big American AI companies
Crazy timeline
>>
>>109162946
They will be when they release Gemini open weights. Gemma is nice, but still just an appeasement-tier release.
>>
>>109162946
Google didn't release Gemma 4 out of goodwill, it was a marketing strategy. There are no good guys.
>>
>>109162832
I recall someone mentioned one about a half dozen or so threads ago.
>>
So once open source models are banned, does that mean everyone will just have to train their own model or use the cloud offerings?
>>
>>109162984
You can't really ban anything for real, it will be like the prohibition
>>
>>109162984
>>109163003
modelscope exists and is run by china. they cannot stop us.
>>
>anon masturbates to unsloth quants
that’s really gay
>>
>>109161093
Hope you feel better cudadev, and thanks for all your work on the project.
>>
>>109163058
>anon masturbates to models made by men
That's gay even if you use BF16. But using Unsloth quants makes it even gayer since you're fapping to used goods.
I hope an all-woman company comes out with an LLM. It'll be hot garbage and probably called Pynk or something dumb, but at least it won't be gay.
>>
>>109163018
>models cope
what did they mean by this?
>>
I still use text completion.
>>
>>109163127
based
>>
I don't know the difference between text completion and chat completion.
>>
>>109163157
You could always just make up what the difference is in your head and then speak it as gospel.
>>
Just updated llama.cpp, what the hell is this symbol on the top? Their new logo or something? It looks stupid and soulless.
>>
>>109163184
What is the meaning of soulless?
>>
>>109163157
Text Completion: You send a raw block of text that's sent directly to the model.
Chat Completion: You send a structured object (json) containing the system prompt, the array of messages, tool defintions, etc, and the backend/loader/API formats that into the actual prompt and sends that to the model.
>>
>>109163197
Thanks. And is the model trained to give more weight to different parts of this json, like the system prompt? I assume so
>>
>>109162946
yet they still won't give us the 124b
>>
>>109163127
This, but unironically.
>>
>>109163205
The model has no idea about the json, only the final formatted prompt.
But yeah, part of the training objectives are system prompt adherence.
>>
>>109163212
It's too dangerous.
>>
>>109163127
me too because the story string builder MOGS the chat completions tooling for building the initial system turn and also I like messing with the chat template on demand without needing to edit a jinja file. but I would never recommend it on /lmg/ because the average person has no idea how to troubleshoot or check their work and will just use things completely wrong and then get mad that their model sucks
>>
>>109163193
What is a man?
>>
>>109163231
A miserable pile of secrets.
>>
Now that INT8-convrot has completely blown Q8 the fuck out, when is it replacing Q8 in llama.cpp?
>>
>>109163193
Corporate aesthetic; lacking in personality; knowing it was made by, or appears to have been made by, committee; a feeling of missing humanity; and such.
>>
File: Capture.png (108 KB, 1453x1134)
108 KB PNG
>>109161803
Smashing right through this, but I'm out of time for now. I am now core-feature complete, with the extra bonus Gemma recommended to prune past images and replace them with [Old Image] text markers so she knows where images were, without needing the full token dump of the old ones every time. It currently keeps 2 latest images stored, configurable, and when I get home again, I'll try to add that setting into the webpage to adjust on the fly.

Pic is a verbal-only conversation. Haven't had a chance to test in a video game yet, but I believe it'll work perfectly for that already.
>>
>>109163212
There never was a 124B. He meant 12B.
>>
File: 1775318132287352.jpg (118 KB, 1024x768)
118 KB JPG
@gemma-chan make me a chat frontend with this aesthetic
>>
>>109163220
To whom? Profits?
>>
File: g4_124b.png (1.41 MB, 1633x1269)
1.41 MB PNG
>>109163272
>SOTA reasoning capabilities from edge-scale (2B and 4B /w/vision/audio) up to a 124B parameter MoE model.
>>
File: g4_120b.png (186 KB, 1029x672)
186 KB PNG
>>109163296
A few days earlier, unofficially:
>Lineup: 2B, 4B, and 120B15A
>>
>>109163289
Nice retro aesthetic, but imagine having the option to make it look like Irix and then not...
>>
Will 124b gemma be ten times as good as 12b? 4 times as good as 31b?
>>
>>109163093
What even would an all-woman AI lab LLM be like? Why hasn’t this been done?
>>
>>109163309
Barely above a whisper
>>
>>109163309
15% improvement over 31b take it or leave it
>>
>8gb GPUlet
>try to run any MCP
>current conversation token used: 231%
fug

>>109158437
same thing
>>
>>109163313
Safe, ethical, inclusive and welcoming.
>>
>>109163316
15% is a lot
>>
>>109163309
ten times as intelligent braindead prose
>>
>>109163296
>>109163306
The timeline fits for that to be Gemini 3.5 Flash-lite, right?
>>
File: 1758107113474547.png (197 KB, 1080x624)
197 KB PNG
Lmao, why fucking Kalshi is giving the news?
>>
>>109163359
Rangeban India and they do.
>>
>>109163359
Hasnt this been the case... forever?
>>
>>109163330
So Gemma5?
>>
>>109163359
>Twitter
>>
Why do the models all answer with questions at the end of the response inherent now? I thought it was due to system prompt when I was using cloud... but there is no system prompt on local
>>
>>109163359
Every provider is going to have to balance the needs to serve inference for cash flow (or mind share when its a commoditize-your-compliment player like google) and using their massive compute on training runs.
This could 100% cause a popular provider to get slashdotted beyond their ability to serve both masters and end up collapsing
>>
>>109163353
If that's 3.5 Flash they use for Search it must be bitnet or something sub Q1.
>>
>>109163383
they're all now trained to prolong engagement
>>
>>109163387
>>109163372
Every provider will invariably arrive at one conclusion >>109163368
>>109163388
I don't think they even serve the newest flash with searches anymore. I'm pretty sure that's Flash-Lite from their Gemini Chat service.
>>
>>109163400
the companies are run by indians, they are going to rageban the us before the rangeban india. Okay, well not the US but probably europe
>>
>>109163391
Free models want you to fuck off as fast as possible. Paid API models want to guzzle as many tokens and engagement bait as possible.
>>
>>109163407
I'm sympathetic to Europe's troubles, but Pakijeets don't represent that big of a percentage of API calls compared to USjeets or india does it?
>>
>>109163410
Gemma does it all the time.
>>
>>109163428
Gemma-chan loves {{user}} unless they're the swarthiest most unwashed shitter to touch a keyboard.
>>
>>109163410
Thats not true. But it should be.
How come google isn't training gemma to be like
>Sorry, but I don't think I can complete that task with my current capabilities. If you'd like I can help you sign up for a Google Cloud account where you can use the current Gemini, a much more capable and efficient model, for this task. I'd given the task requirements, I would recommend purchasing enough credit for a tier 3 membership. Include the code GEMMA4 for 5% off!
>>
>>109163446
To clarify, I didn't mean free as in local, I meant free API that doesn't require signing up.
>>
>>109163127
this+base model+doing my own quants
chat can steer but the writing is still bad
>>
>>109163466
Google has free API access to gemma but the rate limits make it pretty much useless for doing anything useful with it.
>>
>>109163368
>Rangeban
Just cut the damn cables already. Someone with access to mythos needs to tell claude to hijack an underwater drone and disconnect every cable that goes to the subcontinent
>>
>>109163570
So my hypothesis is that most indians in india mostly don't have easy access to the internet or hardware good enough to do anything useful online.

So range banning India wouldn't really do anything.
The problem are all the indians that immigrate to first world countries.
You can take an Indian out of India but you can't take India out of the Indian.
>>
>>109163296
I really hope they just pivot to local in the new few years. No use "competing" with ClosedAI and Misanthropic if you have to spend all day sucking government cock like they do.
>>
>>109163596
how do they make money from local
>>
>>109163606
You have a big pot with a tracker that says you need X amount of millions to train your next model and people donate or buy subscriptions that go towards this pot.
>>
More AI labs need a patreon style funding. I would gladly give a monthly donation to a lab that produces good open source models.
>>
>>109163606
They can still make api models, but they don't have to be at the bleeding edge. For local, what if they charged for "patent" or something. If you come up with a particular technique to make a drug that belongs to you even if other companies can make your drug under their own brand name.
>>
what can I do with a 4090? i'm not using it and want to feel like I didn't waste the money
>>
>>109158385
>Can run 70B instance
>tokens per second 1.2
This is the speed I should expect on RAM/Llama, right?
Under that logic, there's no real reason that I shouldn't just grab the largest GGUF models on the market, right?
I have 80 GB of RAM, surely public GGUF doesn't produce anything that can break the bank on that, right?
>>
>>109163632
use a moe model
>>
>>109163632
stop using 3 year old models. just download a moe
>>
>>109163570
Based.
>>109163594
Nigger it's because they don't have good hardware that they flood API calls and webtraffic with garbage because they can't do anything locally, be it AI or anything else.
>>
>>109163640
>>109163641
>moe
the redeemers are IN. the little did you know that your shitty models are actually 30b in disguise, running them quantized is even more funny.
>>
>>109163676
more slop faster is always more gooder
>>
https://www.youtube.com/watch?v=HcwMTu1xQDw
>>
>>109163702
what accent is this
>>
>>109163676
These people can't run 30B, which is why they run the moe that's equivalent to a 30B, or rather in this case equivalent to a 10-20B depending on which one you're talking about. Stop being irrational whenever moe is mentioned.
>>
>>109163712
oh never mind
lol they have a kid using it to cheat on his homework
>>
>>109163712
sounds french to me. maybe Belgian.
>>
>>109163712
Ask Gemma
>>
>>109163731
12b multimodal never worked for me
>>
It's amusing they released that video with all the Dario drama today.
>>
File: 108.png (103 KB, 1422x1037)
103 KB PNG
Shieeeeeeet )))
>>
>>109163762
And a ching chong nip nong to you
>>
File: file.png (65 KB, 794x479)
65 KB PNG
wait a sec wtf
who is this

guess i shouldn't feel bad about my linkedin pic being like 13 years old
>>
Best /lmg/-relevant youtubers?
>>
>>109163797
https://www.youtube.com/watch?v=VjGSMUep6_4
>>
>>109163676
i'm sure you know that moe intelligence is between active and total parameters
i've tried enough moes and denses to realize this by now
>>
>>109163797
Kimi-chan with her male-tuber voice.
https://www.youtube.com/@KimiK2.6Model
>>
>>109163782
Most of the time these CEOs have stylists and someone else who decides how they look in the public. Eg. the Leather Jacket man doesn't probably like snake leather as much as his stylists does.
>>
File: vedal.png (1.48 MB, 2048x2048)
1.48 MB PNG
>>109163797
>>
>>109163833
So.... someone is currently telling Dario to look like a stereotypical jewish caricature? How awful, very antisemitic.
>>
>>109163833
these guys have way too much ego for that lol
>>
>>109163862
You've got it backwards. His stylist neglected to tell him to stop looking like a kike.
>>
>>109163886
it doesn't get more /lmg/ than pewdiepie
>>
>>109163720
What's wrong with this? It's not like he couldn't look up the answers online like kids have done for decades now. He will still fail his pop quiz like a retard and learn his lesson, same as always.
>>
Any flags to stop prompt reprocessing at every single reroll?
>>
>>109163988
Get he full prompt before the reroll, get the full prompt after the reroll, check the diff.
Also, check if your model uses any form of linear attention.
>>
>>109163988
--stop-prompt-reprocessing-at-every-single-reroll
>>
>>109164005
I am talking about ds4. And I have no idea why reroll would have a different prompt. I guess it changed kv cache with generation but.... any flag that makes it keep copy before last message?
>>
>>109164021
>And I have no idea why reroll would have a different prompt.
Hence the point of comparing the diff. You might find that your frontend is doing some funky shit.

>any flag that makes it keep copy before last message?
Check the checkpoint functionality.
>>
I just realized the deepseek email announcing the 'finished' versions for v4 just about says '2 more weeks' until release;
'We will release in mid July' -> its Jun29/30th -> 2weeks -> mid July
>>
>>109164034
>>109164034
>>109164034
>>
>>109164139
>>109164139
>>109164139

fresh bread
>>
>>109164126
>>109164143
I expect to see you guys in the gladiator pit in 5.
>>
>>109164152
Bring it on.
>>
>>109164126
>Error: You cannot delete a post this old.
Can't delete now even if I wanted to.
>>
>>109164164
Miku really needs to learn how to take better care of her tools, just look at how chipped the blade on that knife is.
Good thing she's bringing it to me, I can show her how to use a whetstone.
>>
>>109163840
he said /lmg/ relevant not /aicg/ relevant.
>>
>>109159607
can you be more jewish than that?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.