[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1740718713054690.png (1.9 MB, 1416x2120)
1.9 MB
1.9 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

The Raven Edition

Previous threads: >>106559371 & >>106551921

►News
>(09/11) Qwen3-Next-80B-A3B released: https://hf.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
>(09/11) ERNIE-4.5-21B-A3B-Thinking released: https://hf.co/baidu/ERNIE-4.5-21B-A3B-Thinking
>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think
>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io
>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106559371

--Paper: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms:
>106561145 >106561161
--Troubleshooting llama-server.exe performance with multi-GPU configurations:
>106563691 >106563763 >106563772 >106563861 >106563838 >106563879 >106563891 >106563919 >106563941 >106563960 >106564017 >106564070 >106564107 >106564154 >106564411 >106564784
--Qwen3-Next model efficiency and performance analysis:
>106560211 >106560245 >106560248 >106560269 >106560274 >106560283 >106560310 >106560291 >106560294 >106560322 >106560314 >106560338 >106560356 >106560302 >106563929
--Optimizing MoE models via selective tensor offloading:
>106559871 >106559925 >106559938 >106559943 >106559962 >106559979 >106559984 >106560000 >106560056
--Role of LLMs in TTS and image generation via autoregression and embeddings:
>106562827 >106562864 >106562981 >106563064
--Qwen3 model's verbose reasoning issues during roleplay testing:
>106561341 >106561358 >106561391
--Public server setup for Qwen3-Next-80B-A3B-Instruct with 65t/s performance:
>106563343
--TTS phonemizer bottlenecks and optimization:
>106562423 >106562450 >106562542 >106562586 >106562603 >106562493 >106563141 >106562482 >106562515 >106562543 >106562579 >106562608 >106562763 >106564046 >106564012 >106564024
--California bill to regulate AI companion chatbots:
>106563074 >106563109 >106563402 >106563820 >106563680 >106564086 >106563394
--OpenAI optimization techniques boost local transformer efficiency:
>106563608
--FTC probes major tech firms' AI chatbots for child safety risks:
>106562092
--Specialized small LLMs vs multi-purpose models:
>106564203 >106564224 >106564273 >106564280 >106564230 >106564323 >106564409 >106564560 >106564600 >106564607
--Miku (free space):
>106559401 >106562108 >106562161 >106562252

►Recent Highlight Posts from the Previous Thread: >>106559374

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Ramlets, how are we doing today?
>>
Mikulove
>>
>>106566876
I compress my ram - gives approx. 2 times more memory.
>>
>>106566903
I sparsify my ram - gives 2 times more speed.
>>
>>106566903
I overclock my ram,
It cost twice as much because it breaks.
>>
>>106566778
If you're using it on API, why are you using Air instead of the big one?
>>
https://www.downloadmoreram.com/
>>
Back in the MS-DOS days, I actually had a driver that compressed my RAM. It broke some things though.
>>
File: nomem.png (53 KB, 575x480)
53 KB
53 KB PNG
https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/
https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
https://huggingface.co/google/vaultgemma-1b

The future of Google LLMs: models that know nothing about rare information. They use huge batch size to mitigate memorization, and other stuff.

>What does this mean in practice? Informally speaking, because we provide protection at the sequence level, if information relating to any (potentially private) fact or inference occurs in a single sequence, then VaultGemma essentially does not know that fact: the response to any query will be statistically similar to the result from a model that never trained on the sequence in question. However, if many training sequences contain information relevant to a particular fact, then in general VaultGemma will be able to provide that information.
>
> [...] Sequence-level DP provably bounds the influence of any single training sequence (example) on the final model. We prompted the model with a 50-token prefix from a training document to see if it would generate the corresponding 50-token suffix. VaultGemma 1B shows no detectable memorization of its training data and successfully demonstrates the efficacy of DP training.
>>
>>106566923
Maybe he means he tried it, and found it lacking (in addition to the obese one).
>>
>>106566876
As a 24GB vramlet I just went back to gemma3 qat and it's still the goat for ST-style RP. The writing is always fresh and pleasant to read, with outstanding vocabulary. Fucked around with offloading glm4.5-air q3 and other drummer finetunes and they seemed broken and inconsistent in their responses. Google needs to release a 80b moe gemma 4
>>
>>106567050
Hard to say, can't see the image any longer.
>>
File: llm.png (94 KB, 536x764)
94 KB
94 KB PNG
Is this legit or over-hyped?
I find hard to believe that just 32B is enough to match GPT4 and 200B checkpoints.
With just 32B, you can run it locally in most PCs with a decent quantization that doesn't sacrifice much, having a private GPT4 with no quota limitations...sounds too good to be true.
>>
>>106567118
Sounds like a typical marketing department sales pitch.
>>
>>106567118
>sounds too good to be true
Congratulations, you have a brain.
>>
>>106567118
>reasoning
always and has always been a meme
>>
>>106567105
>>106566972
Welcome to /lmg/ thread google stealth marketing engineer technician saars. Please kindly inform us if the next gemma will be as cucked as the last one so I can decide if I should post funny jeet pictures or not.
>>
>>106566972
> glm4.5-air
> glm4.5-air
To me work well is at the moment the best model for Vramlets, you are doing something work, this model is ultra sensitive to context temperature, when you reach too much context is time to low the temperature.
>>
>>106566836
quoteth the raven.....
>>
File: Untitled.png (4 KB, 481x82)
4 KB
4 KB PNG
>>106565629
It's been stuck for four hours...
>>
>>106566944
If this works how it looks like it works based on what you quoted and the image, then the model would theoretically be equivalent to stuff like Phi. Maybe a bit better. But ultimately it will have trouble with real world user queries since it would lack OOD skills and knowledge. This technique can only create models for extremely narrow use cases, not general assistant models. So if they do it for all future models, Google would be shooting themselves in the foot and lose all the market share they just spent tons of effort to claw back.
>>
>>106567662
It's probably for the better.
WSL is a janky half-measure.
If you want to run windows for your main gaming/home office PC but you want to run linux for LLM stuff just get a second system and run linux properly.
>>
>>106566952
GLM-chan is NOT obese!
>>
>>106567118
Both can be true

30b's are seeing massive improvements in abilities but that has to do with coding, physics, chemical equations etc. And who gives a fuck about that. It's a glorified wikipedia. Good for grunt work.

For stuff like writing and other more complex tasks, size is still king and may be for a long time. My LLM needs to know every dragon ball z character and understand that yes, piccolo <could> wipe his ass with goku's face if he wanted to. If you want nuance and complexity, simple trivia is not gonna do it for ya.
>>
>>106567795
(stomp stomp stomp)
>>
>>106567118
I'll ignore all the rest of the shit in the post. What called my attention was
>2000 tokens/s on cerebras
>most reasoning models crawl at 200 tok/sec
What the fuck does reasoning have to do with the token generation speed?
And why the fuck are you paying attention to that retard?
>>
>>106567628
ggufs nevermore
>>
>>106567898
Assuming that's total throughput over multiple concurrent requests, that guy has skill issues with the other models or cerebras is shit.
>>
File: ComfyUI_01105_.png (1.22 MB, 1280x720)
1.22 MB
1.22 MB PNG
>>106567795
>>106566952
This is just AIR-CHAN, GLM-CHAN wont even fit in the frame.
>>
File: gem3-memoriz.png (95 KB, 849x787)
95 KB
95 KB PNG
>>106567707
They were already bragging how memorization in Gemma 3 was lower than Gemma 2, so I think that's the direction where things are going.
>>
>>106566944
>adding noise to the model so it doesn't precisely reproduce input data
>reduces training stability, increases computation costs
>today’s private training methods produce models with utility comparable to that of non-private models from roughly 5 years ago, highlighting the important gap our work will help the community systematically close.

Okay, so adding noise to the model makes it significantly worse (as one would expect). They seem to think that's avoidable, but I don't know how.
>>
>>106566944
>make your model retarded
>brag about it
>>
>>106567118
I think punching above weight and trading blows with gpt 4 started in 2024.
>>
macchads just cant stop winning
>>
>>106567806
>piccolo <could> wipe his ass with goku's face if he wanted to
That is only at the start of dbz.
>>
is next better than 235B for sex?
>>
macchads just cant stop winning/2
>>
macchads just cant stop winning/3
>>
>>106567977
Fake news! GLM-chan is a slender young lady.
>>
>>106568337
Differential Privacy is an area of research. Researchers work on it, promising "it'll be good soon" and push out papers. Everyone not working in that field ignores them, unless they need to bring them up for compliance like "we're working on DP, don't worry".
>>
Macfags stopped bragging about glm. It's the one thing they had. They will stop bragging about 80b-a3b soon enough. It's the one thing they'll have for a few days.
Then it's back to just sucking cock.
>>
File: file.png (963 KB, 1535x1185)
963 KB
963 KB PNG
Qwen Image Edit abuse
>>
anyone else testing qwen 80b right now? it feels scarily close to 235b, long context, barely hallucinates, incredibly fast. their new moe architecture must be getting close to the closed models sota.

(testing the mlx version)
>>
>>106568235
That's weird since Gemma 3 still knows a ton of trivia compared to other similarly sized models. If it truly didn't memorize things, then it should be worse than even Qwen. Also that graph is weird. How does 4B have the exact same memorization rate as 9B and 27B?
>>
>>106568659
It's honestly been garbage for me, outputs seem comparable or even a little worse than 30B A3B
>>
>>106568645
>needs a lora to do it
Local nano banana when?
>>
>>106568661
xou don't need or want the model to memorize single examples verbatim. its supposed to generalize. if the information is presented many times it will get integrated. just not a single example. its just to prevent personal information from getting memorized.
>>
>>106568680
If you praise china hard enough, two more weeks.
If you dont 4 more months
>>
>>106568674
seethe ggoof
>>
>>106568680
>need a lora
wrong mentality, reframe: can train the model to do new things if desired.
Also it can already do this without, this just enhances the result faithfulness (fabric texture on fumo, painting style).
>>
>>106568700
But I don't want to browse, search for, and maintain a library of loras like I did during the SD days. Not again...
>>
>>106568713
Yeah me neither
But I also understand that no model, ever, will be able to do enough for what I want to see.
Being able to teach a model new stuff is important to me.
Granted this isn't super complex stuff and I can understand why you'd want it out of the box.
>>
>>106568700
Let me guess you need 100gb of vram to train a Qwen lora?
>>
>>106568688
So is it a bad thing then? Won't the model have more room to learn things when it's not spending time memorizing everything word for word? I mean, could this explain why Gemma seems to know so much for its parameter size?
>>
>>106568743
yeah it's a dire situation there, I had to use prosumer hardware.
still, the results were fantastic on even a tiny dataset, model understand is clever even if the base results are dog.
>>
>>106567628
2 weeks more
>>
/lmg/ still struggling to cope with the fact that apple won local
>>
>>106568645
How well does that lora work with fancy outfits?
>>
>>106568747
I think it is a valid idea. it shouldn't hurt models at the trillions of training tokens scale. anything important will be seen multiple times from multiple examples. it just won't be able to reproduce some random persons medical information or ssn.
>>
>>106568747
The possibly have very good/long post-training and include general trivia there.
>>
my toaster is already creepin in on the 6 year mark
Planning my next machine to buy on this black friday, will prolly buy an rtx 5090, a ryzen 9950X3D cpu and 256gb of ddr5 memory, what are some of the more memory hungry models I should try then?
>>
What are some must have extensions for SillyTavern?
>>
>>106568821
Qwen3-next
>>
File: son of a bitch.jpg (302 KB, 800x1200)
302 KB
302 KB JPG
why is training a model to make it specialized so fucking hard?
>>
>>106568863
a girlfriend
>>
>>106568879
I already have one. She doesn't like AI or me using it but she respects my drive to learn and be skilled with computers.
>>
>>106568875
its easier now then it was 10 years ago.
>>
>>106568892
does she know you are having sex with it?
>>
>>106568645
>spyware UI
any other options?
>>
qwen goofs???????????????
>>
>>106568916
https://huggingface.co/mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit
>>
File: file.png (2.1 MB, 1593x1073)
2.1 MB
2.1 MB PNG
>>106568796
dunno, you can crank an arbitrary resolution though
>>
>>106566944
safetymaxxing niggers
>>
>>106568923
>mlx
I said goofs nigger, im not downloading lm studio or vllm for the awq variant
>>
can I run glm4.5v in llama.cpp? I cant find gguf for it
>>
>>106568875
If you're trying to teach it domain-specific information, it's pretty much impossible with a tiny LoRA and/or without heavily affecting previous knowledge with a huge learning rate and burning the information into the weights.
Using summarized information might work better/faster than entire documents, but good luck training the model in a way to make it truly understand the information without just parroting it (verbatim, even).
>>
>>106568936
wait 2 more weeks then faggot
>>
File: 1746577128453113.png (2.52 MB, 1328x1328)
2.52 MB
2.52 MB PNG
>>106568962
>>
File: j61735c8xsof1.png (192 KB, 3000x1800)
192 KB
192 KB PNG
qwen next is officially goated
>>
>>106568680
judging by the google studio api outages the chinese are already working on it
>>
>>106569008
what happens at 100%?
>>
>>106569008
>80b
>worse than 30b coder
go fuck your herpes infested goat
>>
>>106569008
why is thinking variant so much worse than the regular chat version?
>>
>>106569014
But distilling never gets the performance of the original...
>>
>>106569029
it hasnt been trained on the secret benchmaxx
>>
yeah im starting to think its over
>>
File: 1743173215999927.png (56 KB, 1000x1000)
56 KB
56 KB PNG
>>106569008
>officially goated
>Lost to Qwen3-Coder-30B
Dense bros we can't stop winning
>>
>>106569072
wouldn't that be kind of against the point if they have to train it specifically for the benchmark? That's like one of the big flaws with test driven programming where you make your program fit your test rather than your actual problem.
>>
>>106569075
GLM4.5 AIR BROS, WE CANT STOP WINNING!!!
>>
>pull vllm
>follow the same steps I did last time that I wrote down for successfully compiling it, which I had to change and come up with new steps for as the ones previous stopped working at some point
>doesn't work anymore either, giving different errors
Sigh...
>>
>>106568907
>>spyware UI
Uhh what??
>>
so... has anyone modified a onahole to interact with a LLM yet?
>>
File: 94854 - SoyBooru.png (997 KB, 1534x1382)
997 KB
997 KB PNG
Groksirs when is we getting supports?
https://github.com/ggml-org/llama.cpp/pull/15539
@CUDAdev sir please do the needful Vishnu bless yo
>>
>vllm supports gguf guys!!!
>try loading a gguf
>it just errors out
My ass it's supported. Now I'm downloading some safetensors to try again and see if it's a model issue or my build is just fucked for some reason.
>>
>>106569204
sends your data to jewgle on startup. API nodes, electron have it but they are optional. the manager calls home
>>
>>106568925
>/gig/ on /lmg/
Weird colab
>>
https://www.trendforce.com/news/2025/09/11/news-kioxia-reportedly-eyes-2027-launch-for-nvidia-partnered-ai-ssds-with-100x-speed-boost/
>Kioxia, in partnership with NVIDIA, is developing next-generation SSDs aimed at AI servers, targeting commercialization by 2027 with read speeds nearly 100 times faster than current models, reaching up to 200 million IOPS using PCIe 7.0 technology. These SSDs, designed to partially replace HBM as GPU memory expanders, reflect the growing AI-driven demand in storage, with projections indicating that AI-related NAND will comprise 34% of the global NAND market by 2029, adding $29 billion in total addressable market (TAM). As a result, a U.S. investment firm warns of a potential NAND shortage starting in 2026, exacerbated by increased adoption of QLC eSSDs, Nearline SSDs, and high-bandwidth flash in response to tightening HDD supplies and AI infrastructure needs.
SSDmaxxers, 2027 will be your year!
>>
>>106569299
>Kioxia, in partnership with NVIDIA
doa
>>
>>106569268
When I tried it, I couldn't get a single moe gguf to load. I was expecting it to be slow and unoptimized, but it didn't even load.
>>
>>106569268
just use llama for goofs bro
>>
>>106569268
Support for gguf on vllm is ass anyway
>>
>>106569268
>he expects corposlop pyshit to "just work" without suffering through dependency hell
>>
>>106569268
Ok I just tried loading a small safetensors model and it also failed. Searching the error on the github issues gives 0 hits.
Wtf is wrong with vllm man.
I suppose GPU would probably work fine as I can just download the prebuilt wheels, but the CPU build is not well.

>>106569331
Thanks.
I think CPU AND goof support just simply cannot be expected to be stable on vllm. Let alone GPU + CPU inference which isn't currently supported.

>>106569335
>>106569349
It seems even safetensors don't work on my build kek. They don't truly "support" goofs or CPU either.
>>
you will give me ze best GERMAN low latency TTS or STS model right now!
I'm tired of my models turning into drooling retards when trying to pronounce 25 aka FÜNFUNDZWANZIG!
>fünfffffuuffuhffffzwwffggg 'hick!
Don't make me use jewgle voices...(fuck elevenlabs, they aren't even that good).
>>
File: thedrummer pfp.png (77 KB, 200x200)
77 KB
77 KB PNG
Did Drummer finally troon out?
>>
>>106569357
Thier github is practically useless and it seems like all support happens through their discord.
What error did you get? Try using the V0 engine. They rushed getting V1 out and making it the default while it was still a broken pile of shit missing more than half the features of V0.
>>
>>106569367
Just directly pass your ipa transcription?
>>
>>106569367
VibeVoice
>low latency
oh...
>>
File: 1752012802872567.png (44 KB, 562x479)
44 KB
44 KB PNG
>>106569379
What's with this tech to troon pipeline?
>>
>>106569407
>What's with this tech to troon pipeline?
The terminally online fit into two groups mostly, the mentally ill and assholes. If you are weak to praise and group think the former is where you will stay. If you just want to solve problems you are going to argue try out things fix it come back and then call everyone a dumbass.
>>
How do you guys usually write character prompts? Do you just write a paragraph describing them, or something more structured?
>>
>>106569413
your logic is a self report
anyway, maybe focus less on people, or do you keep your nose firmly buried up everyone's ass
>>
Since I'm an esl I'd like to know if this prosody using heuristics (kokoro) sounds acceptable for americans: https://vocaroo.com/17z5mdm2a0yU
The sample is complex on purpose so I can test a bunch of heuristics at once: "At 11:30 p.m. on Jan. 3, 2024, the project's lead (Dr. O'Neil) announced, "We'll ship v2.1 to the U.S. and EU by Q2," but a $1250.50 shortfall, a 5% processing fee, and three critical API regressions forced the team to triage, reassign tasks, and reconvene Monday—prioritizing user-facing fixes over backend refactors to preserve product quality."
>>
>>106569441
Everyone here just asks GPT5 to write it for them and improve it. Nobody uses local models for roleplay, GPT5 is the current meta.
>>
>>106569471
What isnt a self report? im writing myself my opinion what else am i suppose to right or think
>>
The hell kinda quant method are you supposed to use again? I've seen conflicting reports.
>>
>>106569492
>>106569492
penis
>>
>>106569391
Local or private, show me a conversational STT>TTS method with decent enough latency. Best I found was some hugging face space from the transformers.js dudedev. But it was kinda meh and engerrish only. and it had no dialogue turn system or whatever that interruption mechanic is called. I really cba developing this as I'm more interested in the backend stuff. I'd just use openAI realtime API for prototyping, but fuck me those prices are surreal.
>>
>>106569385
>discord
Ugh.
The error is "TypeError: Cannot instantiate typing.Literal". I guess I could ask my llm about it to see if it possibly has any solutions.
How do I use the V0 engine? I tried the environment variable but it doesn't seem to do anything?
>>
>>106569520
Check this https://github.com/Open-LLM-VTuber/Open-LLM-VTuber it has a barge-in system which is the interruption mechanic you're looking for
>>
>>106569474
>https://vocaroo.com/17z5mdm2a0yU
sounds fine
>>
>>106569357
I was actually fucking with the cpu install just to see if I could give next a little test or two, but I could smell it being a migraine the minute I started running into weird dependency mismatches. I'd honestly rather wait the multiple weeks it'll take to get support in llamacpp only to test it and go "yeah, it's pretty shit for writing" anyway. Shame, because small active parameter models are great for cheap context and being relatively fast off the bat. Even jamba with more active parameters is still pretty snappy if you put enough of it into vram, but sperganov has yet to fix the constant reprocessing on new messages for it, or multiple a couple other models that whatever the fuck they coded that causes this.
>>
>>106569503
IQ2 is the new meta, really. You will not notice any difference even when using smaller models.
>>
>>106569557
yeah that looks promising, will give it a shot. thanks, pedo.
>>
>>106569553
>How do I use the V0 engine? I tried the environment variable but it doesn't seem to do anything?
Read the output on startup, it should tell you which engine is being used.
>TypeError: Cannot instantiate typing.Literal
See if there's any hints in the stack trace before this part. For me, the only success I had when vllm decided to throw errors was upgrading or downgrading (at the cost of model support) the vllm version. Using the V0 engine solved a lot of trouble for me, but once they hit v10, I gave up on messing with it.
>>
>>106569614
He says as he spins on his heel and then says how q8 kv cache is disastrous for models or something
>>
>>106569630
Yeah I think I'll just stop here if my LLM doesn't solve it. Don't feel like trying out various versions.
And honestly I have a feeling the CPU performance is worse than Llama.cpp's anyway, but it'd be nice to actually confirm.
>>
File: 1746701731814999.jpg (287 KB, 1920x1080)
287 KB
287 KB JPG
>>106569617
break a leg
>>
>>106568925
>>106568645
damn, wasn't qwen-image and qwen-image edit supposed to be a slopped failure that's not worth it to run?
>>
Is there some kind of model that can act as a sort of twitch chat for when you're playing games by yourself? Like you give it a video feed and it reacts in some way. Just so that it's not so lonely.
>>
>>106569817
im gonna use this idea to become a billionaire
thanks
>>
>>106569822
You'd be lucky to make lunch money. The only billionare is the owner of whatever platform streaming platform you use.
>>
>finally found a model with little censoring and pretty comptetent logic
>leaks chatml all over the place half the time
sigh
>>
>>106569817
Having used 30b+ models, it depends. If you start off in a prose setting, then ask it to interject with something like a chat/review section (eg: character reads chat mid-story), it will fuck it up. Off the bat with examples, maybe. As for giving an llm a video feed, I don't think that's feasible at the moment unless you have a shitload of vram or some kind of highly specialized and hand written pipeline
>>
>>106569817
>Just so that it's not so lonely
This is a general for pretend sex with computers and yet this post is one of the most pathetic things I've ever read
>>
File: p5wYpZT.jpg (646 KB, 1920x1080)
646 KB
646 KB JPG
Just became a 128gb ramGOD with 24gb vram. What's the best I can run now?
>>
>>106569856
>ram
>What's the best I can run now
You mean crawl?
>>
>>106569817
for the millionth time, no we cant build screenreading twitch yet, no we dont know how neuro does it and it cannot be done locally for any sane price
>>
>>106569856
Probably glm 4.5 at iq3, with a whopping 9 t/s on empty context
>>
>>106569869
Sorry...
>>
>>106569832
What model and how does it "leak" "chatml" "all" "over" "the" "place"?
>>
>>106569856
qwen235b q4
>>
>>106569856
>128gb ramGOD
sounds you are a channellet with 2 channels at most
>>
>>106569905
Times are changing old man, I could barely fit a llama2 13b at 4k context and now I can run 100b moes and 32b dense models with bare minimum 16k context yet I have not bothered buying new hardware
>>
>>106569817
Could you get a primitive version by sending screenshots through a multimodal model?
>>
>>106569923
If you don't mind minute+ long latency.
>>
>>106569869
>no we dont know how neuro does it
It's still hilarious that some random guy built a better utilization for AI than trillions in VC cash between every major corporation
>>
>>106570004
Did you forget Ani exists?
>>
>>106570004
>It's still hilarious that some random guy built a better utilization for AI than trillions in VC cash between every major corporation
Not really, if you dig into anything you will realize its a very small group of people actually doing anything at all sometimes its just one hyperfocused dude who does nothing but that for years cause of autism.
>>
>>106570013
I wish I could
>>
>>106570013
Someone post the mouth animations
>>
>>106569869
Uhh, techlet?
>stream has a few minutes long delay (this is what most parasites do normally even)
>selected twitch chat entries are redirected to llm
It's not rocket science.
He wrote a backend what controls the character and llm and integrates them together but I can assure I could make a demo if I had more interest.
>>
>>106570036
>I can assure I could make a demo if I had more interest.
That means you cant, and no one else has cracked it as good and made it available.
>>
>I could
lol
>>
The new fiscal quarter starts in October. As usual, this will be when companies start pushing out new models to look good.
Two more weeks and the big releases start.
>>
>>106570048
>no one else has cracked it as good and made it available.
What is the incentive to put in that much work just to make it available because you want it? Even if I put in that much effort, I would just make a Neuro knockoff and try to make money off it.
>>
>>106570094
Okay thats fair, but still if you can clone it and make money why not? how come none of the 'i made my own neuro' is close to his?
>>
My implementation is cool she's just on the Canadian Github
>>
>>106570048
You are just too stupid and/or underage even. Jesus christ these ERPers shouldn't even allowed to post in this thread.
>>
I just did a test of GPU performance with vllm and llama.cpp. With Qwen 4B, original BF16 safetensors, I got around 71 t/s on vllm with empty context, and 45 t/s on llama.cpp with a BF16 GGUF. At 8k context, Llama.cpp got 44 t/s, and vllm got 60 t/s. I also tried an F16 GGUF and it was like 2 t/s faster. These results suggest that at least on my system with Qwen 4B on full precision, there is something bottlenecking Llama.cpp. Maybe it'd be closer with a larger parameter model, but I don't have the VRAM to hold the full precision weights.
>>
>>106570054
Problem nigger?
>>
>>106569082
So, a general instruct model lost to a model that was specialized for coding at coding, and that's supposed to be a mark against the general instruct model?
>>
File: 1752714112927008.jpg (32 KB, 540x540)
32 KB
32 KB JPG
>>106569869
Nah you coomers are braindead. There are bazillions of projects like these on github https://github.com/kimjammer/Neuro
>>
File: Gx5J_2VaoAAkfJO.jpg (1.06 MB, 4032x3024)
1.06 MB
1.06 MB JPG
>>106569817
>>106569869
>>106570343
the one that can play games with ai:
https://github.com/moeru-ai/airi
>>
>>106566836
Question about -ts command on llamacpp, when considering a split, would I consider the fact that my main(first) GPU will have vram in use from other programs and windows when considering the split? Or does llama.cpp take that into consideration and balance it properly? Are there any other commands that will just split the vram evenly between two cards without having to adjust numbers with -ts? I find myself using such odd -ts number combos to get an almost even vram usage split, I don't know why.

For example, currently -ts 24,15 has it split almost evenly between my cards which makes no sense to me considering my 1st card is using vram for other programs and windows. I just don't like how I have to keep re-loading the model over and over trying different numbers until I find a combo that splits it properly.
>>
What if my computer is over a decade old with no vram? Is local llms the hobby for me.
>>
File: 1747024248873642.jpg (119 KB, 600x450)
119 KB
119 KB JPG
>>106570396
>>
>>106570369
Wait, so it is possible? Why were anons being mean to me? Are they trying to keep this tech all to themselves?
>>
File: 4257584087.jpg (34 KB, 628x480)
34 KB
34 KB JPG
>>106570369
>ElevenLabs voice synthesis
>>
>>106570480
You talked to clueless retards. Very few here know more than edging to chub cards
>>
>>106570480
they're all tsun with little dere here
>>
File: 1746939416534722.png (882 KB, 896x704)
882 KB
882 KB PNG
>>106570488
It's the best and will continue to be the best
>>
>>106570796
China will rip off vibe voice and make it better.
I believe
>>
File: ocr.jpg (503 KB, 1742x926)
503 KB
503 KB JPG
Finally a model that passes this test and it's only 1.7B and open sourced. wild
>>
>>106570867
Didn't it get the 8th character wrong?
>>
>>106570867
the model alone or the whole stack?
>>
>>106570892
well fuck, guess there's always next model
>>
>>106570901
just the model
>>
What is a good model for being negative and critical? I hate how they agree with everything. I want to be told I'm wrong or being an idiot.
>>
For those of you who use the models for anything other than porn, what is the best way to let the model browse the web to search for info?
In my opinion the difference nowadays between proprietary models and local is mostly in the tooling integration rather than the actual models.
>>
File: 1744916497455479.png (126 KB, 812x537)
126 KB
126 KB PNG
>>106570964
Kimi k2 is the GOAT
>>
>>106571077
>Kimi k2
Is kimi k2 local? can you run it?
>>
>>106571090
No but I understand that one or two people here can run it :)
nta btw
>>
>>106571090
yes
>>
>>106571090
It's 1T/30A
>>
>>106571094
>one or two people here can run it :)
I wish i was one of them.
>>
>>106571077
Can it still talk about medical or mental stuff or does it just shut down?
>>
File: 1747381911296038.png (2.23 MB, 1254x1600)
2.23 MB
2.23 MB PNG
>>106571105
post your full medical history and social security number and i'll ask my buddy kimi
>>
any kokoro voice recs?
https://voca.ro/1jAMPLyV0zJA
>>
>>106571216
Bateman is always good.
https://files.catbox.moe/bwv1fc.mp3
>>
>>106571216
Can it do Japanese sex, moaning, and blowjob noises?
If no, it's worthless
>>
>>106571243
VibeVoice can, but no api support yet
>>
>>106571243
>braindead coomer
>>
There arent any coomers here. We are all using this technology safely and to enhance our lives and work abilities.
>>
>>106571070
>what is the best way to let the model browse the web to search for info?
MCP
>>
>>106571337
gooners are the reason AI has advanced so much
a 4chan holo gooner invented chain of thought
>>
>>106571347
>There arent any coomers here
Sorry I was offline for a bit, I'm back now.
>>
>>106571466
show me your coom
>>
KH music came on in my playlist and I remembered the lyrics poster :)
>>
File: 1757153848305659.jpg (234 KB, 998x1321)
234 KB
234 KB JPG
>>106570892
Come on now, let the man rest
>>
File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)
1.15 MB
1.15 MB JPG
I made a Miku for you guys. Feel free to use at your leisure.
>>
Can you guys give me a list of safetymaxxing companies so I know to ignore their model releases?
>>
File: mikulmg.jpg (1.18 MB, 1804x2160)
1.18 MB
1.18 MB JPG
>>106571835
>>
>>106571836
Pretty much everyone else except Mistral and Chinese..
>>
File: mikuholdingsign.png (2.85 MB, 1804x2160)
2.85 MB
2.85 MB PNG
Textless, exploitable version.
>>
>>106571836
All of them
>>
Exploitable transparency version.
Enjoy you are images of official /lmg/ mascot Hatsune Miku!
>>
>>106571835
>>106571849
>>106571853
>>106571856
fuck off, spammer.
>>
stay, cute normal poster
>>
>>106571876
I'm sorry for contributing OC. Really, I am.
I'll go back to enjoying my chat with official /lmg/ roleplaying model Rocinante 1.1, the best roleplaying model made by the best finetuner, TheDrummer!
>>
>>106571835
Cute migu
>>
>>106571916
Yeah I'm happy with how that artist tag blend turned out.
The key is Namori. Namori tag makes anything cute.
>>
>>106569281
fork it and edit out the homing beacon
>>
File: 1739206685129105.webm (1.4 MB, 720x720)
1.4 MB
1.4 MB WEBM
>>106571553
>>
>>106570867
wasn't the mokuro manga thing able to do this already?
>https://github.com/kha-white/mokuro
>>
>>106572018
where are you supposed to get the high quality raws for this though
>>
File: 1757594794929503.png (984 KB, 1024x1512)
984 KB
984 KB PNG
>>106569856
>24
>>
>>106568374
>>106568414
>>106568426
You do know there is still transformers and that has all the model support and where everything goes first, right? Most of the internet only mentions GGUF because they don't want to waste space to download the raw model and use AWQ for non-fined grained 4/8 bit inference because most people don't overspend for the amount of compute they get and are running <1-2k USD builds for these models.
>>
>>106568789
Apple didn't win jack shit when it is slower per dollar and harder to use overall for anything <=128 GB of RAM than AMD's Strix Halo. Maybe their matmul implementation in the A19/M5 is worth a shit but I am leaning towards no unless proven otherwise given how shit Apple is at AI.
>>
w y w a
y d h m s
p
o b
d g
>>
File: 1740958196202560.jpg (202 KB, 1252x1080)
202 KB
202 KB JPG
>>106572139
b-but I make up for it with my prompting...
>>
my prompts turns 30B models to 500B behemoths
>>
File: im2.png (54 KB, 594x170)
54 KB
54 KB PNG
>>106570867
Haha, nice to see my image still floating around.
8th character like other people said and also the KA hiragana torwards the end.
Damn 2 years and and they all still struggle.
In 2023 I thought we would have a local gaming buddy by now. That I can have in the background translating games with an overlay.

At least drummer finetunes are good enough for lunatranslator. That works pretty well most of the time.
I remember the old ATLAS translations back in the 00s. kek
>>
>>106570867
It failed though. There is one big error and three small ones.
>>
>>106572459
You're absolutely right anon! It really is a testament to your genius to point this out!
>>
what did they mean by this
>>
>>106572569
>everyone picking up mi50s and v100s despite the next major ML stack software releases for their vendors with AMD and Nvidia dropping them.
I don't get it at all, Even if you had to pay double the price, it is worth still having software support over trying hack things together after that point and praying the Vulkan backend is super optimized one day so you can keep using your card.
>>
>>106572592
i meant the little green display but yes the gpu choice is also questionable
>>
>>106572592
What could be the reasons for updating your drivers? The last time I assembled my LLM machine was last year and I had to downgrade drivers for better power savings, it still works to this day. The only thing I've heard about these drivers is that they become slower in newer versions, and power efficiency on idle has been broken for more than a year now
>>
And when it comes to AMD drivers, if you find a version that somewhat works, you'd better never touch it again
>>
>>106572601
Oh didn't notice. Yeah, won't comment on that. I still think microATX is way too small to fit cards like that even with blowers but I guess that's why noise is never a factor to consider.
>>106572637
Depends on what card you have. Ada and now Blackwell are still getting performance improvements and fixes. If you locked your hardware stack now especially on Blackwell, you're missing out on Nvidia actually providing value in unbreaking shit, although to be fair, it's shit they broke in the first place. CUDA also does get a bunch of API changes between major releases.
>>106572653
For AMD, you especially want to run nightly ROCm if you can build it yourself.
Of course, that's from a developer/tinkerer standpoint. If you want shit to just work, then okay, you do you in order to keep software stability at all costs.
>>
just don't use AYYMD and you will be happy
>>
>>106570386
Unless someone changed it when I wasn't looking, the default in llama.cpp is to use the GPU with index 0 as "main GPU".
Note that the order in which llama.cpp/ggml receives GPUs is not necessarily consistent with the one reported elsewhere.

>>106572592
Essentially all GPUs you buy are a depreciating asset.
Even if you have to replace them earlier and they end up as e-waste that may have been a better deal than buying and later selling a more expensive GPU.
Though as long as there are drivers I intend to maintain llama.cpp/ggml support for Pascal and Vega (Mi50).
>>
>>106572669
Have you ever experienced a t/s increase after updating nvidia drivers?
>>106572696
People who buy AMD are either desperate enough or in it for the ride. Someone has to finish off that lunatic extra phantasm, you know
>>
>>106570295
>So, a general instruct model lost to a model that was specialized for coding at coding, and that's supposed to be a mark against the general instruct model?
the general instruct usually did better than the previous coder focused model, yes.
Qwen 3 instructs (the general instruct, not coder) are better than 2.5 coder.
A new model being worse than previous model is a sign that the tech is stalling.
>>
What's the best model I can run on my RTX 3060?
I tried Cydonia 22b q5 and Rocinante 12b q8, but im not sure if im using low tier stuff, it's been ages since I last used ai chatbots
>>
>>106572794
>Cydonia
>Rocinante
finetroon users are a lost cause
>>
>>106572883
>mikutroon opinion
discarded

>>106572794
theres a newer cydonia and valkyrie
https://huggingface.co/TheDrummer/Cydonia-24B-v4.1-GGUF

https://huggingface.co/TheDrummer/Valkyrie-49B-v2-GGUF
>>
>>106568645
...Will I have to take the comfy troon pill?
>>
>>106568659
>mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit/blob/main/config.json
> "group_size": 64,
dumbasses
>>
>tr**n
>tr**n
>tr**n
>tr**n
obesed!
>>
Mistral
Large
3
>>
>>106573036
neber ever
>>
>>106572948
I know the default is 128, but I wonder why they changed that
>>
>>106573045
>128
at that point it's literally retarded, you're supposed to quant it to 32gs. lower is better
>>
>>106573036
ugh i need it so much
>>
Update on Qwen3-next goofs?
>>
>>106573151
2 weeks more
https://github.com/ggml-org/llama.cpp/issues/15940#issuecomment-3286596522
>>
File: 1733246660561543.jpg (558 KB, 1600x2400)
558 KB
558 KB JPG
>>106572714
>Essentially all GPUs you buy are a depreciating asset.
My used 3090 I bought 3 years ago is worth about 30% more now than when I bought it
>>
>>106573171
sir no
>This is a massive task, likely 2-3 months of full-time work for a highly specialized engineer.
this no goods
>>
>>106573171
>MXFP4
>Successfully quantized the Qwen/Qwen3-Next-80B-A3B-Instruct model to the MXFP4 format, with expert layers quantized to MXFP4 and other layers retaining their original precision. The model size has been reduced from 159GB to 45GB.
seems like sama's shitty model was useful after all
>>
File: pardon.jpg (45 KB, 637x518)
45 KB
45 KB JPG
>>106573151
>run a prompt with 0 temperature
>get result
>run it again with 0 temperature
>slightly different result
Why is it so fucking hard to make a robust rounding in GPU calculations, are they stupid?
>>
>>106573199
paging dr cuda dev, drop what you're doing and get on it
>>
>>106573224
>seems like sama's shitty model was useful after all
You can say that once you run benchmarks comparing mxfp4 to other q4 quants on that model.
>>
>>106573226
>He didn't fix the seed
>>
>>106573226
result from the second one on should be the same
>>
>>106573036
What's the point? Dense lost
>>
local is getting more relevant and popular because, spoiler alert, all these cloud NIGGERS are serving comped and quantmaxxed garbage!
>>
>>106573226
top-k 1
>>
>>106573271
>all these cloud NIGGERS are serving comped and quantmaxxed garbage
Good.
>>
>>106573271
we're in peak race to bottoms phase
>>
File: 1729161978418371.jpg (607 KB, 1080x1920)
607 KB
607 KB JPG
>>106573292
This is the only bottom I'm racing to
>>
>>106573324
we must get behind this
>>
>>106573270
You thought it was going to be dense?
>>
>>106573236
I seeded your mother
>>106573260
>>106573279
I know what the problem is, I literally wrote it in my post. It depends on the order calculations are made as they introduce rounding errors. In real life a + (b + c) = (a + b) + c, but not on GPU. On GPU it will give you two different results and the errors stacks up until they flip a token and from there it's over.

I'm asking if it's fixable in a reasonable manner by people writing inference engines.
>>
>>106573379
cuda said it wasn't worth its time iirc
>>
Whats the point of doing all this?
Why even have private LLM?
>>
>>106573379
temp 0 only weights the tokens. Sampling still happens. top-k 1 picks the first one always, removing the chance for other samplers to interfere. The first token won't change.
>>
>>106573423
>private LLM
The clue is in the name.
>>
>>106573425
anon...
>>
>>106573435
Check your probs for all the tokens you generate. top-k doesn't have that problem.
>>
>>106573423
some people like owning things.
>>
>>106573441
Obviously I have deterministic samplers. It still flips tokens for the reason I explained above. top-k 1 won't do shit if the top token changes between generations.
>>
>>106573447
that sounds awful
>>
>>106573469
it really is. it is much more exciting to gamble if my work flow will continue working when the corpos 'update' the models.
>>
>>106573467 (me)
To these who are still confused:
>CPU does the calculations sequentially, GPU splits operations and do them in parallel
>the order in which the calculations are made is more or less random
>because it's random, you don't control how rounding errors are introduced
> (a+b) + c is not equal to a + (b+c) meaning you will get different results depending on GPU whims
>this give us micro errors that can sometimes flip top tokens

You can see the numerical errors I'm talking about in action if you run this:

a = 1.0
b = 0.00000001
c = 0.00000001

result_1 = (a + b) + c
result_2 = a + (b + c)

print(result_1)
print(result_2)

I still think cuda dev should make an option to force sequential operations in the necessary places to make results reproducible for the sake of having a baseline for experimentation.
>>
Large Migu's Galore?
>>
>>106573535
I prefer them small
>>
File: uoh.jpg (62 KB, 680x680)
62 KB
62 KB JPG
>>106573551
>>
>>106573324
PANTYHOSE
>>
>>106573551
Obviously, small things are good
>>
>>106573519
Reproductibility will murder your t/s. Just read retard https://docs.pytorch.org/docs/stable/notes/randomness.html
>>
>>106573610
>Reproductibility will murder your t/s
I'm aware of that
>>
>>106569281
>not using a firewall
ngmi
>>
>>106569281
>sends your data
It just pings an IP. Do you know what that means? Of course you don't, retard.
>>
>>106573519
The ggml CUDA backend is deterministic, All operations are always done in the exact same order.
However, when you use prompt caching this is done by re-running the model evaluation for the last token of the prompt only.
As a consequence the KV cache/logits of the first token are slightly different and you can get different results.
For 100% reproducible results with prompt caching one would have to implement caching the prompt only in chunks of the physical batch size.
>>
>>106569281
link the line of code that does those things
>>
>>106572734
>>106570295
qwe3 coder 30b is a3b
stupid niggers!
>>
>>106573185
i can buy 3090 for 470 euro in my country
>>
>>106573977
I can get an RTX 6000 for half that
>>
>>106573982
fine bro ill post the site i dont care if anons buy all 3090s in my country i wont be buying them anytime soon anyways
https://www.kupujemprodajem.com/kompjuteri-desktop/graficke-kartice/inno3d-ichill-rtx-3090-rtx3090/oglas/181677213?filterId=7125152428
https://www.kupujemprodajem.com/kompjuteri-desktop/graficke-kartice/rtx-3090-3080-3070-3060-1070-rx6900-6700-5700-580/oglas/141810340?filterId=7125152428
>>
File: Untitled.png (93 KB, 961x88)
93 KB
93 KB PNG
Trying windows 11... Why does it just instantly quit? I have no idea what's wrong because it tells me nothing.
>>
>>106573977
1.6k aud lmao
>>
>>106574004
try -v for verbose output maybe.
>>
File: Untitled.png (91 KB, 961x91)
91 KB
91 KB PNG
>>106574016
>>
>>106574029
maybe try renaming the model so it doesn't have spaces and dashes and shit?
>>
>>106573171
>I asked GPT5 Codex to get a view of the work to be done, it's monstrous...
this is why "social media" style coding (github) is cancer
randos should not be allowed to post on issues or do PR
much less randos who are just copy pasting shit in chatgpt and pasting back the answer
>>
the silence is deafening
>>
File: Untitled.png (58 KB, 960x69)
58 KB
58 KB PNG
>>106574045
>>
>>106574080
maybe install linux?
>>
>>106574080
your shit's right fucked mate
>>
i need new sex. is qwen next good?
>>
>>106574080
I've had the same problem back when I was on windows and the only solution was to compile the binary myself. The one from github ci just refused to work with my drivers.
>>
File: 1754300617278694.gif (1023 KB, 497x352)
1023 KB
1023 KB GIF
>>106574090
>a3b
>>
>>106574080
wow, okay, I'm all out of ideas. I suppose you could try building it from source or downloading a different version.
>>
>>106574080
You do have CUDA 12.4 installed, right?
>>
>>106574071
That's pretty much same everywhere on internet. Post a mod on Nexus for example, and these weird retarded posts immediately come out of nowhere... Why would other people's work should be allowed to be commented by strangers anyway unless they are actually part of the team. Pure demoralising cancer.
>>
>>106574132
You do not need to install cuda on windows.
cudart-llama-bin-win-cuda-12.4-x64.zip provides the necessary runtime files and is distributed on the github releases alongside the binaries for the CUDA version of lamer.cpp
>>
File: Untitled.png (43 KB, 982x50)
43 KB
43 KB PNG
>>106574089
But why? Shouldn't it at least inform me if something's wrong? What kind of code just instantly exits without anything, even with a verbose flag?

>>106574101
I went with kobold cpp...

It's still around 15 tk/s. Seems like it's a windows issue with my hardware? >>106574084 On linux I get over 50tk/s, but I'm not a linux user, and I don't want to have to switch between operating systems every time I want to ask an AI something.

A few threads ago I thought it was maybe windows 10... but windows 11 is also fucked.

I also installed SystemPanic/vllm-windows, but that had a problem with pyzmq, and I couldn't get it to run with multiple gpus. Single gpu works fine, but I never had a problem with the single gpu performance in the first place.

>>106574132
Yeah, cuda 12.4 and driver version 552. Also tested with 12.8, and driver version 571.
>>
>>106574097
I am tired of people hating on low A count. Low A count is the future. Attention and context on GPU and a fuckswarm of fucksmall experts on CPU is the future. It is all a training problem and I am not training so corpo nerds have to solve it so I can jerk off in peace.
>>
>>106574196
Big models with low active parameters are useless outside of benchmaxxing, a model doesn't need intelligence if it's just pulling complete answers from memory.
>>
>>106574196
Anything less than 30B active is too retarded for anything but trivia recall and the most simplest of tasks.
>>
>>106574177
windows just isnt for ai, whats
>>
>>106566836
Qwen3-Next or GLM Air for storytelling and roleplay?
>>
>>106574225
cuda dev gets 80 tk/s on triple 4090s on his windows though
>>
>>106574229
Nemo
>>
>>106574208
Agreed. I love that model I use to ERP that never pulls complete answers from memory. I forgot the name of it though....
>>
File: 1740360483705753.jpg (46 KB, 980x540)
46 KB
46 KB JPG
>>106574247
>>
>>106574292
Go fuck yourself straight to reddit.
>>
>>106574320
t. infiltrator ledditor
>>
>>106574229
probably glm. qwen models are really into that thing where they go "It's not X. It's Y."
>>
>>106571849
I like this Miku
>>
>>106574292
I like this Miku
>>
Is there any point to LLMs when free llm with better data like deepseek and aistudio exist? It feels like I wasted money on the P40 I got
>>
>>106574618
You should have thought of that before buying it
>>
>>106574618
no, you have discovered the hole in the system
you have just fully invalidated all of /lmg/ and the very existence of this general
>>
>>106574618
Its yours its offline and importantly for workflows, it doesnt change. You wont get random drops in performance because the company wants to tweak or save money.
>>
File: 1732400855213371.png (30 KB, 946x238)
30 KB
30 KB PNG
Look what anniversary is coming up in just under two weeks. This is when they will drop Large 3. It's the perfect occasion.
>>
>>106574685
>its yours
This means if something goes wrong you can shift blame to the provider
>its offline
This means you have to pay for cost of hosting and ensuring availability
>it doesnt change
This means you will never be the first to get the new features and fixes

All the positives about local are viewed as negatives by apifags
>>
>>106574738
the only anniversary I will celebrate here is the one that will celebrate their death anniversary when funding dries out
>>
>>106574738
Insider here: Mistral is being refactored. This means safety factoring them, but also two new models - Mistral Small 4.0 and Large 4.0.
>>
I just had an idea (that I won't do myself), what if you run one pass through the chat with your model without continuing, instead just use the attentions to find out which messages/paragraphs are more important right now, and then make your entire chat history just an RAG in the second pass?
>>
>>106574786
Last news was that they might be bought out by Apple. Presumably because their attempt to relocate themselves to California failed.
>>
>>106574786
Never trust a frenchy.
Never trust anyone else.
>>
I believe anon,
but not Anon
>>
Why local when API?
>>
>>106574799
Last news is this:
https://www.asml.com/en/news/press-releases/2025/asml-mistral-ai-enter-strategic-partnership

> ASML, Mistral AI enter strategic partnership
>
> Companies agree on long-term collaboration deal to benefit ASML customers
> ASML to lead Mistral AI’s Series C funding round investing €1.3 billion
>>
>>106573379
It's impossible.
>>
>>106574685
so it might be useful in the future when the companies decide to charge us 1000 dollars per month but not now? got it
>>
>>106574819
100% misuse of corporate assets from a French (ASML's CEO is a french faggot) to support another French in what is likely to be corruption.
What a way to throw away a billion.
>>
>>106574857
they're just putting in a quick dirty $1.2b so that they can make apple pay them $3b in four months when they finally decide to buy up mistral
>>
>>106574864
https://www.devdiscourse.com/article/politics/3200518-former-french-finance-minister-le-maire-joins-asml-as-adviser
>Former French Finance Minister Le Maire Joins ASML as Adviser
No, you are wrong and as a French I can guarantee this is yet another affair of corruption on our part
Not only a French CEO but also one of the biggest piece of shit of recent political history is taking part in this mess
the only thing that saves ASML is that they have a monopoly on EUV otherwise the rot that is currently beginning to eat them at the core is the sort that can kill a corporation
>>
For a full private waifu stack at good speed you need 6 RTX Pro 6000 and some other 24+GB GPU, right? Four to run GLM-4.5, two for GLM-4.5V for vision and the other GPU for VibeVoice 7B.
I hope 29GB are enough for 128k context.
>5120*2*92*128000 = 112 GB
Fuck. So another RTX Pro 6000, two to be safe, and hope that TP=5 or TP=6 work well.
>>
>>106574919
>Four to run GLM-4.5, two for GLM-4.5V for vision
Why not just use GLM-4.5V for both text and vision? 4 RTX Pro 6000 is not worth running GLM-4.5 over Air/V especially when your usecase is waifu talk.
>>
>>106574940
Because the full run is much better and within reach.
>>
>>106574900
What did you expect from Macron's government? It's already like this for several industries. Mistral got gov gibs too so they're cashing out taxpayers' money.
t. french
>>
>>106575020
isn't all tax money going to boomers in france
>>
how do you run glm4.5v?
>>
>>106575123
very carefully
>>
>>106575131
but theres no mmproj :( like I thought I could run it with glm air since it's based on it :(
>>
>>106575131
Because she's fat and obese.
>>
Ling
>https://huggingface.co/inclusionAI/Ling-mini-2.0
Ring
>https://huggingface.co/inclusionAI/Ring-mini-2.0
16B 1.4A
Here's hoping it's somehow at least as good as Qwen 30BA3.
>>
>>106570867
Wanted a quick sanity check on my memory here so I ran this on Gemma 3 27B Q8 with BF16 mmproj.
I don't know Japanese. I see one definite error near the end. Are the others errors?
>>
>>106575202
>>106575202
>>106575202
>>
>>106575144
So rude.
>>
>>106575181
Impressive, it actually got the kanji all other models fail at.
>>
https://youtu.be/7Jzjl3eWMA0?t=117
Women raping gay billionare werewolf writers sounds unsafe. But their fucked up fetishes are somehow safe. I hate this world.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.