[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1757469951630088.png (465 KB, 1080x740)
465 KB
465 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108328170


►News
>(03/04) Yuan3.0 Ultra 1010B-A68.8B released: https://hf.co/YuanLabAI/Yuan3.0-Ultra
>(03/03) WizardLM publishes "Beyond Length Scaling" GRM paper: https://hf.co/papers/2603.01571
>(03/03) Junyang Lin leaves Qwen: https://xcancel.com/JustinLin610/status/2028865835373359513
>(03/02) Step 3.5 Flash Base, Midtrain, and SteptronOSS released: https://xcancel.com/StepFun_ai/status/2028551435290554450

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
Immediately afterwards we get a non Miku thread.
>>
whats the difference between diffusion and llama?
>>
comfy bread
>>
why does logan want to kill patrick?
>>
>>108333458
and what a thread too
>333444
>>
Thank you baker. Death to mikutroons.
>>
Vague twitter shit. What a nigger.
>>
>>108333475
you are mentally ill
>>
>>108333506
mental illness is valid and beautiful
>>
>>108333459
i think diffusion denoises the output as a whole, while llama is a autoregressive loop building the output 1 token at a time.
>>
>>108333506
meant for >>108333497
>>
►Recent Highlights from the Previous Thread: >>108328170

--llama.cpp tensor parallelism PR and multi-GPU performance considerations:
>108328868 >108328900 >108328933 >108328937 >108328985 >108328996 >108329004 >108329012 >108329020 >108329046 >108329069 >108329142 >108329152 >108329166
--Nvidia contributor fixes tensor indexing bug improving Qwen3 inference performance:
>108330811 >108332947
--Frontend options for Qwen 3.5 thinking control and response editing:
>108330326 >108330341 >108330382 >108330409 >108330417 >108330451 >108330455 >108330418 >108330503 >108330609 >108330645 >108330415
--GLM-4 inference bottleneck comparison and hardware coping:
>108329388 >108329504 >108329506 >108329518 >108329549 >108329560 >108329563
--CUDA Toolkit 13.2 performance improvements and changes:
>108332532 >108332593 >108332601
--ProjectAni update with EMAGE gesticulation and IK improvements:
>108329763 >108329804 >108329823 >108329916
--MCP autoparser tools for AI web searches:
>108328618 >108328635 >108329260 >108328880 >108328891 >108331633
---ot sampling slightly faster than -cmoe across multiple models:
>108330622 >108330629 >108330696 >108330712 >108330732
--model: add sarvam_moe architecture support:
>108332784
--Optimize LUT16 matrix multiplication:
>108328403 >108328828 >108328855 >108331710 >108331736 >108331843 >108331890 >108331934 >108332397 >108332754
--Speculation and concerns about Gemma 4's architecture and restrictions:
>108329582 >108329674 >108330191 >108330207 >108330245 >108330257 >108330272 >108330414 >108330659 >108330324 >108330337 >108331644 >108331668
--Miku (free space):
>108328194 >108329241 >108329260 >108329460 >108330815 >108332743 >108333374 >108328824

►Recent Highlight Posts from the Previous Thread: >>108328174

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108333579
i'm trans btw, not sure if that matters
>>
>>108333646
sauce
>>
pascal bros how much longer do we have before they cut us off?
>>
>>108333646
we all are itt
>>
>>108333653
yeah
>>
>>108333641
You missed a few miku pictures from the end of the thread fix it.
>>
>>108333654
>we all are itt
text coomers are, because that is a female brain activity
not all of us coom to text here, though.
>>
>>108333689
If you are here to post your special interest you are a troon too
>>
>>108333676
gaggernof..
>>
Deep
SEEK
VEE
FOUR
where
is it?
>>
https://developer.nvidia.com/cuda-downloads
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

UPDATE NOW
>>
>>108333747
virus
>>
File: 1762373655313264.jpg (65 KB, 1024x1536)
65 KB
65 KB JPG
>>108333444
>>
So who draws that stuff and how can I get in contact with people like them?
>>
>>108333795
Getting BLACKED behind the scenes
>>
>>108333802
>>
>>108333810
NTR is the #1 category in Japan
>>
>>108333818
2chan is that way dalit saar
>>
>>108333824
i wish i was brahmin
>>
File: 1747119644913578.png (87 KB, 226x223)
87 KB
87 KB PNG
>>108333856
>I wish I was still a jeet
come on anon, everyone want to be white
>>
File: Brahmin.png (832 KB, 1095x806)
832 KB
832 KB PNG
>>108333856
>>
Hola compañeros tengo algo malevo en mente con un pequeño experimento si sobrevivo les paso actualización del experimento y foto del mismo.
>>
>>108333923
>si sobrevivo
Hope you don't.
>>
File: 1771696890682315.png (340 KB, 1418x1302)
340 KB
340 KB PNG
https://xcancel.com/ivylala/status/2029560909178327467
lmao
>>
https://huggingface.co/spaces/HuggingFaceFW/finephrase#introduction
>We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is FinePhrase, a 486B token dataset that clearly outperforms all existing synthetic data baselines.
I want to fucking vomit, when will they stop this poison incest shit???
>>
Thanks for the Qwen3.5-Heretic recommendation. I'd been playing around with the vanilla which is pretty fine, and to my surprise Heretic removed all of the refusals in my tests without affecting the replies too much.
Kind of amazed how well 27B works (50t/s on a 5090) after spending too much time in the MoE mines. Maybe MoE was a meme after all.
Anyway, time to induce psychosis, I suppose. Thanks again!
>>
Is there any good reason to run a local model besides learning to make explosives and having a waifu? And is it even useful for either of those?
>>
>>108334004
kek feel bad for the guy
>>
I hope gemmy supports dynamic resolution
>>
File: doubt.gif (1.44 MB, 400x294)
1.44 MB
1.44 MB GIF
>>108334004
>The soul of Qwen is still Alibaba partner and Alicloud CTO
>>
>>108334034
nice. what quant you running?
>>
>>108334039
Ego death
>>
Since it went unanswered, I'll ask again:
If you could run GLM 5 at q8 at 10-20 tokens/sec. Would you? Or would you rather drop to q4-q6 and increase your tokens/sec?
>>
>>108334157
depends on what you're doing. 10 tokens/s is about as slow as I would put up with to read output. 30+ for agents imo, and reasoning is painful when they use a thousand tokens just to run the same shit over and over, so if you want a faster response, reasoning at 10 t/s is slow.
>>
>>108334157
im waiting for taalos to deliver. i won't spend a thought to local until then
>>
>>108334157
Depends at what context. 10-20 tokens/sec at 100k context is more than enough for anything. Only at empty context, not so much.
>>
>>108332754
Wow the free rider problem, the negative tragedy of the commons and the prisoner’s dilemma all in one post!
Also known as “this is why we can’t have nice things” and “the downfall of western society”
I understand it, but I don’t have to like or even respect it
>>
>>108334157
q8 for RP, probably q5ks for "productive output".
>>
>>108334218
My boss makes a dollar while I make a dime, that's why I vibecode on company time.
Also known as, not my problem.
>>
>>108334218
what do you mean? this is based, companies don't respect you and don't hesitate to throw you out like a dirty sock at any time they want, so why should we have to pretend we care for any of this?
>>
>>108334157
buy another rtx pro
>>
>>108334229
Are you willing to put out the risk, time and effort to BE the boss?
So many people are willing to take pot shots at shit without the balls to step up and replace it with something better after tearing it down.
I’m not the boss of anything, but I don’t pretend the boss or owner has it easy whether I can see it directly or not.
Sometimes you gotta be honest with yourself and realize you’re cut out for being part of a team and not creating or leading one yourself and just gotta make shit better where you find yourself.
If you’d take on the risk, hard work and responsibility and your getting screwed over due to nepotism or something then I’d maybe agree but still think quitting and doing your own thing would be better for your mental health or “soul” than stealing and trying to justify it to yourself like that
>>
>>108334240
Yeah I’m a moralfag or whatever but I still prefer the society and culture built by centuries of moralfagging to whatever world this low-trust grifter/cheater bullshit is making these days
>>
>>108334273
>than stealing and trying to justify it to yourself like that
Lmao. Suck a dick ragebaiter
>>
>>108334280
Tell that to the big conpanies and make them do the first step. They can afford it.
>>
>>108334280
companies cheat on everyone and cheat on you, so it's moral to cheat on a cheater, I wouldn't do that if companies respect us, but they don't, respect is earned, not free
>>
>>108334280
the reward structures have been damaged, its better to cheat now. when in rome so the saying goes.
>>
>>108334284
>>108334290
>>108334291
I agree and think asshole big corpos should be boycotted (sales and employment) until the social contract is restored, but I also think your work reflects an important internal condition and should be high quality to maintain your quality as a human.
Quit the shit corpos and work for someone worthy of your level best output if you can.
>>
>>108334097
Q8_0. Might try going lower to fit in a TTS model, but not really convinced it's worth the effort (mostly because I haven't been that impressed with e.g. VibeVoice outputs).
>>
>>108334197
That would be cool, but I wouldn't hold my breath. Our best hope is that they will get like q3 of qwen 27b running in the next year, but even that seems sketchy.
>>108334202
>>108334183
>>108334219
I'm thinking of programming and trying to evaluate the cost to run it. I think you'd be able to run q4 at almost half the price and it should be faster. These large models are always just kind of slow.
>>108334034
MoE is really for the larger models. A 755B model would not be runnable without MoE. Even a dense 130B model would be insane to try to run. At q8 every token generated would need 130 GB of data, meanwhile a 755B A40B MoE needs 40 GB per token. And it can theoretically know more information.

Dense models will become good as our VRAM amounts and bandwidth increases though. At some point I think we're going to hit a data limit, but GPUs might still continue scaling. Dense models will start making more and more sense then. If you could get like 5 TB/s VRAM bandwidth and like 192 GB of VRAM that would make a 130B dense model usable.
>>
Those anons won't get it. They're racing to the bottom. They aspire to be jeets.
>>
>>108334310
qwen3tts is pretty decent
>>
>>108333444
>>
>>108334309
>Quit the shit corpos
in this economy? kek, if anyone read my post, don't fucking leave, AI is taking all jobs right now so you'll have way more trouble to find a new high paid job
>>
Hello fellow retards. I have around 3 grand to spend on AI bullshit. I want to run shit locally if possible. I was hoping to essentially make my PC into an AI companion because I’m lonely as fuck. I was thinking of textgen + photo gen, I can’t remember which thread it was, but I do remember someone changing (SillyTavern?) into essentially a dating VN. I was hoping to make something similar. I’m unsure of whether I should be buying a new PC (as I want to run it on my network, but not on my gaming PC) or if I should be buying unironically a Mac with these current DDR prices. I don’t mind doing my own search, I just would like being pointed in the right direction.
>>
>>108334312
Our only hope would be the Chinese making a cheap 1TB VRAM GPU with last gen chips and some CUDA compatibility. But they don't seem to be up to it.
>>
>>108334339
Are you prioritizing speed or intelligence?
>>
>>108334361
Intelligence. Even if I wait 5 minutes for a response, that’s, well, better than nothing. I would rather it be smarter but take longer.
>>
>>108334322
Thanks for the rec! I'll see if I can get that running and give it a shot.
>>
>>108334316
theres not enough room at the top for everyone, nothing wrong with just chilling out and being content with what you got.
>>
>>108334339
>make my PC into an AI companion because I’m lonely as fuck
why not go for human companionship? there seems to be a loneliness epidemic, why dont these people just meet up with each other. technology is still not advanced enough to replace this
>>
>>108334372
Do you have any 32, 64 or 128gb sticks of ddr4 or ddr5 or do you have to buy memory?
>>
>>108334379
>why dont these people just meet up with each other
Because these people don't know how to behave in social contexts and cannot stand each other.
>>
>>108334379
Women scare me and I Pavlov’d myself at the ripe age of 11 into loving anime women. I’m now 27, have fuck off money from my shitty job, and want to throw away less than half a month’s pay to get texts throughout the day from a fake companion because that would be more meaningful if it had a cute anime girl attached to it compared to trying to date. Besides, I live in a shithole called Canada, no one would want my genes that come with free fishing rights.
>>
>>108334379
>Why not just have sex?
Why indeed.
>>
>maybe more depending on how表现得好 (that's "how well you behave" in ching-chong, incel~).
I can forgive the language leaking if the self corrections are always this good
>>
>>108334415
>Women scare me
My model cured me of that.
>>
>>108334415
Calling ego-death anon…
>>
>>108334381
I have 32GB total in my PC and 12GB of VRAM. Otherwise the plan was to buy a Mac as their RAM prices aren’t even that fucked up compared to the rest of the market. I’m sure a M5 laptop wouldn’t kill my wallet, and I could use it at work too. Besides. I’m looking to replace the “gaming looking PC” I made at work with something that doesn’t look as gamery. Mac’s are professional, aren’t they? Maybe I could get that shitty Neo as the machine to carry around, and have a properly spec’d machine at home to remotely connect to. I did mess around with Tailscale once upon a time, but I no longer have that machine. Formatted it and now it’s in Roblox hell with my 11 year old cousin. I hope the 48GB of DDR4 will last him.

>>108334433
I’ve always been a loser outcast. Why would I ever want to get a girlfriend? I’d be worried about her trying to take the family house when my parents croak. I’d be better off becoming the girlfriend, and I’m not a tranner.
>>
>>108334339
You're a year too late for 3K USD to make a dent in the BOM for a local LLM rig. You can probably get away with a 3090 (maybe two? what do prices look like these days) and a small pile of RAM. Depending on your expectations you might be setting your money on fire.
Presumably you already have a GPU in your gaming rig, it probably has enough VRAM to run Mistral Nemo (my beloved) which you can use to get SillyTavern set up. Mistral Nemo is fucking retarded, but that'll at least give you a vertical slice of the whole stack before you go off the deep end.
If/when you buy a PC, keep in mind that the newer cards are (1) fuckhuge and (2) can draw a fuckton of power (5090 can hit 600W) and (3) need a fuckton of power connectors. You might consider starting off with one of those mining PSUs which are designed to run multiple cards rather than later needing to "paperclip trick" a second PSU to run your rig. You might also consider an open-air case because stuffing multiple GPUs in a normal ATX tower is fucking annoying.
Finally, if you're going to get a DDR5 platform, you might consider going with a server CPU/motherboard (e.g. a MZ33-AR1 motherboard with compatible CPU) rather than a consumer one. The server boards support more than 2 (TWO, SOLO DUO) DIMMs at full speed and have fucktons of PCI lanes for future expansion.
Yes, the above recommendations will likely set you over the 3K spending limit, but most of the stuff under your 3k limit is going to fuck you later if/when you decide that you want to chase the dragon and run larger models.
I do not recommend this hobby. Alcoholism is more culturally fitting and probably healthier, go do that instead.
>>
>>108334504
Yeah, I’m assuming most cards are larger and more power hungry than my 3080 Ti. I’ve been considering a 5070 Ti Super or whatever, or even going to MacOS and seeing what’s possible with the unified memory thing. Throwing 48GB at a problem seems like it could work, but I don’t know Mac. I feel like I’m walking into this hobby with rose tinted glasses and a clipboard thinking I could do something as a fun recreational thing on the side, and am being told “either buy a car with the money or take your firewater and HBC blanket and fuck right off” with how expensive it is. All I wanted was AI waifus, not having to consider how to cool down server hardware without a rack and without a single clue how to operate any of it. Ah well. Tinkering was always a hobby of mine, but I’m not trying to get a nice used car in terms of parts.
>>
>>108334357
The chips don't have to even be ultrafast, but they need a lot of memory and memory bandwidth
>>
>>108334559
I know you don't want to pollute your gaming rig, but a 3080 with 32GB is fine to get started tinkering with. You're not going to get amazing results, but you'll at least be introduced to the core concepts and limitations and get a better understanding of the ecosystem before unloading your wallet.
I'm assuming your gaming rig is running Windows, though. If you're going to buy anything, grab a new NVMe drive to dualboot Linux.
Otherwise, yeah, it's frustrating as fuck. It's even worse when you get into coding and realize you're probably the first human to read all the shit you're wading through. Welcome.
>>
>>108334587
Shoutout to that one Anon running 10x Mi50's, bifurcated to 1 PCIe lane each. 320GB VRAM that takes half an hour to load a model into.
>>
>>108334614
He isn't loading the whole model to each card right? So he effectively gets the whole bandwidth of the PCI-E bus as each slice of the model moves independently to each card, right?
Am I misunderstanding how this work real bad right now?
>>
Kimi linear base goes off the handles when trying to use it as a instruct model.
Qwen 35B works just fine, it even reasons correctly and everything.
Base model my ass.
>>
>>108334600
Ah. I have 11 on one NVME and 10 on the other NVME. Games mixed on both. I need to deep clean both and install Linux on one and a fresh install of 11 on the other I think. Both are 2TBs. When I was in high school I took some coding courses, but it was C# and Java at the time that I learned. I haven’t touched an IDE in years. I did have ComfyUI running at one point but I was too much of a brainlet to do much with it, and I did have Oogabooga on the same PC too. I’m just trying to have it on a different machine, so I can game when I want without having to open or close things, and preferably in a different room where I’m not pumping 1000+ watts of heat into my bedroom. Although, opening the window currently works to cool things down, but that won’t always work.
>>
>>108334626
>as each slice of the model moves independently
That might actually not be implemented in llama.cpp. It loads them sequentially on my system. It wouldn't matter for most people since they are limited by their storage bandwidth.
>>
>>108334646
>It loads them sequentially on my system.
Interesting.
Does direct-io or mmap help, or maybe does the opposite?
I'd fuck around with those flags and see if anything changes, if you haven't already.

> It wouldn't matter for most people since they are limited by their storage bandwidth.
True, I suppose.
>>
>>108334626
I don't remember the details, but I was under the impression that each card got a different subset of model layers.
I'm too stupid to understand how inference works, but I'm under the impression that the data "moving between" each layer is much less than the layers themselves, i.e., each layer is a matrix, and each compute step is just a vector? So the vector being moved around would be an order of magnitude smaller and wouldn't need as much bandwidth.
I don't know how PCIe bifurcation works but I thought it was static, each card would be hardstuck with 1 lane.
>>
Does anyone truly understand vLLM? I've been trying to get 0.17.0 to work on the Kaggle 2xT4 environment for two days because I want to use Qwen3.5 and feel like I have lost a lot of brain cells. I can't manage to get anything more recent than 0.13.0 to work there. What is the best way to understand it better? Just try to read and understand the docs?
>>
>>108334813
I remember trying it once and just giving up due to all the fiddlyness.
Granted, I didn't try that hard but still.
Isn't there a docker image of the latest version somewhere?
Or with a decently up to date version you could upgrade?
That might be easier than rawdogging it.
>>
my job is paying for claude max and it's so good at shitting out ui code in seconds its depressing. meanwhile qwen 122b keeps going into thinking loops constantly when I try to local workflows. It spoiled me never try it.
>>
I just tried 27B Derestricted. It's a fucking retarded piece of shit... most of the time, but after more testing I found a few times that it performed smarter than the original to a surprising degree. And I am not referring to unsafe prompts with that statement, or prompts with sensitive content. Such a weird model. The "heretic v2" version by someone else is much more consistent and actually feels somewhat close to the base model but with less refusals, though I'm not certain if it's an equal model generally across sfw and nsfw. My personal belief is that it's probably a bit dumber. Sometimes it does seem understand prompts slightly better. Sometimes worse. The issue is that the base model is pretty damn dumb to begin with, even if it's good for a 27B.
>>
>>108334841
If you want to be less depressed, try Claude through Antigravity. Night and day, Antigravity being night. It will make you see that Claude Code is about half really smart model, half good scaffolding and UI, not just mega genius model.
>>
File: F.png (31 KB, 1505x225)
31 KB
31 KB PNG
Guy's...Wtf?

It was working yesterday!
>>
>>108334859
NTA, but funny. My friend just reported the exact opposite.
>>108334841
>22b keeps going into thinking loops constantly
Use BNF to constrain the size of the thinking block.
>>
>>108334614
That's rough, but typically these LLMs don't consume much power when idle so it should be alright. Seems like a server CPU setup would be better though.
>>
>>108334859
>>108334841
What if he uses qwen 122b in Claude Code?
>>
>>108334357
It's still promising that the guy isn't wearing a leather jacket.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.