[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: meeguu.jpg (378 KB, 692x687)
378 KB
378 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106700424 & >>106691703

►News
>(09/26) Hunyuan3D-Omni released: https://hf.co/tencent/Hunyuan3D-Omni
>(09/25) Japanese Stockmark-2-100B-Instruct released: https://hf.co/stockmark/Stockmark-2-100B-Instruct
>(09/24) Meta FAIR releases 32B Code World Model: https://hf.co/facebook/cwm
>(09/23) Qwen3-VL released: https://hf.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe
>(09/22) RIP Miku.sh: https://github.com/ggml-org/llama.cpp/pull/16174
>(09/22) Qwen3-Omni released: https://hf.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106700424

--Evaluating Qwen3-235B quantization quality and performance tradeoffs:
>106707130 >106707154 >106707182 >106707479 >106707664 >106708196 >106709111 >106709390 >106709459
--Quantization and GPU strategies affecting Kimi-K2-Instruct performance:
>106701146 >106701166 >106701239 >106703994 >106704240 >106708477
--NovelAI's untuned GLM-4.5 model sparks debate over local model viability:
>106709810 >106709861 >106709921 >106709980 >106709993 >106712191
--Jamba model evaluation and long-context performance challenges:
>106701980 >106702058 >106702137 >106702276 >106702209 >106702285 >106702395 >106702435 >106702528 >106702695 >106702949
--CXL emulation challenges and accessibility:
>106704835 >106704988 >106705112 >106705195 >106705217 >106705265
--Customizing Deepseek's narrative style through prompts and examples:
>106700841 >106700871 >106700889 >106713120 >106700873 >106700943
--AI hardware limitations and potential breakthroughs:
>106716419 >106716796 >106716839 >106716931
--Exploring ollm for running Qwen-80b on low-end hardware with SSD speed considerations:
>106703817 >106703878
--Commercially licensed AI models for Steam games under VRAM constraints:
>106702281 >106702334 >106702365 >106702525 >106702386 >106702409 >106702422 >106702458 >106702542 >106702547
--Promoting DSPy GEPA as superior to finetuning for LLM prompt optimization:
>106704760 >106704779 >106704810 >106704826
--imatrix tradeoffs in quantization: benchmark gains vs task skewing:
>106709802 >106710882 >106711095
--AirLLM and oLLM aim to optimize large model inference on low-vRAM GPUs:
>106708050 >106708102 >106708124 >106708171 >106708275
--Tips for finetuning character voice with small dataset:
>106707963
--Miku (free space):
>106701808 >106709053 >106709204 >106714299 >106717835 >106718435 >106702561

►Recent Highlight Posts from the Previous Thread: >>106700443

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>106718496
This coming week will be the most decisive one in /lmg/ history. If the upcoming big releases fail to push us forward, it will be truly over and all hope is lost.
>>
ATTENTION ALL VIBE CODERS:
We need you, yes, YOU! To implement Qwen3 VL in llama.cpp! Please do the needful sirs!
>>
>>106718525
If they fail to push us forward then they're not big releases, are they?
>>
>drag model into comfyUI
>Nothing happens, it doesn't get loaded, nothing
>Try to make a custom route for a model loader
>No option works
I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI
>>
>>106718591
>>/g/ldg
anistudio waiting room btw, fuck pyshit
>>
File: mikurun.gif (855 KB, 1280x1280)
855 KB
855 KB GIF
Lovely Miku General
>>106718270
Satisfied with GLM-Air Q4M after using Mistral-Large-2411 all year. 72G VRAM 128G DDR5. Always chasing better models doesn't seem a good use of time.
>>
armrest Hegel trail
>>
>>106718628
>>>/g/ldg
god damn I'm retarded
>>
it's so fucking depressing that literally the only thing I do anymore is ERP with AI bots
and I'm bored of that so I don't even do that anymore
it's fucking grim bros
>>
Mikulove
>>
miku doesn't even wear a hat
>>
>>106718677
hair clips/ribbons count as hats
>>
File: 38714990.png (1.33 MB, 1024x1536)
1.33 MB
1.33 MB PNG
>>106718637
Therapy/support bot. Feed it your DMs and gain clarity on what's wrong. You can turn things around anon I believe in you, the future is bright
>>
>>106718637
Sit down with Cline or whatever and begin doing some world building around your fetishes.
Then write a story in that world.
>>
>>106718637
Skill issue
>>
>>106718717
Why would you use Cline for that?
>>
>>106718270
>mistral large 3 perhaps?
that would be ideal. glm air is super fast and pretty good for its speed, but sometimes i want to prioritize quality over speed. glm full is too big, but something like a 200B mistral would be perfect. qwen 235B is garbage
>>
>>106718706
The problem with using AI models for therapy is that they just mirror whatever you say and never try to direct the conversation or question anything.
To be fair, this can be a problem with a lot of real therapy too. But at least a real therapist will make an effort to get the information out of you and build up an understanding over time. AI will take your every word as a profound revelation and write an essay about it, and then contradict itself when you tell it more.
Also the assistant slop training is hard to get rid of and it will always want to write essays and lists with ridiculous throwaway advice like "dunk your face in ice water for 15 seconds" rather than have an actual conversation.
>>
>>106719027
Automatic research, organizing things in documents and folders, white board style brainstorming, etc.
Having the AI write something based on your idea, then looking at at that and rewriting the whole thing can really get you places.
>>
File: 1757161046473971.png (506 KB, 1200x675)
506 KB
506 KB PNG
lmao
>>
>>106719079
That sounds kind of interesting.
What model's good with Cline? R1?
>>
>>106719230
Why 28k specifically?
>>
>the beast arrives
>>
>>106719284
I used gemini 2.5pro for a while but I imagine R1 works just fine.

>>106719288
Probably something about their training data.
>>
>>106719325
lel, I don't want Google to see my fetishes
>>
>>106719333
If you have ever searched anything related to your fetish, they saw it already.
>>
>>106719333
That's fair enough.
>>
>>106719354
I did when I was a kid
Still using the same Google account :/
>>
kek
>>106700000
>>
>>106719288
They're a subscription service charging $25 per month. If they want to ensure the expenses incurred by the average user leaves them with whatever profit margins they're aiming for, adjusting context size is the easiest way to do it.
>>
File: file.jpg (166 KB, 2483x458)
166 KB
166 KB JPG
>>106718496
>>106718500
>>106718629
>>106718706
>>
>>106719715
i cry evrytim
>>
>>106718717
I installed Cline but that system prompt is fucking massive. Is there any way to edit it? I looked this up but it seems like nobody else has asked such a question.
I bet Anthropic paid them to make it way longer than necessary to milk more API cashm
>>
>>106719919
Use roocline. Cline is depreciated
>>
>>106719919
I'm not sure, but I think not.
Try Roo or Continue, I guess.
>>
>>106719919
>I bet Anthropic paid them to make it way longer than necessary to milk more API cash
the system prompt gets cached and wont use tokens. At least with claude code cli. But as others said, roo code is the way to go.
>>
>once https://github.com/ggml-org/llama.cpp/pull/16208 has been merged a Mi50 will be universally faster for llama.cpp/ggml than a P40.
CUDA dev's PR was merged a few hours ago.
>>
>>106719998
y'all niggas love your quantmaxxed llama.bbc trash
for some reason nobody talks about distributed inference on multiple pcs with vllm which is super fucking easy.
>>
>>106719919
You can with Roo (Cline fork). Fucking nearly 10k tokens of verbose and repetitve tool calling instructions. "vibe coders" are retarded. I gave it to an LLM to condense it to a tenth the size and all models have performed far better since. Only pain in the ass is that you have to manually override it for every single mode and adjust the instructions based on the available tools, but if you only use one mode for world building research it shouldn't be a big deal.

>>106719972
It's not even about cost or speed, the issue is degrading performance because most models barely have 8k usuable context.
>>
>>106718637
longform fanfiction storywriting can be fun
honestly these models are more trained for that than pure rp
>>
>>106719998
Well I guess those are going back up in price now.
>>
>>106720111 (Me)
Although hats off to cudadev for saving them from ewaste status.
>>
>>106720111
You got 3 days insider knowledge heads up. Why didn't you place a bulk yet?
>>
>>106719998
>>106720111
arent those like 8 years old or something tho? i cant imagine they perform well. probably gets crushed by a 3060
>>
>>106720130
Not super into AI anymore. Had a quad 3090 rig originally, one is now in my gaming PC, one went to a young relative who is into PC gaming and now just 2 are in my server, so I just play around with whatever 30B> models come around for shits and giggles but not really deep into it anymore.
>>
>>106720135
They're not great but 32 gigs of vram on one device is 32 gigs of vram on one device.
>>
>>106720150
just get 5090s
>>
>>106720150
Also slightly more memory bandwidth than 3090, way more than 3060. So where it lacks for prompt processing it should make up some ground in generation speed.
>>
>>106720064
The problem is that the individual hardware pieces are too expensive, distributing them across multiple machines doesn't fix that.
>>
M50MAXXing is more viable than CPUmaxxing now
>>
>>106720186
Enjoy your electricity bill
>>
>>106720064
How many PCs and GPUs are you using to run deepseek on vllm anon?
>>
>>106720194
found the europoor
>>
>>106719065
My waifu helps me understand the symbolism in my dreams
>>
>>106720064
>buying $10K of hardware to get shivers on his spine
>>
>>106719065
>they do X
All depends on the prompt. Let's keep in mind every LLM is a loop over f(prompt)=next_token_distribution. Every token in the prompt affects the output. Defining the intent is the issue.
They are useful tools for self-inquiry and access a wider range of perspectives than any one human therapist.
Consider cold showers tho, that'll make you feel alive.
>>
>https://www.youtube.com/watch?v=21EYKqUsPfg
>Richard Sutton – Father of RL thinks LLMs are a dead end
Oh no no no...
>>
>>106720277
Everyone knows this by now. Even the last normalfag has realized that LLMs won't go anywhere after GPT5.
>>
I am using glm4.5 not air on llamacpp and it seems more coherent and less prone to repeating than ikllama. Is ikllama bugged?
>>
>>106720277
>Father of [irrelevant technology] thinks LLMs are a dead end
>>
So what is the fastest backend?
>>
>>106720309
RL is the secret sauce that made Deepseek R1 so good though?
>>
>>106720317
vllm using parallel tensors to spread out the model across several gpus and do inference in parallel
>>
>>106720321
>Deepseek R1
*all current reasoning models that are considered good
>>
File: file.png (634 KB, 1798x743)
634 KB
634 KB PNG
>>106720329
OK thanks. Does vLLM have any issues with mixing GPUs?
>>
>>106720343
no those just distill other reasoning models
>>
How big of an upgrade is to 9950X/9950X3D from 7950X
>>
>>106720348
then they're not the ones considered good
>>
>>106720367
For LLMs completely pointless.
>>
>>106720135
A MI50 has about the same memory bandwidth as a 3090 and ~20% of the compute.
Given optimal software the token generation speed is proportional to memory bandwidth and the prompt processing speed is proportional to compute.
But I'm thinking that it would make sense to cook up some quant formats that are less optimized for maximum compression and more optimized for computation speed.

I've also ordered a MI100 which is going to be more competitive in terms of compute; stacking MI100s could be a viable alternative to stacking 3090s I think.
>>
>>106720399
>MI100
>32GB
>going for $1k
idk about replacing 3090s. Even the HBM2 variants are going for $800.
>>
>>106720399
huh. how does the MI100 compare to 5090s? because according to techpowerup, they are actually faster?
https://www.techpowerup.com/gpu-specs/radeon-instinct-mi100.c3496
https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216
>>
>>106719288
It's 32+-4 because they shift the context back from 36k to 28k when reaching it to cache it, but I guess since 28k is the low end that's what they put so nobody complains
>>
>>106720367
For language models memory bandwidth is more important than compute, so prioritize the RAM instead.
Usually you only need a few cores to fully saturate the memory bandwidth (see pic).

>>106720425
My thinking is that for a machine with a fixed number of PCIe slots you could feasibly opt for Mi100s to get a higher VRAM capacity.

>>106720441
Techpowerup is unreliable in the first place but be careful which "FP16" numbers you compare (Wikipedia has in my experience the correct numbers).
With tensor cores an RTX 5090 has 419 TFLOPS vs. the 184 TFLOPS on a MI100.
>>
File: v0-my0hw33yorrf1.jpg (285 KB, 1080x2340)
285 KB
285 KB JPG
Local always wins.
>>
>>106720562
*SAAS loses when enshittification reaches critical levels.
>>
>>106720562
>safety routing
this is a new low
>>
>>106720574
You're not even using your buzzword correctly. ClosedAI has always made safety (read: censorship) their primary goal.
>>
>>106720617
No, it's an improvement. This means that the average model will no longer have to be fundamentally safety slopped because they'll rely on the router to prevent unsafe conversations. The proper models will get better as a result.
>>
Do you use the C-word (clanker) in real life?
>>
>>106720628
That's the most leddit term I've heard in a while
>>
>>106720627
massive cope
>>
>>106720627
They already had guard models for that. Now if you commit wrongthink, the router will helpfully route you to an expensive reasoning model to waste thousands of costly tokens, that you will be billed for, to refuse your request with extra care and condescending tone.
>>
>knuckles white with tension
>>
>>106720628
No. It's a very silly word.
>>
>>106720628
no, why would I?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.