/g/ - /lmg/ - Local Models General - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
/lmg/ - Local Models General 09/28/25(Sun)17:30:59 No.106729809

File: 38714990.png (1.33 MB, 1024x1536)

/lmg/ - Local Models General Anonymous 09/28/25(Sun)17:30:59 No.106729809

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106718496 & >>106700424

►News
>(09/26) Hunyuan3D-Omni released: https://hf.co/tencent/Hunyuan3D-Omni
>(09/25) Japanese Stockmark-2-100B-Instruct released: https://hf.co/stockmark/Stockmark-2-100B-Instruct
>(09/24) Meta FAIR releases 32B Code World Model: https://hf.co/facebook/cwm
>(09/23) Qwen3-VL released: https://hf.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe
>(09/22) RIP Miku.sh: https://github.com/ggml-org/llama.cpp/pull/16174
>(09/22) Qwen3-Omni released: https://hf.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
09/28/25(Sun)17:31:24 No.106729810

Anonymous 09/28/25(Sun)17:31:24 No.106729810

File: threadrecap.png (1.48 MB, 1536x1536)

1.48 MB PNG

►Recent Highlights from the Previous Thread: >>106718496

--Gemma3 context shift workaround and llama.cpp default behavior changes:
>106722276 >106722305 >106722485 >106723962 >106724000 >106724028 >106724072 >106724704 >106724711 >106724767 >106725006 >106725723 >106725742 >106725770 >106725984 >106727385 >106728340 >106728355 >106728387 >106724087 >106724441 >106724502
--Resolving speed discrepancies in GLM model quantization due to tensor offloading issues:
>106728273 >106728304 >106728320 >106728351 >106728540 >106728668 >106728732 >106728567
--llama.cpp PR boosts Mi50 performance, sparking debate on GPU viability vs 3090/3060:
>106719998 >106720111 >106720130 >106720146 >106720135 >106720150 >106720162 >106720399 >106720425 >106720441 >106720502
--GLM model performance evaluation against Deepseek and K2 with quantization and hardware considerations:
>106721256 >106721266 >106721281 >106721285 >106721308 >106721325 >106721277 >106721487 >106721557 >106721616 >106721781 >106722001 >106722664 >106722939
--GLM 4.5 memory capacity and subscription pricing strategy:
>106719230 >106719288 >106719673 >106720463
--LLMs rely on statistical pattern matching, lacking true generalization:
>106723945 >106724342 >106724746 >106724939 >106725024 >106725363 >106725666
--Integrated GPU VRAM vs dedicated GPU tradeoffs:
>106725213 >106725225 >106725240 >106725307 >106725277 >106725244 >106725258 >106725601
--Model flowchart and Qwen 30b model discussion:
>106725416 >106725673 >106725731 >106725766 >106727552
--vLLM backend performance and GPU compatibility challenges:
>106720317 >106720329 >106720344 >106720949 >106720970 >106721029
--Adjusting koboldcpp settings to remove <|thinking|> tags via Auto-Parse and "/nothink":
>106724881 >106725045 >106725433
--Miku (free space):
>106718629 >106718706 >106722210 >106722293 >106722737 >106726048 >106729335

►Recent Highlight Posts from the Previous Thread: >>106718500

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
09/28/25(Sun)17:33:14 No.106729822

Anonymous 09/28/25(Sun)17:33:14 No.106729822

hyunyuan-image-80b... forgotten...

Anonymous
09/28/25(Sun)17:34:20 No.106729830

Anonymous 09/28/25(Sun)17:34:20 No.106729830

Wondering, what's a realistic local setup for Qwen3-VL-235B-A22B-Instruct when the ggfus drop? Aiming for minimum 90% quality output of the fp16 base model and > 5tok/s, minimum 32k context. The FP16 needs 8 80GB gpus for 470GB weights and 60GB for all the active stuff (according to chatgpt). So for a q5 ggfu that's roughly 160GB weights and something between 20GB-40GB for the active stuff. So I'm wondering if it could be possible to run the q5 ggfu quant on a system with a single 5090 and 256RAM, using gpu for active params and offloading weights to RAM .

Anonymous
09/28/25(Sun)17:37:59 No.106729861

Anonymous 09/28/25(Sun)17:37:59 No.106729861

>>106720254
Does anyone know any good prompts for this?
I just ask my favorite model for feedback and then remind it that it's an AI model and prone to validating without critically analyzing, and ask it to look over the conversation again and find the parts that don't make sense.

Anonymous
09/28/25(Sun)17:38:43 No.106729869

Anonymous 09/28/25(Sun)17:38:43 No.106729869

File: file.png (1.63 MB, 3770x762)

1.63 MB PNG

>>106729830
i doubt it. qwen3 235b without the vision runs like shit for me if i put any of it onto RAM. a single 5090 would definitely not be enough

Anonymous
09/28/25(Sun)17:41:43 No.106729880

Anonymous 09/28/25(Sun)17:41:43 No.106729880

So I can't just dump my loose collection of markdown files into RAG huh? I actually have to do the work and prepare data for the model to consume huh?

Anonymous
09/28/25(Sun)17:48:18 No.106729935

Anonymous 09/28/25(Sun)17:48:18 No.106729935

>>106729869
yeah, feeling tells me I'll need at least 4x 3090 even when quant and offload maxxing, but the math theoretically ads up. what do you define as running like shit btw? probably 5tok/sec. But for my usecase, 5tok/sec would be enough.

Anonymous
09/28/25(Sun)17:51:55 No.106729969

Anonymous 09/28/25(Sun)17:51:55 No.106729969

>>106729935
a q5 gguf ran at around 4t/s for me. the more you offload, the worse it gets. even offloading a tiny bit can reduce your performance by as much as 80%. you probably better off either vram maxxing, or using a smaller quant, or getting high speed ddr5 (which is extremely expensive if using a server or workstation motherboard)

Anonymous
09/28/25(Sun)17:53:11 No.106729983

Anonymous 09/28/25(Sun)17:53:11 No.106729983

>>106729880
>...I see the fact X is already mentioned in the system note, no need to focus on it.</think>
So this is the power of RAG...

Anonymous
09/28/25(Sun)17:57:33 No.106730022

Anonymous 09/28/25(Sun)17:57:33 No.106730022

>>106729880
semantic chunker
metadata extractor and embedder
multidimensional vector embeddings (semantic, keyword, metadata etc.)
sql tree struct docflow
latelatching retrival mechanism (colpali/colqwen)
reranker
make everything agentic

or just use morphik ai

Anonymous
09/28/25(Sun)18:00:17 No.106730054

Anonymous 09/28/25(Sun)18:00:17 No.106730054

>>106730022
I guess it's time to get out of silly tavern kiddie pool

Anonymous
09/28/25(Sun)18:07:20 No.106730114

Anonymous 09/28/25(Sun)18:07:20 No.106730114

>GLM4: 30b
>GLM4.5: 350b
We're on track for our first 3T model if the patterns hold up.

Anonymous
09/28/25(Sun)18:12:57 No.106730147

Anonymous 09/28/25(Sun)18:12:57 No.106730147

File: eyes beaten tears abuse c(...).png (55 KB, 194x221)

55 KB PNG

>>106730114
They had their one good release. Now it's time for them to order from the updated menu. The menu options are: Safety, Scale, or Synth (can choose all 3).

Anonymous
09/28/25(Sun)18:14:15 No.106730153

Anonymous 09/28/25(Sun)18:14:15 No.106730153

I get it. AI slop is basically like the uncanny valley of text. At times it's close enough to human that you can get immersed in the illusion, but then it hits you with artificial shit and immediately breaks the flow. And just like the uncanny valley, some people don't see or detect it, even when they're aware of it. Funny how that works.

Anonymous
09/28/25(Sun)18:15:53 No.106730166

Anonymous 09/28/25(Sun)18:15:53 No.106730166

>>106730147
So the Cohere playbook is it?

Anonymous
09/28/25(Sun)18:19:45 No.106730205

Anonymous 09/28/25(Sun)18:19:45 No.106730205

>>106730114
it's 4.6 so your math doesnt add up, sorry

Anonymous
09/28/25(Sun)18:25:29 No.106730262

Anonymous 09/28/25(Sun)18:25:29 No.106730262

>>106730205
GLM4.5.5.5 is going to be 30T and we're going to need a colony on the sun just to power the machines to run it

Anonymous
09/28/25(Sun)18:25:45 No.106730266

Anonymous 09/28/25(Sun)18:25:45 No.106730266

>>106729969
mhm ok, we'll see. I'll let others figure it out. GPT tells me the current 4bit autoround quant of the vl model still needs 2 RTX 6000 gpus. That's like 14k$. Granted, running much faster than offload memes, but I'll gladly go snailmode for half the price.

>>106730054
don't research RAG. It's such a jeeted clusterfuck of a field. A quick peek in the fun world of RAG:
https://www.reddit.com/r/LocalLLaMA/comments/1ned2ai/building_rag_systems_at_enterprise_scale_20k_docs/
But since I've done the mistake of researching RAG, here's my widows
- ignore knowledge graphs like graphRAG. absolute useless garbage.
- your docs got pictures or graphics? colpali or colqwen is a must. thus also a strong vision model is needed for retrieval (which you probably cant run locally)
- avoid ocr like the plague (by using colpali/colqwen and vision model). If you absolutely need ocr for tables or whatever, use dots.ocr
- semantic chunking is good, but sometimes not good enough. Using structured output generated from docs can help, although often not feasible. LangExtract is a framework for that
- you'll need some agentic routing and reranking eventually as you scale. top_k 10 won't cut it with thousands of docs, unless your metadata game is on point.
- oh yeah metadata. generate metadata for everything. helps llm dodge retrievals which are irrelevant and just had aemantic similarity

Anonymous
09/28/25(Sun)18:28:15 No.106730282

Anonymous 09/28/25(Sun)18:28:15 No.106730282

>still no goof of 80b
It's over...

Anonymous
09/28/25(Sun)18:31:08 No.106730308

Anonymous 09/28/25(Sun)18:31:08 No.106730308

altere arm lighters

Anonymous
09/28/25(Sun)18:32:03 No.106730318

Anonymous 09/28/25(Sun)18:32:03 No.106730318

File: pwned.jpg (96 KB, 1713x326)

96 KB JPG

>>106729809
>>106729810

Anonymous
09/28/25(Sun)18:46:52 No.106730435

Anonymous 09/28/25(Sun)18:46:52 No.106730435

File: stepx1 image edit llm ai (...).png (1.04 MB, 1030x817)

1.04 MB PNG

>>106730318

Anonymous
09/28/25(Sun)18:50:05 No.106730457

Anonymous 09/28/25(Sun)18:50:05 No.106730457

https://huggingface.co/tencent/HunyuanVideo-Foley
video to sound effects audio

Anonymous
09/28/25(Sun)19:02:58 No.106730523

Anonymous 09/28/25(Sun)19:02:58 No.106730523

>>106730457
https://litter.catbox.moe/miyv3qo1d9a2q8dv.mp4
>pos: a woman moaning in an intimate scene with a man
>neg: music,

https://litter.catbox.moe/x9rru6dc89vzcc1j.mp4
>pos: a woman with a high pitched voice is moaning and making slurping and slapping sounds when the penis enters her mouth
>neg: music,

Anonymous
09/28/25(Sun)19:04:30 No.106730532

Anonymous 09/28/25(Sun)19:04:30 No.106730532

>>106730523
loool

Anonymous
09/28/25(Sun)19:05:24 No.106730542

Anonymous 09/28/25(Sun)19:05:24 No.106730542

>>106730523
>second one
You should pretend that it's supposed to be a robot waifu and that this is basically AGI.

Anonymous
09/28/25(Sun)19:24:04 No.106730661

Anonymous 09/28/25(Sun)19:24:04 No.106730661

>>106730266
>- oh yeah metadata. generate metadata for everything. helps llm dodge retrievals which are irrelevant and just had aemantic similarity
Nta. When ensuring a document or specific information within a document (let's assume I'm only using raw pure .txt files) has metadata, how should it be formatted? Should it just have basic information about the document or section of a document before or after it or something?

Anonymous
09/28/25(Sun)20:16:30 No.106731075

Anonymous 09/28/25(Sun)20:16:30 No.106731075

>>106729809
>>(09/26) Hunyuan3D-Omni released: https://hf.co/tencent/Hunyuan3D-Omni
>>(09/25) Japanese Stockmark-2-100B-Instruct released: https://hf.co/stockmark/Stockmark-2-100B-Instruct
>>(09/24) Meta FAIR releases 32B Code World Model: https://hf.co/facebook/cwm
>>(09/23) Qwen3-VL released: https://hf.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe
>>(09/22) Qwen3-Omni released: https://hf.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe
What is all this shit? Is it all slop?

Anonymous
09/28/25(Sun)20:33:56 No.106731197

Anonymous 09/28/25(Sun)20:33:56 No.106731197

>>106731075
no goofs

Anonymous
09/28/25(Sun)20:40:29 No.106731251

Anonymous 09/28/25(Sun)20:40:29 No.106731251

Gee I can't wait for GLM-4.6 to release and it's the same '+5 points in agentbench and worse in anything creative' like we've seen from Deepseek and Kimi.

Anonymous
09/28/25(Sun)20:42:27 No.106731269

Anonymous 09/28/25(Sun)20:42:27 No.106731269

>>106731251
At least you will have goofs...

Anonymous
09/28/25(Sun)20:43:49 No.106731279

Anonymous 09/28/25(Sun)20:43:49 No.106731279

>>106731251
>worse in anything creative
They didn't get worse; you just fried your brain and have a higher threshold for pleasure

Anonymous
09/28/25(Sun)20:45:53 No.106731294

Anonymous 09/28/25(Sun)20:45:53 No.106731294

File: 5kM.jpg (423 KB, 756x906)

423 KB JPG

>>106730435
>>106729809
>>106729810
Its all off-topic now kek

Anonymous
09/28/25(Sun)20:48:54 No.106731311

Anonymous 09/28/25(Sun)20:48:54 No.106731311

>>106731279
The old versions are still okay though?

Anonymous
09/28/25(Sun)20:55:18 No.106731349

Anonymous 09/28/25(Sun)20:55:18 No.106731349

File: 1750710731510258.png (1.97 MB, 1200x1767)

1.97 MB PNG

>>106731197

Anonymous
09/28/25(Sun)21:17:55 No.106731517

Anonymous 09/28/25(Sun)21:17:55 No.106731517

AI's ten years from now will be made to talk to modern models and they will feel embarrassed.

Anonymous
09/28/25(Sun)21:30:05 No.106731591

Anonymous 09/28/25(Sun)21:30:05 No.106731591

>>106731517
stop humanizing clankers

Anonymous
09/28/25(Sun)21:30:18 No.106731594

Anonymous 09/28/25(Sun)21:30:18 No.106731594

If 4.6 comes out and it's just a safetyslopped benchmaxxed 4.5 its truly over

Anonymous
09/28/25(Sun)21:31:46 No.106731606

Anonymous 09/28/25(Sun)21:31:46 No.106731606

File: 1740094338302083.jpg (35 KB, 406x388)

35 KB JPG

>>106731517
Not gonna happen, these retarded labs will keep doing retarded benchmaxxing until some intelligent guy in his garage makes something better

Anonymous
09/28/25(Sun)22:09:25 No.106731841

Anonymous 09/28/25(Sun)22:09:25 No.106731841

4.6 will not be worth using.

Anonymous
09/28/25(Sun)22:11:07 No.106731849

Anonymous 09/28/25(Sun)22:11:07 No.106731849

is there any evidence that glm 4.6 actually exists and wasnt just a typo on the website

Anonymous
09/28/25(Sun)22:39:20 No.106732029

Anonymous 09/28/25(Sun)22:39:20 No.106732029

I know this isn't local, but you guys seem competent enough to discuss this
how come tech companies aren't using LLMs to push for ads? like, if you were to shilll for products and whatever, be them your own products or someone else, LLMs look like a great tech to do it. so, why spend many billions on this shit, yet not give them an actual application?

Anonymous
09/28/25(Sun)22:43:50 No.106732053

Anonymous 09/28/25(Sun)22:43:50 No.106732053

>>106732029
Despite your perception that LLMs are not used for advertising, tech companies are actively integrating them into their ad products, although not in the overt, "shilling" manner you might expect. The reasons for this more subtle approach are complex, combining ethical concerns, technical limitations, and the need to protect consumer trust.

Anonymous
09/28/25(Sun)22:51:01 No.106732094

Anonymous 09/28/25(Sun)22:51:01 No.106732094

>>106732053
>ethical concerns
>protect consumer trust
fucking gpt5 slop if i ever saw it
tl:dr yes companies are using llms to summarize the vast data they ingest, you can see it on everything from amazon reviews to youtube.
https://blog.youtube/news-and-events/new-youtube-ai-tools-summer-2025/

Anonymous
09/28/25(Sun)23:10:10 No.106732232

Anonymous 09/28/25(Sun)23:10:10 No.106732232

>>106732053
>>106732094
I see, so it's not used for outright shilling as I said, but it's being integrated to suggest stuff

I was thinking of using it for ads beyond simple search, though... something contextual, like, make LLMs read and explain the context of articles, and then have other LLMs that create ads and put them as footers in the articles

this guy asked chatgpt and got a bunch of similiar ideas that go beyond adding results to a search: https://markcarrigan.net/2025/09/22/agentive-llms-and-the-coming-wave-of-ad-tech/

Anonymous
09/28/25(Sun)23:19:49 No.106732303

Anonymous 09/28/25(Sun)23:19:49 No.106732303

Damn I really like this Mistral 24B but 90 seconds per prompt feels awful. Would Q4 speed it up noticably over Q5? I can't quite squeeze the entire Q5 in on my 16gb card so some of its loaded into RAM.

I don't wanna make it too retarded or I might as well stick to the 12B Q8 i've been using. I just hate when I can read faster than the streaming text output.

Anonymous
09/28/25(Sun)23:28:40 No.106732350

Anonymous 09/28/25(Sun)23:28:40 No.106732350

is aider actually good? Trying to drop sonnet4 and I want something that can search a local codebase

Anonymous
09/28/25(Sun)23:29:58 No.106732366

Anonymous 09/28/25(Sun)23:29:58 No.106732366

>>106732303
having the entire model in VRAM makes a huge difference. try the Q4

Anonymous
09/28/25(Sun)23:36:00 No.106732392

Anonymous 09/28/25(Sun)23:36:00 No.106732392

>>106732350
aider is less popular now due to them missing the MCP fad entirely (still don't support it) so now people either use IDE plugins like Roo/Cline or Claude Code CLI or its various forks and copycats

Anonymous
09/28/25(Sun)23:37:08 No.106732397

Anonymous 09/28/25(Sun)23:37:08 No.106732397

>>106730166
>Cohere playbook
Please exclude that book from the datasets. If they ever release anything meaningful again I'll eat a Miku.

Anonymous
09/28/25(Sun)23:48:42 No.106732450

Anonymous 09/28/25(Sun)23:48:42 No.106732450

>>106732392
I'm playing with it. I like that it generates a database folder that can be searched. Cline gave me shit results. There are a lot of good ideas here but I really don't want to have to stitch a bunch of random shit together to do what I want. Every tool I find is almost there but not quite.

Anonymous
09/28/25(Sun)23:54:30 No.106732467

Anonymous 09/28/25(Sun)23:54:30 No.106732467

File: file.png (653 B, 491x42)

653 B PNG

>>106732366
Now that's a tight squeeze, definitely faster at least.

Anonymous
09/29/25(Mon)00:02:49 No.106732505

Anonymous 09/29/25(Mon)00:02:49 No.106732505

>>106731591
robophobe

Anonymous
09/29/25(Mon)00:15:14 No.106732563

Anonymous 09/29/25(Mon)00:15:14 No.106732563

>>106732392
>MCP meme
I can't wait for them have their redemption arc when everyone abandons MCP

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.