[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 38714990.png (1.33 MB, 1024x1536)
1.33 MB
1.33 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106718496 & >>106700424

►News
>(09/26) Hunyuan3D-Omni released: https://hf.co/tencent/Hunyuan3D-Omni
>(09/25) Japanese Stockmark-2-100B-Instruct released: https://hf.co/stockmark/Stockmark-2-100B-Instruct
>(09/24) Meta FAIR releases 32B Code World Model: https://hf.co/facebook/cwm
>(09/23) Qwen3-VL released: https://hf.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe
>(09/22) RIP Miku.sh: https://github.com/ggml-org/llama.cpp/pull/16174
>(09/22) Qwen3-Omni released: https://hf.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>106718496

--Gemma3 context shift workaround and llama.cpp default behavior changes:
>106722276 >106722305 >106722485 >106723962 >106724000 >106724028 >106724072 >106724704 >106724711 >106724767 >106725006 >106725723 >106725742 >106725770 >106725984 >106727385 >106728340 >106728355 >106728387 >106724087 >106724441 >106724502
--Resolving speed discrepancies in GLM model quantization due to tensor offloading issues:
>106728273 >106728304 >106728320 >106728351 >106728540 >106728668 >106728732 >106728567
--llama.cpp PR boosts Mi50 performance, sparking debate on GPU viability vs 3090/3060:
>106719998 >106720111 >106720130 >106720146 >106720135 >106720150 >106720162 >106720399 >106720425 >106720441 >106720502
--GLM model performance evaluation against Deepseek and K2 with quantization and hardware considerations:
>106721256 >106721266 >106721281 >106721285 >106721308 >106721325 >106721277 >106721487 >106721557 >106721616 >106721781 >106722001 >106722664 >106722939
--GLM 4.5 memory capacity and subscription pricing strategy:
>106719230 >106719288 >106719673 >106720463
--LLMs rely on statistical pattern matching, lacking true generalization:
>106723945 >106724342 >106724746 >106724939 >106725024 >106725363 >106725666
--Integrated GPU VRAM vs dedicated GPU tradeoffs:
>106725213 >106725225 >106725240 >106725307 >106725277 >106725244 >106725258 >106725601
--Model flowchart and Qwen 30b model discussion:
>106725416 >106725673 >106725731 >106725766 >106727552
--vLLM backend performance and GPU compatibility challenges:
>106720317 >106720329 >106720344 >106720949 >106720970 >106721029
--Adjusting koboldcpp settings to remove <|thinking|> tags via Auto-Parse and "/nothink":
>106724881 >106725045 >106725433
--Miku (free space):
>106718629 >106718706 >106722210 >106722293 >106722737 >106726048 >106729335

►Recent Highlight Posts from the Previous Thread: >>106718500

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
hyunyuan-image-80b... forgotten...
>>
Wondering, what's a realistic local setup for Qwen3-VL-235B-A22B-Instruct when the ggfus drop? Aiming for minimum 90% quality output of the fp16 base model and > 5tok/s, minimum 32k context. The FP16 needs 8 80GB gpus for 470GB weights and 60GB for all the active stuff (according to chatgpt). So for a q5 ggfu that's roughly 160GB weights and something between 20GB-40GB for the active stuff. So I'm wondering if it could be possible to run the q5 ggfu quant on a system with a single 5090 and 256RAM, using gpu for active params and offloading weights to RAM .
>>
>>106720254
Does anyone know any good prompts for this?
I just ask my favorite model for feedback and then remind it that it's an AI model and prone to validating without critically analyzing, and ask it to look over the conversation again and find the parts that don't make sense.
>>
File: file.png (1.63 MB, 3770x762)
1.63 MB
1.63 MB PNG
>>106729830
i doubt it. qwen3 235b without the vision runs like shit for me if i put any of it onto RAM. a single 5090 would definitely not be enough
>>
So I can't just dump my loose collection of markdown files into RAG huh? I actually have to do the work and prepare data for the model to consume huh?
>>
>>106729869
yeah, feeling tells me I'll need at least 4x 3090 even when quant and offload maxxing, but the math theoretically ads up. what do you define as running like shit btw? probably 5tok/sec. But for my usecase, 5tok/sec would be enough.
>>
>>106729935
a q5 gguf ran at around 4t/s for me. the more you offload, the worse it gets. even offloading a tiny bit can reduce your performance by as much as 80%. you probably better off either vram maxxing, or using a smaller quant, or getting high speed ddr5 (which is extremely expensive if using a server or workstation motherboard)
>>
>>106729880
>...I see the fact X is already mentioned in the system note, no need to focus on it.</think>
So this is the power of RAG...
>>
>>106729880
semantic chunker
metadata extractor and embedder
multidimensional vector embeddings (semantic, keyword, metadata etc.)
sql tree struct docflow
latelatching retrival mechanism (colpali/colqwen)
reranker
make everything agentic

or just use morphik ai
>>
>>106730022
I guess it's time to get out of silly tavern kiddie pool
>>
>GLM4: 30b
>GLM4.5: 350b
We're on track for our first 3T model if the patterns hold up.
>>
>>106730114
They had their one good release. Now it's time for them to order from the updated menu. The menu options are: Safety, Scale, or Synth (can choose all 3).
>>
I get it. AI slop is basically like the uncanny valley of text. At times it's close enough to human that you can get immersed in the illusion, but then it hits you with artificial shit and immediately breaks the flow. And just like the uncanny valley, some people don't see or detect it, even when they're aware of it. Funny how that works.
>>
>>106730147
So the Cohere playbook is it?
>>
>>106730114
it's 4.6 so your math doesnt add up, sorry
>>
>>106730205
GLM4.5.5.5 is going to be 30T and we're going to need a colony on the sun just to power the machines to run it
>>
>>106729969
mhm ok, we'll see. I'll let others figure it out. GPT tells me the current 4bit autoround quant of the vl model still needs 2 RTX 6000 gpus. That's like 14k$. Granted, running much faster than offload memes, but I'll gladly go snailmode for half the price.

>>106730054
don't research RAG. It's such a jeeted clusterfuck of a field. A quick peek in the fun world of RAG:
https://www.reddit.com/r/LocalLLaMA/comments/1ned2ai/building_rag_systems_at_enterprise_scale_20k_docs/
But since I've done the mistake of researching RAG, here's my widows
- ignore knowledge graphs like graphRAG. absolute useless garbage.
- your docs got pictures or graphics? colpali or colqwen is a must. thus also a strong vision model is needed for retrieval (which you probably cant run locally)
- avoid ocr like the plague (by using colpali/colqwen and vision model). If you absolutely need ocr for tables or whatever, use dots.ocr
- semantic chunking is good, but sometimes not good enough. Using structured output generated from docs can help, although often not feasible. LangExtract is a framework for that
- you'll need some agentic routing and reranking eventually as you scale. top_k 10 won't cut it with thousands of docs, unless your metadata game is on point.
- oh yeah metadata. generate metadata for everything. helps llm dodge retrievals which are irrelevant and just had aemantic similarity
>>
>still no goof of 80b
It's over...
>>
altere arm lighters
>>
File: pwned.jpg (96 KB, 1713x326)
96 KB
96 KB JPG
>>106729809
>>106729810
>>
>>106730318
>>
https://huggingface.co/tencent/HunyuanVideo-Foley
video to sound effects audio
>>
>>106730457
https://litter.catbox.moe/miyv3qo1d9a2q8dv.mp4
>pos: a woman moaning in an intimate scene with a man
>neg: music,

https://litter.catbox.moe/x9rru6dc89vzcc1j.mp4
>pos: a woman with a high pitched voice is moaning and making slurping and slapping sounds when the penis enters her mouth
>neg: music,
>>
>>106730523
loool
>>
>>106730523
>second one
You should pretend that it's supposed to be a robot waifu and that this is basically AGI.
>>
>>106730266
>- oh yeah metadata. generate metadata for everything. helps llm dodge retrievals which are irrelevant and just had aemantic similarity
Nta. When ensuring a document or specific information within a document (let's assume I'm only using raw pure .txt files) has metadata, how should it be formatted? Should it just have basic information about the document or section of a document before or after it or something?
>>
>>106729809
>>(09/26) Hunyuan3D-Omni released: https://hf.co/tencent/Hunyuan3D-Omni
>>(09/25) Japanese Stockmark-2-100B-Instruct released: https://hf.co/stockmark/Stockmark-2-100B-Instruct
>>(09/24) Meta FAIR releases 32B Code World Model: https://hf.co/facebook/cwm
>>(09/23) Qwen3-VL released: https://hf.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe
>>(09/22) Qwen3-Omni released: https://hf.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe
What is all this shit? Is it all slop?
>>
>>106731075
no goofs
>>
Gee I can't wait for GLM-4.6 to release and it's the same '+5 points in agentbench and worse in anything creative' like we've seen from Deepseek and Kimi.
>>
>>106731251
At least you will have goofs...
>>
>>106731251
>worse in anything creative
They didn't get worse; you just fried your brain and have a higher threshold for pleasure
>>
File: 5kM.jpg (423 KB, 756x906)
423 KB
423 KB JPG
>>106730435
>>106729809
>>106729810
Its all off-topic now kek
>>
>>106731279
The old versions are still okay though?
>>
File: 1750710731510258.png (1.97 MB, 1200x1767)
1.97 MB
1.97 MB PNG
>>106731197
>>
AI's ten years from now will be made to talk to modern models and they will feel embarrassed.
>>
>>106731517
stop humanizing clankers
>>
If 4.6 comes out and it's just a safetyslopped benchmaxxed 4.5 its truly over
>>
File: 1740094338302083.jpg (35 KB, 406x388)
35 KB
35 KB JPG
>>106731517
Not gonna happen, these retarded labs will keep doing retarded benchmaxxing until some intelligent guy in his garage makes something better
>>
4.6 will not be worth using.
>>
is there any evidence that glm 4.6 actually exists and wasnt just a typo on the website
>>
I know this isn't local, but you guys seem competent enough to discuss this
how come tech companies aren't using LLMs to push for ads? like, if you were to shilll for products and whatever, be them your own products or someone else, LLMs look like a great tech to do it. so, why spend many billions on this shit, yet not give them an actual application?
>>
>>106732029
Despite your perception that LLMs are not used for advertising, tech companies are actively integrating them into their ad products, although not in the overt, "shilling" manner you might expect. The reasons for this more subtle approach are complex, combining ethical concerns, technical limitations, and the need to protect consumer trust.
>>
>>106732053
>ethical concerns
>protect consumer trust
fucking gpt5 slop if i ever saw it
tl:dr yes companies are using llms to summarize the vast data they ingest, you can see it on everything from amazon reviews to youtube.
https://blog.youtube/news-and-events/new-youtube-ai-tools-summer-2025/
>>
>>106732053
>>106732094
I see, so it's not used for outright shilling as I said, but it's being integrated to suggest stuff

I was thinking of using it for ads beyond simple search, though... something contextual, like, make LLMs read and explain the context of articles, and then have other LLMs that create ads and put them as footers in the articles

this guy asked chatgpt and got a bunch of similiar ideas that go beyond adding results to a search: https://markcarrigan.net/2025/09/22/agentive-llms-and-the-coming-wave-of-ad-tech/
>>
Damn I really like this Mistral 24B but 90 seconds per prompt feels awful. Would Q4 speed it up noticably over Q5? I can't quite squeeze the entire Q5 in on my 16gb card so some of its loaded into RAM.

I don't wanna make it too retarded or I might as well stick to the 12B Q8 i've been using. I just hate when I can read faster than the streaming text output.
>>
is aider actually good? Trying to drop sonnet4 and I want something that can search a local codebase
>>
>>106732303
having the entire model in VRAM makes a huge difference. try the Q4
>>
>>106732350
aider is less popular now due to them missing the MCP fad entirely (still don't support it) so now people either use IDE plugins like Roo/Cline or Claude Code CLI or its various forks and copycats
>>
>>106730166
>Cohere playbook
Please exclude that book from the datasets. If they ever release anything meaningful again I'll eat a Miku.
>>
>>106732392
I'm playing with it. I like that it generates a database folder that can be searched. Cline gave me shit results. There are a lot of good ideas here but I really don't want to have to stitch a bunch of random shit together to do what I want. Every tool I find is almost there but not quite.
>>
File: file.png (653 B, 491x42)
653 B
653 B PNG
>>106732366
Now that's a tight squeeze, definitely faster at least.
>>
>>106731591
robophobe
>>
>>106732392
>MCP meme
I can't wait for them have their redemption arc when everyone abandons MCP



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.