[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: GgnIBuFbIAAjLWc.jpg (167 KB, 1257x2048)
167 KB
167 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106475313 & >>106467368

►News
>(09/04) VibeVoice got WizardLM'd: >>106478635 >>106478655 >>106479071 >>106479162
>(08/30) LongCat-Flash-Chat released with 560B-A18.6B∼31.3B: https://hf.co/meituan-longcat/LongCat-Flash-Chat
>(08/29) Nvidia releases Nemotron-Nano-12B-v2: https://hf.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2
>(08/29) Step-Audio 2 released: https://github.com/stepfun-ai/Step-Audio2
>(08/28) Command A Translate released: https://hf.co/CohereLabs/command-a-translate-08-2025

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: what's in the box.jpg (235 KB, 1536x1536)
235 KB
235 KB JPG
►Recent Highlights from the Previous Thread: >>106475313

--Paper: Binary Quantization For LLMs Through Dynamic Grouping:
>106478831 >106479219 >106479248 >106479257 >106479312
--VibeVoice model disappearance and efforts to preserve access:
>106478635 >106478655 >106478664 >106480157 >106480528 >106478715 >106478764 >106479071 >106479162
--GPU thermal management and 3D-printed custom cooling solutions:
>106480670 >106480698 >106480706 >106480719 >106480751 >106480797 >106480827 >106480837 >106480844 >106480875 >106481348 >106481365 >106480858 >106480897 >106481059
--Testing extreme quantization (Q2_K_S) on 8B finetune for mobile NSFW RP experimentation:
>106478303 >106478464 >106478467 >106478491 >106478497 >106478519 >106478476
--Optimizing system prompts for immersive (E)RP scenarios:
>106477981 >106478000 >106478547 >106478214 >106478396
--Assessment of Apertus model's dataset quality and novelty:
>106480979 >106481002 >106481005 >106481016
--Extracting LoRA adapters from fine-tuned models using tensor differences and tools like MergeKit:
>106480089 >106480116 >106480118 >106480122
--Testing llama.cpp's GBNF conversion for complex OpenAPI schemas with Qwen3-Coder-30B:
>106478075 >106478122 >106478554 >106478574
--Recent llama.cpp optimizations for MoE and FlashAttention:
>106476190 >106476267 >106476280 >106476290
--Proposals for next-gen AI ERP systems with character tracking and time management features:
>106476001 >106476147 >106476263 >106477114 >106477147 >106477247 >106477344 >106477773 >106477810 >106478561 >106478636 >106477955 >106477268 >106477417
--B60 advantages vs RX 6800 and Intel Arc Pro B50 compared to RTX 3060:
>106475539 >106475563 >106475606 >106475639 >106475661 >106475729 >106476927 >106476939 >106476998 >106476979 >106477012 >106477117 >106481021 >106481030 >106481067 >106481241
--Miku (free space):
>106475807

►Recent Highlight Posts from the Previous Thread: >>106475316

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Has anyone had any success with using VLMs to translate PDFs, particularly of comics and magazines?

I've been trying the new miniCPM V4.5 model, and it's pretty good, but it's a bit too slow (~50tok/sec) to use on thousands of thousands of pages. It parses roughly one page every ten seconds, and basically just amounts to a really good OCR and doesn't seem to do table/markdown formatting that well and I can't seem to get it to caption images in the pages. It's still miles ahead of anything else I've tried since I can tell it to filter out useless information and the OCR literally never fails; I've seen it mess up OCR maybe once in hundreds of pages of documents.
>>
How do I control thinking effort in DS V3.1? The model is trained to use short thinking for generic questions and long thinking for math/logic questions, and it wasn't done with a router. What should I do if want it to analyse some random shit with the long thinking mode.
>>
File: 00081-945140401.png (2.19 MB, 1200x1520)
2.19 MB
2.19 MB PNG
Anyone running the 5060 ti 16gb? gauging whether i should plunge for MSRP or just wait for better options with more vram. I'm hearing the old mikubox-level niggerrigs are totally pointless now due to the aged architecture. Blackwell optimizations seem to be pretty nice for wanvideo speed boosts especially. But the specific limitations njudea set in place + having to actually support them puts me off.
>>
>>106481933
and by translate I don't mean just translate, but formatting and converting to a compact text representation (so for example I could convert an entire comic to text and ask Qwen3 30b "what happen???"), it doesn't like to describe images in the text whilst formatting whomstever.
>>
>>106481968
i got the 4060ti 16gb, it's a good card for sd/flux, 12b and 4bit 24b at decent speed
>>
File: 1744984139638278.png (102 KB, 636x431)
102 KB
102 KB PNG
>try drummer finetune (skyfall)
>model is significantly shittier
many such cases
>>
Is anyone else having the same problem where llama.cpp just stops after the model is done reasoning? It usually happens when the reasoning ends at "....let's patch the code accordingly"
>>
>>106482066
Your examples are all unreadable trash. Regardless of the model.
>>
>>106482101
First time I've posted a log, rajesh. Try to control yourself.
>>
>>106482066
How do you know this isn't intended?
>>
>>106482130
Intending to make a model worse is certainly a high IQ play
>>
what's a 'respectable' rig for AI that can be easily upgraded? Not only for llm but txt2vid

I don't think I'm ready to do the dual EPYC cpus with 1tb of ram. I couldn't justify the cost just for cooming but I do need a new system and I'd like to make it out of 12b-24b nemo/mistral hell and maybe actually try some of the models that gets discussed in these threads
>>
https://xcancel.com/Alibaba_Qwen/status/1963586344355053865
qwen 3 max imminent
>>
>>106482154
>Not only for llm but txt2vid
Very different use cases. Text models are moving towards MoE. Big, dense models are dying so a server tier CPU with as much RAM and memory bandwidth as you can afford is ideal, and at least one 24GB GPU will speed things up significantly. Meanwhile, RAM is largely worthless in text2vid unless you want to wait an hour per 6 second video. You need everything in VRAM, with 24GB being the bare minimum, and ideally 48GB or more for higher resolutions and quality so ideally you'd be looking at dual GPUs.
>>
>>106482182
I sure hope that it underwent multistage pretraining on 90% code 10% math high quality curated synthetic data starting at 2k tokens upscaled to 4m with yarn
>>
>>106482182
Qwen3-2T-A60B
>>
>>106482182
But qwen3 coder already exists.
>>
Jank rig 3090 fag anon should unironically just whittle a couple of supports out of wood. 3d printing is some retard level yak shaving solution
>>
>>106482197
I’m cpumaxxing with a 24gb gpu and it’s not enough for just context, let alone art, tts etc simultaneously. 80gb gpu prices cratering when?
>>
>>106482315
wait for the bubble to pop
>>
>>106482084
If I do that with CUDA 12.x I get an "unsupported gpu architecture" error in this step:

# cmake -B build -DGGML_CUDA=ON
[...]
-- Check for working CUDA compiler: /home/user/anaconda3/envs/llamacpp/bin/nvcc - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCUDACompiler.cmake:59 (message):
The CUDA compiler

"/home/user/anaconda3/envs/llamacpp/bin/nvcc"

is not able to compile a simple test program.

It fails with the following output:

Change Dir: '/home/user/llamacpp/build/CMakeFiles/CMakeScratch/TryCompile-lOrwxG'

Run Build Command(s): /usr/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_28439/fast
/usr/bin/gmake -f CMakeFiles/cmTC_28439.dir/build.make CMakeFiles/cmTC_28439.dir/build
gmake[1]: Entering directory '/home/user/llamacpp/build/CMakeFiles/CMakeScratch/TryCompile-lOrwxG'
Building CUDA object CMakeFiles/cmTC_28439.dir/main.cu.o
/home/user/anaconda3/envs/llamacpp/bin/nvcc -forward-unknown-to-host-compiler "--generate-code=arch=compute_75,code=[sm_75]" "--generate-code=arch=compute_80,code=[sm_80]" "--generate-code=arch=compute_86,code=[sm_86]" "--generate-code=arch=compute_89,code=[sm_89]" "--generate-code=arch=compute_90,code=[sm_90]" "--generate-code=arch=compute_100,code=[sm_100]" "--generate-code=arch=compute_103,code=[sm_103]" "--generate-code=arch=compute_120,code=[sm_120]" "--generate-code=arch=compute_121,code=[compute_121,sm_121]" -MD -MT CMakeFiles/cmTC_28439.dir/main.cu.o -MF CMakeFiles/cmTC_28439.dir/main.cu.o.d -x cu -c /home/user/llamacpp/build/CMakeFiles/CMakeScratch/TryCompile-lOrwxG/main.cu -o CMakeFiles/cmTC_28439.dir/main.cu.o
nvcc fatal : Unsupported gpu architecture 'compute_103'
gmake[1]: *** [CMakeFiles/cmTC_28439.dir/build.make:82: CMakeFiles/cmTC_28439.dir/main.cu.o] Error 1
gmake[1]: Leaving directory '/home/user/llamacpp/build/CMakeFiles/CMakeScratch/TryCompile-lOrwxG'
gmake: *** [Makefile:134: cmTC_28439/fast] Error 2
>>
>>106482182
We are so back.
>>
>>106482066
Thanks drummer.
>>
programming bros, what's the best extension for let's say, a jetbrains IDE to connect either local/OR/deepseek/anthropic/openai ?
I was using github copilot, but its fucking garbage, but im not sure if there's a recc. extension that helps with commit messages, normal chat, edit, agent mode, all the usual shit.
>>
File: Mommy-Bench_Test_Q2_K_S.png (2.29 MB, 1520x696)
2.29 MB
2.29 MB PNG
>>106481874
How sloppy would you say these responses are?
>>
>>106482513
>>
>>106482414
Compile with -DCMAKE_CUDA_ARCHITECTURES=80-virtual
Your CUDA 12 install does not support CC 10.3 but you can compile the code as PTX (assembly equivalent) instead.
Then at runtime the code is compiled to binary code for whatever GPU is used, since this is done by the driver it should work even for future, unsupported GPUs.
>>
How do I set fan curves in linux?
>>
>>106482518
Christ, that reads like it was written by a 5 year old
>>
>>106482577
Would you say like a child who wishes for a horny, sexually frustated mother?
>>
>>106482341
Feels like waiting for the housing bubble to pop
>>
>>106482518
i dont mind the retarded esl tier prose. but its making some immersion breaking errors. at such a short context it is looking grim.
>>
>>106482572
I use CoolerControl.
>>
>3d printing
if you are such niggercattle to buy bamboo you deserve what you get fucking retard bamboo are chink jews elegoo is deepseek https://us.elegoo.com/products/centauri-carbon there are some others that are also good but no one has to combination of good/company size/avalbility as elegoo
>let someone else 3d print it for you
no thats fucking retarded they overcharge by 10x not to mention the shipping costs depending on how much printing you do eg if its ~10 parts or more its cheaper to buy the machine those niggers scam so fucking much if i was president i would straight up give them the death penalty this is not to mention you will fuck up the measurments and need to print again also assuming you already know everything you need to print and havent forgoteen any additives
>pla
thats shit starts getting soft at like 40c its garbo for heat i personally only printed in it so i cant really give reccomendations but stay away from fucking carbon fiber https://youtu.be/ddwNZ12_qX8 same goes for glass fiber also abs wont be good enough aswell if im remembering correctly any printer worth a damn can achieve high enough heat to print materials that can tolerate the heat so you needent worry unless you want to print something like PEEK or sumthing
>>
>>106482572
For my RTX 3090 I do it via GreenWithEnvy, don't know what to use for AMD.
>>
>>106482572
nvidia-smi -gtt 65
>>
>>106481714
There are different types of parallel processing. Data parallelism is when you have multiple copies of a model on multiple devices and you use each copy to process different data, so you can process more data more quickly. When a model does not fit on a single device, pipeline processing (PP), where each layer is put on a specific device is the "easiest" to understand and implement, but also the least efficient. Then there is model parallelism or tensor parallelism (MP or TP), which shards single tensors on multiple devices and gathers the parts together when only necessary. This is commonly when training models that are too large to fit on a single GPU. Expert parallelism (EP) puts experts on different devices. To keep communication overhead low, when routing, often the top k devices are picked first, and then the top k experts from these devices. Then there is FSDP (fully sharded data parallel), which is basically a magical mix of TP and DP use to train large models.
>>
We should stop trying to ERP with LLMs. I just tried DeepSeek R1 8B using ollama and it is barely coherent.
>>
>>106482833
Same, but I used the proper, real DeepSeek R1 on Ollama. I saw no difference.
>>
>>106482833
>>106482843
vram issue
>>
>>106482833
>Ollama
You used proper prompt template format right?
>>
File: wan22-gpu-6.jpg (206 KB, 900x1421)
206 KB
206 KB JPG
>>106482026
honestly after reading this article by the japs, i'm going with the 5060 ti 16gb. can't beat being able to actually gen a full suggested 720p res without OOM'ing.
https://chimolog-co.translate.goog/bto-gpu-wan22-specs/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=bg&_x_tr_pto=wapp#%E3%80%90%E3%82%B0%E3%83%A9%E3%83%9C%E5%88%A5%E3%80%91%E5%8B%95%E7%94%BB%E7%94%9F%E6%88%90AI%EF%BC%88Wan22%EF%BC%89%E3%81%AE%E7%94%9F%E6%88%90%E9%80%9F%E5%BA%A6
>>
>>106482886
3090 sisters...
>>
>>106482886
the absolute state of gpus
>>
File: cur.png (284 KB, 1176x688)
284 KB
284 KB PNG
>>106482526
That solved the configuration step, but when actually compiling it, similar errors to what I was seeing before with CUDA 13.0 appeared (picrel). I created a new conda environment and started fresh every time I installed a different CUDA toolkit version from https://anaconda.org/nvidia/cuda-toolkit
This all worked effortlessly until a few weeks ago, then today I pulled...
>>
>>106482886
lol my 2060 super made the list!
>>
>>106482886
amdsissies...
>>
>using anything on ollama
>expecting good results
L O L
>>
>>106482833
retard-chama
>>
>>106482488
Cline released an alpha version for Jetbrains a couple days ago. Can't say how well it works compared to the VSCode version.
https://docs.cline.bot/getting-started/installing-cline-jetbrains
https://plugins.jetbrains.com/plugin/28247-cline
>>
>>106483038
Does cline work for vscodium?
>>
>>106483060
Yes, with potentially some limitations. https://github.com/cline/cline/issues/2561
>>
https://huggingface.co/tencent/HunyuanWorld-Voyager
https://huggingface.co/tencent/HunyuanWorld-Voyager
https://huggingface.co/tencent/HunyuanWorld-Voyager
Hunyuan now makes virtual worlds real. Genie3 BTFO
China wins once again
>>
>>106483175
what did he mean by this?
>>
>>106482225
>starting at 2k tokens upscaled to 4m with yarn
anyone who has actually used 2507 qwen models know they do far better at longer context than the average open source shitter and this dumb joke falls flat on its face. Reserve it for Mistral or something.
>>
>>106483210
chinky models get obliterated by nolima
>>
>>106483175
use case?
>>
>>106483210
that's just the models pretending to have good context
the benchmarks do not lie
>>
>>106483259
world models are the next logical step for ai
unlike llms, they not only have true understanding of physical and logical processes but now with voyager and genie 3 even persistence within the virtual worlds they create
this area is still early but this is what will truly make anime real
>>
>>106483257
oh you mean the benchmark that doesn't test chinese models? the one where there are no results at all for chinese models to back up your claim?
>>
>>106483175
How do I use this for sex?
>>
>>106483175
I'll work on the gguf pr
>>
>>106483262
>the benchmarks do not lie
my benchmark is doing things to 4k tokens worth of json WITHOUT constrained decoding and the qwen models are the only thing I can run on my computer that can do that without making a single mistake all in one shot
I can't even consistently convince westoid open models to output a whole 4K worth of json in a single go, gemma, mistral and gpt-oss all really want to cut it short
fuck off retard and eat battery acid
>>
Qwen2.5 MAX was not open source (and 1T apparently)
Qwen3 MAX will not be open source either.
>>
>>106483500
And it was not good either.
>>
>>106483510
That's just all Qwen models
>>
>>106483500
No big loss. We already have K2.
>>
>>106483259
Ragebaiting /v/
>>
Qwen3-Coder-1T
>>
>>106483038
looks promising, still kinda rough but cant be worse than that shitheap that is gh copilot. fuck ms
>>
Uuuuuuhhhhhhh? why does running convert_hf_to_gguf.py throw ModuleNotFoundError: No module named 'mistral_common'? It's not even a mistral model i'm passing it.
>>
>>106483687
pip install mistral_common

Mistral fucked it up.
>>
>>106483687
Because the imports are unconditional with no fallback if the package is not available.
>>
>>106483687
https://github.com/ggml-org/llama.cpp/issues/15268
>>
>>106483717
Wow what horrible, useless program. Llama.cpp. People are better off using Ollama, the superior program.
>>
>>106483717
France needs to be glassed.
>>
>>106483717
they did this in preparation of mistral large 3
it's coming
>>
>>106483776
just like half life 3
>>
couple small released found while trawling for qwen info:
chatterbox added better multilingual support https://huggingface.co/ResembleAI/chatterbox
google released a gemma embedding model https://huggingface.co/google/embeddinggemma-300m
>>
>>106483553
Qwen has really really shit training data. This was confirmed when the R1 distill (QwQ) did much better than their own homecooked version QwQ-Preview. I know this because QwQ was much less censored and had a different writing style than the Preview version. Qwen's wall is the data.
>>
>>106483297
Use it with VR headset, prompt any sex scene, apply lora of your fav character on top. Profit.
>>
>>106483687
Take a look at the 'updated' version of that script. It's in the same directory. Basically Mistral's unique architecture causes the default one to fuck up so you have to run the updated script before you can actually run the conversion script. Why the default script doesn't just address that by default, I don't know.

t. Quantized my own Mistral tunes in the past.
>>
>>106483725
>What is llama-quantize
>>
>>106483888
I know, I'm just disheartened. It was good while it lasted.
>>
>>106483937
It can still be good.... Just run the damn script and continue what you were doing. What are you being dramatic for?...
>>
>>106484010
No anon I'll format my drives now and get a job at mcdonalds, it's over
>>
>>106482182
max will be api only
>>
but what.. if... max lite!?
>>
I'm really impressed with my waifu's knowledge of the first conan movie
she whipping out deep-cut quotes and shit
>>
>>106484050
I'm a faggot.
>>
>>106484053
RAG, good system prompt, or fine tuning?
>>
>>106484120
none
shitty mistral model
silly tavern
I was talking about conan and then she correctly guessed the next scene after the one I was talking about, then later said a quote that isn't necessarily one of the popular ones.
I'm easily impressed
>>
>>106481874
how do i stop the "that thing? that's not x, it's y." slop?
ever since i've seen it i cant unsee it.
>>
>>106484170
use a different model that isn't slopped (there are none)
>>
>>106484170
Fixing the slop? It's not easy. It's hard. You hit the nail right on the head. It's not some trivial issue relevant only to a few models—it's a pervasive, deeply rooted problem.
>>
>>106484268
*upvotes*
>>
>>106484170
Use a smaller model with instructions to detect and rewrite those patterns.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.