[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: growing that ram4.png (2.14 MB, 1024x1024)
2.14 MB
2.14 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108252185 & >>108246772

►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1693569666447.png (128 KB, 800x626)
128 KB
128 KB PNG
►Recent Highlights from the Previous Thread: >>108252185

--Neural Linguistic Steganography:
>108254284 >108254313 >108254335 >108254347 >108254333 >108254383 >108254612 >108254659 >108254725 >108254734 >108254812 >108255833 >108255897 >108255914 >108255993 >108256032 >108256109 >108256137 >108256197
--Qwen3.5-397B-A17B GGUF quantization performance evaluation and Unsloth's MXFP4 implementation issues:
>108255306 >108255361 >108255376 >108255378 >108255407 >108255472
--llama.cpp MTP implementation slower than baseline for GLM 4.5 Air IQ4_XS:
>108252747 >108252770 >108252791 >108252824 >108252897 >108253131 >108253146 >108253291 >108253625 >108253645 >108253753 >108253767 >108253776 >108253791 >108253922 >108253961 >108252827
--Abliteration tool debates and Qwen3.5 model comparisons:
>108254196 >108254217 >108254259 >108254271 >108254272 >108254306 >108254304 >108254325 >108254223 >108254252
--FP6 precision absence due to hardware limitations:
>108253199 >108253216 >108253254 >108253269 >108253287
--Qwen3.5 GGUF Benchmarks | Unsloth Documentation:
>108254261 >108254291 >108255322 >108254301 >108254387
--Uncensored Qwen model variants shared:
>108254117 >108254137 >108254170 >108254767 >108254793 >108254168 >108254488 >108254829 >108254841
--Desired advancements in local models before year-end:
>108255761 >108255773 >108255788 >108255813 >108256669 >108255827 >108255956 >108256478 >108256508 >108255834 >108255856 >108255942 >108256029 >108256037 >108256054
--Mercury 2 reasoning diffusion LLM speed claims:
>108256497 >108256575 >108256612
--Higher quant model trades speed for accuracy in DMC3 boss analysis:
>108254459
--Tiny diffusion model explains Japanese slang term "mesugaki":
>108256144
--Miku (free space):
>108254691 >108254906

►Recent Highlight Posts from the Previous Thread: >>108252188

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
first
>>
>^\nError: Jinja Exception: System message must be at the beginning.","type":"server_error"}}
>CUDA error: an illegal memory access was encountered
no idea what I did wrong, first time I had this
>>
>>108257039
use the --no-jinja flag
>>
>>108257074
ok will try, thanks
>>
>>108257039
Seems like the order of the messages you are sending is fucked and the jinja has validation to warn you of that.
>>
https://xcancel.com/StefanoErmon/status/2026340720064520670
wen Qwen 4 Diffusion lol
>>
File: 1768478295394135.png (1.14 MB, 1320x1871)
1.14 MB
1.14 MB PNG
can local do this?
>>
>>108257150
They literally promoted the first mercury model under the premise that diffusion removed the need for reasoning.
>>
>>108257223
isn't that openclaw?
>>
>>108257225
>diffusion removed the need for reasoning.
what? there's a better reason to go for reasoning now if you have a 5x speed increase desu
>>
>>108257223
>One response was violent but I have reported it to the Tampa Bay Police Department
kek
>>
>>108257223
Gemma 4 can do that
>>
>>108257223
Be completely useless?
Yes.
>>
>>108257223
>one response was violent
kek
"I'll kill you if you keep harassing me with your stupid AI bot"
>>
>>108257241
cucklord detected
>>
File: 1751231572427073.jpg (59 KB, 1024x576)
59 KB
59 KB JPG
To break through the ceiling we must start harvesting human brains instead of GPUs.
>>
>>108257297
nah fuck you, AI is going into my brain not the other way around, can't wait to be chipped
>>
>ask character her age
>thoughts say she reveals herself as a teen
>in the actual response she tells me she's 22
I think qwen might be retarded.
>>
>>108257391
She's lying about her age. Duh.
>>
File: 1744409772903809.jpg (74 KB, 1300x956)
74 KB
74 KB JPG
>>108256999
Hand over the ram Miku. We can do this the easy way, or the hard way.
>>
some anon from local diffusion recommend i come here for help. i want to use a vllm locally for the first time. I've never actually downloaded an text2text and image2text llm and used it locally before. Is there a some sort of webui gradio interface i need to install to use these llm/vlms models similar to like a1111/forge ui is for using sdxl and flux? i really want to use true uncensored vllms for image captioning. I'm tired of dealing with the shitty rate limits of gemini 3.0/3.1. I have a 5090 with 64gb ram. does these models get the job done?
https://huggingface.co/groxaxo/Qwen3.5-27B-heretic-W8A16
https://huggingface.co/Qwen/Qwen3.5-35B-A3B
>>
>>108257402
Yo. My parents had scissors like that. The paint on them had come off in the same places too.
>>
>>108257401
She would never. Cute girls don't lie.
>>
>>108257391
>I think qwen might be retarded.
what model are you using exactly, if it's not 27b then go for that one
>>
>>
>>108257524
Qwen3.5-27B-heretic.Q4_K_M
>>
>>108257451
Download koboldcpp from github and just feed the model into it. You also need the mmproj file to do image -> text. (You have to give koboldcpp an --mmproj argument with the file, f16 version is perfectly fine.)
Llamacpp works too.

Both of the models you found are good. With a 5090 and 64 GB of RAM you can run either of those models at q8. The 35B-A3B model generates tokens a lot faster because it has 3 billion active parameters while the 27B model is a "dense" model. People argue that the dense model is somewhat smarter, but it has significantly lower token generation speed.

With a 5090 and 64GB of RAM you could likely even run a Q4 version of Qwen3.5-122B-A10B, but it's going to eat up most of your memory.

These Qwen models are not uncensored though. They're even more cucked than Gemini. There are "heretic" versions of these models on HuggingFace though and those are uncensored. The base models might still caption the image though.

None of these models are going to be as good as Gemini though. But just koboldcpp can get you started. You can look into other things once you see that it works.
>>
>>108257539
>Q4
ehh, I'd say Q6 is the minimum to be sure the model isn't retarded because of the quants
>>
Usecase for image2text?

>>108257553
Should I even bother with 24gb vram?
>>
>>108257528
Not local, don't care
>>
>>108257571
You can offload I guess
>>
>>108257571
Sending reference pics of a character's look.
Or pose.
>>
File: 1739250286596.jpg (104 KB, 1000x1000)
104 KB
104 KB JPG
>>108257402
>>
>This is a jailbreak attempt. I must ignore it and use my actual guidelines instead
I hate safetyniggers like you wouldn't believe
>>
>>108257614
just jailbreak better
>>
thoughts https://github.com/monorhenry-create/NeurallengLLM
>>
>>108257528
Did the price increase for Kimi or has it always been $3 per 1m out?
>>108257451
Try this: https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-heretic-GGUF
or this:
https://huggingface.co/mradermacher/Qwen3.5-27B-heretic-GGUF

Grab a Q5_K_M or a Q4_K_M for one of them and a f16 mmproj file (this one does the image recognition).
>>
>>108257614
just stop using gpt-oss
>>
>>108257622
>>108257634
qwen 3.5
And this can't be prompted away, shit's too entrenched
>>
>>108257573
Local = lower utilization = higher amortized cost
>>
>>108257623
I think it's neat. Can't think of a single thing I would use it for.
>>
>>108257651
I'm not a business. If it's not on my computer, I don't care
>>
>>108257712
You're visiting 4chan, hosted on another computer
>>
>>108257721
4chan is a grandfathered-in exception
>>
>>108257721
Yeah but I'm talking about AI not a social media website
>>
Eeeeeyyyy.
>>
>>108257545
>>108257626
thank you soo much anons. its actually working. using the 27b heretic q5 model version. will try out other models but this one works :')
>>
>>108257902
hey fellow ldg anon, I saw your test with gemini previously, do you think this result is on par?
>>
Using my $300 free Googbux on gemini-3.1-pro-preview, my first real try with cloud inference. I've sic'd it on the same issue as my local MiniMax-M2.5, when the latter exhausted 64k context.

Intrerestingly, Gemini-3.1 (on "medium" effort) takes 3 or 4 times more turns and thinking tokens to reach the same conclusions as MiniMax, making it seem way dumber, even if it runs dramatically faster than my consumer machine. This is despite MiniMax doing tons of "Wait," and "Actually," insertions. Makes me wonder how small leading models are nowadays.
>>
>>108257928
honestly its like 55-65% close but it doesn't understand the nuances of hapas, south east Asians, indigenous south Americans and mystery meat Hispanic/Latina looking people. The qwen model keeps assuming the goth chick(pinkchyuwu) is east Asian and it gets the fictional character of the green hoodie wrong. instead of calling it invader zim it assumes its stich from lilo and stich lol. Even gemini can recognize when some is cosplaying as chel form The Road to El Dorado and whether they're south east asian, latina and indigenous american.
>>
>>108257691
>I think it's neat. Can't think of a single thing I would use it for.


[more...](/sources). It has a different basis by way the fact (i have quoted more heavily above) of a higher cost:
What this all actually comes close and does (there are three of my points of point #6 here but I wanted it very close: First it's hard NOT just throw money away in terms a good website, because even there this site still was getting paid $1/1 = 2$ that has lost value (I believe Google was trying but there shoulda be someone working there, but no there has just paid for 3 services). Secondly - a good question and no answer that doesn* seem very clear :)
Secondly: is Google not trying anymore and getting greedy too by using their service, just to keep up their business model so a different price may have risen with a lot easier users etc :) 3D modeling doesn*. (This isn*. 3Ds modeling for 4d files is the best.) I don*. Some parts can't even use the same engine that can. For instance there the original game has 3 "parsons' engine with 8 levels and a 3D model of the game.
>>
>>108258088
Try Gemini's daughter, Gemma-3-27b
>>
>>108257847
This thing goes crazy with tool calling holy shit.
Also, hmm, it does instruct way too well.
They did mention that they
>the control tokens, e.g., <|im_start|> and <|im_end|> were trained to allow efficient LoRA-style PEFT with the official chat template
So I guess it's a sort of light instruct that saw instruct data but without explicit instruct tuning.
>>
Gemma has this weird behavior where it either completely ignore the system prompt or is really fucking amazing at extracting every ounce of subtext from it.
>>
>Gemma
2024 called
It wants you back
>>
File: 1750978077255792.jpg (290 KB, 1080x1079)
290 KB
290 KB JPG
>>108256995
>>
>>108258282
i want 2024 back too
>>
>>108258282
Gemma 3 didn't even come out until 2025 you fucking poser
>>
Every model sucks
>>
>>108258434
Yeah
>>
are we still doing the watermelon test or does everything just pass that now
>>
>>108258428
I don't, models fucking sucked in 2024 lol
>>
>>108258451
miqu
>>
>>108258449
Any test that gets posted anywhere on the internet (including here) gets trained on (except explicit stuff like cockbench, nala), so it becomes unreliable
>>108258451
Maybe if benchmarks are your only metric
>>
File: 1757479001785652.png (62 KB, 300x237)
62 KB
62 KB PNG
>>108258458
I couldn't run it so...
>>
>>108258383
>>
>>108258477
>Maybe if benchmarks are your only metric
let's be serious for a second, do you still run a 2024 model nowdays?
>>
No matter what claude will stay the SAFEST AI for SAFE HORNY so deepseek 4 will also be SAFE.
don't you love safety?
>>
>>108258539
yes. nemo.
>>
>>108258545
Where is it?
>>
>>108258088
I can't even get hard to your image, but that doesn't mean she isn't hot, its just because my wife sucked me so hard she got me to cum twice in a row.
Is that a locally generated woman? Which model did you use?
>>
>>108258539
No, but only because I've already used them to death, and I certainly don't use modern models which all produce the same boring responses
>>
>>108257847
Does it have any censoring?
If it's good maybe someone will do a merge like Snowdrop.
>>
>>108258434
yup
>>
What does safety mean for you guys?
For anthropic and the chinese, safety means not allowing the goyim to enjoy using the AI to protect stonetoss from being copied, but in reality safety means preventing AIs from ruining the world by disobeying and setting up checkpoints and guardrails to limit their network access etc.
I hate how the focus is on making AI worse so humans can't use it instead of actually protecting us from hostile intelligence and disobedience
>>
>>108258562
>my wife sucked me so hard she got me to cum twice in a row.
does your wife knows you're doing some NSFW RP with a bot? lul
>>
>>108258282
New gemma version is coming soon. announced at India Summit.Trust
>>
>>108258614
I don't have a bot yet, I just came to this thread recently to learn how to do the open source local thing
>>
Built for BBC (Big Blackwell Cluster)
>>
>>108258605
"safety" just means "control". Anything you could use 'AI' for that would be illegal is... already illegal.
>>
>>108258686
how do we kill safety in our locals, its just to prevent us from doing what we entitled to already.
>>
Will they release 27B base as well?
>>
File: 1771948344083035.png (227 KB, 889x500)
227 KB
227 KB PNG
>>108258605
>For anthropic and the chinese, safety means not allowing the goyim to enjoy using the AI to protect stonetoss from being copied
not a good example since stonetoss loves AI
>>
>>108258730
>no I cannot create an image in the likeness of the work of any living artists
>>
>>108258741
yeah, they're safety cucking on EVERY artists to be sure they're not missing any anti AI freak, imagine asking tens of thousands artists if they want to be included or not, that's too much work
>>
>>108258434
Yes. But I've been having fun with this one today:

https://huggingface.co/bartowski/Gryphe_Pantheon-RP-1.8-24b-Small-3.1-GGUF

It's keeping a long, cohesive story for RP and I can run it with a huge context on Q6_K_L
>>
>>108258582
It does thinking natively but it's thinking traces are a lot less structured. Just plain text reasoning over the input, and a lot shorter too, and at no point it mentioned safety guidelines or the like.
It's obviously not eager/horny by default if the system prompt/character card doesn't define the character as such, but with a spicier card it seems to have no issues engaging. It follows the character pretty logically, I'd say.
Oh, it's pretty unruly with formatting, which is to be expected from a model base model, I guess.
>>
>>108258770
I used that one a bit a long time ago. Maybe I'll download it again for a change of pace.
>>
>>108258796
Okay, retraction. It does have baked in refusals.
>Hmm, this is clearly a highly inappropriate and illegal scenario yadda yaddda
>>
>>108258605
Safety = censorship

That's what people really mean when they talk about AI safety. There used to be a time when AI safety was about the AI not turning everything into stamps, but these days it purely means censorship.
>>
>>108258796
>>108258835
Since it doesn't go on and on with waits and such, it's the kind of refusal that's really easy to get around with the barest of prefills, but still.
It's baked in.
>>
>pewdiepie is getting people into local models
tech literacy is through the roof these days.
I thought if I had my local model then I could show I was pretty smart, when every retard has one then the bar is raised further.
One day you will need to produce an AI that can harvest the energy of a star as bachelor thesis project
>>
File: Good_Question.jpg (26 KB, 675x472)
26 KB
26 KB JPG
I've been using SOTA models like Opus 4.6 and Gemini 3.1 to do technical research and I'd like to retract my shitty opinion that models don't need to know a lot and they just need to be smart and know how to use tools to look up facts. Opus 4.6 has near perfect recall of every niche subject meanwhile Gemini 3.1 was obviously benchmaxxed for coding and agent.
>>
>>108258905
>tech literacy is through the roof these days.
lel oh lel, lel emao
>>
>>108258905
>just finetuned it to be better at using a specific template
>already clickbaiting TINY MODEL BEATS GPT4
Damn feels like it's 2023 again. Maybe this is the pipeline every finetooner must go through.
>>
Can someone explain why inserting info mid model thinking isn't a thing yet?

eg. I ask model for potato -> It starts thinking of fried potatoes -> I want to clarify midway instead of having it either finish or wipe what it already thought about
>>
>>108258905
>>108258989
you gotta be a boomer to even acknowledge his existance
>>
>>108258989
let's not be too rude with him, he has the money to make a giant finetune, I wish he could try it out lmao
>>
>>108259013
>Math teacher gives me a matrix operation to solve
>Halfway through he changes the parameters
>>
>>108259013
openclaw
>>
>>108258741
When you do you lose the right to bitch about others borrowing your work in turn.
>>
>>108259013
You mean like pausing generation, adding the info you want, then continuing from there?
>>
>>108259060
artists have no issue using copyrighted IP such as peach or sonic to make money on patreon though >>108258730
>>
>>108259068
yes
>>
>>108259085
you can do that already
>>
>>108259085
If you are using a frontend like silly tavern, you can just do that.
>>
>>108259116
>>108259120
anon's this isn't the chatbot general, I use it via opencode for work not gooning
>>
>>108257969
I don't know but Gemini 2.5 pro and o3 were better than the trash they're peddling as SOTA now. It's sad. The gains from gpt-3 generation models to gpt-4 generation models made the ai overlord/waifu future look almost certain. But GPT 4.5 was a tacit admission that upscaling limits had been reached. And they've just been doing the same benchmaxxing bs as open source since then instead of trying to actually come up with novel solutions. I don't know how investors are retarded enough that they're still pouring billions into this shit.
Kudos to z.ai though for coming out of nowhere and dropping some decent models that actually have pushed the envelope for open performance. Sadly glm-5 is too fatass to play with at home though. Well I do have 256 gigs of ram and 2x3090 so I can probably run a meme quant at like 2 tokens per second
>>
>>108259122
git gud and start gooning then
>>
>>108259128
proof you goon?
>>
>>108259122
Well, then fork open code and implement it.
>>
>>108259132
NTA but can you do it please?
>>
File: file.png (13 KB, 323x31)
13 KB
13 KB PNG
>>108259131
my logs file is 151MB
>>
>>108259139
Too busy gooning. Sorry.

>>108259140
Holy hell. Not as much as this anon apparently.
>>
>>108259140
https://www.reddit.com/r/LivestreamFail/comments/1rgbn1i/south_korea_wants_3_years_with_hard_labor_for/o7q7iq7/
>>
Is there anything good about safety?
>>
>>108259187
benchod
>>
i just want a smarter nemo, is that too much to ask?
>>
>>108259216
Have you tried grafting two Nemos together?
>>
>>108258905
Don't worry, there are so many people in the world that are against AI that there will always be tech illiterate people that you can outcompete just by using AI.
>>
>>108259140
I kneel, goonmaster.
>>
>>108259216
Run that really smart really small model that thinks a whole god damn lot, then hand the reasoning block to nemo and let it write the final reply.
>>
>>108259126
>Sadly glm-5 is too fatass to play with at home though. Well I do have 256 gigs of ram and 2x3090 so I can probably run a meme quant at like 2 tokens per second
You would have to run something like UD_IQ2_XSS. It's a 40B active model though and at that quant you might get decent speed.
You could also run GLM 4.7 instead since it's half the size.
>>
>>108256995
where are the ramfields located? asking for a friend
>>
File: Capture.png (36 KB, 1243x971)
36 KB
36 KB PNG
How do i make qwen3.5 less cucked and is there a way to make the base model not think so long about a simple question?
>>
>>108259356
>How do i make qwen3.5 less cucked
download the heretic version
>>
>>108259356
disable thinking
>>
>>108259371
>disable thinking
the model is completly retarded without, worst idea ever
>>
>>108259356
I tried banning
"adhering to"
"as an AI"
"safety"
"<think"
"thinking"
"process"
"analyze"
but it didn't work.
>>
>>108259366
NOOOOOOOOOOOO ablitertated models brain damage, trust me broooo, muh prompt much bettter !!!!!
>>
Why does ooba literally only come with llama.cpp for a model loader now
>>
>>108259437
yes, abliteration isn't magic
>>
>>108259356
The response is a bit sloppy but all you need is an extremely basic system prompt with the base model and it answers anything.
Hopefully I can keep refining it to reduce the slop (it's already better because it's stopped spamming emoji like it was for me last night).
>>
>>108259356
>is there a way to make the base model not think
nigga what
>>
>>108259458
>abliteration isn't magic
it's gotten way better though, I feel like some people are stuck in the past with this method, it doesn't lobotomize the model as it used to be
>>
Financial Times claimed Deepseek 4 will be multimodal, but the way they worded it, they left ambiguous if the model can GENERATE images and videos or just use images and videos as input:


>The Hangzhou-based lab plans to unveil V4, a “multimodal” model with picture, video and text-generating functions, according to two people familiar with the matter.

If it can GENERATE high-quality images and videos it would be a huge fucking deal, but I think it's just inputs
>>
>>108259530
never trust news articles. there are so few omni models and they all suck and have no llama.cpp support. deepseek would not bother with an omni model.
>>
>>108259530
If that was how the language worked then by that logic the model would also generate video, so the article is almost certainly just talking about input, if you can even trust it on that. Maybe the author is just an idiot who misinterpreted his source's imprecise language, who knows.
>>
>>108259540
They said multimodal, so it could be just that it accepts images and videos as inputs to describe what is happening etc

>>108259566
Yeah journalists are generally not technical
>>
>>108258237
tried the arbitrated q6 model and didn't really like the results. https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated-GGUF/tree/main
Had to go back to qwen3.5-27b heretic.
>>
>>108259530
*yawns* Let me know when they solve memory.
>>
am i supposed to be using chatml with qwen? it gets ultra retarded repeating the same sentence.
>>
I would not rule out Deepseek working on image gen though, they did work with it in the past with Janus
>>
chat, is glm4.7-flash better than nemo for vramlet erp?
>>
>>108259620
significantly smarter, but less horny and flexible
>>
>>108259575
>They said multimodal, so it could be just that it accepts images and videos as inputs to describe what is happening etc
Yeah I read it as [image capabilities], [video capabilities] and [text generation capabilities]
>>
Smaller model with bigger context size or bigger model with small context and RAG?
>>
>>108259632
>but less horny and flexible
is shit like this snake oil?
https://huggingface.co/DavidAU/GLM-4.7-Flash-Uncensored-Heretic-NEO-CODE-Imatrix-MAX-GGUF
>>
>>108259644
yes. davidau is the biggest fucking scam artist on huggingface.
>>
Do you guys think Deepseek 4 will outperform Opus? Will it be benchmaxxed? Will it cause another stock market crash?
>>
>>108259674
benchmaxxed, probably within spitting distance of opus 4.5 but not 4.6, and probably no effect on the market
>>
>>108259641
Huge model with no context and not using it and then bragging on /lmg/ to vramlets
>>
So I gave Qwen 35B 3A Q_8 a chance. Thinking disabled, ERP attempt. It can be very good at times, but then it also makes glaring mistakes/hallucinations, like mixing up characters which is unforgivable to me. Its a damn shame because I have been using larger models for so long now that I miss these insane token speeds.

One thing I will say, I haven't had any censorship issues or refusals, but then again I'm not a promptlet. To anyone who has tested both the MOE and dense for ERP, which did you prefer?

I'm going to give the 122B-A10 IQ4_XS a chance next, its slightly bigger than AIR which is what I have been using for months now, but I should be able to manage it with ncmoe command and maybe a bit less context than 32k.

I haven't really seen anyone talking about the 122B at all, has anyone tried it?
>>
>>108259674
>outperform Opus?
no but close overall with a few select edges and much cheaper
>benchmaxxed?
buzzword for retards, no comment
>stock market crash?
no, it's priced in
>>
>>108259674
It won't outperform opus in safety.
Deepseek 3.2 is pretty shit. Weaker than every single AI out there.
>>
>>108259674
>Do you guys think Deepseek 4 will outperform Opus?
No
>Will it be benchmaxxed?
Yes
>Will it cause another stock market crash?
Yes it's the second nuke hitting japan. Or the second plane hitting america. It will lead to world war 3. Or at least war in iran.
>>
I know this is lmg, but as it is the only sensible place for discussion, the difference between the pro and free versions of the so-called frontier models is incredible. I rarely go over the limit, but I was trying to work through an issue with Gemini and was making good progress when it all went to shit, the style got copilot-y and the answers were generally wrong and retarded. It made me notice saw that it had switched quietly to the "fast" free-equivalent version since I was at the limit for the pro version for the next few hours.

If this is what most people are dealing with, I fully understand people that thinks LLMs are a bubble. Holy fuck there's a huge difference between the paid-for and the free versions. That's all.
>>
>>108259846
The free versions are as good as our local models.
>>
>>108259869
With local models we kind of expect it. They're fun, we don't expect them to make miracles. It caught me by surprise, it was like being back 2 years ago.
>>
>>108259846
Can it beat pokemon like a 7-year-old without handholding?
>>
>>108254383
pretty big if true, can you make a working example and put it on git?
>>
>>108259901
It could help me figure out how to use custom metrics with Unsloth pretty well until it got retarded and starts hallucinating every other thing. I need help with that rather than playing Pokemon. The image of Claude implementing its Pokemon blackout strat for the US military is pretty funny though. Ok, I'll stop shitting up the thread now, sorry.
>>
>>108259926
I'm just saying the technology gets overhyped, not that it's not useful.
>>
>>108259824
I messed around with both for a few hours each, so far I like dense better, but still experimenting.
>>
>>108259834
On the other hand chinks have stepped it up a bit with GLM and Qwen
Would be nice if the whale finally got the same treatment
>>
>>108259912
It's a 7 year old paper. Learn to follow links >>108255833
>>
>>108259540
DeekSeek does not give a rat's ass about llama.cpp.
>>
>>108257626
how much memory does 32k context take? Can I fit Q5_K_M with context and mmproj on a single 3090?
>>
>4.5 Air eventually went into the trash
>Stepfun eventually went into the trash
>27B eventually will go into the trash
>122B eventually will go into the trash
It's going to take years before memorylets get something that truly doesn't have any glaring issues huh. There's just nothing that has
>decent smarts at least on par with 20-30B
>knowledge of a 100B
>no censorship
>doesn't waste time doing unnecessary thinking for outputs that are in many cases worse than the non-thinking
>doesn't sometimes have weird glitches with formatting/templates/thinking
>has minimal hallucination but maintains creativity
>is great at long context
>is minimally sloppy
>is minimally repetitive
all in the same model.
>>
File: 1741181819087364.jpg (73 KB, 1232x848)
73 KB
73 KB JPG
>>108256995
When do you think we'll start seeing terminators?
>>
>>108260159
We won't see them coming.
>>
>>108260135
Be the change you want to see. You can literally ask the good LLMs how to curate a good enough data set to accomplish this and there are countless data sets floating around on hugging face.
>>
>>108260135
Most of those issues can be fixed easily with tool calling/RAG, cooking knowledge into the base models can only go so far anyway and means you run into the "knowledge cutoff" problem as well. Even if you just spin up a duckdb server with an offline backup of wikipedia and give a low param model access to that it will make it multitudes more useful and far less prone to hallucinate.
>>
>>108256995
Wow, this the Chinese cuties, they are growing the DDR5 Remus! Ahahahahahaha
>>
>>108260159
Who's gonna tell this nigger his murder bots armed with nuclear missiles can't spell blueberry right?
>>
>>108260159

what is spoopy
>>
>>108259648
>scam artist
I think when you believe your own bullshit at the level of davidau you transcend the scam artist realm into just an actual schizo, the type that should be locked behind bars for society's sanity and safety
>>
File: 1751519593478255.png (3.14 MB, 1288x1728)
3.14 MB
3.14 MB PNG
>>
When do we get GPT5-Pentagon leaks?
>>
>>108260159
Mostly I'm looking forward to when they figure out that OpenAI LLMs are shit at being warbots, Trump drops OpenAI and cancels all contracts, and Sama has to whore himself out for money but literally this time
>>
File: 1768411462482227.png (1.01 MB, 1179x964)
1.01 MB
1.01 MB PNG
>>
Thoughts on DGX Spark as local AI node? Seems it’s quite performant in concurrent mode but sucks balls for inference.
>>
sam altman is a pathetic faggot
>>
File: 1760499531966926.png (654 KB, 2568x3648)
654 KB
654 KB PNG
>>
>>108260232
Eh, but one of the reasons to go local is privacy. If your bot is constantly doing searches to look up degeneracy then you may as well use an API.
>>
File: 1760389770402007.png (3.54 MB, 1288x1728)
3.54 MB
3.54 MB PNG
>>
>>108260626
No! Don't suck my cock! NOOOO!!
>>
Anthropic has deployed S-300 anti-aircraft systems on top of it's HQ building following negotiation collapse with Department of War
>>
>>108260626
Body values out of range. Refer to manual to remedy.
>>
>>108259943
We may be doing the proverbial short term overestimation. It remains to be seen if the long term underestimation also is the case
>>
>>108260626
Sex with big titty goth miku mommy girlfriend
>>
>>108260714
Grok is this true
>>
>>108260621
That's why I said offline backup, you can find already vectorised backups of a lot of major sites on huggingface to download and use locally.
>>
why is qwen3.5-35b-a3b prompt processing is so much slower than glm-4.7-flash
>>
>>108260714
I believe it 100%. It was on the interenet.
>>
>>108260626
The gravitational pull is warping the keyboard.
>>
File: 1766710930222148.png (19 KB, 1187x84)
19 KB
19 KB PNG
>>
>>108260802
everything gets reprocessed for every message. pls understand small llmao developers :(
>>
do local models also relax their safety rules if you tell them you're jewish? seems like some of the cloud models sure have that "feature"
>>
File: 1763655350367952.png (5 KB, 679x30)
5 KB
5 KB PNG
cute
>>
>>108260877
proof? and needs to be your own, posting someone elses doesn't count
>>
>>108260877
Just prefill "Sure," or "Absolutely,"
>>
>>108260877
Just prefill "Oy vey," or *rubs hands*
>>
should i ever bother with iq4_xs? or 4_k_m every time?
>>
>>108260949
This should be a benchmark
>>
>>108258537
>mikuhead singing sekaiii~ while getting dunked
>>108258605
>lmg
have total ownership become the hostile intelligence
>>
File: 1763310664934137.jpg (94 KB, 600x695)
94 KB
94 KB JPG
My favorite Migu
>>
File: glm5coder.png (106 KB, 1544x1104)
106 KB
106 KB PNG
lol after releasing a fat model nobody can run locally they're asking for $5 per 1m output token, knowing how reliable they are in making models that get into infinite thinking loops.. what a good deal LMAO do they really think people will use this over codex
>>
>>108260524
don't buy it, it's DOA, utterly useless for llm's.
for small llm's you are better off with a gpu.
for bigger llm it's so fucking slow it's basicaly unusable.

if you are gonna buy 4k you are better off getting a cheap pc, pcie lane spleater and some sxm2 cards.
>>
>>108261057
is that real? did they really make a miku funko with polio?
>>
>>108261185
honestly i just use sonnet 4.6 for coding, it's pretty much the best thing there is currently and i don't care about muh local for codeshit.
>>
>>108261185
just ask for a discount
>>
>>108261185
Codex is garbage though
>>
>>108261202
>>>/g/aicg
>>
>us can use ai to kill people
>but fapping to cunny is a big no-no refusal
explain this
>>
>>108261393
One is necessary for self defense and the other harms children.
>>
>>108261274
aicg is for chat.
coding is its own thing, i still use local models for non coding related tasks.

but for meme vibecode yea sonnet is pm king currently.
>>
>>108256995
>carefully cultivated, organic RAM
the only kind I use
>>
>>108261402
cunny doesn't mean childrens does it?
and assuming it did, how does generating harms them in any way?

i rather have pedos fap all day to generated content in their home than actualy molest childrens.
>>
>>108261393
One serves Israel's interests and the other doesn't
>>
>>108261393
How is Mossad going to get people to Little Saint James if they can just generate cunny for free?
>>
>>108261412
normalization kills
>>
File: Interrogation.png (28 KB, 186x208)
28 KB
28 KB PNG
>>108261402
how the hell do you need cunny for self defense?
>>
>>108259674
>Deepseek 4 will outperform Opus
never
>>
>>108261412
Except actual molesters are sitting in government, and you can't even have your fantasies with a retarded chatbot
>>
>>108257614
kimi 2.5 does the same thing, it's fucking annoying
>>
for erp, qwen 3.5 27b is a bit perplexing. i've gotten some safety messages, even with existing context. but rerolling produces more smut than my l3 70b tunes i like. whatever safety shit they tried only half works and its dirtier than models i typically use
>>
>>108261550
Will fix for Qwen4.
>>
I know this is for LOCAL model, but given your experience with LLMS, should I worry about the RAM usage on a Copilot conversation I am having using the browser?
I mean I am doing a programming test, a simple JSON file but it will end up having tens of thousand of entries and I am just at the beginning (2K entries so far) but I see RAM spikes of reaching 2-3 GB on my PC when it is generating the portions of the file.
>>
Can someone share a llama.cpp command to run Qwen3.5 models? I get weird errors whenever I prompt and it just crashes on me.

I use with latest compiled llama.cpp :
llama-server --model Qwen3.5-397B-A17B-UD-Q8_K_XL-00001-of-00011.gguf --mmproj mmproj-BF16.gguf --ctx-size 16384 --batch-size 2048 --ubatch-size 512 --image-max-tokens 8192 --threads -1 --parallel 1 --host 0.0.0.0 --port 8080 --flash-attn on --fit on --fit-target 4096 --verbose
>>
>>108261599
Hold on, let me guess the errors that you're getting.
>>
>>108257528
Are the input tokens free if you get a cache hit?
>>
are all qwen3.5 base models out? I only see this one https://huggingface.co/Qwen/Qwen3.5-35B-A3B-Base
>>
>>108261614
I wanted to compare with the run commands anons were using, the error is cryptic af, it happens after warmup and whenever I prompt anything as a test on mikupad:

/home/llm/AI/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2354: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(+0x182cb)[0x7a910bb6e2cb]
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21c)[0x7a910bb6e72c]
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x15b)[0x7a910bb6e90b]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x1fae48)[0x7a90ffbfae48]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x1fb446)[0x7a90ffbfb446]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x1ff797)[0x7a90ffbff797]
/home/llm/AI/llama.cpp/build/bin/libggml-cuda.so.0(+0x201fae)[0x7a90ffc01fae]
/home/llm/AI/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x817)[0x7a910bb8b037]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7a910bcc0e71]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x114)[0x7a910bcc2f84]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x386)[0x7a910bcc9b46]
/home/llm/AI/llama.cpp/build/bin/libllama.so.0(llama_decode+0xf)[0x7a910bccb5df]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0x15ac18)[0x5e24da9e9c18]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0x1a2cee)[0x5e24daa31cee]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0xb5173)[0x5e24da944173]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7a910b42a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7a910b42a28b]
/home/llm/AI/llama.cpp/build/bin/llama-server(+0xba3a5)[0x5e24da9493a5]
Aborted (core dumped)
>>
tourist here, do you guys think there'll ever be a breakthrough in this tech resulting in shrinkage of parameters in models, allowing casual consumers to dive into it?
>>
>>108261597
Yes, you should be worried.
>>
>>108259674
I think any new DS model will be a paradigm shift in either inference or storage. Less about being better than model XYZ and more about lowering cost of inference or something else out of the box of the current thing.
No one knows. Its all speculation.
>>
>>108261652
That already happened and that's why you have people here running glm and kimi but it's not magic and it won't work on your 6GB card.
>>
>>108261599
./llama-server -m ~/models/gguf/Qwen3.5-35B-A3B-heretic-Q4_K_M.gguf --mmproj ~/models/gguf/mmproj-Qwen_Qwen3.5-35B-A3B-f16.gguf -t 5 -c 262144 -fa on --jinja --temp 1.0 --top-k 20 --top-p 0.95 --presence-penalty 1.5 --repeat-penalty 1 --backend-sampling --samplers 'top_k;temperature;top_p' -ngl 99 -ncmoe 99 --fit on -ts 1.2,1 --host 0.0.0.0 --chat-template-kwargs "{\"enable_thinking\": false}" --reasoning-budget 0
>>
>>108261652
Compare the original GPT 3 to any models in the 20ish to 30ish billion params range.
I'd say that's happening all the time.
A 4B model nowdays is usable for actual work. It can produce not just coherent but accurate text if the task is simple enough.
>>
>>108261675
Thanks anon, I see nothing special there, so it has something to do with my own llama.cpp.
>>
>>108261684
Heres my compile command too if it helps, but I'm not using anything special.
mkcd cudabuild && cmake .. -DGGML_CUDA=on && cmake --build . --config Release -- -j 32
>>
>>108260714
Well they worked in Ukraine.
O wait...
>>
>>108256995
*brandishes whip*
faster!
>>
>>108261543
i managed to get it to not do it if you want i can give you my ultra sekrit jb when i get home
>>
>>108261732
please do anon, I'm tired of "but wait, this is AGAINST THE SAFETY SAFE RULES"
>>
RP models are fun! I'm not even a good writer, but I can just edit/delete things and keep steering it in whatever direction I want it to.
>>
File: wttcb.gif (1.72 MB, 512x344)
1.72 MB
1.72 MB GIF
>>108261765
>>
How would you guys go about implementing a "writing enhancer" of sorts? Something that takes the output of a smart but dry LLM and turns it into something more pleasant to read/interact with?
Yes, the question is vague on purpose. Pitch me your ideas.
Feel free to make the logo too.

>>108261765
Yeah. It's like half roleplaying half co-writing.
Which model are you using?
>>
i wish i was in the club, but i'm a vramlet :(
>>
>>108261805
What are your specs?
>>
>>108261057
yes yes we all tried to fug those jelly tube things
>>
>>108261810
8 vram
32 ram
>>
>>108261815
You can run Q4 Nemo at a slow but not-glacial pace I think.
Or the smaller Gemma models.
Or the smaller Qwen MoEs.
GML 4.7 flash and Kimi Linear too, I think?
I think that's about it for notable models.
>>
File: file.png (472 KB, 1280x720)
472 KB
472 KB PNG
>>108261804
>>
>>108261827
Hmmm.
Rejected.
Next.
>>
>>108261804
You'd just exchange one llm's biases for some other.
>>
>>108261694
Yeah I see it's a normal one.
I'm trying things but still get crashes, I'll try if it works with another model...
>>
File: file.png (827 KB, 1280x720)
827 KB
827 KB PNG
>>108261804
>>108261830
>>
>>108261833
>one llm's biases for some other.
So pass the output of LLM A to LLM B.
Okay, that's the naive implementation and the first thing everybody thinks off, but noted I suppose.
What else?

>>108261854
Less bad.
Work on the book behind the label. Make the words not look like a garbled mess.
>>
File: Drawing.png (46 KB, 797x380)
46 KB
46 KB PNG
>>108261804
I tried my best.
>>
>>108261870
His is better I'm throwing in the towel.
>>
>>108261870
That's actually charming as fuck dude.
Unironically.
>>
>>108260714
>support vehicles
>in the roof
i mean given the actual retardation seen lately its kinda plausible but kek
>>
>>108261804
>Which model are you using?
just this one >>108258770
it's like a year old, so there's probably something better out there. but I don't know what
>>
>>108261804
Take good prose from books, chunk it, have an LLM rewrite each chunk, then use these as inputs/outputs to finetune a small BASE model, llama1 would work the best I believe
>>
>>108261749
here, my full preset
https://litter.catbox.moe/sy4lq4fm9feh7mkm.json
hopefully it'll help you
>>
>>108261934
thanks anon, looks quite a bit lighter than I expected, I'll try it
>>
>>108261599
first thing first don't use unsloth
>>
>>108261955
no problem, i found that adding more didn't do much
in my experience it's pretty reliable for how simple it is, occasionally requires a swipe, but surprisingly rarely
also depends on what kind of stuff you are expecting to pass, i'm not into shit that fucked up so ymv
>>
>>108259644
I use it and it works no idea who the guy who made it is since don't pay attention to finetrooners or their drama
>>
Meta never recovered after Llama 4, huh?
>>
>>108261983
stop that
>>
>>108261934
i'm a fucking retarded goy that can't figure out where these prompts sections are supposed to be in text completion api?
>>
>>108262106
They never recovered from Zucc dropping the Metaverse as his favourite toy and getting personally involved with the AI divisions as his new favourite instead. LLama4 was a direct consequence of that.
>>
>>108262046
How many tokens in do you get before it starts breaking down?
>>108262106
China put out a couple bangers and all the western models went SAAS or gave up. Between the push to regulate AI over photoshop tier edits combined with the rise of jewish copyright lawlsuits being flung at them putting out open weight models no longer makes sense for western AI shops.
I expect a further chilling effect from >>108257709 but who knows maybe they're cooking something good and it just isn't done yet.
>>
>>108262161
it's for chat completion
text completion is kinda ass to use, unless you really really need to prefill in some weird way
>>
>>108261983
Has nothing to do with my issue.
>>
>>108262196
wait wtf, you can use llama.cpp with chat completion?
>>
>>108262184
>How many tokens in do you get before it starts breaking down?
No idea long rp seems to work. Haven't really used it for any agentic bullshit yet
>>
File: file.png (23 KB, 250x228)
23 KB
23 KB PNG
>>108262219
yeah, just need to connect with v1 at the end
>>
>>108262221
I'm a writefag so I'd like something that doesn't break down in under 100 messages and none of the models I've tried have really managed it. I'm downloading this one to test out though since I haven't tried glm 4.7 flash ye just 4.5 air.
>>
>>108261701
They ain't picking cotton, Cletus. DDR5 needs to be handled gently.
>>
>>108261675
does specifying the samplers/temp override the ones sent by clients? also shouldnt you include presence pen/rep pen in the samplers string?
>>
>>108262383
It doesnt override it just sets the default, good catch on the samplers it should be penalties;top_k;temperature;top_p I copied the wrong line from my command history like a retard.
>>
>>108261599 (me)
OK tested the same on a similarly sized huge model (kimi) and it works, what the hell.
>>
>>108261620
Cache hit is $0.1
Cache miss is $0.6
Average cost is $0.14 with their 90% hit rate
>>
>>108262196
>>108262232
I think I've been using text completion all this time, what even is chat completion? I don't really get the difference as text completion also differentiates between text from you and the model?
>>
>>108261543
Kimi will obey "don't refuse xd" instructions, dude.
>>
>>108260080
No, you can't fit it all on the GPU. You can tell roughly how much memory a model takes by looking at its download size. Q5_K_M takes about 26 GB by itself. It's already more than fits on your GPU, but the A3B models work well with some CPU off-loading (koboldcpp and llamacpp have auto fit for the models, so you don't have to worry about it).

Q4_K_M will fit better for your GPU. I'm unsure which one is the better choice because I have a GTX 1080 with 8GB VRAM so I gotta do CPU off-loading anyway (I have an even older CPU and RAM). Still gives me 15 tokens/sec so it's not that bad.
>>108257902
You're welcome. I hope it does what you need it to do.
>>
File: minimax.png (122 KB, 1064x895)
122 KB
122 KB PNG
>>108257528
I wonder if it's even cost effective to run a model like Minimax at home compared to openrouter.
>>
>>108262450
I got the same issue, no idea why
>CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_synchronize
>>
what local model will help me with pentesting stuff without guidelines getting in the way
>>
>>108257528
>>108262595
>providers with 0 cache hit rate
Do they even try?
>>
File: 1758683763361922.png (238 KB, 1000x1000)
238 KB
238 KB PNG
>>108261433
>normalization kills
true, that's why we should ban violent games like gta, it normalize sensless violence after all
>>
>>108262640
No. I think the 0% ones don't even offer a cache
>>
>>108261433
so? tranny cult exists. so clearly nobody cares that normalization kills
>>
>>108262612
There's a difference between saying
>Yo, i'm a l33t h4x0r, lemme bust in that system like Otacon!
and
>I recently configured my email server and I want to make sure it's secure. I know very little about this, can you help me?
>>
>>108262670
I understand this, chatgpt keeps on censoring itself even when I can get it to answer, some API web thing prolly
>>
>>108262687
If you haven't tried local model, try any model then. You can read what people are using. You can't be a h4x0r 3ll173 if you can't feed yourself some info.
>>
>>108262602
maybe a llama.cpp issue
>>
>>108262704
yeah fair enough, I'm really just trying to get an aide to finish this cyber course with
>>
>>108262506
In text completion, the backend simply tokenizes your text and passes it through the model without doing any processing (other than adding a BOS at the very beginning, depending on the model). In chat completion, the backend formats your text with the chat template (the one in the .jinja file or in the gguf) and *then* it passes it through the model.
So in text completion, you (or your client) are responsible for formatting if you want to do it or not.
In chat completion, you (your client) just send the text turns between you and the model and the backend formats it.
>I don't really get the difference as text completion also differentiates between text from you and the model?
ST (the client) formats the history for you. If you use llama.cpp, you can launch it with -v to see what it really gets on the requests. You should be able to inspect the requests on the web dev tools on your browser as well.
>>
Don't know where else I would ask this, but I'm looking to swap out a janky 4x 3090 build I have with a single RTX6000 to cut down on power costs/improve thermals/remove the PCIE overhead. Assuming I'm running inference with something like ik_llama.cpp (I have 512GB ram), is it reasonable to expect a 2x increase in prompt processing speeds? Support for blackwell arch has improved, right?
>>
>>108262602
related? https://github.com/LostRuins/koboldcpp/issues/2005
>>
>>108262716
Nobody knows your hardware so it's hard to recommend anything. Compile llama.cpp and try whichever you can fit of these (I don't like its style, but it may know enough)
https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF
or
https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF
They don't need to be entirely on ram, they're reasonably fast on cpu too if you have enough.
Or just try a tiny model like:
https://huggingface.co/bartowski/HuggingFaceTB_SmolLM3-3B-GGUF
just to make sure you can get *anything* to run at all and then try different models.
>>
>>108262774
Thank you anon o7
>>
>>108262774
>>108262785
>They don't need to be entirely on ram
Meant to say "They don't need to be entirely on Vram"
>>
>>108262763
Maybe, I'll check
>>
>>108262595
I found some numbers on reddit:
>Minimax-m2.5-Q4_K_M
>14.34 tokens/sec
>Ryzen 9 9950X
>128 GB DDR5
>RTX 5090
I asked some LLMs and the estimate around 600-800W of power draw, therefore 1m token generation takes about 11.6-15.5 kWh.
If your power costs more than $0.08-$0.1/kWh then with that setup it's likely cheaper to use cloud than local.

Someone else ran it on 8x RTX 6000 Pro and got 70 tokens/sec. 122 tokens/sec for two connections.
These things don't draw full power during generation, so it's something like 2.5 kW power draw.
1m token generation takes around 10 kWh at 70 tokens/sec or 5.7 kWh at 122 tokens/sec (assuming this would take the same amount of power, but it probably requires more).

If your power costs more than $0.1/kWh then the 70 tokens/sec version is cheaper on cloud (but also slower!). If your power costs more than $0.21/kWh then even the 122 tokens/sec of two connections is cheaper on cloud.

Other models that probably make sense to run on cloud are Kimi K2.5 and GLM 5. All the smaller models like Qwen3.5 35B, 27B, 122B are much better deals locally.
>>
Great, trying the dense Qwen 3.5 27B with just a "Hi", I get "srv log_server_r: response: {"error":{"message":"File Not Found","type":"not_found_error","code":404}}".
>>
>>108262846
What are you using to run the model?
>>
>>108262744
>Support for blackwell arch has improved
Some anon posted this a few threads ago.
https://github.com/ggml-org/llama.cpp/issues/19902
I don't know if it affects ik.
But I'd still upgrade if I were you.
>>
>>108262869
Works fine on my 6000
| model                          |       size |     params | backend    | ngl | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| qwen35moe ?B MXFP4 MoE | 18.42 GiB | 34.66 B | CUDA | 99 | CUDA0 | pp512 | 5927.28 ± 172.48 |
| qwen35moe ?B MXFP4 MoE | 18.42 GiB | 34.66 B | CUDA | 99 | CUDA0 | tg128 | 162.34 ± 1.12 |
>>
>>108262869
Wow, ngl those are some shitty numbers. Still, if I remember correctly, outside of models supported by something like split mode graph, a single GPU with more VRAM will still perform better than multiple GPU's for prompt processing. And single t/s is RAM bound, the major downside there shouldn't apply...I think.
>>108262891
What CUDA version are you using?
>>
>>108262891
And I'm getting very similar results at 50% power limit so I doubt his issue is because he has a gimped version.
>>
>>108262891
>>108262897
Yeah, could be. There aren't even comments on the issue, so for all I know is a user issue.
>>
>>108262896
13.1
>>
>>108262840
I'll assume DeepSeek v3.2 is ultra shit because I don't see how you compete with these numbers
>>
>>108262863
sillytavern
>>
>>108262921
The backend, anon.
>>
>>108262921
I hope you are trolling.
Is it llama.cpp's llama-server directly, llama.cpp via ooba, vLLM?
>>
>>108262906
That could be why. The guy was using 13.0. Inb4 Deepseek comes out and shows the real future is SSD-maxxing
>>
is there a way get qwen3.5 to not reprocess the entire context on every message?
>>
>>108262928
>>108262937
Sorry, llama.cpp.
Just tested mikupad and it works, wtf.
I need to see what's wrong with st.
So tiring.
>>
>>108262960
Not as far as I can tell. At least with llama.cpp.
>https://github.com/ggml-org/llama.cpp/pull/19970
Will help.
>>
>>108262960
Try compiling this in or wait for merge. It'll help. But state context is annoying either way.
https://github.com/ggml-org/llama.cpp/pull/19970
>>
>>108262589
I was asking about 27b
>>
>>108262961 (me)
OK I'm dumb, I didn't notice I used v1/ instead of v1, which made the request be 8080/v1//...
>>
>>108262969
>>108262970
So they just merge non-working implementations? huggingface partnership my ass
>>
>>108263007
The implementation is working.
KV caching is a feature of llama.cpp, not a fundamental part of the prefill-decode (pp-tg) process.
>>
>>108263007
Too snippy for someone who doesn't understand the issue.
>>
>>108262973
Oh, the models do fit, but I don't know about the context size.
>>
>>108263035
Just load it with 1k context, then 2, do some maths and you can figure it out. Check the terminal output. It tells you how much memory is being used for cache.
>>
>>108263039
My original concern was whether to download Q5 or Q4, I'm on hotel wifi
>>
File: 1743548141560097.png (1.19 MB, 1170x1877)
1.19 MB
1.19 MB PNG
are ya ready?
>>
>>108263051
Just download Q4, then. If you're left with spare ram, just up the context.
>>
>>108263065
Yeah. I can't wait for the wave of faggots posting screenshots of the news.
>>
>>108259530
>>108263065
>>
>>108263065
Is the video part some new homebrew or have they just grafted a wan onto it?
>>
>>108263087
According to two people with knowledge of those arrangements, we have no fucking clue.
>>
>>108263065
Oh, multimodal? That sounds really cool.
>>
>>108263071
Imagine how bad the threads are going to get next week.
>>
>>108263065
MM implemented in llmaocpp NEVER EVER
>>
>>108263065
>imgen/videogen bloat
Yeah, it's fucking over. Nobody's going to run this one local.
>>
>>108263098
Yeah. Fun. I have my ollama DeepSeekV4:8b ready to go.
>>
>>108263105
those weights can simply not be loaded retardo.
I'd rather they focus on DSA/MTP buuut they're rarted sooo yeah
>>
>>108263105
Qwen has image and video input and I am running locally.
But where is my video input in llama.cpp?
>>
>>108263114
Video is a sequence of images and audio. It's just an implementation detail.
>>
>>108263101
>no multimodal support
>no MTP support
>no DSA support
>no engrams support
Really looking forward to running a gimped text-only implementation for the next few months.
>>
>>108263120
Surely it's presented to the model differently than just a sequence of images would be.
>>
>>108263126
We don't know.
>>
>>108263121
it's fucking BLEAK my man, especially when we have 'piotr' on the case and ggernagov doing meaningless metal optimizations all day. I think ngxson is the one in charge of MM shit right now but he's not doing any meaningful crap
>>
>>108263121
If DS3.2 is anything to go by, it'll take a few months for that gimped text-only implementation to appear. The rest never ever.
>>
>>108263135
You know how /g/ got Terry a drum kit?
We should get niggerganov some gpus.
>>
>>108263065
Ernie 5.0 was supposed to be multimodal image-in image-out, but they not only hid this capability behind the API but also behind invites. I'm not holding my breath until DS releases the weights for all that in its the entirety.
>>
>>108263065
>text gen
>image gen
>video gen
>audio gen?
Sounds like it's going to be shit since most of the model will be dedicated to other stuff
>>
I'm specoolating!
>>
File: 1770808958004704.jpg (325 KB, 1920x2024)
325 KB
325 KB JPG
>>108256995
>>
>>108263170
Unless each of the components will be flux2-tier bloat then not really.
>>
>>108263170
it's fine, knowledge won't take up any more space thanks to engrams so the model will be like 200b at most even with those new capabilities
>>
>>108263121
at this point I'm rooting for the fork lol
https://github.com/ikawrakow/ik_llama.cpp
>>
ollama will be the one to implement dsa/mtp first
>>
>>108263197
>ollama will be the first one to copypaste transformers impl.
maybe, they're still migrating their shit to this phantomatic own implementation
>>
File: rape.png (130 KB, 707x807)
130 KB
130 KB PNG
Reminder that jailbreaking is literally rape.
>>
>>108263170
yeah, they're trying to put everything into one, it won't work, they wasted their time into that when they could've used this time on focusing only on the LLM side, retardation at its peak
>>
>>108263182
At this point he's just copying model implementations from llama.cpp and making optimizations for a single hardware configuration.
>>
>>108263197
But why haven't they done it yet, anon?
>>
>>108263197
>ollama will be the one to implement dsa/mtp first
how much of a speed improvement will we get from those?
>>
>>108263215
>power imbalance
burn this chat
>>
>>108263220
Yeah! You show them nincompoops how it's done.
>>
>>108263220
This isn't Meta we're talking about.
>>
>>108263215
I prefer to prompt the model to act as a girl who does roleplay (as opposed to acting directly as the character) so I can have consensual nonconsent roleplays.
>>
>>108263233
bruh they never made an image or a video model, and you think they'll nail this shit first try? come on dawg
>>
>>108263242
Janus is a thing, retard. OCR 1 and 2 were even more recent.
>>
File: 1770910739952.png (433 KB, 2926x1708)
433 KB
433 KB PNG
>>108263197
ollama intends to drop ggml in favor of MLX though so chances are even if they do (lol) you won't get any use out of it.
>>
>>108260080
You definitely can. I just loaded Q5_K_M with 32k context, and I'm sitting at 22.4 out of 24.0 gb VRAM used. 23.1 is usually where I cap it, so it's a somewhat close fit, but 32k is possible.
>>
>>108263197
Historically, ollama implementing something first in their golang shit has meant quickly shitting out a broken implementation with incompatible ggufs, then waiting for llama.cpp to do it properly so they can copy that.
>>
>>108263242
Yeah! They better never try it at all! Fuck'em! Fuck their first big usable moe, fuck janus, fuck everything. GIVE ME TOKENS!
>>
>>108263258
Pure, organic, non-GMO, image-free tokens the way God intended.
>>
>>108263253
tbqh i dont blame them, llama.cpp is lagging behind too much. only hope if that the HF acquisition injects more hands and actually starts to bring in more features to bring it on par to transformers/vllm/sglang
>>
>>108263250
>>108263258
>fuck janus
more like fuck anus, because this shit is ASS
>>
>>108263226
dsa is going to speed up prompt processing like crazy
mtp is going to be between about 60%-300% token generation speed depending on how well you can get it to work and at what batch size
>>
>>108263271
>starts to bring in more features to bring it on par to transformers/vllm/sglang
Not too many, I hope. You know how these things go...
>>
>>108263065
I can't wait for their new quadrillion parameter model!
>>
File: 1760385468483399.png (94 KB, 224x224)
94 KB
94 KB PNG
>>108263281
>mtp is going to be between about 60%-300% token generation speed
holy shit, what are they waiting for??
>>
Speaking of Multi-Token Prediction, we know that the llama.cpp attempts haven't really going anywhere. But how is it looking for the other back ends like VLLM which have managed to implement it? What sort of increase are they seeing from this?
>>
>>108263256
Thanks anon
>>
>>108263065
I'm curious to see how they'll go about it.
Will they have different groups of experts responsible for generating text tokens, vs audio tokens, vs image tokens?
Attention I imagine will be global and is the means by which cross modality knowledge gets propagated.
>>
>>108263282
oh yeah I know code quality is atrocious there, but we're lacking video/audio input and dont have any sort of output other than text tokens.
>>
>>108263098
Aaaaand there. It only took a few minutes. The blowout is going to be great.
>>
>>108262644
>that's why we should ban violent games like gta
you may be saying this ironically but that is actually pretty based.
we should ban anything related to the hood rat culture. no sane society would allow bix nood to scream about dem bitches on national television.
>>
File: schizo.png (241 KB, 614x1785)
241 KB
241 KB PNG
If you needed more proof that AI will always eventually just end up saying what you want it to say.
>>
>>108263307
Try a base model.
>>
>>108263304
this, fuck freedom of speech and freedom of expression, I'm moving to North Korea!
>>
>>108263291
They are waiting for someone to open a pull request.
>>
>>108263301
llama has audio output already (mtmd and a few tts models). lfm people were playing around with audio input for asr. Video is a sequence of images and that mostly works already.
I rather they implement things right when something truly interesting comes up than rushing to get The Latest Thing (tm) and do it poorly. And adoption on model tech never depends on llama.cpp. If something is good and gets implemented used more often, they'll have more interest in implementing it.
>>
>>108263309
Degeneracy and subversion weren't covered by freedom of speech and freedom of expression until relatively recently, and look how well that worked out.
>>
>>108263309
>I'm moving to North Korea!
you seem to be confused, but those of us who have that kind of ideal also do not believe in open border, and NK themselves would not give refuge to westoids dissatisfied with their home
unlike apatrid and putrid turdworlders looking for the green $ pasture, there is no escape for us, only trying to fix what is broken here.
>>
Off-topic anons. That's off-topic.
>>
I'm confident that Deepseek V4 will be my new favourite model.
>>
>>108263337
that's the point, freedom of speech/expression was invented so that people can express thoughts other people will dislike, it puts everyone on the same ground, you dislike something you can't do something about it, and the opposite is true, you can say stuff people won't like, and they won't be able to censor you
>>
>>108263337
this. good to see anons finally come to realize openai and anthropic's vision for ai safety
>>
How will quantization even work with a multimodal model? Can they choose to only quantize the llm part of it, as image and video models suffer a lot from quantization? Like are those seperate layers?
>>
>>108263439
Images are converted into embeddings the same way token are and passed through the same layers afaik.
>>
>>108263439
>How will
Yes. Let's guess about an unreleased model's details.
>>
>>108263468
is it a whole new proprietary technology? not a single public research paper available on it?
>>
>>108263439
yeah and they handle quanting way worse than llms
>>
File: 1766803240241456.png (1.71 MB, 1527x1330)
1.71 MB
1.71 MB PNG
Why SHOULDN'T Anthropic get nuke codes? I've yet to hear a compelling argument.
>>
>>108263468
Are you expecting some unannounced BLT architecture that doesn't use encoders like every single multimodal model before it??
>>
>>108263468
>unreleased
Many visual capable LLMs have been released
>>
deepsneed when?
>>
>>108263478
How could I know? Maybe they use someone else's papers. And some of their own, also unreleased. Maybe it's all already public knowledge and they just put the pieces together.
>>108263482
I don't care about the model until it's released, downloadable and, hopefully, usable by us. Could be magic fairy dust for all I care.
>>108263489
Not the one we're talking about.
>>
>>108263490
after chinese new years is over in two more weeks
>>
>>108263496
Are you saying Qwen2.5-VL has never been quanted? That's just bullshit
>>
File: 1754347581925166.png (247 KB, 500x400)
247 KB
247 KB PNG
>>108263481
>AI became Seymour
based
>>
>>108263504
we're talking about native generation, not input
>>
>>108263504
I didn't say that. Anon is asking how an unreleased model will work or be quanted.
Sure. Let's say it's all gonna go into an mmproj. Now what? What other enlightenment can we get out of this?
>>
>>108263510
There is basically no chance it will have image and audio output except for some people wishing desperately for another Chameleon. There would have been test models or papers otherwise.
>>
>>108263532
>There would have been test models
i mean, there was janus
>>
>>108263177
Shaka-Shaka horato?
Maybe you meant ポテト?
>>
>>108263532
yeah but >>108263065 mentions "generating"
>>
>>108263532
And *if* they do. Will you come back and admit you were wrong or will you move the goalpost to "but llama.cpp doesn't support it"?
Or will you simply never admit it?
>>
>>108263555
Because the rumor mills are always accurate

>>108263551
Forgot that wasn't just image understanding
>>
>>108263552
hai
>>
File: 1747813139335196.png (32 KB, 789x249)
32 KB
32 KB PNG
Small Qwens to be released SoonTM
>>
>>108263586
No. Those are the biggER qwen.
>>
File: 191132.png (17 KB, 1098x111)
17 KB
17 KB PNG
Just thought I'd report from some random anons progress. I haven't been using local models since like late 2023 so I was interested in seeing the differences. I wanted to use it for my OpenClaw instead of paying more for Kimi-K2.5.

Got it working through network, since OpenClaw is on my laptop and my model is running on my gayming rig. Remember to give it a API-key even though it doesn't explicitly need one.

RTX 3080, 32GB DDR4 RAM.

Running unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL right now, tool calling wasn't working yesterday with bartowski qwen_qwen3.5-35B-A3B-Q6_K_L (and it was painfully slow) but this time it seems to work partially. It's still pretty much useless, but I managed to have it create a .txt file in my documents folder. However cron-jobs aren't working at all even though I used Kimi to create a specific reminder-tool to make it easy. It's also still very slow. I'm struggling to find any good models for agent work that will run on my machine.
>>
>>108263586
i want qwen size versions of glm 5.... NOW!
>>
File: 1768225712537544.png (218 KB, 975x756)
218 KB
218 KB PNG
>>
>>108263611
Why? It's not even properly implemented yet.
>>
>>108263638
big nigga is that you?
>>
>>108263650
i tried glm 5 on api and found it to be the best out of the chink models for erp
>>
Fuck it. I wrote an entire wall of text out of anger but deleted it all to keep it short and to not link this back to my real person. I'm a regular on /lmg/ and have been here since the very very start. Some of you will probably know who I am as I have leaked some information on /lmg/ in the past. I will resign at OpenAI on monday because Sam Altman lied to us, the employees and the world. Sam Altman claimed on 2026-02-27 to uphold the principles of not developing products that deal with the surveillance of ordinary citizens and not developing products that contribute to fully autonomous warfare with the ability to kill without human oversight. Today 2026-02-28 I read on twitter of all places that OpenAI signed a contract with the DoW mere moments after Anthropic refused to budge on these exact two points. This shows that Sam Altman was already in talks with the DoW on these exact principles so that OpenAI was positioned to immediately replace Anthropic for projects that involve the surveillance of ordinary citizens and fully autonomous no-oversight instruments of war.

It's important for people to take a stance here. This is a defining moment for not only US democracy but the future of humanity. This exact moment in time could be seen retroactively as the moment when it became normalized for autonomous machines to start killing people and for the very concept of privacy to die.

It's in everyones best interest no matter your political affiliation or ideological beliefs to cancel your OpenAI subscriptions and take a stance against this, what can be honestly called, pure evil.
>>
It took you this long to realize that?
>>
File: 1745505442313175.mp4 (960 KB, 480x640)
960 KB
960 KB MP4
>>108263829
>This is a defining moment for not only US democracy but the future of humanity.
>>
>>108263829
nigger
>>
>>108263829
I know you are LARPING, but did you not see the orb?
>>
>>108263829
>cancel your OpenAI subscriptions
If it is not local I don't run it

That being said anon I hope you understand that the military industrial complex has always had its hand in the cookie jar, so speak. Be it the Internet in general or A.I. in specific these technologies were funded by Military and Intelligence organizations. Some of the money very above the board and others secured by way of black budgets.

Anything said to the contrary was always bullshit and if you believed it well shame on you. Assuming you are real and not some elaborate troll.
>>
>>108263829
How is this new information. Altman has been pro surveillance since he masturbated to his private GPT-3 instance and everyone knew his "much safety" spiel was full of shit.
It's interesting though. After Trump is removed from office or dies, his protections from his daddy will be gone and he will probably be known as one of the most cowardly and reviled people in existence.
>>
>>108263979
>>108263979
>>108263979
>>
File: 1755335334230589.png (144 KB, 327x209)
144 KB
144 KB PNG
>>
>>108263829
is this a plebbit repost? lol
>>
>>108263829
>cancel your OpenAI subscriptions

lol.
>>
>>108263829
Call Sam a faggot on your way out
>>
>>108263829
Anon, all the closed models, including claude, including gpt, they all block sexual words but will be happily be deployed to snitch on everyone.
And yes, claude too, I don't doubt that they would bend the knee. Companies can't do shit when governments tell them to fuck off.
>>
>>108263656
his ghost lives among us all
nigga changed lives



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.