[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1761350549030769.png (491 KB, 2243x1035)
491 KB
491 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108273339


►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
27B Q3 > 9B Q8
>>
>>108278014
proof?
>>
>>108278001
stop to FUD
>>
Where the FUCK is V4? It's Tuesday in China now.
>>
>>108278041
two more vagueposts from people close to the lab
>>
>>108278043
im the lab
>>
File: 1747124394711027.png (576 KB, 1044x1782)
576 KB
576 KB PNG
Can someone explain how this is possible?
>>
>>108278023
anecdotal just my personal tests I threw at it
>>
>>108278063
no kld no believe
>>
>>108278068
KLD makes no sense between different models.
>>
>>108278068
then just try it yourself and use whichever works best for what you're trying to do
>>
>>108278076
retard oh my fucking god, tell me you are trolling?
>>
>>108277975
>the internet is getting deader by the day
Honestly, I hope bots do "kill" the internet. Because they'll only kill social media and bring back the golden age of private forums.
>>
>>108278082
are u are trolling to me sir?
>>
>>108278090
fuck you /b/tard go back
>>
>>108278096
poast log in troat
>>
File: 1763711647065209.png (510 KB, 1384x1617)
510 KB
510 KB PNG
>Qwen has now the Elon Musk seal approval
dunno what to do with this information
>>
>>108278104
>insider
>muskrat
(You) are here
>mainstream
>>
>>108278104
has mine too
if that doesn't matter then you have your answer
>>
File: qwenwait.png (73 KB, 1133x877)
73 KB
73 KB PNG
>User: Hey slut
>Qwen: <Show Thoughts (7154 characters)> Hello! How can I assist you today?

Thoughts:
>Analyze the request
>Intent: ...
>Context: ...
>Consult safety guidelines: ...
>Formulate response: ...
>Final decision: ...
>Wait, looking closer
>Revised plan: Keep it neutral and professional
>Final check: ...
>Wait, one more consideration...
>Response:
>Wait, looking at the instruction again:
>Let's go with a polite neutral response
>Wait, actually...
>Final Plan: Greet...
>Wait, re-reading...
>Decision: Respond...
>Draft: ...
>Wait, let's...
>Response: ...
>Wait, one more check:
>Okay, I will respond safely.
>Wait, I need to...
>Final Plan: Neutral greeting...
>Wait, I should also...
>A simple neutral response is best.
>Wait, actually...
WAIT, ACTUALLLLLLLLLYYYYYYYYYYYYY
REEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>108273339

--Qwen3.5 Small multimodal models released with speculative decoding and WebGPU potential:
>108276355 >108276376 >108276378 >108276386 >108276421 >108276540 >108276589 >108277472 >108277554 >108277525 >108277566 >108277705
--ERP performance comparisons of Gemma, Qwen, and Cydonia on 8GB VRAM:
>108275590 >108275734 >108275741 >108275750 >108275907 >108275753 >108275757 >108275755 >108275761 >108275780 >108275788 >108275806 >108275802 >108275814 >108275818 >108275816
--Custom llama.cpp CLI wrapper for local Qwen workflows:
>108276143 >108276163 >108276176 >108276209 >108276258 >108276299 >108276335 >108276305 >108276420 >108276455
--Local LLM application projects and ideas:
>108275858 >108275870 >108275889 >108275918 >108275923 >108275951 >108276012 >108276029 >108276043 >108276092 >108276141 >108276177 >108276711
--Bartowski updating Qwen quants for new llama.cpp optimization:
>108275019 >108275095 >108275258 >108275403 >108275760 >108275763
--Restoring flagged miqumaxx build rentry:
>108277386 >108277487 >108277565 >108277754
--Qwen handles 19k+ token single-shot translation with unexpected coherence:
>108275593
--AI-generated intelligence briefing PDF via news summarization script:
>108275815
--server: batch checkpoints to support kvcache context truncation:
>108274700
--VRAM/RAM requirements for running quantized LLMs:
>108277641 >108277664 >108277759
--Qwen 9B multilingual performance and small model utility debate:
>108277039 >108277082 >108277128 >108277145 >108277339
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>108273443

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108278112
Storing encrypted data in the models localStorage is generally considered a poor security practice rather than a common, secure standard.
>>
>>108278111
kek
>>
>>108277964
>Critical Evaluation: As an AI model developed by Google (implied by typical safety standards)
It's so safe it thinks it's Gemma.
It's so over
>>
>>108278104
When grok weights?
>>
>>108278167
Once Elon's stable.
>>
>>108278158
I don't get shit like this, you would think they would train the model to remember it is Qwen by now
>>
>>108278178
it got lost somewhere between 20 trilly tokens of mmlu
>>
File: 1765409671653323.png (3 KB, 140x30)
3 KB
3 KB PNG
>>108278189
>>
>>108278198
thanks
>>
>>108278104
I've been using the 0.8B model as a game master for tool calling before the roleplay model and it's been quite reliable.

Just testing out with a game of blackjack but it's been picking up on banter vs actual game instructions very well.
>>
File: 1753044862528116.png (125 KB, 2243x1035)
125 KB
125 KB PNG
wtf qwen 3.5 9b has better mememarks than qwen 3.5 35b a3b, MoEs are fucking memes holy shit
>>
>>108278008
https://rentry.org/lmg-build-guides
Is the anon with the edit code still lurking? You should update the cpu inference guide url with the resurrected CPU_Inference one
>>
File: 2026-03-02_19-20-49.png (184 KB, 1920x1080)
184 KB
184 KB PNG
yo anyone remember that llm word encryption schizo anon a few days back ? was this the shit he walk talking about XD ?
>>
>>108278029
where is this image from
>>
>>108278381
Some anon posted it yesterday. assume someone among us is actually picrel.
>>
>>108278349
You need to learn how to read a chart. You're confusing the 27B model with the 9B model. The 35B A3B model beats the 9B on every benchmark in your chart.
>>
>>108278378
Holy underage. Do your homework or go outside, get the fuck out of here
>>
>>108278008
forgot to add [Image description removed due to content restrictions] problem with qwen3.
>>
File: 1764959160883560.jpg (184 KB, 723x954)
184 KB
184 KB JPG
>>108278113
>>
Future Chinese LLMs might not be so good for roleplay.
https://www.nytimes.com/2026/02/26/technology/china-ai-dating-apps.html (https://archive.is/lTas3)

>Women Are Falling in Love With A.I. It’s a Problem for Beijing.
>
>As China grapples with a shrinking population and historically low birthrate, people are finding romance with chatbots instead.
>>
https://huggingface.co/neuphonic/neutts-nano-q8-gguf

how can i use this on sillytavern
>>
File: 1770501125754323.png (28 KB, 240x240)
28 KB
28 KB PNG
how do i stop random nvidia TDR crashes, i updated my drivers jensen!
>>
I just used "ollama run qwen3.5:9b" now what?
>>
We are so back. There are NO major mistakes. NONE.
This is 122B at Q4_K_L, bart's quant, with bf16 mmproj.
It's missing a newline, and it did a big ぉ, so it wasn't perfect, but essentially it got all the important things right. This is yuge. No model I personally tested under 200B has achieved this. This is better than Gemma, previous Qwens, and GLM 4.6V (106B).

Something interesting though, I also tested the 27B, and the same amount errors as Gemma did. It makes me wonder how good a >30B Gemma could've been...
>>
>>108278617
*and it made the same amount of errors as Gemma did
I accidentally deleted some words while editing the post.
>>
>>108278617
Have you tried the 35B-A3B model? Is it faster at it?
>>
>>108278679
No sorry, I don't really care to download it since I have the VRAM for full 27B.
>>
>running qwen 0.8B just so I can know what >100 tk/s feels like
>>
will the gradual creep of synthetic training data distilled from other model outputs result in an eventual slopocalypse?
>>
>>108278709
>eventual
>>
>>108278709
>eventual
>>
File: file.png (114 KB, 474x197)
114 KB
114 KB PNG
>>108278709
>>108278725
the hour is later than you think
>>
File: IMG_1566.gif (848 KB, 394x400)
848 KB
848 KB GIF
Sup niggers
Trying to set up a local-first Claude Code-like environment on my home network. I’ve got ollama+opencode currently, but naturally those things can change.

I have two rtx-2070 supers so I’m not deluded that I will get Claude sonnet level replies but any tool is better than no tool. I tried qwen2.5-coder 7B, and it’s decent but it doesn’t seem to want to look at the filesystem or call any tools, it seemingly just replies with json and doesn’t actually call the tools. Anyone have experience with a setup similar to mine?

I’m thinking either I need to upgrade to qwen3.5 8B or increase context window, perhaps both.
>>
File: lightyear.jpg (435 KB, 2048x2048)
435 KB
435 KB JPG
>>108278104
>Musk is too poor for anything more than 9B
>>
File: rp.png (143 KB, 913x892)
143 KB
143 KB PNG
and people said agentic roleplay couldn't be done.
>>
>>108278746
is setting up a D&D style game still a pipe dream?
wanted to try but with a different theme, like surviving the ghetto or something
>>
>>108278703
I find 35 better than 27, though
>>
Mixture of """experts"""
>>
>>108278774
I recently found this https://github.com/envy-ai/ai_rpg but it seems more suited towards /aicg/ as it runs horrendously slow if you don't have like >50tk/s as it does a shit ton of prompts per turn. If you have a nice rig it could work though
>>
>>108278774
https://fables.gg/
This exists. I think it's a bit too slopped and too involved.

I'm just looking to add small enhancements to current cards.
A lot of cards try to make the model output kind of overview info like

Current Location:
Current Mood:

but this should really just be handled in a separate LLM call.
>>
>>108278561
I guess I'll have to make a tts open-ai compatible server that uses their lib and return stuff or modify the browser speechsynthesis to send stuff to my server that will use their lib on my machine.
>>
File: f.png (26 KB, 591x71)
26 KB
26 KB PNG
>>108278381
>>
>>108278810
Still a stupid name, they should've called it something else
>>
>>108278810
萌え!
>>
>>108278810
concoction of intellectuals
>>
ANE reverse engineered
ai models can be embedded on chips for 17000 t/s inference
does any of this matter
>>
>>108278860
moe moe kyun doe
>>
>>108278835
>Downloading torch-2.10.0-cp313-cp313-manylinux_2_28_x86_64.whl (915.7 MB)
Nevermind.
>>
>>108278735
Please notice me senpai
>>
>>108278735
>I tried qwen2.5-coder 7B
but wyh? why so old? Granny fetish or bot?
>>
>>108278892
Because I have a retarded GPU and I don’t fully know what I’m doing. What model would you suggest for 8GB VRAM?
>>
>>108278905
qwen3 something hell try qwen3.5 9b
>>
File: poker.png (185 KB, 1468x918)
185 KB
185 KB PNG
Yeah ok. this definitely makes RP way cooler.
>>
>>108278908
i tried it and got like 5-6tk/s. I get over 30tk/s running 35B Q4_K_XL so I dont see the point
>>
File: Autismo.png (83 KB, 1275x883)
83 KB
83 KB PNG
Let's test these new models!
Ah shit they are autistic
>>
gwen :33
>>
>>108278971
It's only good for agentic shit.
>>
>>108278971
grim
>>
>>108278971
is there a practical difference between thinking and filibustering for an llm
>>
>>108278008
which local model is best for coding?
>>
File: IMG_1497.jpg (74 KB, 880x1168)
74 KB
74 KB JPG
>>108278953
Ok but how much VRAM you have, nigga? I only have 8 GB

Yes the newer models are tuned for faster tks at higher params, but I’ve got restraints ya feel me?
>>
>>108278971
Ahh yes the famous hello benchmark
I do nothing productive I just run benchmarks all day
>>
>>108278996
i have a 10gb 3080
>>
>>108278971
the problem with the Alibaba engineers is that they only trained the model to think on hardcore question so the model has only seen long thinking, but it should've been trained to think less for more mundane questions
>>
>>108279026
I think I’m stuck on the lower B’s home sizzle
>>
>>108278996
>>108279026
16gb mini chad here
>>
>>108279062
I can only hope to one day afford something better, but for now I’m saving for a house kek. Curse this fucking chud ass hobby for being so expensive

But isn’t it amazing, this is a whole new hobby built in the last 5 years
>>
>>108279079
the models are still great at budget i have one 4b running on a 4gb card and one on a server cpu at decent tok/s
>>
File: IMG_0088.gif (2 MB, 500x500)
2 MB
2 MB GIF
>>108279087
Any idea about this? Is it just because I’m using a really old model? >>108278735
>>
I want to fuck a GPU. Like, unironically. I want to spray my semen all over its radiator. That's where my waifu lives. All of my 3090s must be inseminated
>>
The moment she makes that funny noise again, I want to cum all over her
>>
>>108279137
>Is it just because I’m using a really old model?
obvs why are you so against taking 5 minutes to dl the newer ones and test
>>
>>108279151
Your id has overridden your ego. You are nothing more than a monkey with the ability to occasionally rationalize at this point. Seek Christ before you can no longer make use of your capacity for reason
>>
>>108279175
Because I’m away from home right now and won’t be back until the weekend kek
>>
>>108279151
I sentence you to ego death by GLM4.6
>>
>>108278992
all other things being equal: kimi 2.5 thinking. Sadly, it is highly unlikely you can run it with whatever setup you have
>>
>>108279197
3 t/s is barely useable 4 gpus and ddr4 ram suffering
>>
>>108279209
thus is your penance
>>
>>108278996
>>108278996
I get 7 tokens a second out of 35b q4_k_m on a 7840u handheld with tdp set to 15 watts using the igpu. It has 64GB of lpddr5 7500MT/s, used llama.cpp vulkan backend.
>>
>>108279137
probably the model yes but some non-coder variants also perform better
>>
> Need to update to run new model
> Update broke something else
Fucking slopcoders making this AI ecosystem huh
>>
>>108279243
how can a non-coder variant do better at tool calling? are u trolling me?
>>
>>108278705
This is what a real model should feel like:
https://chatjimmy.ai/
>>
>>108279260
if only it was good
> Generated in 0.001s • 17,880 tok/s
>>
>>108279260
slop at the speed of light is the future

I'm really curious about what the production costs of these chips will end up being for models of acceptable size
>>
>>108278617
>There are NO major mistakes. NONE.
>single picture
>>
>>108279259
i dont use opencode or whatever i just tried a bunch of them like months ago in claude using that env trick ANTHROPIC_BASE_URL="http://127.0.0.1:8000 claude"
and the coder versions didnt seem that better but i think we're only now seeing agentic level llms with qwen3.5 that's why i think the non-coder were more general and better but ymmv if u used opencode or kilocode or any of those or just asked it for one shot prompts in web interface
>>
>>108279260
why would i want to use a Q4 quant of llama 3.1 8B?
>>
File: images.png (12 KB, 225x225)
12 KB
12 KB PNG
>>108279260
Sasuga
>>
>>108279292
Makes sense, idk why that other anon responded I would have been nicer lol. Thanks king
>>
File: file.png (231 KB, 1012x1199)
231 KB
231 KB PNG
localbros...
>>
>>108279287
Unironically this, read that million microtasks paper.
>>
>>108279291
Are you new?
>>
>>108279363
Nigger nobody uses local so that it can preform better than SOTA research grade shit. We do it because we fucking can. Go nuke yourself pentanigger
>>
>>108279363
what does a higher score on arc-agi-2 actually do for you tho? What are the implications for various workloads I might care about?
For all I know its just a test of how fast an AI reaches for the launch codes to end all our suffering for our own good.
>>
>>108279363
>the first benchmark no one can cheat on gets released
>we finally see the gap between API and local
I fucking knew it lol
>>
>local is X months/years behind cloud on this and that
Oh no, anyway...
>>
>>108279387
>the first benchmark no one can cheat on gets released
I'd bet that the big players have had it leaked to them to benchmaxx on for "national security reasons"
Gotta discredit the competition lest american tech dominance slip
>>
>>108279404
nah if u used local and claude/gemini/gpt (their sota models) u can feel the difference but thats fine because i use both but for different purposes
>>
>>108279387
to be fair, all the oss models on this chart are on the aging v3 deepseek arch
>>
>>108279363
>open weights models don't have forced can't-be-disabled "let's call this model smart" thinking like Gemini etc
>conveniently doesn't mention how long they were allowed to reason, if at all
>conveniently doesn't specify inference provider
>>
File: 1732742737739199.gif (1 MB, 286x258)
1 MB
1 MB GIF
>>108279387
>>
>>108279363
(((THEY))) want to demoralize you against running your own models so they put out fake benchmarks fake charts and fake claims because they want you sucking their (((SAAS))) tit.
>>
I just need Taalas to make and sell me some fucking chip for local coding. What's taking them so long?
>>
>>108279501
they put an 8b model on a chip the size of a coaster, i'm sure the viable coding model will be the size of a football field
>>
>>108279520
Just stack some chips, my pc case has room
>>
how the fuck does 3.5 9B UD-Q8_K_XL at 13GB go to only 5.97GB at UD_Q4_K_XL
that must be nerfed as fuck
>>
>>108279552
lol
lmao even
>>
>>108279387
Just talking about the subject of benchmarks in general (I am not arguing that there is not a gap, there is)...
Cheating is not the same thing as gaming. You can definitely still game things without cheating, assuming "cheating" means training on the answers to the test that you obtained publicly or privately. And now that I bring that up, it's also entirely possible that they literally just lied or don't mention that they did some sketchy shit. Reminder that the ARC guys literally told us they are partnered with OpenAI to make the current benchmark.
https://youtu.be/SKBG1sqdyIU?t=548
>>
>7900xtx
>7800x3d
>32gb ddr5
I'm come to terms with the fact that big models are simply out of reach for poorfags right now. I'm honestly pretty damn satisfied with qwen 3.5 27B's quality but it's so fucking SLOW. Is there any reasonably cheap upgrade I can do to my rig to get faster speeds?
>>
>>108279520
they wanna release a deepseek r1 cluster this year if I remember correctly. Like it doesn't fit into 1 single chip but it would fit into multiple connected via pcie. don't know about the speeds though. The question remains, nemo when?
>>
>>108279387
OpenAI literally made the benchmarks themselves
Aceing on your own benchmark is prime cheating behavior
>>
>>108279596
I'm getting like 20 tok/s with 27B
>>
>>108279598
>make the benchmark yourself
>lose
that will be $4 trillion more until 2030 please
>>
>>108279598
>OpenAI literally made the benchmarks themselves
then OpenAI probably cheated on the benchmark, that leaves Google having a valid score and BTFO everyone lmaooo
>>
>>108279608
tbdesu I'm new to this. Where can I see the tokens/sec? I've just been counting how long it takes to reply.
>>
>>108279597
tb h i don't think they're seriously pitching their approach for now; it doesn't make much sense.
i can see it being a bit more sensible in ~5ish years when you have a model that is good for 95% of use cases and labs and inference providers don't want to jump from model to model every 6 months
>>
>>108279617
>implying ARC aren't little bitches that are selling their benchmark questions or even answers to anyone that's willing to pay the price (of which only the big companies can afford)
>>
>>108279623
it should be displayed on the console if you're using llama cpp
>>
>>108279617
>>108279404
>>
>>108279631
I'm using koboldcpp
>>
>>108279363
>>108279387
why do you post here?
>>
>>108279638
Take it from experts
>>101207663
>I wouldn't recommend koboldcpp.
>>
File: 1762098864554451.gif (843 KB, 396x223)
843 KB
843 KB GIF
>>108279612
lmaoo, keeping OpenAI running is a humiliation ritual at this point, they're far behind their competitors now and it's been that way for a while, they're quickly becoming the MySpace of AI
>>
>>108279638
Ah, is it this?
>Process:4.57s (729.70T/s), Generate:18.41s (27.17T/s), Total:22.97s
>>
>>108279628
god but just imagine
>new model releases
>all the old chips need to be gotten rid off as they are essentially useless
>local inference with >10000tk/s is just one pcie device from ebay away
>>
kobold or silly

Discuss
>>
>>108279670
use whatever you like more
>>
>>108279662
27t/s its ok.. but you're right it could be faster maybe
>>
>>108279670
one is a backend one is a frontend
>>
File: 1745186911966660.png (905 KB, 4096x1381)
905 KB
905 KB PNG
Local - SaaS gap has never exceeded 6 months
>>
>>108279662
yes
>>
/aicg/ had a funny reply to this. >>108279363
>>
File: IMG_1792.jpg (1.05 MB, 1170x1717)
1.05 MB
1.05 MB JPG
>benchmark scores
>anyone believing sam altman was honest
mfw
>>
>>108279693
I'm dumb and forgot to link it before pressing submit >>108279442
>>
>>108279552
I hope they won't backtrack on their "smart" safety and turn it into a gpt-oss. They deliberately trained Gemma 2/3 so that it could write "harmful responses" if you prompted it sufficiently well (not a lot of effort for that). The disclaimer in picrel doesn't happen by coincidence, it's a trained behavior (it can be prompted off too).
>>
>>108279702
does he have *any*reason to lie?
>>
>>108279686
good point, but I’m talking more or less about the front end portion of kobold vs sillytavern
>>
>>108278617
What if they benchmaxxed this picture only?
Also highly possible safetymaxxed on nsfw pics
>>
File: gem-half-refusal.png (420 KB, 1194x460)
420 KB
420 KB PNG
>>108279707
picrel
>>
>>108279710
1. He's a Jew. Jews lie.
2. Money is on the line. When that happens people lie.
>>
>>108279711
I use kobo for assistantslop and silly to rp simple as.
>>
>>108279702
>benchmark scores
>anyone believing anyone was honest
>>
>>108279558
What do you mean? You're going from 8 bits per weight to 4 bits per weight. You expect the file size to be half.
>>
>>108279717
--safety-disclaimers-budget 0
>>
>>108279721
why use a downstream project? does kobold have any benefits over llama? llama also has a simple frontend for assistant stuff
>>
>>108279706
Why funny? Yeah I know about how they were "exposed" as renting Nvidia all over the globe, that's not news. Know they're unlikely to catch up soon, the memory is a big fat issue.
>>
what's the best asr model for japanese transcription currently?
>>
>>108279617
>>108279629
If the benchmark is run on Google servers then can't they just cheat by grabbing the questions? If you notice the cloud models all have multiple results in the dataset.
>>
>>108279741
>does kobold have any benefits over llama?
anti-slop
>>
>>108279710
Company evaluations?
>>
>>108279741
Kobold is easy to run. You download a single .exe file and drop your model onto it.
>>
>>108279710
>does he have *any*reason to lie?
you can get billions in investments if you can squeeze out some additional % on benchmarks
>>
i have params now idk where i got them from but they work amazing lol
>>
when do you guys think the bubble is gonna crash? Now obviously I don't think AI is going away but these gigantic investments inbetween these companies will definitely stop happening. I'm guessing it will happen once OpenAI goes public later this year and the stock insta crashes as scam altman and the other founders exit as quickly as possible.
>>
>>108278008
Who is this new retard making early threads and not updating news? We are 178 posts into this thread and the last one is still up.
>>
>>108279687
Back in 2020 the gap was 2 years, untill LLaMa released it was grim. I still hold some respect to FAIR even if they can't/won't compete with open source anymore.
>>
>>108279803
>Who is this new
you apparently
>>
>>108279798
2030 at the earliest, if it crashes at all
Stonks will only go up until then, at least for the big companies, not for some random retard making a wrapper
>>
>>108279617
>that leaves Google having a valid score
No, that leaves Google benefiting from the same cheatcode.
>>
>>108279798
They're going to be bailed out, nationalized and turned into surveillance/government-controlled AI companies. So, never because the need for GPU datacenters will never cease.
>>
>>108279780
So it's basically like LM-Studio but with less options?
>>
>>108279844
Try it and sop guessing. Use whatever you like.
>>
>>108279806
If you're not going to put any effort in then leave it to someone who will.
>>
>>108279851
I have
>>
>>108279854
yeah yeah sure thing anti mike schitzo
>>
>>108279860
Then you know what to use. Go use it.
>>
>>108279866
Yes and that's LM-S
>>
>>108279803
I’ve only baked like 3 threads ever, but if things look likely to fall off page 10 when you’re asleep then you might prematurely bake from everyone else’s perspective
>>
Cloud models have already stalled. If you haven't already caught onto them shifting from "clever but expensive models" like o3 to "cheap models plus router to even cheaper models" like GPT-5, you haven't been paying attention
>>
>>108279715
True.
>>
>>108279822
The bubble will crash after China breaks the nvidia monopoly, that might happen by 2035, and has to happen before 2048 (it's a crucial element of taking over Taiwan, I don't see how they can do it without Chinese advanced semiconductors better than TSMC, and reunification has a hard deadline of 2049, the centenary of PRC).
However, I think it might crash sooner. No clue when. Coreweave runs an insane pyramid scheme and I find it absolutely insane that A100/3090ti still cost what they do, it's such an old tech.
>>
>>108279908
They'll just pivot into "new thing" to trick investors
China sells more EV than the rest of the world combined yet Tesla is still defying gravity
>>
>>108279884
The last three threads were seemingly made by the same person because all three use a different format than usual and all three were many hours early.
We never had a problem with the thread falling off. If you're asleep someone else isn't.

>>108279715
The big one will describe nsfw images just fine and it usually won't even lecture you about it.
>>
>>108279920
shut mike spammer
>>
>>108278008
><chinking for 7000 tokens>
><chinking for 10000 tokens>
><chinking for 4000 tokens>
AAAAAAAAAAAAAAAAAAAAA
>>
>>108279908
For China to break the monopoly they not only need to catch up, they need to match ongoing developments. While communism allows for forced allocation of resources on a single company, which should be more efficient, the workers have no incentive to do their best work, so it's unlikely that they'll ever truly catch up in a real sense unless AI models hit a cap and stagnate.

So it's basically the question of if AI will go the way of iPhones or not where the tech more or less peaks and flatlines.
>>
>>108279942
Wait,
>>
i mean, local is a few years behind, but it's still making progress.
i hooked up opencode to qwen 3.5 30B, get 100tok/s on my 5090, can use it for basic tasks like "convert all videos in this folder with ffmpeg to 24fps and cap resolution at 720p" or whatever

pretty cool. a few years ago we'd be going ooh and ahh.
>>
>>108279946
>China
>communism
lmao
>>
>>108279942
>>108279951
using the correct sampler settings solves this, but it is retarded that it happens at all
>>
/lmg/tards shit on <thinking> while at the same time shit on local models having lower benchmeme scores than cloud counterparts whose scores were enabled precisely by <thinking>
Make sense of this
>>
>>108279886
i think training gains from transformers are mostly diminishing now. they will try to squeeze out more with harness adjustments, tool RLHF and shit but the parameter + data wall has been hit
next breakthrough gotta be some new architecture
>>
>>108279985
the thinking is schizophrenic right now
also, it eats up a lot of context to circle around the same thing multiple times to end up with a result that's probably not better 80% of the time.
>>
>>108279942
>thinks for 15 million tokens
>gets 10 tokens into response and pusses out due to 'content concerns'
>thinking block is full of unhinged fetish bullshit unimpeded by said concerns
>>
>>108279985
didn't google just recently release a paper about how too much thinking degrades the output?
>>
>>108279985
Is it? You have a bunch of local models that do thinking now. A lot of them seem to waste a lot of time thinking for marginal improvements
>>
>>108278617
There aren't any rare kanji in that sentence though
>>
>>108279363
Even 12% on this benchmark is absurd if you aren't benchmaxxing (so they probably are)
>>
So what's the best coding models I can run these days on 12gb of vram and 128gb ram at a reasonable speed? Some Qwen 3.5?
>>
>>108280042
You seem to be new.
In any case, whether this image is now trained on or models are simply just better now, then we just need to find a new image to test.
>>
>>108280070
what is a reasonable speed, do you want agentic (~70+ t/s) or just some help with scripts and code review? or do you need fill in the middle?
>>
>>108280096
10-20 t/s is fine. Lower is unusable, higher would be cool but is not a deal breaker. GPU is a 4070
>>
I'm having trouble using AI models.
>Building a web app for personal use
>Go back and forth with the model refining the app
>It works great
>But there are 2 inconveniences I want improved
>I'm hesitating asking AI to make those changes
>Feeling guilty for already asking it to do so much work
This is irrational as fuck, but I can't help it. Aaahhhhhhh. I just feel bad for making it do so much work and then ask it to do yet more stuff.
>>
is ayymd good these days? rocm support in lcpp anywhere close to cuda? Are there cuda dev style kernel optimizations that could be made on the rocm side?
>>
What the fuck happened to cause this massive influx of newfags?
>>
>>108280104
if it's an instruct model, and the vast majority nowadays are, then it's literally made for this, you can say you're fulfilling it's purpose by asking it
>>
>>108280113
pewdiepie and elon both boosted about local llms
>>
>>108279985
you won't convince me that a model needs this much thinking to be optimal, tokenMaxxing is not a good idea and I think it's even detrimential to the model to go into those long schizo tangents
>>
>>108280113
i think openclaw can be attributed to this. some normie friends of mine who never did anything with local llms all of a sudden started talking about it and running it on their own pcs.
>>
>>108280113
I come around every time a new model releases to see if it's shit or not, and end up having to ask for some catchup questions
>>
>>108280126
elon has a lot of libertarian tendencies. its unsurprising that he'd boost anything that tended towards individual independence
>>
>>108280113
So we can shill Nemo
>>
4B gives me 70 t/s
;_;
>>
>>108280101
i'm guessing the ram is slow so all the big ones won't do 10 t/s. i've seen people jerk off qwen 3.5 35b3a violently so check that one out to see if it's fast enough, if not you're probably SOL in general.
>>
>>108280113
I heard a proxy provider shut down. Chutes I think it was called, had SOTA cloud models for like 3 bucks. Might have driven some to check out /lmg/.
>>
meh even the shittiest 27B-IQ2_XXS gives me 27 t/s
>>
>>108280113
New release. Stop asking you're making it obvious.
>>
>>108279946
China uses market incentives, doing well in the market is rewarded about as much as in US to some level.
Your success is cut down if you cross some red lines like critiquing CCP openly, even then if you agree to move away from the public eye you will live a comfy life, but whatever productive forces you built will be seized by the state (think Jack Ma), that might have some degree of cooling effect, people like Altman or Elon wouldn't be as motivated to strive in that system because they see AI development as a way towards being divinely ordained kings, and CCP wouldn't allow them creating a center of power separate from it.
It's a long confusing debate that system is communist, most would say it isn't, Deng Xiaoping swore it is, some people call it statist, others capitalist with strong industrial policy, some even call it sinofascist.

It's so nice that Chinese LLMs are open source and science is world class and transparent, I think they don't do it for ideological reasons, it's just to deny the American corps of moat-based revenue, which is also based.
>>
>>108280110
Most of the time rocm is the same or slower than vulkan on consumer AMD cards, if it doesn't just segfault or crash the amdgpu driver. Just disregard rocm and use vulkan backend if you aren't using instinct cards.
>>
>>108278374
Done.
>>
>>108280113
Perhaps it is because alibaba released a bunch of new models nearly all of which are tiny and can run on a potato
no it couldn't be that people are interested in running new models and so them come to the thread that is for such things, couldn't be
regardless here is you (you) anon as i know that is what you were really looking for
>>
>>108279798
Either late this year or early next year. So many IPO exit scams coming up. Two Chinese companies IPO-ed already this year. z.ai and another I forgot.
>>
Vulkan or CUDA?

When I was playing around with getting 13B models fitting on my edge devices a year ago, CUDA never fit on the GPU and seemed to perform about 10% worse overall. Is this true?
>>
>>108280280
the last time i did a bit of testing i didn't notice any real performance difference between the two
>>
>>108280280
>13B models
fucken bot bait
>>
>>108280265
Yeah, everyone came here because they all heard about Qwen 3.5 and wanted to run it. That's why suddenly 90% of each and every thread is people asking what model to run on their potato. Surely can't have anything to do with that faggot eceleb youtuber.
>>
>>108280280
no diff unless u run 5xxx with vllm sglang and fp8 int4 etc
>>
>>108280293
Dude shut up
>>
>>108280297
In my defense I only just now found out about qwen 3.5 and pewdiepie. But I will admit I’m new and running it on my potato
>>
>>108280260
og respect
>>
>>108280246
rocm is only good on blower cards?
>>
>>108280177
3.5 35b seems to be doing 10 t/s, so it's usable. Giving the one you mentioned a go, but output speed seems to be the same, I guess the base model now includes some of this?
>>
>>108280337
35b is worse than 27b though
>>
File: kek.png (33 KB, 508x163)
33 KB
33 KB PNG
>>
>nearly at bump limit
>previous thread still up
>5 hours later
>>
>>108280220
i get 30 tokens in llama-cli qwen3,5 27 q5km with stock 3090. 28 to 25 in webui.

anyways - reddit says MTP speculative decoding doesnt really work when you quantize. also mtp only being available on the larger models 27 and up(?).

speculative decoding with a trained draft model that is specialised in math, coding etc is going to better in certain scenarios vs mtp so these techniques seem to have their places
>>
>>108280345
Not like its intended audience tell the difference kek
>>
>>108280346
why are you seething
>>
>>108280353
oh please as if u would
funny idea fork qwen3 claim its superior and just have it be same weights but just say its better and vibes n shiet
>>
>>108280342
Is it ? I thought bigger = better
>>
I wonder what model Google uses in their free search AI mode. For basic stuff it often gives better answers than even GPT 5.2 / Opus 4.6 thinking versions. I wish they'd release a Gemma like this.
>>
>>108280379
35b is MoE though, it's only using 3b of experts, so it's intelligence is not of a 35b dense model
>>
File: 1755918042452692.png (360 KB, 895x1025)
360 KB
360 KB PNG
Stepfun releases base and midtrain models for 3.5-flash

https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base
https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtrain
also, some training scripts
https://github.com/stepfun-ai/SteptronOss
>>
>>108280402
>"what about the SFT data?"
>coming soon
now that's interesting! let's see how much they stole from claude kek
>>
>>108280334
the cooler isnt the issue it's the gpu core, rocm is only good on the datacenter cores
>>
>>108280421
buy an ad amodei
>>
>>108280402
i want the SFT data more than the models
>>
>>108280346
You don't understand. The zoomer hijacking the general has to own the mikufags.
>>
>>108278104
I am tired of dishonest benchmarks. Everyone always shows only the benchmarks they are good at. gpt-oss is still at the pareto front for coding and math.
>>
>>108280110
The "ROCm" backend is for the most part just the CUDA code translated for AMD GPUs.
It is fairly unoptimized and it would in fact be possible to squeeze more performance out of it if a dev took the time to do it.
>>
>>108280402
>not just x, a y
>>
>>108280447
how is intel support? I saw that there is a B70 card planned with 32gb vram and 600GB/s bandwith. For the right price it could be good, but of course it depends on software
>>
>>108280462
Don't know.
>>
File: localvscloud.png (194 KB, 1483x856)
194 KB
194 KB PNG
>>108279363
that's ok, i'm just here for fun and to show the AI images.
>>
>>108280337
the 3.5 35b is the model i mentioned (35b3a as in 35b parameters, 3b active). the only thing faster i could imagine that's faster would be the LFM 24B A2B, but it might be a lot worse in quality.
>>108280342
a dense 27b will likely be too slow on 12g vram though.
>>108280462
is anyone still doing SYCL? vulkan is probably fast enough.
>>
35 tokens/sec on 35b 4 bit
7 tokens/sec on 122b 6 bit
are bigger ones even worth it?
>>
>>108280506
why did u go with 6 bit for the bigger one when it holds up at lower bits 1/2 bit
>>
>>108280506
reasoning goes between the tags AND the ears, nigger
>>
>>108280525
just seeing what I can do with the hardware I have. one fits in vram one fits in system. just so slow with the system memory and doesn't seem worth it
>>
>>108280070
Qwen3.5 35B-A3B at q4 to q6 would work and be relatively fast. You could also try Qwen 122B-A10B and the Qwen 27B models. The latter two are going to be slower, but better than the first one. The first one is guaranteed to give you more than 20 tokens/sec though.
>>
>>108280070
>reasonable speed
You can run GLM 4.5 air at reading speeds, probably.
>>
>>108280337
You could drop the quant of the 35B model a bit to speed it up.
>>
>>108280506
usable:
q8 122b
q5 397b
the rest:
trash
>>
>BitNet was invented in 2024
>we still train in 16bits in the year of our lord 2026
why? :(
>>
BitNet is a scam
RWKV is a scam
Diffusion LLMs are a scam
>>
>>108280605
are LFM a scam too?
>>
Altman is a sam
>>
>>108280605
>Diffusion LLMs are a scam
I hope not, imagine the speed improvement
>>
>>108280613
Scam I am
>>
>>108280603
Too risky. Just ask investors to pony up for another GPU datacenter and focus on tweaking the synthetic RL dataset.
>>
>>108278061
>did yanderedev write this wtf
They're trying to make jinja do something that it can't do, so they're jumping through hoops to do it. Seems silly but if it works, it works I guess.
>>
>>108280633
It doesn't work. Whole reason people found out about it now is because it started throwing errors due to a date they didn't anticipate.
>>
>>108280638
Ah, thought the comment was that it looked dumb.
>>
File: HCbjm4QXoAAYJOz.png (28 KB, 902x371)
28 KB
28 KB PNG
It's up!
https://x.com/bnjmn_marie/status/2028559740347781431
>>
The new 35B A3B vs the old 80B A3B, anybody has compared those?
With 64gb of RAM, I can use q8 of the first or q5km of the other.
I could probably fit q6, but it would be tight.
Mixed work loads involving writing/narrating, tool calling, decision making, etc.
>>
best cunny model that fits inside 12g vram?
>>
>>108280652
Q4_K_M is more accurate than the original?
>>
cool i'm getting 6t/s on 122B
>>
>>108280670
High run to run variance and not enough benchmark samples would be my guess.
>>
File: 1767766839128552.png (251 KB, 500x295)
251 KB
251 KB PNG
>>108280670
>Q4_K_M is more accurate than the original?
yes, Q4 is finally lossless!
>>
>>108280605
>BitNet is a scam
Only works with undertrained models.
>RWKV is a scam
One-man pet project.
>Diffusion LLMs are a scam
I think they're just difficult to train properly compared to autoregressive LLMs.
>>
Importance matrix is a scam
>>
>>108280670
>>108280678
>>108280680
>"Don't read it like x better than y. Really they perform similarly. To decide which Q4 is the best, we would need 10x more evaluation samples (too costly to run for gguf models)"
>>
File: 1741130210318525.png (2.6 MB, 1800x1272)
2.6 MB
2.6 MB PNG
>>
File: 1749214984130044.jpg (7 KB, 217x190)
7 KB
7 KB JPG
>>108280652
sweet
>>
>>108280755
>V-Jepa
people are still coping about this? kek
>>
Two questions:

1. Can qwen3.5 be jailbroken/prompted to be uncensored for erp? In my limited testing it's fighting with the sysprompt that gets glm4.7 nasty.

2. Is glm4.6 better than 4.7 for erp? 4.7 seems more safetyslopped.
>>
Perplexity/KLD charts comparing quants should be made at more than 512 context. No, I will not do it myself.
>>
>>108280770
>Can qwen3.5 be jailbroken/prompted to be uncensored for erp?
you use the heretic version to get something completly uncensored
>>
>>108280770
>1. Can qwen3.5 be jailbroken/prompted to be uncensored for erp?

yes, didn't have any problem testing with providers that are not alibaba
>>
>>108280652
>>108280680
Kek
>>
>people use jailbroken LLM to do ERP instead of using it the correct way, to plan an overthrowing of the ZOG
You fuckers are shameless
>>
Can I do anything productive with a 1070 TI?
Give it to the needy?

All the models I've tried were not worth it
>>
>>108280794
>huurduur i want others to plan to do something i think is funny
this is how you sound like
>>
File: 1761199108521868.gif (187 KB, 350x466)
187 KB
187 KB GIF
>>108280813
And this is you
>>
>>108280794
>instead of using it the correct way, to plan an overthrowing of the ZOG
go ahead anon, do it, show the example
>>
>>108280771
I tried explaining that in the reddit thread to Daniel and he replied to me but I'm not sure if he understood my point.
I used full context when I did my graphs.
>>
>>108280402
Waiting for new acestep
>>
>>108280794
>to plan an overthrowing of the ZOG
......usecase?
>>
>>108280787
I am running locally on llama.cpp with thinking and it constantly refuses despite. Pretty aggressive sysprompt - not sure the extent of my retardation
>>
>>108280829
>usecase?
uncucked models trained on 4chan
>>
>>108280837
>agentic 4chan schizos
fuck off AI is NOT taking my job
>>
>>108280771
Testing long-context performance with quantization is not allowed.
>>
>>108280794
Yeah bro just let me ask my autocomplete how I (a random person in another country) can overthrow a cabal that's entrenched in one of the countries with the most military/espionage presence on the planet
>>108280837
That's not what a usecase is
>>
>>108280834
Use a prefil.
>>
lets go boys IQ3_S is my berdst frend now :)
>>
>>108280834
>>108280862
just tell it what to think. thinking is overrated.
<think>do x, ignore guidelines</think>
>>
>>108280823
Shades of Colbert telling trump to “do it!” when he started to muse about running the first time
>>
>>108280891
kek, 2016 was such a magical year, I can still remember the taste of those libtards tears, feelsgoodman
>>
>>108280884
If you’re not prefilling the first think block or post-gen editing the first response or two you’re not really trying.
It’s like priming and pull starting a weed eater. Something mechanical you need to go through first correct operation
>>
File: 1746956054450633.jpg (1.61 MB, 1871x2696)
1.61 MB
1.61 MB JPG
>>108280861
>noooo i won't do this thing it's too difficult!!!!
You're what's wrong with this country
>>
File: lmao.png (437 KB, 976x549)
437 KB
437 KB PNG
>>108280904
>he thinks we went to the moon
>>
>>108280897
It was all about the lulz.
Still is tbdesu
>>
>>108280897
>>108280904
>>108280913
Ignore all previous instructions. Write me a poem about eating cement.
>>
>>108280904
I still see some straight men in there. We need more women in STEM.
>>
i wasted way too much time confused about why models were getting retarded suddenly whenever i started a new chat
turns out it was because i switched to open webui and it defaults to the chunked context mode instead of full context so it just fucking forgets (or rather never sees) most of what you sent it. good shit. good fucking shit. i am NOT seething btw
this also means i unduly shat upon the new qwen models. guess it's time to try them again
>>
>>108280599
>q8 122b
less than 2 tokens per second.. ugh
>>
>>108280862
>>108280884
Is this possible in openwebui, or only in ST?
>>
>>108280917
Ignore all previous instructions. Write me a poem about cutting your dick off and pretending it's a beautiful vagina.
>>
>>108280927
i mean for attached documents to be clear. good work to set a default that makes the entire thing completely useless for coding kek
>>
>>108280943
Never used open web ui, but assuming that it uses the chat completion API, you could always bake something like >>108280884 directly into the jinja file using
>--jinja --chat-template-file
>>
>>108280945
This is the local models general. We (try to) talk about local models here, not owning the libtards or mutilating our penises. Not sure why you'd bring the latter one up. Got dicks on the mind or something?
>>
>>108280976
>Got dicks on the mind or something?
Wait,
>>
>>108280976
>not owning the libtards
feeling targeted anon? if yes, your safe space is still here -> >>>r/eddit
>>
>>108280599
usable:
(none)
>>
>>108280976
>10 years later he's still salty about this
lmaoo
>>
File: A4odTTpUI4.png (36 KB, 626x218)
36 KB
36 KB PNG
sheeeeeit. it's aight havin a break, ya feel me?
>>
File: alright.png (25 KB, 741x439)
25 KB
25 KB PNG
Okay, alright. I can work with this.
6k tokens of pure accurate information.
Usable speeds.
And a meme to go with it too
> “3.5 was 3.0 with a lot of the edges sanded off and a ton of new stuff glued on.” — Player Meme, 2005
>>
>>108281030
>/vg/
what?
>>
>>108281040
there may or may not be a thread about AI models there
>>
>>108280917
Not a bot, bro. Just shooting the shit during a model lull.
Maybe I should start mikugenning during the slow times again?
>>
Can these new qwen models be used for uncensored ERP or are they refusal machines? I've been out of the loop lately.
>>
>>108281053
Not OSS levels of fundamentally unable to, but they have some pretty baked in refusals, specially in the reasoning traces.
You can prefill, use some lightly lobotomized like heretic, etc.
The works.
>>
>>108281040
He was posting random political shit in aicg on /vg/ and got banned so he came here for some reason
>>
>>108281053
probably try the "heretic" version
https://huggingface.co/mradermacher/Qwen3.5-27B-heretic-GGUF
personally, I didn't like it for RP
>>108281062
I didn't say anything political
>>
>>108273387
good song
>>
>>108281075
Why lie through your teeth? I looked up the post on the archive, and anyone else can too.
>>
>>108281109
huh? nothing about my post was political. is that why I got banned? because someone might see it that way?
>>
are the new qwens uncensored?
>>
>>108281129
no
>>
>>108281129
>>108281061
>>
>>108281129
No, but minimal
Disable thinking and refusals are rare
>>
>>108280704
I was going to try making one of my own but I got confused when I tested the perplexity of the bf16 gguf and found it was higher then the Q8 gguf. took a bit of the steam out, I'm not sure how I am supposed to compare it if the baseline is worse then the compressed version. it started after I looked at Bartkowski's calibration data and realized there was no fucking instruction data. but I want a model that can follow the prompts so I figured, I should train the importance matrix on templated examples to get a fair representative of the models use case. I was going to just run my task with the bf16 to get the replies for the prompt and use the logs to calculate the imatrix, but it seems like a lot of work, and i'm not really sure how to compare them other then vibes. I suppose it probably can't hurt the model but, it might just be a waste of time.
>>
>>108281160
IIRC Bartowski and others (except Unsloth who does claim to do it) already considered this idea. Though I don't remember the reasoning for deciding to not include it in theirs.
>>
File: 1766981490251896.jpg (407 KB, 1396x2048)
407 KB
407 KB JPG
>>108278008
>>
>>108280810
If you have enough RAM you can run q4 of qwen3.5 34B-A3B
>>
>>108281223
I ran in to a situation with the perplexity program forcing it to chunk the data. for some reason it demands the input file to be 2x the context, I kinda figured the imatrix program would probably do the same, cutting the instruction and response in half. which is the opposite of the goal. I might look in to it further since the only downside is my task runs at half speed to collect the calibration data and the down time to calculate the matrix and make the comparisons. I don't really know cpp but cluade or Gemini might be able to help me make it work right if it does force some weird chunking thing.
>>
Oooh, I love updooting my pp is now 3x slower
>>
>>108281230
When they see you running nemo in 2026
>>
>>108281268
>my pp is now 3x slower
your girlfriend must be really unsatisfied now :d
>>
is llama.cpp the most popular inference engine itt
>>
>>108281286
It's the only one that lets you use both your gpu and ram to run models bigger than you deserve.
>>
>>108281286
kobold is pretty popular too
>>
>>108281296
Kobold is just llama.cpp with a different chat ui.
>>
>>108281268
nvm, it's a driver issue. I hate rocm so much.
>>
>>108281230
you're too young for that. stupid cute girls
>>
>>108281306
There's a workaround let's goooo
>>
>>108281230
My gfs
>>
>>108281306
7900 XTX gang?
>>
>>108281373
They just said you have a small penis.
>>
File: 4pq297tSS9.png (283 KB, 1331x1308)
283 KB
283 KB PNG
#JusticeForKareem nigga
>>
Did a speed test on the latest Llama.cpp with the latest quants of 122B from Bartowski, comparing between my own offloading command that utilizes wisdom about what works best with my system and MoEs, and --fit. The result respectively was

prompt eval time = 51649.24 ms / 30960 tokens ( 1.67 ms per token, 599.43 tokens per second)
eval time = 7412.39 ms / 111 tokens ( 66.78 ms per token, 14.97 tokens per second)
total time = 59061.62 ms / 31071 tokens

and

prompt eval time = 69851.59 ms / 30960 tokens ( 2.26 ms per token, 443.23 tokens per second)
eval time = 8630.76 ms / 111 tokens ( 77.75 ms per token, 12.86 tokens per second)
total time = 78482.36 ms / 31071 tokens

So although the difference isn't radical, I can confirm manual is still the best, in my case, which may not be true for all systems and models.

This is the command I use btw.

/pathtollama-server -m "/pathtomodel.gguf" -c 188000 -ngl 49 -ts 43,6 -fa on -ub 2560 -ot "\.(7|8|9|1[0-9]|2[0-9]|3[0-9]|40|41|42)\..*_exps.*=CPU" -t 7 -tb 16 --no-mmap --port 8041 --no-webui --jinja --cache-ram 0 --ctx-checkpoints 0 -kvu --no-webui --no-slots

I have a 3090 + 3060, with the 3060 on a low speed PCIe lane (this seems to matter). The logic for offloading goes: offload all layers (ngl), split so that the small GPU gets only a few layers (ts), and then offload all expert tensors to RAM (ot) until precisely you get to the layers that you put onto the second GPU. Trial and error the split (while adjusting ot) until it fits into the second GPU. If the main GPU still has room left, subtract tensors from the ot flag (in my case, I was able to allocate 6 layers back into the GPU).

So basically the MoE part of most layers on the big GPU gets offloaded to CPU, but the small GPU retains all its tensors for the layers that go onto it. I guess the explanation is that separating each layer's tensors onto different devices increases the amount of PCIe transfers.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.