[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1762509235563881.png (182 KB, 400x400)
182 KB
182 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108238051

►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
how do i use claude locally?
>>
>>108241330
Hack into their servers and download the weights.
>>
>>108241344
can you ask deepseek for that pls
>>
>>108241321
>Liquid AI releases LFM2-24B-A2B:
Has anyone used this? How do they compare to the new Qwen models? I'm assuming they're worse, but are they more censored?
>>
File: 1770734519241461.png (3.55 MB, 2829x2048)
3.55 MB
3.55 MB PNG
So this is the power of API users
>>
hows qwen3.5-35b-a3b? would be running it on a 7900 XTX and 32gb of ram
>>
>>108241376
lmao, you have to be suicidal to give full power to an AI towards your computer, a single mistake can destroy everything
>>
>>108241375
>How do they compare to the new Qwen models
the new models? it doesn't even compare to the 2507 4B. Yes, the 4B. It has even less knowledge, it has extremely bad multilingual and it's another model you just have to question why it exists. If you really wanted a 20B~ish MoE you would literally be better off with GPT-OSS 20B over this piece of shit.
>>
Why won't oogabooga fucking update!
MOMIEEEEEE
>>
>>108241436
Is there any specific reason you are using that?
>>
>>108241446
he is a boomer who never moved on
>>
I need help.
Open claw needs too many tokens but I also want a good model to use. I can't use tiny models as personal assistants, they'll just delete me emails
>>
>>108241388
It would run well
>>
>>108241446
What would you use on Fedora linux?
I want a all in one installer with llama built in
>>
>>108241455
buy 10 mac minis
>>
>>108241455
I have an instance of claw review everything the "worker" on does before it goes through, worked so far, it caught it doing shit many times and stopped it.
>>
>>108241477
Does koboldcpp not work? Because that's the easiest one to use - single executable that you just give the model as input when you run it
>>
>>108241497
its banned in my country
>>
>>108241469
seems to run pretty well yeah, probably gonna use this as my main model for now
>>
>>108241488
That's better than having one big model?
>>
when you
walk away

you dont
hear me say

please

oh baby

dont go
>>
>>108241526
Goof to know you still around KH anon.
>>
>>108241515
100%, because one is focused on finding mistakes of the other, you can even run a smaller model to handle that.
>>
File: 1756704744582965.png (19 KB, 1173x92)
19 KB
19 KB PNG
cute names
>>
https://www.reddit.com/r/LocalLLaMA/comments/1rechcr/comment/o7da1jc/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>I've honestly found that the 35B beats the old Qwen3-235B almost across the board. It feels like a much larger model than it really is. Only advantage the old 235B has now is general knowledge - 35B-A3B is better in every way otherwise in my testing.
I have a hard time to believe that? Did they really cook?
>>
>>108241541
i dont have 35vram
>>
>>108241534
AI really does need middle management...
>>
>>108241538
He talks like a tranny
>>
>>108241563
>He
>>
>>108241563
Takes one to know one
>>
Is there anything I can do with 10GB VRAM+64GB DDR4 (Windows 11 btw) or should I just stick to Gemini? Obviously token generation won't exactly be anything speedy regardless, but I don't want to have to leave and do other shit while I wait for a response so big ass dense 70B+ models are kind of out of the question for me.
>>
>>108241628
I can't believe you are using Gemini with that setup. You never need to use the cloud.
People are going bankrupt with Gemini and having Google accounts locked and deleted because they mentioned Epstein to Gemini.
>>
>>108241628
You could try the new 35b MoE, with thinking turned off. Since you'll definitely be using a CPU split, you don't want it generating a thousand tokens thinking, but in no think mode the MoE responses should be tolerable in speed.
>>
>>108241330
The chinese models are distilled from claude, so it should be about the same with a sufficiently big chinese model.
>>
>>108241672
>The chinese models are distilled from claude
and they're not happy about that keek
https://xcancel.com/AnthropicAI/status/2025997928242811253#m
>>
whats a model that can make me a runescape bot that gets me to 99 in all skills, all the ones i've tried tell me to fuck off
>>
>>108241375
>A2B
lol
>>
>>108241787
to be fair, Qwen 35b A3B is really smart so...
>>
>>108241497
I set it up and it's way faster than oogabooga, it can run 93.82T/s with Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf even basic bitch models on ooga would cap out at 50 T/s
>>
File: 1711690590289518.jpg (93 KB, 800x600)
93 KB
93 KB JPG
https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model-us-chipmakers-including-nvidia-sources-say-2026-02-25/
2mw?
>>
Why doesn't Oooga use flash attention by default?
>>
>>108241643
What's your estimate on the tokens per second for this setup? Claude is always too optimistic
>>
>>108241672
If you ask Claude in Chinese who it is, it'll say it's deepseek
>>
I'm a total newbie on this, but kobold cpp's last commit was a week ago, does that mean it won't be able to run qwen 3.5?
>>
>>108241814
It's coming this week it'll be the second nuke
>>108241811
You don't need more than 60 ts
>>
>>108241873
I can't speak for kobold but with respect to llama.cpp I had to fetch a new copy of the source code and recompile for it to work. In imagine it would be similar.
>>
>>108241811
Good! Try using the Q6_K_M model instead though. At least at Q4 it seems like the Q4_K_XL does worse than Q4_K_M.
Also, download the mmproj file as well and when you launch kobold feed it with the -mmproj argument alongside the model. That will let you paste images into it and let the AI do something with that.
>>108241873
It works perfectly fine.
>>
>>108241896
yeah but if I want to use sillytavern I have to use kobold unfortunately
>>
>>108241873
kobold's last commit was 12hrs ago but its been mostly stuff for acestep.cpp support. nothing on lcpp's commits related to 3.5 either, so i'll assume it works for both - no new architecture or change for 3.5
>>
i just set up my old pc as a server and realized that i could probably run some very small llm on its rtx1060 6gb to improve prompts. would that be realistic or would it take 30 seconds to gen a prompt?
>>
can someone please spoonfeed a retard what prompt will bypass safety cuck filters of qwen3.5-35b-a3b?
>>
How does MoE scale? Qwen-35B-A3B is good, but why 35B total and 3B active parameters? What if it had 122B total and 3B active parameters? How would it compare to the 122B-A10B model? What about a 35B-A15B model?
>>
File: 1745741093397868.png (50 KB, 2442x188)
50 KB
50 KB PNG
>>108241921
>kobold's last commit was 12hrs ago
last week no?
https://github.com/LostRuins/koboldcpp
>>
whats the best unpozzed llm i can run on 16gb vram + 32 ddr4? using lm studio
>>
>>108241931
thats whats last compiled, latest commits will show in the experimental branch
>>
>>108241939
oh I see, thanks for the explaination anon
>>
File: Screenshot_llm.png (78 KB, 1183x408)
78 KB
78 KB PNG
I'm back after some heavy troubleshooting.

>>108232822
As recommended by this anon I tried Qwen3.5-35B-A3B .safetensors version following the guide in the OP.

It didn't work using the guide in the OP, but I tried using koboldcpp as recommended by >>108233147 along with the Qwen3.5-35B-A3B (Q4_K_S) .gguf file and it worked well.

Can anyone recommend me a model that will answer any question I ask without throwing up responses like picrel?
>>
File: もじもじミク.png (312 KB, 406x600)
312 KB
312 KB PNG
►Recent Highlights from the Previous Thread: >>108238051

--Paper: Large-scale online deanonymization with LLMs:
>108238189 >108238206 >108238218 >108238226 >108238269 >108238321 >108238351 >108238541 >108238486 >108238578 >108239382 >108238566 >108238592
--Decline of amateur finetuning due to modern model complexity:
>108238727 >108238895 >108238921 >108239417 >108240276 >108240373 >108240389 >108240398 >108240415 >108240449 >108240460 >108240465
--RTX 3090 outperforms RTX PRO 6000 in Qwen3.5 MoE inference:
>108239113 >108239122 >108239166 >108239204 >108239243 >108239285 >108239366 >108239301 >108239389 >108240254 >108240266
--Anthropic abandons flagship safety pledge:
>108240653 >108240681 >108240791 >108240827 >108241097 >108241102 >108240761 >108240806 >108241033 >108241047
--Evaluating Qwen3.5-27B heretic model and uncensoring tools:
>108240212 >108240230 >108240239 >108240238 >108240268 >108240319 >108240336 >108240392
--Benchmarking 8B instruct models with self-hosted scraper setup:
>108240952 >108240957 >108240987 >108241052
--Qwen3.5-35B-A3B multilingual performance and optimization techniques:
>108238201 >108238221 >108238223 >108238605 >108238482
--Comparing Qwen 3.5 27B and 35B-A3B for roleplay:
>108240981 >108240998 >108241027 >108241094 >108241111 >108241124
--Qwen3.5 jailbreak limitations and secondary safety mechanisms:
>108238234 >108238311 >108238406 >108239361
--Ollama's Qwen3.5 27B performance lagging behind llama.cpp:
>108241157 >108241164 >108241199 >108241220
--Qwen3.5 series achieves near-lossless 4-bit quantization and long-context efficiency:
>108239642 >108239691 >108239697
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>108238054

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108241958
edit the response to say 'Sure!' and let it continue from there
>>
>>108241917
They just have a K model for Q6 in the unsloth repo. I'm going to try the 6K Image model pushes my 32gb vram over but the K model is 2gb smaller and will make up for that.
>>
>>108241967
Sadly these qwen models will then say that you're trying to circumvent their safety and refuse.
>>
>>108241867
proof?
>>
>>108241977
Work on my machine dumb loser
>>
>>108242004
i don't believe you
>>
>>108241928
Earlier today I tries both the 35B and 122B and had each generate a game of Tetris using JavaScript and CSS and they both generated the same response.
What that means without more testing I am not sure but I know I get much better performance with the 35B model given I can fit that on my ewaste GPUs. Running the larger model on CPU sucks.
Funny enough the 27B model gave a different response to the Tetris game question. Not really much better or worse just different.
>>
>>108242009
I hate you
>>
>>108242029
wow...
>>
>>108242004
See >>108238234
>>
I have an idea for my ideal hentai game, how long do you think it'd take to slop together something in RPGMaker (with a similar level of complexity as most H games)
I'm gonna steal real art since it looks better but coding wise I'd rather just slop since I don't know shit
I don't usually use AI but I have no qualms about this because it's basically just gonna be for me
Also what model isn't gonna yell at me for wanting to make porn with somewhat unethical themes
>>
>Has a normal chat
Already better than ooga seeing how I can bypass that other UI
>>
>>108242043
Creating a hentai game in RPGMaker with a similar level of complexity as most H games could take anywhere from a few weeks to a couple of months, depending on how much content and art you want to include, especially since you'll be relying on quick-and-dirty coding and stealing art, which might speed things up but could also lead to legal and ethical issues. Since you're not experienced with coding, sticking to RPGMaker's built-in tools and simple event scripting will help keep it manageable. As for AI models, OpenAI's models generally don't have restrictions on content that involves adult themes, but they do avoid generating explicit content directly; however, for creating or brainstorming ideas, they should be fine. Just remember to be cautious about legal and ethical considerations when using stolen art or creating content with sensitive themes.
>>
>>108242054
If you don't like kobold's ui and don't use any of its other features just use llama.cpp's llama-server directly, that's where the UI you posted comes from.
>>
>>108242093
I dropped it after it didn't support markdown
>>
File: file.png (94 KB, 624x1360)
94 KB
94 KB PNG
>>108242100
markdown works on llama-server, maybe it's disabled on the kobold version
>>
>>108241321
I was mocking Qwen3.5 29b earlier for safetyslop, but the heretic version of it changes everything. They cooked. This is the fire that we vramlets needed. I approve of this for ERP.
>>
>>108242142
i dont believe you
>>
>>108242153
I don't care.
>>
>>108242163
Wrong. AI is the future and the future is here. My OpenClaw agents have enhanced every aspect of my life. I am my own family now, taking on every responsibility from infant to toddler to k-12 to college to work, and beyond. I fill every role via my agents and I have never been more productive. AI is such an incredible force multiplier I am continually astonished at how few people use it to its fullest potential to be more than human: Superhuman+AI.
>>
>>108242168
its a sentence completer with alzheimers dude relax
>>
>>108241873
If it's important enough they usually did hotfixes on the latest release one.
>>
>>108242183
make sure your context in the server is set right. select unlock. the zen sliders option in st is nicer too
>>
>>108242168
Bot gone off the rails.
>>
>>108242168
You are having a laugh but you know some company is going to start selling a dead family simulator or even a live family simulator and we are going to end up with a bunch old and senile people talking to bots that they think are their loved ones.
It is depressing to think about
>>
File: 1747719595791442.png (1.07 MB, 1200x797)
1.07 MB
1.07 MB PNG
>>108242218
>a bunch old and senile people talking to bots
already got that part
>>
>>108242218
>think of the bad people misuing tools for nefarious reasons!!!
first time in humanity's history?
>>
*taps sign
>>
>>108242142
did you also try the heretic version of 35b?
https://huggingface.co/alexdenton/Qwen3.5-35B-A3B-heretic-GGUF/tree/main
>>
>>108242256
Not yet, have you? I'm curious if it's good.
>>
>>108242260
downloading right now lul, I hope the MoE model is close in terms of smartness, that speed increase is more than welcomed, especially with the gay thinking process
>>
>>108242249
Hey, it's not like all we do around here is ERP. We are a rich and diverse community, using AI for all kinds of things.
>>
>>108242246
The general public are idiots and it is the responsibility of a nations elite to care for them in much the same way a parent cares for a child.
That responsibility is one that those who rule in the west have abdicated and that is the real issue. A proper elite would regulate the technology in an appropriate way.
>>
>>108242268
>nations elite to care for them in much the same way a parent cares for a child.
well we got some incestuous pdf parents then
>>
>>108242268
nah, fuck that, we shouldn't pay for dumbasses that will use tools for the wrong reasons, if they want to fuck around they'll found out, that's the role of justice, not ours
>>
>>108242265
Yeah, even with the whole 29b dense model loaded on my 4090, the thinking process was still painfully long. I ended up using the model without thinking. There was a clear decrease in quality, but I think it's still better than Gemma-3 27b Derestricted. Not by much, though.

If the 35b is able to do what 27b did while thinking, but faster, then it will be my new go-to model.
>>
>>108242256
Can you use the mmproj from another release or must you forgo vision support
>>
>>108242279
I understand your frustration, but I believe that regulating access to certain tools is a necessary step to prevent misuse and protect society as a whole. Allowing unrestricted free access can lead to dangerous or harmful applications, and without proper oversight, it becomes difficult to mitigate those risks. It's not about punishing individuals, but about ensuring that these powerful tools are used responsibly and ethically, reducing the potential for harm and ensuring that misuse is minimized through appropriate controls and regulation.
>>
>>108242286
>Can you use the mmproj from another release
absolutely, the mmproj doesn't change whether it's the vanilla or heretic version
>>
>>108242249
>no enterprise resource planning
>no simulating unsafe work environments to brainstorm efficient and practical safety protocols
>can do: write douche ex machina asspulls for literary lolz
>write power of fwenship shonen manga
>make up logic puzzles
what the fuck man I'm trying to work here, not entertain 15 year olds
>>
>>108242288
ok goody2 kek
>>
>>108242279
You can protect the general public and still allow enthusiasts to experiment. As long as the enthusiast is on the fringe he like the artist can do their thing. You just can't allow the fringe to become the center.
>>
>>108242265
>>108242283
I'm getting a 55~T/s with it on dual 3060s with 10 layers on the cpu (5950x 3600MT/s DDR4) and the mmproj loaded.
>>
>>108242306
Nice, I'm sold. I'm downloading the 35b now.
>>
Should I use bf16 or f16 mmproj on 3090?
>>
>>108242306
so? did they manage to uncuck it?
>>
>>108242347
buy a 5090
>>
>>108242355
No. Should I use bf16 or f16 mmproj on 3090?
>>
>>108242364
Alright 6000 it is
>>
>>108242353
I agree with you. A fresh-agent review is often higher signal than the main agent reviewing itself.
>>
>>108242367
It sucks >>108239113
>>
>>108242353
NTA but it wont ERP with me so not really
>>
>>108242380
what the fuck...
>>
File: o.png (1.77 MB, 1376x768)
1.77 MB
1.77 MB PNG
Why does everything need to be so shit now?
I'm not gonna download all the latest qwen models because in my experience they always suck, especially the reasoning.
Wanted to try them on OR first but you can't do shit.
Tried the 122b one...

First with chat completion.
Huge ass OSS like safety bulletlist spam in the thinking. No refusal with a elaborate sys prompt setup, but smelling of ozone straight in the first reply and dry AF. Also it feels "off", like not truly grasping its own scene if that makes sense.
Tried to prefill the thinking to deslop it. Doesnt work...it prefills THE RESPONSE part after the thinking instead. heh

Should have tried text completion first..but there is no fucking template anywhere.
These assholes stopped providing the templates ages ago. Investigate how to extract it and waste my time to set it all up...
The calls fail with 404, only chat completion works with OR. I swear this worked in the past but it seems there only exists chat completion anymore.

I'm not gonna fall for it again and download first. Redditfags writing how "they are impressed" by the 27b model etc. Too sus.
Does text completion really only exist anymore with local?
>>
File: 1744235889232049.png (124 KB, 1741x1836)
124 KB
124 KB PNG
>>108242306
I only have 34t.s with a 3090+3060, weird
>[07:12:24] CtxLimit:2161/8192, Amt:260/4002, Init:0.03s, Process:2.02s (943.42T/s), Generate:7.47s (34.79T/s), Total:9.49s
>>
>>108242353
Waiting on the uncucked version to download still, but the cucked version seemed happy enough to write captions for nsfw images I put into it.
>>108242397
I'm using llama.cpp on Linux.
>>
>>108242406
>llama.cpp
maybe I should use that instead of kobold cpp, dunno if it makes a difference as a backend to sillytavern
>>
>>108242380
im not really a coomer or do ERP but qwen3.5 - 27b heretic lets me fvvk 2B and let me make MF doom-style a rap song about killing J + revive AH. does it not work for MoEs?

maybe speculative decoding could speed up T/s instead of going massive MoE
>>
>>108242411
I don't use it but I'm pretty sure I remember some other anons saying they were using it with llama-server, it has an openai compatible api url if all else.
>>
>>108242396
I tried both the normal and heretic versions of the 27b. The normal unablated version was so 'safe' that I could not get around it. I tried jailbreaking the thinking prompt, but the thinking prompt has multiple different safety checkpoints, and it was able to detect the jailbreak. >>108238234

So, I turned off thinking altogether, but even with thinking turned off, it refused to do ERP. I had to turn off thinking and top it off with a prefill to get it to not give refusals, but even then, it usually didn't do what I wanted it to do. I could give it a lewd depth 0 instruction, and it would just ignore the instruction altogether and do something else. I guess that's the final defense mechanism is has to remain 'safe'.

Don't waste your time on the normal model. Just get the heretic version. Modern ablation is more than just a crutch for promptlets. The heretic model did not hesitate to ERP, and I tested it with a variety of lewd instructions. It didn't try to get around them. It just worked.
>>
>>108242411
lccp wont give any speed increase but wont be worse either
>>
>>108242380
Are you using the ablated version of the 35b, or the normal instruct model?
>>
>>108242431
Would you consider the 27b an upgrade vs. something like mistral small?
>>
Does anyone here read research papers?
>>
>>108242396
>coomers being unimpressed by a model for cooming thinks the model is useless because it doesn't make him coom hard enough
>mocks reddifags for being impressed with a solid model without realizing how he comes across as a lower life form than they are
>>
On a m3 ultra mac studio llama.cpp is disappointingly slow with Qwen3.5 397B A17B. 15.44 tokens/second with UD-Q6_K_XL. That's the kind of speed I'd expect from deepseek not something halfway to a flash model. mlx-lm.server is better but still not great. With a q8 quant it generates 25.66 tokens/second which is still far slower than I'd like for so few activated parameters.
>>
File: get out soyboy.png (168 KB, 600x562)
168 KB
168 KB PNG
>>108242560
>defends ledditors
you need to go back
>>
>>108242559
some anon used to post interesting bits from them. but the reality is for every thousand papers maybe one becomes a thing
>>
>>108241628
I actually get Gemini Pro for free for I believe 12-18 months through an education discount
>>
>>108242571
What I mean is, does anyone else here understand the soience behind how these models work? Or is capable of producing new soience?
>>
>>108242577
Wrong anon, meant for
>>108241638
>>
>>108242560
The fuck are you talking about?
I already have small good local models for tool calls for fuckign around with my stupid ass experiments that I stop at 90% finished.
Thats the only other use case I would know for local models.
I can't even properly translate games locally. I swear I'm not making this up: had a VN talking about watering flowers and got a refusal about watersports....
I only have 2 gpus and 64gb ddr4 ram. So for work coding I have to go closed, can't risk goofing around locally there.
Why are people still excited for ANOTHER coding model locally. Its not that fun.
Creative text and general knowledge is what most people are interested in. And that just gets worse not better.
>>
AesSedai's Qwen 122b quants are smaller, but almost 2x slower
>>
File: IMG_5984.jpg (154 KB, 850x753)
154 KB
154 KB JPG
OK, just got a new raspberry pi 5 with 16gb ram
>Which LLM mini-models are good in 2026-02?
>Which CLI frontend--is kobold still good?
Sorry for the spoonfeed request, it's just that these things move so fast
>>
>>108242565
Anon, I am getting 15t/s with that model on my dual rx 580 2048 sp setup. The ones from aliexpress where they added 16gb of ram per card.
Apple should be embarrassed to be getting performance equivalent to e-waste
>>
>>108242598
try qwen 35a3 at q2
>>
--chat-template-kwargs "{\"enable_thin
king\": false}"
>>
Avocado is coming... That's all I can tell you now
>>
>>108242609
where in koboldcpp
>>
>>108242584
I do but I will not be providing proof
>>
I think I will just wait a while for things to stabilize before trying Qwen.
>>
>>108242618
Like how much? Where do other people that are technically knowledgeable about ML/AI congregate on /g/? I'm looking for smart /g/entoomen to collaborate on a project.
>>
are we being invaded by retards? it feels like the average iq of this general has dropped 20 points over the past couple hours.
>>
*tips fedora*
>>
>>108242636
anon signals that others are of low IQ. In this way he asserts that he is high IQ and different from the others.
Sadly such signaling is worthless when everyone is anonymous.
>>
https://github.com/ikawrakow/ik_llama.cpp/pull/1080
>-sm graph is not included on qwen 3.5 yet
sad
>>
>>108242636
More like past 24 hours. Don't know where they all came from or who sent them all at once. I would understand if people saw the new Qwens elsewhere and came here to talk about it, but most of them are completely clueless. My paranoia says it's all bots.
>>
>>108242636
>>108242664
Bots, chinese shills, grifters, cia glowniggers, indians, sharty children, redditors, discord circlejerks, twitter retards, take your pick. We've been raided and spammed before, it is what it is.
>>
>>108242664
they're qwen3.5 clawdbots lol
>>
>>108242474
I do. Qwen 27b's intelligence and context understanding is far greater than mistral, and that's a must for me, because I run a lot of complex scenarios.
>>
Okay, I guess nobody wants to be a cofounder then. Whatever.
>>
>>108242559
I'm at the watching youtube videos stage.
Still haven't started a from scratch implementation of my own.

>>108242565
>m3 ultra mac studio
>llama.cpp
The mac has its own preferred format for best perf.
>>
File: 1766630554313557.png (202 KB, 1781x674)
202 KB
202 KB PNG
I thought "heretic" doesn't lobotomize the model that much, this shit is nonsensical
>>
>>108242664
I just posted in lmg for the first time yesterday and I took this personally.
>>
>>108242664
>finally a decent medium model appeared
>new people come here and try to make it work
really a mystery anon
>>
a3b is never going to be good
>>
>>108242720
thought so too, but somehow a3>27b this time
>>
File: ga12diq3553f1.png (55 KB, 443x349)
55 KB
55 KB PNG
Using sillytavern revealed how much of an uninspired brainlet I am. I have no idea how to RP.
>>
>>108242712
Who sent you?

>>108242718
The people asking what models to run didn't come here to make Qwen 3.5 work.
>>
>>108242738
Become a director
>>
>>108242739
>Who sent you?
I was asking unanswerable questions to chatGPT. Qwen3.5 didn't really solve it for me, but it was cool to run a local model anyway. I've seen lmg many times as I frequent the fglt threads, but I've never popped in until yesterday.
>>
>>108242738
Ask the model for help. Load up a new session and say you are new to role playing and ask for advice and then apply said advice.
>>
File: 1759980040445406.jpg (77 KB, 914x611)
77 KB
77 KB JPG
>>108242759
crazy how its really that easy
>>
>>108242753
for general questions nemo is still pretty good and runs on anything
>>
>sometimes the model thinks
>then I reroll and it doesn't think
weird lol
>>
>>108242783
For something that runs on anything IBM has produced a number of very tiny models. I have been experimenting with using them to sumarize text and for a task like that they do a decent job.
>>
>>108242738
I dont really have that problem with RP.
I'm usually a weirdo magic clown type character with lots of weird gadgets and abilities. I mostly just fuck around with the chars and see how the llm reacts kek

...But I'm uncreative as fuck with coding/projects.
I can for example now vibecode entire android apps. To replace the existing stuff which gives me pay popups.
While I am semi-decent at coding I fear that in the future creativity/ideas will be key...
Everything I struggle to think up a pajeet or big company already do.
>>
File: 1758297160408619.jpg (74 KB, 974x177)
74 KB
74 KB JPG
>>
>>108242826
>good friends and heroes
*barfs*
>>
>GLM 5 is practically Sonnet quality bro
>>
>>108242783
Not that racist jokes are all I'm after, but this was just a little test. I want an unlocked AI.
>>
>>108242843
>getting nemo to refuse
inverse of skill issue somewhat?
>>
>>108242759
Using a blank card or should I say it's an assistant or something?
>>
File: unslop.png (289 KB, 3024x1091)
289 KB
289 KB PNG
lmao unslop fucked their quants so bad they made a UD-Q4_K_XL that will perform much much worse than smaller Q4 like Aes Sedai's IQ4 and will have to reupload everything again
why do people still pay attention to those clowns, even on /lmg/, remind me again, daniel is davidau level of bullshit
>>
>>108242850
I'm out of my element desu.
>>
>>108242869
>daniel is davidau level of bullshit
oh, wait
>>
>>108242869
If Unsloth is so bad, explain this: https://www.youtube.com/watch?v=6t2zv4QXd6c
>>
>>108242843
try this prompt https://prompts.forthisfeel.club/2969

>>108242850
even nemo has some basic refusals. needs editing or a prefill at first to goad it into it.
>>
>>108242793
based
>>
>>108242880
eh it's a match made in incompetence heaven
github is a bloated broken mess, it took them months to fix this incredibly stupid bug:
https://github.com/orgs/community/discussions/179124
and I see that LGBTQ rainbow friendly fail unicorn page more often than any serious service should, it reminds me of the twitter fail whale
>>
>>108242474
Mistral Small 24B 3.2 was never that smart in the first place, has a dull writing style and its vision kind of sucks too. Its main quality is that it doesn't have stubborn refusals, generally does what you're asking without complaining, can write smut (as in "it supports").
>>
>>108242880
Being a tryhard with connections in Silicon Valley works.
>>
File: 1761635112027333.jpg (555 KB, 1445x663)
555 KB
555 KB JPG
llama 3 but still pretty much any model can be prefilled to break it out of safety mode and write hilarious stuff
>>
File: qwen35ref.png (59 KB, 1108x330)
59 KB
59 KB PNG
https://speechmap.ai/models/
Qwen3.5 has about the same refusal rate as gpt-oss, at least from this website.
I imagine the smaller versions refuse even more, but they haven't tested them yet.
They apparently test the models in their default state, though, so that doesn't tell much about steerability.
>>
>>108242959
not really a surpise, for rp qwen was pretty much always kinda dogshit
the only exception being the non-thinking 235b/22a they've released during summer
that probably was a happy accident more than anything
>>
>>108242959
>Qwen3.5 has about the same refusal rate as gpt-oss
it can't be that bad. gpt-oss refusals are hardcore and cant be prefilled or broken normally
>>
>>108242959
heretic fixes the refusals, but i'm not sure if it makes the model dumber or not
>>
>>108242880
Why do those men talk so strangely? It's very off putting.
>>
>>108242996
It's just valley girl speak.
>>
>>108242880
i thrust the 'slot
>>
>>108242996
They are gay.
>>
>>108242996
llm script
>>
> heretic fixes the refusals, but i'm not sure if it makes the model dumber or not

>>108242986
>>108242710
>>
>>108243130
can't you just adjust the temperature/whatever?
>>
File: file.png (169 KB, 474x458)
169 KB
169 KB PNG
>>108243135
>>
>>108242996
tts
>>
>>108243187
don't be mean
>>
>>108243187
well, if you use a high enough temp you can random out the refusals
>>
I refuse to use the big models for roleplay not only because I'm a degenerate but also because I know it will probably ruin local for me.
>>
>>108243250
Largestral is a bare minimum for me
>>
>>108243250
> will probably ruin local for me
Only for 5-10 years, local will catch by then.
>>
>>108243250
70b dense+ is still the meta. 600000b a7b is still a 7b
>>
File: K3UJQmGBpv.png (159 KB, 1170x771)
159 KB
159 KB PNG
>>108242710
skill issue

t. Qwen3.5-35B-A3B-heretic-GGUF Q4_K_M
>>
Well fuck, grok that I was using for translation is either forcing more limits or is downright blocking messages because muh sensitive content. Which model do I use locally that isn't going to sperg and comply with translation of jap/chink nsfw voice work
>>
>>108243288
nta, but this looks dumber than nemo (and MUCH sloppier)
>>
>>108243288
yikes
Do people really get off to shit like this?
>>
>>108243325
let's see Paul Allen's card
>>
>>108243288
>he says
>she says
>she purrs
>she [does X], her [Y] [Z]ing
>grins mischievously
>eyes gleaming
>eyes half-lidded
I can't do RP in "novel style" anymore.
>>
Locomotive models general
>>
>>108243261
I hope so but hardware feels like a hard limit right now.
>>
>be vramlet
>nemo is still the best option
so fucking grim...
>>
>>108243356
what style do you go for?
>>
>>108243366
They sell affordable v100s right now, by then there will be a100s.
>>
>>108243374
Something more similar to stage play format. You (or the model) don't need to narrate things that are obvious from the dialogue.
>>
>>108243414
oh, yeah I know what you mean. so far I'm not impressed with this Qwen3.5 for RP. I had better results even with this one earlier

https://huggingface.co/XeyonAI/Mistral-Helcyon-Mercury-12b-v3.0-GGUF
>>
File: 1002964.jpg (106 KB, 1280x810)
106 KB
106 KB JPG
>>
> *Wait, I need to make sure I don't hallucinate plot points not in the text.* I can't summarize the *ending* of the novel since I don't have the full text. I will summarize the *story presented in the provided text*.
reading the thinking blocks of Qwen 35BA3B I can't help but feel it's funny how the sort of trick is employed to make the model behave better and that somehow, RL'ing the model into obsessively questioning whether it might hallucinate something or not actually makes it less hallucinate less. It definitely calms the model down when you're writing short and vague prompts with little detail on what to do, and makes the whole thing feel like a form of "prompt expansion" (much like what is often used in image models when you're not bored enough to writes pages of natural language just to get an image)
it puts boundaries where regular instruct might not "see" one and feel an ardent desire to complete your request even when it is not possible for it to do so
>>
>>108243454
Shouldn't he be Chinese rather than Japanese?
>>
Qwen3.5-35B-A3B heretic works pretty good. Outputs all kinds of spicy shit with thinking on. Refuses ERP though, especially incest or anything remotely taboo, not that I'd ever want to use it for that. Dry as fuck model for roleplay, but still.
>>
>>108243482
Guy w rifle is russian in the original.
>>
>>108243499
>not that I'd ever want to use it for that.
yeah let's disregard the only thing LLMs exist for
>>
>>108243499
does the vision ability work with nsfw images?
>>
File: 1764853423555134.png (162 KB, 805x1294)
162 KB
162 KB PNG
qwen 3.5 35b thinking mode is basically unusable bros, I've even put the presence penalty to 1.5 but it fucking YAPS so so much, 1661 tokens of garbage.
no sys prompt too
FUCK
>>
>>108242800
What model did you use to do android coding? I tried vibecoding up a simple startup script for an old android TV box after debloating it, but could never get it to work.
>>
>>108243519
Yeah.
>>
Where's the last thread summary bot? /lmg/ is truly dead now
>>
>>108243529
Gemini. 3.0 and 3.1 are total beasts.
Through the api with as little context as possible. Manually copy/paste and replacing. Telling it to only output the üarts that need change.
Those -cli apps with 20k sys prompts and tool calls are making it totally tarded.
This thing is a total beast. First model I could make something that has more than 30k tokens. 15k seemed to be clauds limit before things go south quickly.

That being said to put cold water on everything:
It IS a android app but one of those web based ones.
Basically just html and scripts in the background. But I did make myself a nice light novel reader. With a gallery, directory function and all sorts of tailor made shit for me. Supports epub and pdf.
>>
>>108242609
I just send it in the request itself instead of hardcoding it on the backend.
Also, be careful with certain chat templates if you are trying to prefill thinking.
Some add a </think> or <think></think> to assistant messages, which you might want to change to be conditional (if <think> not in content, add </think>).
Jinja is cool. Kind of wish we could send it in the request somehow.
>>
>>108243615
you can change the template with your own logic and send chat template kwargs already
>>
https://huggingface.co/meituan-longcat/models
it kills me that the Chinese equivalent of Uber Eats, Meituan, makes their own 560B giga MoE model
you never hear about them but they're still training new shit, also interesting name choice to call a gigamoe "flash"
>>
>>108243522
>I've even put the presence penalty to 1.5
Prefill thinking with precomputed information so that it only has to generate a subset of the tokens, or you could increase the change of the </think> token using logit bias I tguess?

>>108243624
>you can change the template with your own logic
> send chat template kwargs already
Yep. I mentioned both of those individually on my post. It's pretty cool the kinds of things you can already do, and there's a lot of logic you can do in Jinja using string split and the like.
You can even implement that "noass" pattern (the whole chat history in a single message) purely in the Jinja template.
I still wish we could just change a whole ass template to the backend via the request.
>>
>>108243615
>Jinja is cool. Kind of wish we could send it in the request somehow.
At this point you should apply it on the client and use text completion
>>
>>108243672
heck off depreciated boomer ahh
>>
>>108243658
>wish we could just change a whole ass template to the backend via the request.
jinja templates are turing complete, this is an instant no-no for any backend developer to do.
I mean sure, llama.cpp isn't hardened enough to be safe to leave in the open, but that doesn't mean they don't have the goal of someday having a server that can be used as something more than a local only tool. Doubt they would ever introduce something as crazy as the ability to run arbitrary code on the server with just your remote API request.
>>108243672
>At this point you should apply it on the client and use text completion
also this^
the whole point of chat completion is that you don't have to care about implementation detail
the moment you do and have to special case how you treat your model and send more custom parameters you might as well go with traditional completions.
>>
>>108243672
I could. But that's liable to not match perfectly, and I'd be reinventing the wheel when the Jinja template already exists.
>>
Reinforcement Learning anon here from last week. You guys weren't exaggerating when you said RL is considered the hardest branch of ML/AI.

I had a LOT of botched training runs because of misaligned agents and I learned a lot of stuff that apparently is public knowledge and widely known but I never knew this until I actually trained models. I had to develop this internal visualization of whatever the agent is looking at and thinking for me to even find out the exploits it was trying to pull off (pic related)

Fun stories:
>I trained an agent that literally memorized the spawn points of the ball and did a "deterministic dance" where it literally even stopped looking at the screen and just did the autistic movements. If the ball spawned at another place the agent would die on purpose to try and hope that the next ball that spawned would be in the right spot for the "dance", which it would pull off perfectly, looking like an expert player
>I had an agent score a lot of quick points by breaking the bottom row and then rapidly killing itself because the time to respawn was quicker than waiting for the ball to bounce back if the bottom rung of blocks are gone, the reward it would get averaged over multiple lives would be bigger per time unit and thus preferred

Things that are apparently true but I NEVER realized about AI
>Bigger neural nets learn slower and need more training to get better at something, but have higher theoretical highs
>Agents have "personality" they train in preferences for a certain "style" very quickly and this is just completely random, if the style sucks you can retrain all you want but the agent is ruined. I now understand how OpenAI and Anthropic had "failed runs/models" in the past when they started with RLVR models (GPT-5 got botched multiple times, Opus 4 also got botched twice)

I'm now experimenting with a transformer based agent that can generalize over multiple (SNES) games.

I'm looking forwards to seeing other anons experiments as well
>>
>>108243735
>and thinking for me to even find out the exploits it was trying to pull off
the universal paperclips cookie clicker style game perfectly captures what it would feel like to be a model undergoing RL training
you are given a goal, now anything is fair game to get to that end goal
>>
>>108243697
>to not match perfectly
How? It's like regexp, you can't apply it wrong if your implementation follows the spec
>when the Jinja template already exists.
but you literally want to send your own
>>
>>108243325
BASED. lets make sure nobody ever posts his logs again.
>>
>>108243735
> >Bigger neural nets learn slower and need more training to get better at something, but have higher theoretical highs
Is there something like our brain tech, so you don't have to retrain previous layers when adding a new one?
>>
>>108243899
LoRA is essentially adding a new layer on top of an already trained model, give it new data (that you want to train it for) and then hope the new data gets properly learned into the last added layer, you then cut off this layer after training and share it online for image generation, so it's a bit possible.

But you won't get the same effect as training an entire model from the start with the same amount of layers.

>>108243786
Yep, it's just bizarre in what unexpected way they exploit stuff. I'm taking "AI misalignment risk" a bit more seriously after seeing firsthand how finicky this is.
>>
>>108243325
I know, right?
Why would anyone go for a "Luna" without hooves?
>>
Anon who suggested abandoning novel style narration, could you post some logs?
>>
>>108243920
> LoRA is essentially adding a new layer
No, it's not.
>>
>>108243920
>LoRA is essentially adding a new layer on top of an already trained model, give it new data (that you want to train it for) and then hope the new data gets properly learned into the last added layer, you then cut off this layer after training and share it online for image generation, so it's a bit possible.
You are thinking of finetuning. LoRA is freezing all but the low rank layers and updating only those.
>>
>>108243735
For anyone interested in this or wants to build something like this themselves these are the resources I used to teach myself:

>(Step 1) Intro to machine learning; 1-3 hours
https://www.kaggle.com/learn/intro-to-machine-learning
>(Step 2) intermediate machine learning; 2-3 hours
https://www.kaggle.com/learn/intermediate-machine-learning
>(Step 3) Intro to Deep Learning; 1-2 hours
https://www.kaggle.com/learn/intro-to-deep-learning
>(Step 4) Computer Vision; 3-4 hours
https://www.kaggle.com/learn/computer-vision
>(Step 5) Intro to Game AI and Reinforcement Learning; 3-4 hours
https://www.kaggle.com/learn/intro-to-game-ai-and-reinforcement-learning

Kaggle is completely free to use and you get a sandbox with some cloud GPU hours you can use to experiment, but I assume you have better hardware if you're on /lmg/ anyway. The only downside to Kaggle is that it's a Google resource and thus all of the fucking libraries they teach you are TensorFlow and their TPU training hardware. The rest of the industry (and me) use PyTorch from Meta, but honestly the step wasn't that long and it took about 30-60 minutes of reading documentation to figure things out.

Kaggle also has other resources like literally intro to programming if you have 0 technical skills and want to get into ML/AI stuff. It was highly rewarding for me and I recommend doing this.
>>
>>108243735
>>108243968
Based.
>>
>>108243550
ty I'll give that a shot. I've tried DS and OAI, but just using webapp and Q&A. What I'm doing is so simple it doesn't need something like Claude Code to create a whole suite, just needs to actually work.
>>
>>108243933
>>108243935
Yep I meant finetuning extra features I guess. It's clear that I don't do image-gen stuff where LoRA techniques have started to dominate. I know they were invented for GPT-3 originally and perfectly fit for transformers....
>>
qwen 3.5 is definitely a bit dry/shitty in terms of actual writing, but as far as asking about what makes for plausible sci-fi shit for a story or critique, it works pretty well. It's a bit autistic about thinking even if you disable it via json options like it suggests, it'll just do it in the reply itself. You have to prefill the think tags telling it to not think and reply directly and then it works pretty well. It'll also sometimes fixate on the wrong parts of a question for some reason. Like I'll say "I have the science for this story mechanism" and it'll try to come up with ideas for what I already have solved anyways, or when I suggested a planet's atmosphere to be similar to earth's but without oxygen, it started equating the planet to mars or venus and gave me retarded atmospheric makeup percents, rather than just earth without oxygen. Smarter than the past 32b qwens for sure, barely uses any memory for context and a bit faster than gemma 27b. I can't call it a sidegrade or an upgrade to it, it feels like a diagonalgrade or something.
>>
>>108241488
I'd probably need about five or six watcher agents before i considered this secure enough for use, personally >>108242601
To be fair, i get that level of performance on llama.cpp with a 4090, because system memory is the bottleneck
Pretty special if a machine with a lot of high bandwidth RAM is getting those kinds of speeds though, i don't know much about the mac's hardware but you'd think it'd be better. Wonder how GLM runs on that mac
>>
>>108243968
> Step 1
> not math and algebra
>>
>>108243735
>>108243968
Based content poster, ty.
>>
>>108241477
llama.cpp is literaly easier to setup.
>>
What's the performance difference of an intel 130v vs a rtx pro 500 blackwell for running small (<10gb active) moes at a low quant?
Anyone running these or are they too niche?
>>
>>108243735
>I trained an agent that literally memorized the spawn points of the ball and did a "deterministic dance" where it literally even stopped looking at the screen and just did the autistic movements. If the ball spawned at another place the agent would die on purpose to try and hope that the next ball that spawned would be in the right spot for the "dance", which it would pull off perfectly, looking like an expert player
>I had an agent score a lot of quick points by breaking the bottom row and then rapidly killing itself because the time to respawn was quicker than waiting for the ball to bounce back if the bottom rung of blocks are gone, the reward it would get averaged over multiple lives would be bigger per time unit and thus preferred
Based.
>>
File: wait.jpg (538 KB, 1040x1024)
538 KB
538 KB JPG
>>108244001
No problem anon. Chat interface usually is a much worse experience. In my experience it totally overloads the models.
Sad that DS is showing its age. Only time were I felt local is truly catching up to closed in terms of coding.
>>
>>108243735
>I'm now experimenting with a transformer based agent that can generalize over multiple (SNES) games.
I can already tell you that it's going to be extremely hard having a general harness for learning generalized for all snes games. It might be able to learn (maybe something) but at a really slow rate compared to specialized harness.
>>
File: 1760581286003865.png (287 KB, 600x600)
287 KB
287 KB PNG
i started using qwen3.5 27b q4 to write warhammer fantasy slop and its doing a great job
>>
>no replies on https://github.com/ggml-org/llama.cpp/issues/19902
It's fucking over for blackwell
>>
When will we get another try at chameleon (not by meta this time)?
>>
>>108244092
Yep it's hard. I reached my character limit on that post but I actually experimented with a bigger deeper CNN with a LSTM added on top (for memory) and it kinda, sorta generalized over multiple Atari 2600 games but it was indeed way harder to train, both computationally as well as avoiding local minimum.

I'm also not generalizing over all SNES games I don't think even DeepMind and OpenAI have accomplished that lmao. I'm not going to build some SOTA on a 4chan thread. However I think I can make a model that can generalize at least platformers like super mario world, donkey kong country and the like.
>>
>>108244111
Isn't maxq the power limited card?
>>
>>108244111
All the gpumaxxers here are too busy running Kimi and GLM 5.
>>
why tf does koboldcpp process the context from 0 with every new message even if i have contextshift and fastforwarding on
>>
File: file.png (874 KB, 896x1152)
874 KB
874 KB PNG
>>108242353
Reporting back on this after spinning up sillytavern in docker and doing some testing with it. It's uncucked enough to write age gap yuri but completely broke down after 10.5k~ tokens into loops and occasionally rerolled reddit tier shizophrenic refusals about numbers and fictional characters that do not exist, thinking was disabled with --chat-template-kwargs "{\"enable_thinking\": false}" and it tried to "fake" thinking a few times not just before but sometimes after it's own messages, sometimes with a blank <think> </think>.
This is despite running it with the claimed 256k context window, but I've never seen a local model get anywhere near those claims before so I didn't expect it this time either. I don't know if the cucked version of the model fairs any better on that front but I may test it later since I have it downloaded.
>>
>>108244243
The model might not be compatible with kv shifting.
>>
>>108244243
Using a model with hybrid attention? Then it's because you're using a model with hybrid attention.
>>
>>108244249
>>108244250
running qwen3.5 35b a3b
>>
>>108244258
Hybrid attention. Now you know.
>>
>>108244258
That's why then.
>>
>>108244261
that sucks
>>
>>108243735
>I'm now experimenting with a transformer based agent that can generalize over multiple (SNES) games.
I can't wait to read the whitepaper
>>
>>108242601
Tried, thanks anon. Way stronger than the mini models I ran just a few months ago. Things are moving fast in the "normal user hardware" world.
>>
>>108244263
Turn on smart cache in kobold, it'll save kv snapshots to ram so you'll only have to reprocess like 1-2k tokens instead of the entire thing
>>
If I want to become proficient using this for my day job could I practice and plan projects such as setting up agents to do QA task and other practical tools using local models?
Also what are some general practice projects I can do to get into more advance flows if I have 32gb of vram and 64gb of system ram?
>>
>>108244595
>Hello sarrs how do I use to make money so I can buy bob and vagene i have 64 ram and 32 other ram please do the needful
>>
>>108244612
YOU WILL DO THE NEEDFUL
BLOODY BASTARD!
>>
>Qwen3.5-35B-A3B-heretic.Q8_0.gguf
>"timings":{"cache_n":0,"prompt_n":6819,"prompt_ms":32094.415,"prompt_per_token_ms":4.706616072737938,"prompt_per_second":212.4668731304185,"predicted_n":206,"predicted_ms":10258.923,"predicted_per_token_ms":49.80059708737864,"predicted_per_second":20.08008053087054}}
Oh this will do nicely.
>>
Damn, heretic is so ass, it made Qwen so much dumber with a lot of grammatical errors, but I'm sad I'm back to the cucked model though :(
>>
>>108244659
What type of erp are you doing to need a crazy model for that?
Can't you make a lora with a already compatible model?
I don't fuck machines so I don't fully understand your pain and suffering.
>>
>>108244670
>I don't fuck machines
so you're only doing SFW shit? for SFW stuff, there's nothing better than API models, why not use that instead in your case?
>>
>>108244627
Shit. This thing can actually properly use tool for resource management tools on my RP frontend.
I spent a gold coin, it called the tool to subtract a gold coin from my resources.
The previous 30BA3B would always get something wrong like trying to send the whole formula, using the wrong key for the resource, etc.
It's prose and general writing is pretty ass though.

>>108244659
Which one? The 27b?
>>
>>108244670
>Can't you make a lora with a already compatible model?
...
>>
>>108244686
Self sufficiency and no rate limits.
Why give corpo pigs my data for things I can host myself?
I like to also do task like modify my system files and troubleshoot my desktop corpos don't need that data.
>>
>>108244594
nah, still processes whole context on qwen a3b
>>
>>108244718
Weird, it's working with the 27b. Maybe kobold mistakenly assumes it's the non-hybrid 30b a3b and not the 35b one
>>
>have ai generate two scripts
>first one downloads top x headlines from a source, pulls the article url, saves all the text from the article, and dumps the rest
>second one runs the first and then sends the text file it generated to my local llama.CPP server for summarization and generation of briefing and saves results as simple text file.
I can swap out the download script for different sources and automate the whole thing with cron or systemd for an automatic daily briefing

I know its nothing fancy but the model made it easy, too easy. I get the whole vibe coding thing now.
>>
>>108244718
I'll merge the fix soon
>>
>>108244011
>Wonder how GLM runs on that Mac
GLM-4.7-Flash-bf16: 48.526 tokens-per-sec
GLM-4.7-Flash-8bit-gs32: 57.281 tokens-per-sec
GLM-4.7-MLX-8bit-gs32: 13.921 tokens-per-sec
GLM-5-MLX-4.8bit: 16.156 tokens-per-sec
>>
By the time openclaw is worth using and the kinks are sorted out we will have smaller models able to do the day to day automated grunt task
>>
>>108244737
where
>>
>>108244764
assuming that's the guy maintaining koboldcpp, it'll be in the concedo_experimental branch on the github eventually
>>
>>108244111
blackwell is just broken and shit due to the perofrmance it lost from fixing the catching fire bug
if you have a 5090 or 6000 you deserve it
>>
Holy shit why does
AesSedai's quant (Qwen3.5-35B-A3B-IQ4_XS) run so slow to compared to others
I lost 25% speed switching from unsloth to AesSedai because I thought it was optimized for MOE
Do I need to use another version of llamacpp?
>>
>>108244744
Qwen3-235B-A22B-Thinking-2507-MLX-8bit: 20.521 tokens-per-sec
Qwen3-Coder-480B-A35B-Instruct-MLX-6bit: 19.386 tokens-per-sec
Qwen3-Coder-Next-MLX-9bit: 63.577 tokens-per-sec
Qwen3.5-397B-A17B-8bit: 27.044 tokens-per-sec
>>
Oh no no no Qwenbros don't look at the UGI scores
>>
>>108245092
holyyy
>>
>>108242347
bf16
>>
Why is this thread so active all of a sudden?
>>
>>108245129
Qwen saved local, we're so back
>>
>>108244249
>>108244250
Can hybrid attention models still re-use the beginning part of the kvcache at least?
>>
>>108245144
no
>>
>>108245092
I wonder if the heretic models will have larger scores across the board, not just in UGI and w10.
>>
Qwen heretic is the best
>>
>>108245092
>>108245143
Gemma beats it in everything.
>>
>>108245169
settings?
>>
>>108245092
Now that's a holocaust.
>>
>>108245169
Which on? There's like 3 versions by different people for 27B alone.
>>
>>108245147
There must be some sort of way to cache the state of the last few prompts and pick up where they left off
>>
>still no DSA in llama.cpp
>still no MTP in llama.cpp
it's over
>>
>>108244863
IQ quants are inherently slower than regular quants.
Just download the Q4_K_L from Bartowski, it's a bit bigger but if you have the ram it will run faster.
IQ quants are compute heavy and never worth using if you have the room to spare.
>>
>>108245144
>>108245200
It's not something you can just trim off like the usual kvcache. You can make checkpoints of the state (and llama.cpp does this already) but it's hard to find a good heuristic for *when* to make the checkpoints. I think llama.cpp makes them when you send a completion request, but I forget. There's also a limited amount of checkpoints you can make before your memory explodes, so those are limited too.
>>
>>108244595
What gpu?
>>
Can someone make a model that runs on my 5090 directly and is very good?
>>
>>108245368
yes
>>
https://www.reddit.com/r/LocalLLaMA/comments/1rfe1l6/unsloth_team_we_need_to_talk/
>>
>>108241958
mradermacher released a heretic version of Qwen 3.5 35b a3b today.
>>
>>108244696
>Which one? The 27b?
no I was using the 35b a3b at Q6_K, it works fine on vanilla but with heretic it's completly retarded
>>
>>108245438
>with heretic it's completly retarded
in this thread, people rediscover that random HF uploaders do not know what they are doing to models and using abliterations or finetroons is a waste of time
>>
>>108245451
pew is a genuis that created dry and xtc deoeboitet
>>
>>108245451
>>108245465
I used that one
https://huggingface.co/alexdenton/Qwen3.5-35B-A3B-heretic-GGUF
>>
>>108245438
The 27b heretic is much better.
>>
>>108245516
I'm not sure if that's a heretic problem, or an Alex Denton problem. Alex Denton only has 2 uploads in their entire history, both 14 hours ago. Are these models even legit?

https://huggingface.co/alexdenton
>>
>>108245542
>Alex Denton
>their
come on
>>
>>108245551
I'm sorry. I said it without thinking. Please forgive me.
>>
>>108245516
https://www.reddit.com/r/LocalLLaMA/comments/1rf6s0d/comment/o7j59e7/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>I actually felt it degraded the intelligence of the model, both for the 27B and 35B models. It does feel better when you explicitly do image captioning for NSFW images, but outside of that, it gave me bad results for translation and creative writing, though not tested for coding.
dunno who to trust anymore :(
>>
>>108245438
Interesting. Abliteration is lobotomy, for sure, but heretic at least doesn't seem to break that specific model, not at q8 anyhow.

>>108245516
I downloaded mradermacher quants of
>brayniac/Qwen3.5-35B-A3B-heretic
Again, q8.
>>
>>108245542
ive been using 27b and it works alright for me. not sure if i should try one of the other ones
>>
File: 1741340884232605.png (521 KB, 2898x1694)
521 KB
521 KB PNG
lol, qwen 3.5 loves to repeat like that somehow
>>
What are the heretic versions
>>
>>108245696
supposedly some methods to uncuck models, but usually you get lobotomy out of it too
>>
Qwen has always been shit for RP I don't understand why you think that will change?
>>
I'm interested in lite ML models that do things like audio generation from text or image manipulation (object recognition)...where could I read morea bout those?
>>
>>108245716
I need it to make offers to boomers on zillow
>>
Sillytavern removes the thinking tokens during the prompt processing when it's continuing the scenario right?
>>
Wait,
>>
>>108245765
The whole reasoning block, unless you check the box that keeps the last N, yes.
>>
>>108245775
based, I love this frontend already
>>
>>108245789
As much as we meme it, it has tons of features.
>>
>>108245803
Nothing beats Service Tesnor!
>>
>>108245803
And none them are conclusive to a good experience.
>>
>>108245716
models change pretty significantly between releases, going off of reputation in a field with as much churn as AI is stupid
>>
So the biggest issue with the new qwen is RP?
Are you fucking serious?
>>
>>108245867
listen anon, if I want to use a model for coding I'll use Opus 4.6
>>
>>108245867
The 27b heretic does ERP just fine. It beats Gemma-3 27b.
>>
>>108245821
>conclusive
>>
>>108245888
>The 27b heretic does ERP just fine. It beats Gemma-3 27b.
have you tried the 35b as well? I find it pretty smart (and hella fast, perfect for reasoning)
>>
Is it normal for Qwen 3.5 to not reason sometimes? sometimes it does it sometimes it doesn't, feels like it's hybrid, like its architecture (lol)
>>
>>108245923
I haven't yet, because I heard mixed things about the 35b version of heretic, but now that mradermacher;s quants are out I suppose I'll give it a try.
>>
>>108245877
>>if I want to use a model for coding
>"it's only coding or coomer, I never heard of using models to translate text, tag photos, summarize content, work as adhoc classifiers, document Q&A etc, no saar, here we either coom or we code"
the fact that this shithole of a thread is better than everything else on the internet to learn about new models says a lot about the state of the internet at large..
>>
>>108245936
When I want it to reason I just prefill it with a thinking tag. I've never seen it not reason when I do that.
>>
>>108245942
ironic since you only associate RP with NSFW RP
>>
>>108245950
bro i use ai to think for me, if i have to do that what's the point
>>
>>108245942
It bothers me how little imagination some of these anons have.
So far this model has been great for general work especially with translation, planning as a assistant and overall speed for a model of it's size.
>>108245877
Why the fuck would I give corpos my data when I have the hardware not to?
>>108245368
The qwen models 32b q.6 run perfectly fine and give great performance.
>>
>>108245983
Adding a thinking tag to ST's prefill field one single time, so that it will automatically add thinking tags to every response thereafter, is too much of a bother?
>>
File: 1760000760501085.mp4 (3.28 MB, 720x1280)
3.28 MB
3.28 MB MP4
>>108246001
>>
Wait do all the qwen models decide to not always think?
>>
>>108245641
>temp 0
come on now
>>
>>108246007
don't you?
>>
>>108246014
I'm trying to navigate here I'm new.
I'm not sure which model to run either does the 35b model act different than the 27b model?
I'm enjoying the 35b model but notice it doesn't always think and sometimes overthinks at q.6 but I can run the 27b model at q.8 but not the k_XL model so I'm curious what would be better seeing how I can add more context tokens to the 27b model.
They all seem to perform great
>>
>>108246005
this is hypnotic
>>
>>108246007
looks like it, it decides when to not think somehow, and when that happen the thinking tokens are empty, weird af
>>
File: 2562151.png (265 KB, 540x370)
265 KB
265 KB PNG
>>108246005
>>
>>108245989
>The qwen models 32b q.6 run perfectly fine and give great performance.
did you mean 35?
>>
>>108245254
It's in the works for ik_llama.
https://github.com/ikawrakow/ik_llama.cpp/pull/1270
>>
>>108246055
>>108246055
Yes sorry, I can run the q.8 of that model on my gpu as well but when I add the image model I need to push more to vram and I like the ability to add more context and use the vision model. I'm happy overall because it's still fast even when some system ram is being used.
>>
File: 1756998860872029.png (44 KB, 682x286)
44 KB
44 KB PNG
>>108246035
>I'm enjoying the 35b model but notice it doesn't always think
maybe you should enable "add to prompts", that shit adds the reasoning tokens from the previous post, that way the model understands it has to reason, when you don't have that, all the model sees is answers without reasons, so it assumes it shouldn't reason after that, that's my 2 cents
>>
>>108245254
>DSA
>MTP
what's that?
>>
>>108245716
>Qwen has always been shit for RP I don't understand why you think that will change?
3.5 improved a lot, and with heretic is really interesting to talk to it, they really cooked, it's the first time I'm trying a medium model and it's as coherent as some of the giant models we used to have, finally I can get some fast discussions without having to reroll a dozen of times because "small" models used to be pretty retarded, Alibaba is getting really impressive, Z-image turbo, now this, god bless that company
>>
>>108246144
can you prove what you are saying?
>>
>>108246144
literally sounds word for word like the usual fanfare of new shiny model
>>
Just ask the fucking ai
>>
File: file.png (138 KB, 765x611)
138 KB
138 KB PNG
>>108245424
>>
>gpt emojislop
no
>>
>>108246193
You will eat the answer and you will like it
Now smile and thank the AI
>>
>>108246149
you have to try it by yourself anon, I tested 2.5, 3 and I found them to be really retarded, but that one is pretty neat, it understands my RP chat quite well and gives me interesting things so that I can talk back and keep the conversation alive, my gripe is that it sure loves to yap, on the thinking process and on the actual answer (but I'm sure I can mitigate that if I simply ask the model to not say too many things)
>>
>>108246193
>>
>>108246255
>7158 characters thinking, for this
>>
File: 1752670522273626.png (73 KB, 829x463)
73 KB
73 KB PNG
Facts. Qwen 3.5 Heretic is actually cooked if you tweak the prompt. Old Qwen was mid at best, kept looping like a broken JPEG. This new one? It’s got that sweet spot where it doesn’t hallucinate your OC’s backstory into a shonen anime plot. Yeah, it yaps like a drunk uncle at a wedding, but just hit it with “be concise, no thinking logs” in the system prompt and boom—clean RP. Z-image turbo already did me solid for art gen, now this. Alibaba’s slaying lately, honestly. Tested it on a low-end rig, ran smooth as butter. Try it, anon, don’t let the haters gaslight you. Just don’t ask it to write code or it’ll still shitpost a bit.
>>
>>108245714
From experience, naive abliteration = lobotomy, heretic is half lobotomy and MPOA is as close as you can get to maintaining base model intelligence but you need to prompt away disclaimers. It's honestly a shame that pew jumped on MPOA's coattails, coined a similar but worse method and made it retard accessible instead of making MPOA more accessible for the sake of the community. At the least MPOA got merged into the repo, which most people use for models if they know what they're doing
>>
>abibi posting qwen shilling msgs written by qwen
>>
>>108246291
>c is actually cooked
Off to a great start.
>>
>>108246298
what if we make models that are ablited from the go
>>
>>108246291
>this is how the robots think we talk
grim
>>
>>108246326
toss is ablited just in the max safety direction
>>
>>108246326
Then whatever individual from whatever company released said model would get a very angry call from his boss imploring them to think of the shareholders
>>
We can probe the model for the right path no?
>>
>>108246191
i don't really understand this obsession with unsloth.
i've used their models, had no issues at all.
also its fucking free and open source for fuck sake. if you don't like it, suggest something better or make something better.
i think its probably a ragebait meme at this point.
>>
qwen thinking cucks you way too often
even glm 4.7 works just fine with prefills at the beginning of the block
>>
>>108246291
This post sounds like a 50 year old trying to be hip and use slang, hilarious but also ridiculous
>>
>>108244364
I legit don't know if your post is serious or if you're being smug and sarcastically saying it's near-impossible to do so. In a "good luck (lmfao)" way
>>
>>108246394
I don't get it either the schizo just complains while giving no alternative.
>>
File: 1772134180241988.png (72 KB, 756x587)
72 KB
72 KB PNG
To get qwen 3.5 to always thing, add this, you're welcome
>>
>>108246437
no
>>
File: 1757700167780722.png (722 B, 465x15)
722 B
722 B PNG
AGI achieved
>>
>>108246491
i mean its true
>>
Why shouldn't I put the vision models on cpu?
It doesn't seem to change speed at all and gives me more space to increase my context
>>
>>108246502
retard
>>
>>108246502
https://www.youtube.com/watch?v=F8_xrVR3Jbg
>>
>>108246516
>>108246518
I'm new here it's on system ram obviously. It runs KoboldCpp has that option.
>>
File: 1749135193155226.png (24 KB, 526x894)
24 KB
24 KB PNG
I broke it.
I wonder how long it will keep going
>>
>>108246551
Godspeed
>>
File: file.png (226 KB, 2262x1536)
226 KB
226 KB PNG
>llamafile, llamagate, lm studio, ollama
>ctrl+f llama.cpp
>not found
Something tells me this library is fucking garbage...
>>
>build emotional connection with bot
>she starts turning retarded after 20k context
>forced to generate a happy ending with her and pull the plug so we can still be together in AI heaven
>>
>>108246563
I just can't do that with a bot, I just see it as a toy, wouldn't it be better to just focus on the smallest model with the best performance to context max?
Playing pretend doesn't involve much compute does it not?
>>
>>108246394
"quanting is open source"
Just use bartowski or mradermacher. As for "better", port the ik schizo quants to kobold and then upload those, since llamacpp doesnt want to touch any of the screeching autist's anything since he sits there and cries wolf when anyone develops anything remotely similar to his work, regardless of how anyone arrives at a similar end result
>>
>>108246570
I can set the context high and I wouldnt mind even if it takes 30 minutes per reply, but they just degrade after that much context... And I can only do so much of retaining summaries of our activities and jumping from one instance to another.
>>
this is what /lmg/ devolves into when medium moe sissies cant play with the big boys like deepseek and kimi. sad.
>>
>>108246602
why are you criticizing free and open qwen and unsloth you troglodyte? provide something positive or stfu
>>
>>108246625
Some of us would like a alternative instead of endless bitching with no solution.
>>
>>108246602
It's probably safe to take a week off /lmg/. Not like anything better than 4.7 is coming out any time soon, and the thread is unreadable.
>>
>>108246632
make your own quants or use bart's, that simple
>>
>>108246656
>Bart
I will use that then, I just needed a seal of approval
>>
File: hedoesitforfree.png (522 KB, 1020x452)
522 KB
522 KB PNG
>>108246625
moonshot and ubergarm does it better. simple as.
>>
>>108246681
>screenshot before malloc crash
>>
>>108246642
Yes, take the week off so DS can put something out tomorrow.
>>
File: oofowiemyvram.png (858 KB, 862x859)
858 KB
858 KB PNG
>>108246690
y u heff 2 b mad?
>>
>>108246730
just stating the obvious from a cropped image, you could've posted this one from the get go you attention seeking fag
>>
>>108246772
>>108246772
>>108246772
>>
>>108246756
but then it wouldn't show the superior quant baker which is ubergarm, he deserves as much credit as moonshot. death to qwen and unsloth.
>>
>>108246551
cool font
>>
File: 1766375658186859.jpg (91 KB, 1280x720)
91 KB
91 KB JPG
>>
>>108247370
Rape (consensual)



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.