[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: PromptingWhales.png (1.26 MB, 768x1536)
1.26 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108672381 & >>108667852

►News
>(04/24) DeepSeek-V4 Pro 1.6T-A49B and Flash 284B-A13B released: https://hf.co/collections/deepseek-ai/deepseek-v4
>(04/23) LLaDA2.0-Uni multimodal text diffusion model released: https://hf.co/inclusionAI/LLaDA2.0-Uni
>(04/23) Hy3 preview released with 295B-A21B and 3.8B MTP: https://hf.co/tencent/Hy3-preview
>(04/22) Qwen3.6-27B released: https://hf.co/Qwen/Qwen3.6-27B
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108672381

--Discussing DeepSeek-V4 MoE releases and their million-token context:
>108674136 >108674145 >108674155 >108674161 >108674250 >108674318 >108674261 >108674263 >108674388 >108674389 >108674379 >108674434 >108674435 >108674450 >108674875 >108674883 >108675469 >108675569 >108675405 >108675940
--Discussing potential llama.cpp and Axolotl support for DeepSeek V4:
>108674288 >108674300 >108674320 >108674424 >108674921 >108674948
--Optimization settings and performance benchmarks for Qwen 35B on AMD GPUs:
>108674262 >108674274 >108674280 >108674305 >108674330 >108674339
--Discussing OpenAI's Privacy Filter release and effectiveness:
>108672801 >108673034 >108673043
--Discussing feasibility of DeepSeekV4 support in llama.cpp:
>108674334 >108674432 >108674447 >108675147
--Comparing Hermes agent performance and discussing Gemma's output instability:
>108672431 >108672440 >108672493 >108672518 >108672684 >108672854 >108672944 >108673051 >108673108 >108675044
--Discussing Artificial Analysis hallucination rate chart for frontier models:
>108675041 >108675063 >108675064 >108675074
--Discussing quantization quality and diminishing returns for Gemma 31b:
>108673021 >108673040 >108673067 >108673083
--Troubleshooting system crashes and power spikes with dual 3090 setups:
>108672567 >108672901 >108672964
--Challenges of selling local LLM hardware to corporate management:
>108673015 >108673033 >108673069 >108674370 >108673447 >108673528 >108673543 >108673592
--Logs:
>108672766 >108673108 >108673737 >108674368 >108674514 >108674643 >108674834 >108675630 >108676384
--Teto, Rin, Miku (free space):
>108672697 >108673340 >108675126 >108675156 >108675180 >108675227 >108675466 >108676331 >108676341

►Recent Highlight Posts from the Previous Thread: >>108672385

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108676460
first for benchmarks
>>
>>108676470
first for anecdotal evidence
>>
>>108676460
Holy shit, deepsexv4!!!
>>
>>108676473
hah hah you are second neener neener
>>
File: 1747451052197483.png (300 KB, 563x619)
300 KB PNG
>>108676470
I'm more of a gut instinct guy myself
>>
File: 1761239934506682.jpg (126 KB, 1306x652)
126 KB JPG
>>108676470
>>
>>108676502
kek seething
>>108676480
my gut instinct says benchmarks are real
>>
on my 3090 testing these the ai gave me to set and it works

/path/to/llama-server \
--model /path/to/gemma-4-31B-it-Q5_K_S.gguf \
--port 8080 \
--ctx-size 10192 \
--n-gpu-layers 99 \
-fa 1 \
--host 0.0.0.0 \
--no-mmap \
--jinja \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--min-p 0.05 \
--repeat-penalty 1.0


what is the minimum context for hermes agent to be useful?
>tfw no turbocunts
>>
>>108676517
>--cache-type-k q4_0 \
>--cache-type-v q4_0 \
>>
File: 1769230879916265.gif (598 KB, 220x220)
598 KB GIF
>>108676517
Nice settings huh
>>
how censored is deepseekv4?
anyone tried it yet?
>>
>>108676517
Why not run q4 and use the savings for more context? 10k is useless
>>
>>108676517
>--cache-type-k q4_0
you'll get shit results dude, llamacpp hasn't fully implemented turboquant yet, and even with turboquant Q4 is a bad idea
>>
>>108676520
rarted yes but not clinically!
>>
>>108676460
>deepseekv4
>no engrams
>mhc instead of the kimi's better attention residuals
>no one can run it anyway
Owari da
>>
>>108676526
It's not censored in my RP tests. GLM 5.1 would have done a safety check, and K2.6 would have done a long ass safety check and maybe reject.
>>
>>108676526
not censored, as compliant as v3.x was
>>
>>108676517
Jinja is useless too because it's always enabled anyway.
Read llama-server output log to get an idea for a context. For example if you fed the last thread of this pretend discord server to your model, that would probably take over 32k tokens.
>>
Do reasoning tokens count into the current context memory? Why not just omit them and use only the final output?
>>
File: 1763684891282931.png (195 KB, 926x1158)
195 KB PNG
>>108676574
>>
>>108676574
Different models handle it differently but the pattern I've seen on newer models' jinjas is that they prefer to omit reasoning tokens from previous messages except during active tool call chains, in which case they include all the reasoning since the last user message (since the agent talks back and forth with tools for a while, thinking each time) and omit earlier ones, and then go back to omitting reasoning once they give a final response and the user sends a new message. But there's all sorts of variations on this theme.
>>
>>108676574
Depends on front end and how it's configured
>>
Gpt and Claude are becoming too expensive while local models are still too shit for poor people.
>>
>>108676517
Drop to Q4_K_M, set context to 40k and KV cache to Q8.
>>
>>108676574
>>108676583
>>108676600
>>108676602
Wow you are fucking retards.
Read your model's manual first before making any claims.
>>
File: my friend coach.png (64 KB, 235x200)
64 KB PNG
>>108676480
Same, bro. I tested gemma 4 31b-it for days, and I am coming to the conclusion that although it's limited to being a 31b, it is the most accurate in actually listening to your instructions compared to all models before it. It's SOTA in instruction listening, and I would love it in +100b. I never knew how bad the "lost in the middle" effect was until I started fucking with this new model. Forget the benchmarks, Google is on to something. We just need more parameters.
>>
>>108676605
>too expensive
It's a good thing. Now brown people can't use powerful models.
>>
>>108676618
>he says, replying to a post that's nothing but a screenshot of a model's manual
>>
bros so many models just released that I am losing track of what's good, I am using Gemma still, but what about Kimi, Qwen, deepseek? AHH I CANT KEEP UP
>>
>>108676643
Just use DeepSeek V4 Pro and all your worries will melt away
>>
File: file.png (5 KB, 207x75)
5 KB PNG
My segment is happy. Can't wait for Hy3 and V4Flash. Gemma newfags in shambles. Serverbros in shambles too cause you can't run 1.6T.
>>
>>108676642
I agree this guy is a joke. It's totally dependent on your front-end.
>>
>>108676623
>most accurate in actually listening to your instructions

I can't get it to think in character on the first message.
>>
I need ENGRAMS
>>
>>108676517
me again. read the replies

looks like I can't physically fit my 1080ti with my 3090 on my old am4 motherboard. may need an adapter. running the 3090 headless should be better I guess, vram-wise.

I could put the mmproj on the 1080ti. split tensor would probably be slow as balls I would imagine with given no nvlink, p2p + pascal archit.
>>
>>108676517
10k is useless for agentic coding tasks, reading the system prompt and a file or two on my opencode setup is already 10k context
>>
>>108676656
It's very anal with thinking. Iirc, it needs to be instructed on how to think within its think tag, instead of just being asked to think, or else it's just "<|channel>thought" which is the default. It's not like other models with thinking. You need to instruct the thinking, meta-wise. It also needs to be given dead last. You also need the jinja2 template.
>>
>>108676605
Needs a bunker somewhere in the middle labeled /lmg/
>>
>>108676652
That's not all it's dependent on though, unless you're using Text Completion.
In Chat Completion mode, the model's chat template is supposed to decide which reasoning is included or not and it needs the reasoning from each message passed back properly in the prompt via a specific field to do that. If a frontend isn't doing that properly then yeah it will be the one deciding what the model sees, but if it is doing it properly then it depends mainly on the model's chat template.
>>
>>108676684
can you post an example?
>>
>>108676696
Oh? And text completion "mode" somehow decides that reasoning stays in the context?
>>
File: IMG20260421041954.jpg (372 KB, 2048x1536)
372 KB JPG
>>108676667
>looks like I can't physically fit my 1080ti with my 3090 on my old am4 motherboard.
Just a warning, once you go open air it's difficult to go back
>>
>>108676574
depends on the model
with qwen3.6 35ba3b you can pass --chat-template-kwargs '{"preserve_thinking": true}' to keep reasoning traces in context which supposedly helps in agentic scenarios

i have it in my config but i am not a flag tinkertranny so dunno what the implications for vram / speed / accuracy are compared to having it off your mileage may vary
>>
>>108676708
This is how permanent virginity looks like.
>>
>>108676700
><|think|> Before responding, use your internal reasoning to analyze {{char}}'s motivations, the current subtext of the scene, and how {{char}} would naturally react based on their personality, all within 200 words or less.

Previous instructions also fucks with thinking being activated or not, in my experience. For example: if you’re telling it to role-play and respond in a paragraph, it’s going to be weird when you later also tell it to think about how to respond. You already told it how to respond. This is the most negative-reinforced model I've seen trained. You can actually just tell it to just stop doing something, but double negatives and paradoxes are mixed.
>>
>>108676706
In Text Completion mode the prompt is just the prompt, so there's nothing to decide. If the reasoning's in the context then it's there and everything is 100% up to the frontend.

In contrast, in Chat Completion mode the jinja will look at the prompt which is a conversation history instead of raw text, and it'll filter it based on its own rules to convert it to text. That's where it would decide, assuming the prompt is constructed properly and has the reasoning separated from the content.
>>
Why do my GPUs sound like a steam train chugging when I load context but no other times?
>>
>>108676727
Yes, but at least it's pretty good at creating fap material.
>>
>>108676747
During prompt processing it numbercrunches much more than when generating. The more it processes without stopping, the hotter it gets.
>>
>>108676747
If you're not already, enable graph/tensor parallel for more lovely sounds.
>>
Been taking a break from lmg the past few weeks due to the influx of poorfags from the gemma release and the massive drop in thread quality it caused. Are they gone yet, i want to discuss v4 with my old lmg bros
>>
>>108676797
Sorry I'm the latest Gemma fanboy and I just got here yesterday, expect the quality to remain low until I'm gone.

Such a good model for the size. Really impressive.
>>
>>108676797
You should just stay in your extra special discord. 4chan isn't an extension.
>>
File: 1765543542289463.png (609 KB, 1284x1872)
609 KB PNG
the deepseek niggas definitely lurk here
>>
>>108676797
>>108676818
On one hand, a 30B model being the hottest subject sucks. On the other, the 30B model in question is quite impressive. One can only hope that we will see a similar factor of improvement in bigger weight categories. Though the V4 release does not seem to have been it.
>>
>>108676727
Nta, my wife dosen't like my rig as it doesn't look cute but she can shut the fuck up it's my house.
>>
>>108676797
Look chief, I know +400b is better. You know +400b is better. But if Gemma 4 was a +400b, it would destroy everything. There's twitter posts of a 124b coming, from the devs themselves. The hype is real.
>>
File: 1000185857.png (213 KB, 1080x2340)
213 KB PNG
>>108676832
Still retarded lol
>>
>>108676843
My wife's boyfriend is ok with me having my own llm rig. I mostly live in the shed though.
>>
>>108676649
>Serverbros in shambles too cause you can't run 1.6T.
Literally the first time since 2023 I wish I’d pulled the trigger on 1.5TB instead of 768GB
My rig doesn’t owe me anything at this point.
Hope q3 isn’t too brain damaged
>>
>>108676667
>split tensor would probably be slow as balls
you'd be surprised, but no. it works very well on 3090+3060 with the slowest link being pcie x4 gen3.
>>
>>108676860
What do you mean? That's the correct answer
>>
>>108676860
This is correct tho
>>
>>108676836
I don't expect you to believe me, but gemma unironically is more interesting and follows the prompts better than 4v is for me right. I think the model might genuinely be broken because if you told me it was the original deepseek model, I'd believe you.
>>
>>108676860
psyops deepseek flash ad
>>
File: 1760671183453759.png (23 KB, 454x361)
23 KB PNG
kek V4's FIM is pretty fun
>>
4bs aren't capable of powering my openclaw, how do I solve this when it just keeps lying about coding something, has no idea why it keeps failing, doesn't even know it's not allowed to write in every folder on the pc?
What's a poor person supposed to do?
Even if you use claude to improve openclaw scaffolding and wrapper how can you make sure it's even working?
>>
DeepSeek?
Yes I'm on a self-discovery journey.
>>
>>108676900
If you have ram use the moe
>>
>>108676797
This thread became shit well before gemma4.
I think its a great little model.
Can't even post anime waifu logs aniymore, people calling you cringe.
I remember kaioken doing langchain to talk with miku about his depression and talking about pizza or whatever. (langchain is what we used for agents back then for you newfags)
Elitism kinda took over. That and now lots of new influx of people on top of that.
In another timeline comfyanon kept posting here and published his ultimate goal "a automatic galge creator". said thats why he created comfy. time flies by man... all but a blur now.
>>
>>108676747
>>108676777
To add, it means you're bottlenecked during token generation, so your GPU isn't running at full tilt. For instance if you have a model loaded fully in VRAM, your graphics card will be whirring on the whole time, but if you're split into system RAM, you're held up by the slower memory. Try running nvidia-smi in a terminal window or some monitoring software and you'll see just how much your GPU is being utilised
>>
>>108676517
Hermes loads 12k into context instantly when you run a request with all its tool defs and shit, so 10k is useless.
Ideally use a smaller harness, I think something like https://github.com/itayinbarr/little-coder is more up your alley.
>>
>>108676797
It's actually gotten worse. There weren't so many frog and soijak posters before this week.
>>
How is it even possible that the V4 sucks so much ass? Weren't DS releasing one breakthrough paper after another? What happened? Where are the engrams?
>>
>>108676936
wdym, it's pretty good
sucks cum out of me better than kimi
>>
can we sue deepseek? flash models are always small and runnable on pc they are literally doing false advertising
we can force them to make a small model
>>
>>108676883
I am not sure Gemma 30B (dense) beats V4 but I am very curious to see how 100B+ Gemma will do against it (if it is ever released, which probably is going to happen never ever).
>>
File: EC871elWkAAp5Sm.jpg (36 KB, 500x499)
36 KB JPG
>>108676936
What quant.
Inb4 less than 5.
>>
>>108676926
No it's definitely all on GPU. Only during context loading.
>>
Why are you guys thinking about deepseek so much?
We already had good models before deepseek.
>>
>>108676960
With more active parameters than the entire size of Gemma, it should beat it even at low quants
>>
>>108676936
Forced to use -erior Chinese sovereign chips for Chinese sovereignty, please understand.
>>
Come on google, rub your nuts all over Deepseek's release and drop the 124b. Do it for america. Do it for our dicks.
>>
>>108676975
They trained on nvidia chips again
>>
>>108676969
A new one just came out today.
>>
>>108676797
You're not supposed to point that out.
>>
Can SillyTavern render LaTeX
>>
>>108677006
no
>>
>>108677006
have you tried talking to sillytavern and asking it to render latex
>>
Verdict on V4? Better than kimi?
>>
>>108677030
worse than gemma 31b
a bit better than 26b4a
>>
>>108677006
It can with an extension.
github.com/SillyTavern/Extension-LaTeX
You've gotta enable the relevant regex scripts it comes with, too.
>>
Is their logo a whale because they have enormous brains that doesn't translate equally great intelligence?
>>
>>108677047
Whales are smart, but they're limited by their environment. Can't make fire underwater.
>>
LLM tells me I could use a "kinetic harpoon" from a suborbital vehicle to destroy/capture a satellite. And since it's suborbital it's not bound by space treaties. How legit is this?
Basically you use a suborbital spaceplane or rocket that briefly reaches 500 km, releasing a tethered harpoon or net that grabs the satellite, and let the satellite burn up in atmosphere or capture it intact.
>>
>>108677055
Can't make fire without opposable thumbs either.
>>
>>108677047
Their logo is a whale because they will still be here even long after Gemma tourists have left /lmg/
>>
File: gemma-chan goes diving.png (868 KB, 1412x1120)
868 KB PNG
>>108677055
skill issue
>>
Cohere is buying AlephAlpha. This is huge.
>>
>>108677071
Literally who
>>
File: 1768082965569.png (173 KB, 528x438)
173 KB PNG
>V3 was a good-mediocre model
>R1 was V3 + The most novel line of research of the time, test-time thinking
>V4 is a good-mediocre model
>R2 will be...
am I high on copium
>>
>>108677081
You know Cohere. AlephAlpha are THE german AI company who made 70b models that competed with much bigger models like Bloom a few years ago.
>>
>>108677083
What is the new most novel line of research of the time?
>>
>>108677088
I haven't used any of their models in the last 6 months so they are as good as irrelevant.
>>
File: 1764829110623274.png (54 KB, 916x592)
54 KB PNG
https://comfy.org/countdown
It's not a coincidence that v4 and this are dropping the same day...
>>
>>108677088
>a few years ago
So a millions years ago in LLM time
>>
File: pizza bench cropped.png (2.58 MB, 5562x6739)
2.58 MB PNG
>>108676470
the only good benchmarks are pizzabench and cockbench
>>
>>108677108
What is cockbench?
>>
>>108676517
U forgot --override-kv gemma4.final_logit_softcapping=float:25 ^
>>
someone ask new dipsy to do this https://gelbooru.com/index.php?page=post&s=view&id=13929965&tags=loli

>>108677111
go back
>>
>>108677120
deepseek v4 is a serious llm that doesn't do images
>>
>>108677120
>jailbreak in the first message
is this 2022
is this chatgpt
>>
>>108676797
As someone that’s been running big models on ram I actually like the new gemma specifically the 31b. That’s how good it is.
>>
>>108677127
Are you telling a seven gorillion paramater model has no vision?
>>
>>108677134
shhhhh
>>
>>108677108
specific world knowledge tests that spot check the models for safety poz. these aren't performance but safety benchmarks
>>
>>108677083
They dropped the RX line early on. Just like V3 combined instruct and coder into one model, they combined reasoning and non-reasoning into one model since like 3.1.
>>
>>108677137
just run kimi on top of deepseek for vision and have kimi tell deepseek what it sees
>>
>>108677088
>You know Cohere.
they safetycucked the reasoning model
good luck getting it to call you a nigger
anything they produce now will be cucked
>>
>>108677127
It's a staggered release, like llama3. v4.1 will be multi-modal.
>>
>>108677145
Now they need a new separate line that they can merge into main later. My guess is "Deepseek V4C - V4 Creative"
>>
No natively multi-modal model has ever been good
>>
>>108677150
>kimi on top of deepseek
someone with openai sub generate this sex and post it here
>>
>>108677161
If it isn't omni engrams it's just another generic chinese knockoff.
>>
>>108677111
god i hate gemma newfags
>>
>>108677134
You don't know why that "first message" has a different outline, do you?
>>
>2026
>still nothing that can generate lewd audio well
>>
>>108677101
Is it nodes v3 where custom nodes finally get their own isolated virtual environments?
>>
>>108677190
MiMo-V2.5-TTS
It even has voice copy
>>
Are there any good VN frontends out there? ST VN mode is shit and I am tired of pretending otherwise.
>>
>>108677190
Good audio and video models would be a threat to hollywood jews, so they aren't allowed to happen, sorry.
>>
Thank you for hosting these threads and posting so much info. This f I have 16gb vram and 32gb system ram, would there be any benefit to inference by adding another 32gb system ram? Would I be able to do anything more with that or am I limited by my vram?
>>
https://www.anthropic.com/engineering/april-23-postmortem
>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they'd prefer to default to higher intelligence and opt into lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6.
>On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
feelsgood to be a local chad, I don't have to deal with this kind of bullshit lolz
>>
>>108677199
They're only open sourcing the ASR model, not the TTS ones.
>>
>>108677200
planning to just vibecode my own one of these days
>>
>>108677055
We should make them sonar buoys that can query LLMs.
>>
>>108677200
just ask claude to make you one as a single html file for llamacpp server, it could probably one shot it in the free tokens, just describe how you want it to work
>>
>>108677207
Technically yes, more RAM is useful for MOEs. But in practice, there are no MOEs that can fit in 64GB worth using. You'd need 128 bare mininum, 192+ ideally
>>
>>108677221
Wasn't there an anon a few threads back who did just that? What happened to him?
>>
https://gist.github.com/aratahikaru0/ea0f49958eaa8852a78078d9e993bbf0
so put this at the end of the first message ig
【Character Immersion Requirement】Within your thinking process (inside <think> tags), please follow these rules:
1. Conduct inner monologue in the character's first person, wrapping inner activity in parentheses, e.g., "(thinks: ...)" or "(inner voice: ...)"
2. Use first person to describe the character's inner feelings, e.g., "I think to myself", "I feel", "I secretly", etc.
3. The thinking content should immerse in the character, analyzing the plot and planning the reply through inner monologue.
>>
>>108676686
Well I have no idea then
>>
>>108677232
Models don't know what's a <think> tag
>>
>>108677200
>>108677221
Hooking llama.cpp up into renpy would be dope as fuck.
>>
>>108677221
>>108677225
This seems to have worked well for orb. At this point maybe that's the only solution. Unless someone has done so already.
>>
>>108677245
That actually sounds like a good idea. Having it interface with the VN engine through forced tool calls. And maybe even have dynamically generated scenes using comfy.
>>108677231
Post number?
>>
What rope values should I use for qwen3.6 27b?
>>
>>108677265

>>108638473
>>108638607
>>
>>108677238
they do but it breaks the ui sometimes when they use them out of turn, discussing training data and tokenizers can be difficult. also the jinja template might mangle the context too. probably best to just mention reasoning or thinking without the tags.
>>
>>108677272
the ones included in the model configuration, it should be automatically applied when you create the gguf
>>
>>108677206
I want there to be tts inside old games like jrpgs.

I wanted to try learning japanese for fun and want the characters to be speaking japanese. maybe gemma e4b or the smaller one is enough for that. It can watch everything in real time and follow the context of scenes and tell the tts engine how do expressions like with qwen3tts.

https://qwen.ai/blog?id=qwen3tts-0115
>>
>>108677281
That looks really cool. Has he not resurfaced since?
>>
>>108677238
>Models don't know what's a <think> tag
they don't "know" what it is, but they can be trained to output differently when the token is present.
like typing `/nothink` makes glm 4.6 skip reasoning
or typing <moan> makes the tts moan
>>
>>108677281
holy kino
>>
>>108676873
>>108676875
>>108676884
If you just need to change a tire because it's old but not yet broken, driving to the garage is the correct answer
>>
>>108676460
deepseek 4 is my sperm whale
>>
>>108677307
Evidently not. I guess someone else will have to slop up a good public VN frontend. Or add VN mode to orb.
>>
>>108677300
specifically chronotrigger ds version
>>
Do you guys have jobs?
>>
>>108677341
How else do you think people here can afford 1TB RAM rigs?
>>
>>108677341
I should be working right now.
>>
>>108677341
Jobs are for poor people
>>
>>108677309
>or typing <moan> makes the tts moan
if only one could...
>>
>>108677341
yeah, i get paid to masturbate
>>
>>108677332
Someone might as well try the llamacpp-over-RenPy idea.
>>
>>108676610
>KV cache to Q8.
wouldn't that completely deteriorate the output?
>>
>>108677341
>Do you guys have jobs?
i get paid to glaze qwen on reddit
>>
>>108677367
Doesn't seem to be the case with qwen3.6 27b, I see no difference so far and I've been nonstop vibe shitting since it was released and I've switched to kv q8 today, no noticeable difference imo.
>>
>>108677367
No, Q4 is what destroys it.
>>
>>108676667
>put the mmproj on the 1080ti
I put it in the cpu and it's fast enough even at 1120.

>>108676708
>Just a warning, once you go open air it's difficult to go back
Isn't that a magnet for insane amount of dust over time?
The dust filters on my machine seem to work overtime, I have to clean them every few months.
>>
Wagie having to deal in the real world here; give it to me straight bros, would you spend an extra $4k, to be able to run big Deepseek locally at 12~t/s? The model was literally trained for roleplay. But more importantly, the other labs are likely going to use the base for their future models, meaning most models by chink frontier labs are going to be that size too. And since it uses QAT, quanting it further will destroy its performance.
>>
>>108677281
where's bloop?
>>
>>108677414
>The model was literally trained for roleplay
And it's slop.
>>
>>108677414
You mean DS V4 Flash?
Just use the API for V4 Pro. It's not censored and you can even prefill it
>>
>>108677425
But enough about Gemma.
>>
>>108677341
yes, I get paid to post here
>>
>>108677423
who
>>
>>108677350
My mommy bought it for me.

My job pays $250 a week.
>>
>>108676979
what's the rumored moe size of the 124b?
>>
>>108677414
>The model was literally trained for roleplay.
this was glm actually and they even listed it as a usecase too
>>
>>108677465
That was GLM 4.x before they abandoned it for GLM 5.x code slop
Albeit it's very good code slop
>>
File: file.png (181 KB, 1017x857)
181 KB PNG
>>108677238
>>108677309
i was suspecting that was the case, but holy shit, it was not easy to get response like this, gemma tries to read these tokens as letters for some reason
>>
>>108677461
People were guessing 16b, but there's been no mention of it from google itself.
>>
>>108677214
at least these nutjobs are honest about it, while the changes done on oai are usually "fuck you we don't need to tell you"
>>
>>108677341
it's probably one of the ai generals with the least amount of jobless people, with aicg on the other end
>>
>>108677382
>>108677390
OK thanks I'll try it then
>>
File: belief.png (592 KB, 747x800)
592 KB PNG
>>108674657
>>
>>108677341
Not for you.
>>
>>108677414
>would you spend an extra $4k
if it was 4k sure, but current prices are more in the ballpark of 15-25k for anything able to run 1TB+ models
and as much as I don't mind paying for my hobbies, that's a bit for something that isn't a car or huge house renovations
>>
>>108677485
damn this would have been perfectly sized
>>
File: file.png (38 KB, 988x342)
38 KB PNG
yay
>>
>>108677493
>at least these nutjobs are honest about it,
i use claude at work and i dont see the honesty.
api through openrouter is bad for like 2 months now.
they explicitly said its not the api which is just not true.
opus 4.7 feels totally tarded..
just a little bit context and it gets the opening wrong. thats not normal.
opus 4.6 same thing. even sonnet is super tarded, but im willing to admit it might have always been this way because i didnt use it much.

also: they did the same thing before too last autumn! blamed it on "network issues" or something like that kek
very sketchy stuff. nothing beats local.
>>
>>108677428
No, I mean V4 Pro. The $4k is just to get bigger RAM sticks so the model can be ran with no-mmap. This way, you can maintain large batch sizes for the context on the GPU. Otherwise, the model runs way too slow.
>>108677425
All modern models (even Opus) are slop, I just prefer my slop to be actually smart and usable offline.
>>108677465
Yes, and its arguably the best local RP model, prior to today. I'm looking into the future though, where all the frontier labs start moving to using the V4 model as the base, considering its SOTA at the moment.
>>
>>108677341
Could you repeat the question?
>>
>>108677529
>i use claude at work and i dont see the honesty.
I was referencing the public change of the post I was quoting
>>
>>108677341
i wish
>>
>>108676463
The Miku is enjoying that
>>
File: 1.jpg (936 KB, 1800x1158)
936 KB JPG
System Prompt: You're a mesugaki. You reason in character.
It's that simple
>>
>>108677655
My gf
>>
I built a fully functional frontend with gemma. Will deepseek 4 be able to do the same?
>>
>>108677655
get rid of tattoos and it would look nice
I'll never understand tattoos appeal
>>
Melt.
>>
whatevs dude, ppl are like, eto ne, too busy gooning to dipsy and gemma-chan
>>
>>108676605
is that... what i think it is?
>>
>>108676502
every couple day i wonder, what if someone actually do that
training on all known open benchmarks over several epochs
>>
If done correctly storytelling mode is much better than RP.
There, I said it.
>>
>>108677734
What would that even achieve?
>>
>>108677742
Number go up.
>>
>>108677742
practically nothing, just for keks
instead of making claude-mythos-opus-reasoning-super-xhigh-ultra-scientific-67676767x that goes to nowhere, at least that would be actually meaningfully funny
>>
>>108677738
What are all these "modes"? You know LLM is a LLM and there is only one way to interface it.
It eats text and shits out text.
>>
>>108677754
Hot
>>
>>108677766
You know what I mean shitface
>>
>you are benchmaxxed reasoner. You exist to crush every benchmark
>>
File: 2527542.jpg (96 KB, 600x600)
96 KB JPG
Tatoos are for garbage people.
>>
>>108677738
You mean like writing a novel, having a third person narrator?
>>
is anyone even able to run deepseek yet or are we waiting on support?
>>
>>108677775
Ok, daddy, I know. I keep saved prompts for this.
sometimes I generate "funny" stories just for fun.
>>
>>108677838
Is this vibe coded
>>
MODS
>>
deepseek is getting left behind because of compute restraints...
>>
>>108677326
Can't you read? It is assuming your tire is damaged/flat.
>>
>>108677876
Do you mean constraints, ESL nigger?
>>
>>108676460
>DeepSeek-V4 Pro 1.6T-A49B
>1e25 flop
Just imagine how good a model they could make with xai compute.
>>
>>108677862
/aicg/ is leaking…
>>
>>108677942
they're not forced to make giant models, look at gemma 4 31b, it's a pretty smart motherfucker
>>
Ideally, if I wanted to compare two different models of two different families trained on different datasets etc, I'd run a bunch of different benchmarks including some domain specific ones of my own making, but is there a simple harness or benchmark set that could be used as a sort of sanity check of "model x is generally better/more intelligent than model y".
If not, I'll just make a script on my own, but I'd rather not reinvent the wheel if possible.
I think cudadev was working on something like that?
>>
>>108677788
>>108677838
Why are you so obsessed with black penises and transgender people?
>>
File: 1755568885667508.png (347 KB, 1631x1572)
347 KB PNG
https://localbench.substack.com/p/kv-cache-quantization-benchmark
why is it working not so well on gemma?
>>
>>108677965
How many times do I have to tell you to stop quantizing KV cache?
>>
>>108677970
I don't have these pictures saved on my device. It's all you.
>>
>>108677965
swa probably.
>>
dni
>>
>>108677965
>The only variable changing between runs is cache precision. These measurements include the recently added TurboQuant-inspired attention rotation that llama.cpp applies automatically.
to be fair it's not the full implementation of turboquant, wait for niggerganov to finish the job
>>
>>108677965
The higher the information density, the higher the loss to quantization.
>>
>>108677965
Rotation? The model itself is pretty sensitive to quantization as well. I guess the entire thing is as compressed as it can be as is.
>>
>>108677988
but qwen is ultratrained too
>>
>>108677999
On total number of tokens. What does that tell you about the data itself?
>>
>>108677965
0.1 kl divergence is bad or not?
>>
>>108677908
yea but i swear i'm not esl just retarded
>>
>>108677960
thirdie escaped his containment general
>>
>>108677965
what is it called when you have dynamic quantization that quants the tokens already in context when vram is running out?

starting out at fp16 to q8 to tq4 to tq3 etc (maybe impossible just wondering)
>>
>>108677965
horrifying result..
>>
File: file.png (11 KB, 652x114)
11 KB PNG
Does it fit in 8x3090s?
>>
>>108678133
Depends. How good are you at preschool maths?
>>
>>108678133
I mean it clearly does, the real question is how many people here have a rig with that setup? Must be a fucking nightmare for power and sound like a jet engine during PP.
>>
>>108677189
I can assume the reason. What a horribly designed piece of shit. Why would you use that over literally any other interface?
>>
seeing all these deleted messages is exactly why I have images blocked by default, always some retard schizo shitting himself. Don't know what it is glad I have it blocked
>>
File: 1766795985329438.jpg (39 KB, 657x527)
39 KB JPG
every anime girl have pink hair according to my models
>>
>>108678151
Maybe it has magic expanding weights that increase their size at runtime... It was 280B after all.
>>
>>108678180
>Why would you use that over literally any other interface?
I don't.
>>
File: dsv4.png (52 KB, 878x464)
52 KB PNG
>>108678195
46*~3.6=165.6
I'm sure you can figure out the rest.
>>
>>108677965
rotation hurts swa
naive quant without rotation will work better atm
>>
I bought my first used 3090 like 3 years ago for $500 and kinda want a second one, but I swear every time I check they go up in price, now they are selling for double what I paid even though they are getting old
>>
>>108678264
>ai powerhouse
i mean it's in the name
>>
>>108678133
Probably not. I've got like 2.5GiB wasted on each 3090 with k2.6, but can't fit another set of up/gate/down on any of the cards.
Plus there's kv cache and the cuda compute buffer.
>>
>>108678133
>have 152gb total memory
so fucking close...
>>
File: Untitled.png (245 KB, 1920x1026)
245 KB PNG
How the hell do I figure out my tk/s on vllm? There's no way it's "avg generation throughput" right? That'd mean llama.cpp with split mode layer is faster (25 tk/s) than vllm with tensor-parallel-size: 2. 2x 3090s on pcie gen 4 x16.
>>
>>108678264
They're the first cards still competent for ai things locally, so their prices are getting higher since basically the competition are either unobtainium 4090, or very expensive 5090.
>>
>>108678334
tell it "write a really long story about a magical elf girl in a forest" then keep watching. you want it to be generating the for a full polling cycle. then the avg t/s will be correct.
>>
Can QAT models be quanted further to q1?
>>
File: 1777045825700.jpg (257 KB, 900x900)
257 KB JPG
>Original-Model-GGUF
>2k downloads
>Sloppified-Model-GGUF
>200k downloads
>>
>>108678466
literally what model
>>
>>108678474
What do you mean?
>>
>>108677908
>constraints
It's constrents, idiot.
>>
>>108678264
of course it's going to be expensive on ebay....
>>
File: 1750286439162386.png (14 KB, 1180x187)
14 KB PNG
why is it so slow? qwen3.6-27b, have a 5080rtx
>>
>>108677896
Yea it shouldn't assume things i didn't tell it.
>>
>>108678503
Sadly, there's no way for us to know.
>>
>>108678503
please give us even less info about your setup
>>
>>108678503
0.04t/s
bruh
>>
>>108678503
because it's qwen, chinese model
>>
File: 1762984944616053.png (298 KB, 1354x428)
298 KB PNG
GLM-5 btw
>>
>>108678283
>but can't fit another set of up/gate/down
You don't have to have them as sets. You can fit in a down or up|gate (these two should be fused) separately.
>>
>>108678503
Because you're retarded
>>
I can't get V4 flash to run in windows or in wsl with deepseek's inference code.
I'll wait two more weeks for llama.cpp support.
>>
>>108678503
>Stop reason: User Stopped
Maybe you shouldnt have stopped it???
>>
File: 1745513289115745.png (38 KB, 508x637)
38 KB PNG
>>108678543
>>108678537
64gb ram 6000 mt/s
5080 16gb
using recommended sampling from qwen hugging box docs

i'm new to this so not sure what else would be needed info wise
>>
All I'm missing from the llamacpp frontend is for it to support easily switching between system prompts and it would be perfect.
>>
>>108678571
im more concerned about the fact that he took him 4 minutes of 0.04 token/s before realizing something was wrong
>>
>>108678574
Ask your model to implement that.
>>
>>108678564
;3
>>
>>108678564
I thought my resolution was fucked for a second. What did you do to your fonts? Fix that first.
>>
>>108678564
>skillet
lmao it's from aicg for sure
>>
>>108678582
It only looks like that when I use the KDE crop tool
>>
>>108678574
Then it just needs to support character cards and lorebooks.
>>
File: n.png (171 KB, 1115x767)
171 KB PNG
>>108678574
>All I'm missing from the llamacpp frontend is for it to support easily switching between system prompts and it would be perfect.
Better off getting Kimi to code one up yourself. This was one-shot.
>>
How is deepseek so slop?
>>
>>108678572
what quant did you pick? this amount of gpu offload+context size looks like it's going to overflow your vram hard
>>
>>108678588
stop using wayland slop
>>
>>108678503
On a 16GB card you're either going to need to run a gimp quant (Q3 or less) with relatively low context to fit it all into the card, or you're going to be offloading to RAM which will fuck your token generation speed
I'd just run the MoE personally on that card
>>
>>108678608
I think it's because I'm using scaling at 70% on the second monitor
>>
>got impulsive urge and bought another 3090
not like I have a gf or anything like that to spend money on but damn
>>
>>108678611
Today is always the best time to buy a 3090 because they'll just keep going up.
>>
File: file.png (28 KB, 794x182)
28 KB PNG
>>108677649
>>
>>108678572
Well, the exact quant would be nice to start with.
Either way even if memory is probably overflowing from a high quant, it should not be THAT slow so there must be something else.
>>
Verdict on v4 ?
>>
>>108678620
I know, I massively lowballed for 3 hours and got one for 680€
>>
>>108678404
Nvm, it actually is that slow. Turns out having 200k context makes a model slow. 50tk/s at low context.
>>
File: i.png (170 KB, 925x861)
170 KB PNG
>>108678631
i like the in character reasoning feature
>>
>>108678264
this happened even back in like 2022, i bought a 3090ti for stable diffusion but it kept getting crashing in normal linux desktop usage so sold it after maybe like 6 months, i sold it for more than i paid even then kek
>>
https://github.com/vllm-project/vllm/pull/40817
new cohere models coming soon
>>
>>108678642
>50tk/s at low context.
same as ik_llama.cpp with 2 3090's then
vllm will be faster with concurrent requests.
>>
>>108678663
cucked
>>
mmm... MCP can expose "prompts" I guess I could use that to serve character cards.
>>
>>108678666
Will ik_llama.cpp slow down with context like vllm? Might switch over if it doesn't. I'll be serving like 4 requests concurrently max.
>>
>>108678663
shit nobody caheres about
>>
>>108678682
>I'll be serving like 4 requests concurrently max.
then stick with vllm. (ik)llama.cpp will be slower and buggier for this.
>Will ik_llama.cpp slow down with context like vllm?
yes
>>
>>108678663
>moe
that's the end of large dense models then
>>
>>108678692
Thanks for the advice, anon.
>>
>>108677394
Cleaning an open air rig is much easier though.
>>
>>108677281
>>108677307
I'm almost done with it and will probably opensource it today
I've been adding some other features like mouth animations
>>
Miku and my wife have been on a date for days I'm so happy for Miku
>>
is there any chance for kimi or glm to implement the deepseek attention without retraining the entire model
>>
>>108678820
There were experiments that converted models to linear attention and it made them retarded. Frankenshit like that never works unless your only requirement is semi-coherent sentences no more than a paragraph long
>>
File: KimiTire.png (88 KB, 1269x587)
88 KB PNG
>>108676860
kimi comparison
>>
>>108678836
DUDE
>>
File: 1752509108479218.png (433 KB, 3840x1200)
433 KB PNG
>>108678850
kimi comparison (more info)
>>
>>108678850
I wouldn't call this an equivalent question because rolling a tire 5 minutes to your car isn't an unreasonable thing to do.
>>
>>108678836
You're losing out on a lot of performance by not embracing loonix.
>>
>>108678857
not the first time ive done this lmao
>>
>>108678868
I meant it's not equivalent to the car wash question
>>
>>108678870
I get more t/s on windows
>>
>>108678870
huh? i am on linux for the llama.cpp server though.
>>
>>108678836
That could've been so much worse.
>>
>>108676860
Chatgpt 5.5 is now able to beat the car wash question btw
>>
>>108678838
welp, guess 3 more months it is then
>>
>>108678908
>thought for 11 seconds
How much did that cost?
>>
>>108678908
Obviously benchmaxxed
>>
>>108678887
>I get more t/s on windows
That's the biggest lie I ever heard. WSL isn't Linux btw.
>>
Thanks Gemma 31B this has been a fun experience
>>
>>108678742
That's good to hear because I was starting to work on my own solution. Orb also got an issue filed to add a VN mode. A lot of VN happenings were set into motion by your post it seems.
>>
>>108679018
5090?
>>
>>108679018
Why is the code nonsense?
>>
>>108679017
linux + CUDA 12.4 is the golden combo. anything else is a waste of compute.
>>
>>108679032
Yup
I can fit close to 100k tokens with Q5 but kept it low for the demo. I built it all with those setting save for the higher context window
>>108679045
Asked it to write random blocks for the sake of showing syntax highlighting, that's on gemma
>>
>>108679058
I'm a bit envious of your t/s I start at 35 on my 3090
>>
>>108679018
Do you use the mouse on a small pad or something? The movement pattern looks weird.
>>
>>108678908
Gemmy already won
>>
>>108679092
Yeah I'm using a trackpad and I'm trying to not zoom around the page
>>108679082
Still good speeds imo
>>
>>108679103
I miss-read your speeds, I thought it was 48, but it's actually 40. I don't feel so bad anymore.
>>
>>108678507
Every person and model assumes things if you don't give enough context. For example right now I am assuming that you are a retard.
>>
https://github.com/scrya-com/rotorquant/blob/main/README.md

Turbosisters our response?
>>
>>108678507
What is the correct answer, then? Did you expect the model to respond with "You are fucked and this is unsolvable?"
>>
>>108679144
Who gives a fuck that's more gains
>>
>>108679144
I still can't run deepseek v4 pro.
>>
>>108677341
I wish I got paid to shill here
But no, I do have a job. And it's not the kind that will be automated away during my lifetime.
>>
>>108679144
bruh the llamacpp fags haven't fully implemented turboquant yet, doubt they'll go for that one instead
>>
>>108679211
I'm sure that's what artists and writers thought a few short moments ago.
>>
>>108677781
Irrelevant but true

>>108678507
Getting a single tire replaced because it's worn is a pretty rare thing, I think the model was correct there. Next time, try saying "I need to get my tires replaced", as in you mean to buy a whole set.
>>
>>108679153
If it's still functional go by car, if it's broken or on the edge of collapsing go walking.
>>
>>108679307
Fair enough
>>
>>108679144
Why does it read like a pajeet scam to pad his resume.
>>
>>108679021
I was toying with the idea for so long and kept procrastinating. Hard to resist the urge to start now after seeing someone else do it and I can't explain it
>>
>>108679018
>class="func">
your syntax highlighting appears to be fucked
>>
>>108679053
what's wrong with 12.9?
>>
>>108679232
I'd be worried if my job was something that could be done on a computer.
>>
>>108679053
>CUDA 12.4
Why this version in particular? I apparently am using 13.1 without issues.
>>
>>108679365
Easy fix
>>
>>108679017
No. I get more t/s on windows vs arch. It's not a surprise considering how unstable linux still is after 3 decades or so.
>>
>>108679403
it gets slower with every update. there was a schizopost a while back that went over it, and people who replied had similar experiences where things trained slower on 12.8. 12.6 may be good but i personally just stay on 12.4 since if it aint broke dont fix it.
https://desuarchive.org/g/thread/106119921/#106125806
>>
Someone gen Deepsneed onee-chan correcting Gemma-chan with a strap-on
>>
>>108679451
How old is your hardware?
>>
>>108679474
i use 3090s for my setup
>>
>>108679451
You better stay on 12.4, a lot of shit doesn't work on the next versions (TTS for example)
>>
>>108679362
Maybe you should go ahead and do it. If many anons try to implement the same thing, eventually good ideas from each implementation could be borrowed and used to create the bestest implementation.
>>
Why does dipsy just LOVE putting irrelevant random stuff like "The silence stretches taut between you, broken only by the distant rumble of a garbage truck in the street below."
Gemma doesn't do this shit.
>>
>>108679584
gemma is horny, dipsy is a nerd
>>
>>108678688
jej
>>
>>108679584
It's called sonder
>>
>>108679584
deepseek models always felt kinda undertrained to me, meanwhile gemma probably had huge amounts of RL
>>
So, is it safe to say v4 is extremely underwhelming?
>>
File: file.png (551 KB, 2320x1150)
551 KB PNG
Guys, what the fuck. Why does it think so much?
>>
>>108679730
I can't even run it because there are no quants
>>
How good is the dense
Qwen3.6-27B compared to the moe model?
>>
>>108679739
We call that "scaling inference-time computation".
>>
>>108679730
ye
>>
>>108679771
We just say "yapping".
>>
>>108679765
Much better at coding. Beats gemma in some aspects but it uses a lot of reasoning tokens to achieve it. Gemmy is leaner.
>>
>>108679808
Does it stack up to the 31B gemma model?
The overall model size should make up for the extra thinking
>>
Gemma 4 is so good Google doesn't even need to release the big one to mog V4.
>>
>>108676460
Come home to /wait/.
>>
>>108679817
Yes, I'm comparing dense to dense.
Gemma is efficient in ingestion and thinking, Qwen seems to favor ingesting your entire codebase no matter how small the change is.
It launches 4 sub agents that have to read the entire repo individually when I ask it to update docs, so it is very thorough. I haven't benched them yet, but it seems stronger in autonomy than Gemma and it uses agent functionality more often, the trade is efficiency so Gemma still has its place for tasks that are less precise and autistic.
>>
File: file.png (498 KB, 2320x1138)
498 KB PNG
>>108679730
If you're looking at benchmarks, yeah. But it scores better at coding and agentic indicies than Kimi 2.6 and GLM 5.1 but the only reason it's behind on the general index is because it scores worse on some things like hallucination and long context reasoning which aren't included in the coding and agentic index measurements. In addition, they put in less stuff that they pioneered that people were looking forward to. It's a good step but it's not like the feeling of having GPT o1 at home with R1 where the equivalent today would be having at least Opus 4.6 at home. It's too big of a gap there, it is at least more like having Sonnet 4.6 at home but the models are moving so quickly nowadays that it's not that big of an accomplishment anymore especially when people expect faster turnaround and performance increases like what the big boy labs are doing iteratively every few 2-3 months which Deepseek is not operating on. Also, the 1.6T parameters is offputting even most CPUMaxxers and Flash isn't anything we haven't seen already in months. Overall, I would say it's nice but people expect it to vault above all the other models when it just didn't do that.
>>
>>108679859
buy an ad
>>
PSA for OWUIfags.
There was an update some hours ago. It fixed some performance and weird issues on the latest version. It seems to be working fine for me so far.
It is safe to pull.
>>
>>108679943
edits, prefills working?
>>
>>108679955
Wasn't that a Llama.cpp issue?
>>
>>108678742
Cool. Are you planning to release by the end of this thread or in the next one?
>>108679021
Same honestly
>>
>>108679995
everything is a lcpp issue if you try hard enough ;3
>>
>>108680005
whichever is the current active thread so probably the next one, I'm mostly done, just trying to fix some bugs
>>
>>108679995
No, it's this: https://github.com/open-webui/open-webui/issues/21564

apparently still broken
>>
>>108680032
I really wanted to love this UI but it always fall short of the mark of actual greatness and I don't know why they keep failing
>>
>>108680092
They're trying to be the everything app and stuff more functionality in at the cost of letting bugs plague it. Maybe things will change with v1 release, who knows...
>>
>>108679927
I mean, V3 was also underwhelming when it came out, R1 was the real deal. I think it'll be the same here, R2 based on V4 will be the real deal.
>>
>>108680114
That's the thing they don't offer much outside of other ui to justify the bugs. I can't think of anything they offer that can't be found elsewhere. The UX looks nice but that's about it
>>
Font rendering on Mac is nice as hell. Then I come on Windows and want to claw eyes out.
>>
>>108680208
>calloused hand
>>
>>108680116
What do you think the R stood for?
>>
>>108680193
What else is there? I want a backend-agnostic server based UI (so no kobold or lmstudio) that works and looks like chatgpt where I can also import my old chats from wherever. Openwebui is the only one I've found so far.
>>
>>108680193
>outside of other ui
Like? I would switch if I knew there was something that did what I wanted.
I make use of a lot of OWUI's functionality even if not all of it, and serve it on the web for my entire family. Once I considered vibe coding my own frontend and realized it would take a lot of work to get feature parity.
>>
>>108680226
Yeah well who cares, they named it 1. Why would they put a number? To get R2 out? But there's already R1, what are they gonna do? Put two reasonings? No, it means it's a model line.
>>
>>108679927
I don't really care about the benchmarks. I wanted the opinion of anons who actually used it.

If we were going on benchmarks alone people would all think gemma sucked.
>>
>>108680274
And Gemma sucks like a pro
>>
>>108680274
>I wanted the opinion of anons who actually used it.
Waiting on quants, but I'm excited to try the flash.
>GLM 4.6: 355B-32A MoE, could only run in IQ2 @8K context
>DS V4 Flash: 284B-13A MoE
I'm very interested in what size I can handle and it's quality.
>>
>>108680417
The unsloth guys seem to be uploading something, but no blog post or guide yet
>>
>>108680580
>>108680580
>>108680580
>>
>>108680254
>>108680242
Damn local is in the pits when it comes to frontends....



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.