[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>103644379 & >>103638036

►News
>(12/25) DeepSeek-V3-Base 685B released: https://hf.co/deepseek-ai/DeepSeek-V3-Base
>(12/24) QVQ: 72B visual reasoning model released: https://qwenlm.github.io/blog/qvq-72b-preview
>(12/24) Infinity 2B, bitwise autoregressive text-to-image model: https://hf.co/FoundationVision/Infinity
>(12/20) RWKV-7 released: https://hf.co/BlinkDL/rwkv-7-world
>(12/19) Finally, a Replacement for BERT: https://hf.co/blog/modernbert

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/tldrhowtoquant

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/hsiehjackson/RULER
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>103644379

--Local vs cloud computing, API usage, and the future of model deployment:
>103647690 >103647701 >103647797 >103647883 >103647912 >103647967 >103648025 >103648067 >103648113 >103648126
--Testing Deepseek's capabilities and limitations, comparing to Claude:
>103644658 >103644698 >103644738 >103644754 >103644779 >103644816 >103644876 >103644835 >103644896 >103644923
--Discussion of language models and their capabilities:
>103648286 >103648296 >103648315 >103648321 >103648330 >103648355 >103648385
--DeepSeek V3 effectiveness and limitations:
>103644403 >103644423 >103644470 >103644559 >103644512 >103644543 >103644590
--Optimizing DeepSeek V3 performance and reducing repetition:
>103648755 >103648783 >103648811 >103648830 >103648857 >103648874 >103648883 >103649020
--Bots stopping mid-sentence during generation and potential solutions:
>103644937 >103644996 >103645178 >103645349 >103645723
--Anons discuss and compare SSDs for DeepSeeKV3:
>103644429 >103644507 >103644539 >103644691 >103644757 >103644630
--Anon weighs GPU options for coding use case:
>103646018 >103646042 >103646043 >103646067 >103646151
--Anon asks for AI model to tag large video collection:
>103644683 >103644708 >103644740 >103644751 >103645219
--Qwen 1.5 MoE vs non-MoE model comparison:
>103648423 >103648429
--DeepSeek V3 discussion and dragon story example:
>103646053 >103646086 >103646125 >103646138 >103646161 >103646193 >103646163 >103646192 >103646197 >103646201
--DeepSeek V3 usage and troubleshooting discussion:
>103646967 >103646992 >103647158 >103647221 >103648463 >103648469 >103648480 >103649124
--Anon wants to merge Constitutions using an LLM:
>103648078 >103648136 >103648165 >103648932
--Miku (free space):
>103644553 >103644661 >103644732 >103644887 >103644895 >103644911 >103646906 >103647932

►Recent Highlight Posts from the Previous Thread: >>103644382

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
>>
Let's take a moment to remember Qwen and how badly they got dunked on by Deepseek.
>>
>>103649789
>let's take a moment to remember chinks and chinkshit
Let's not.
>>
>>103649789
https://www.chinatalk.media/p/deepseek-ceo-interview-with-chinas
CEO seems based at least. Interesting interview. Gonna post a couple excerpts.

>China should gradually become a contributor instead of freeriding.
>In the past 30+ years of the IT wave, we basically didn’t participate in real technological innovation. We’re used to Moore’s Law falling out of the sky, lying at home waiting 18 months for better hardware and software to emerge.
>That’s how the Scaling Law is being treated.
>But in fact, this is something that has been created through the tireless efforts of generations of Western-led tech communities.
>It’s just because we weren’t previously involved in this process that we’ve ignored its existence.
>What we see is that Chinese AI can’t be in the position of following forever. We often say that there is a gap of one or two years between Chinese AI and the United States, but the real gap is the difference between originality and imitation. If this doesn’t change, China will always be only a follower — so some exploration is inescapable.

>Q:But you’re ultimately a business organization, not a public-interest research institution — so where do you build your moat when you choose to innovate and then open source your innovations?
>A:In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
>Open source, publishing papers, in fact, do not cost us anything. For technical talent, having others follow your innovation gives a great sense of accomplishment. In fact, open source is more of a cultural behavior than a commercial one, and contributing to it earns us respect. There is also a cultural attraction for a company to do this.
>>
>>103649824
Highly based.
>>
How does the cache work? Like if I deleted 90% of a chat and start over is the cache still infecting its outputs?
Seems like if the slop infects your API key you need to get a new one.
>>
File: chinesemantyping.jpg (131 KB, 1255x837)
131 KB
131 KB JPG
>Highly based
>>
>was going to get 2 5090 for 72b/123b models
>now debating for cpu maxxing instead
please tell me all the deepseek posts are just shills or a meme
>>
>>103649866
They're slant-eyed shills.
>>
>>103649866
they're state-operated shill agents
>>
>>103649866
CPUMAXXing is futureproof and there's nothing stopping you from slapping 3-4 5090s onto your dual socket Genoa-X board later when you need them.
>>
>>103649851
Yeah, I mean he could be lying of course but very surprising insight into chinese ai.

>Why is Silicon Valley so innovative? Because they dare to do things. When ChatGPT came out, the tech community in China lacked confidence in frontier innovation. From investors to big tech, they all thought that the gap was too big and opted to focus on applications instead. But innovation starts with confidence, which we often see more from young people.
>Our hiring standard has always been passion and curiosity. Many of our team members have unusual experiences, and that is very interesting. Their desire to do research often comes before making money.
>DeepSeek is entirely bottom-up. We generally don’t predefine roles; instead, the division of labor occurs naturally. Everyone has their own unique journey, and they bring ideas with them, so there’s no need to push anyone.

>Q:Many LLM companies are obsessed with recruiting talents from overseas, and it’s often said that the top 50 talents in this field might not even be working for Chinese companies. Where are your team members from?
>A:There are no wizards. We are mostly fresh graduates from top universities, PhD candidates in their fourth or fifth year, and some young people who graduated just a few years ago.
>The team behind the V2 model doesn’t include anyone returning to China from overseas — they are all local. The top 50 experts might not be in China, but perhaps we can train such talents ourselves.

>Q:Once DeepSeek lowered its prices, ByteDance followed suit, which shows that they feel a certain level of threat. How do you view new approaches to competition between startups and big firms?
>A:Honestly, we don’t really care, because it was just something we did along the way. Providing cloud services isn’t our main goal. Our ultimate goal is still to achieve AGI.
>Big firms have existing customers, but their cash-flow businesses are also their burden, and this makes them vulnerable to disruption at any time.
>>
File: WEBP_Player.png (700 KB, 777x630)
700 KB
700 KB PNG
I'm starting to get used to AI-generated look and feel.
>>
>>103649866
You'll get fucked either way. Why not watch some TV shows, read a book, play some games? LLMs and hardware are in a weird spot right now, I'll just wait until the dust settles and then scoop up some cheap hardware
>>
>using UnslopNemo 4.1
>mention {{user}} has long, sharp, pointy, claw-like toenails
>well over 100 message later
>{{user}} makes {{char}} lick her feet
>model mentions that the toenails are painful on her tongue
Whoa.
This after this model actually demonstrating significant spatial awareness the other day.
Good model for a 13B. Continues to impress me in little ways like this that other models of the same size, or even Mixtrals back in the day, have not.
Drummer completely fucked up the Metharme implementation but the model is quite good with Mistral templates.
>>
>>103649893
Yeah I feel like local image gen is in a much better place right now than local LLMs.
Still flawed, but you can get great results with a bit of shooping and inpainting, and unlike editing prompts and messages, this feels satisfying, whereas having to edit prompts and messages feels more like tard wrangling.
>>
>>103649866
Qwen and deepseek is too dry and positivity pozzed unfortunately.
We have mistral for RP and thats it. And even those get worse the bigger the B.
For coding I wouldnt use local. But if you must I'd use qwencoder. And thats 32b.
Hope you at least tried out the big 70b+ models somewhere before buying. I was dissapointed.
>>
Gemini is great.
>>
>>103649932
Which one? From my testing they all kinda seemed dumb.
>>
its been a while since i used llama.cpp as im a exl2 user.
I see that now it requieres cmake to compile instead of "make".
It's painfully slow.

I used to run:
make clean && time GGML_CUDA=1 make -j$(nproc)

and it was faster. Any alternative to?:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

I have an EPYC system.
>>
All Chinese LLMs are going to be CPUmaxxed from now on because they are sanctioned to death by the US and the sanctions are going to get worse under Trump.

With CPU inference they can at least skirt some of the requirements and make optimum use of available GPU power by putting 100% of the GPUs in the country to use purely for training.
>>
>>103649866
The approach is to wait and see what hardware and models will come out in the coming year. Everything moves too fast to just sink a ton of cash into something and regret it a few months later.
>>
>>103649926
I agree, even the biggest txt2img models can be run on a single xx90, but that's entry level for average intelligence LLMs. So much for "a picture says more than a thousand words"
Images are also far easier to understand and compare
>>
>>103649946
Check the docs https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#cuda
Also use -j [threads] when compiling, it helps a lot, especially when you compile all FA quants
>>
>>103649956
something something "hurr durr poorfag cope"
>>
>>103649973
would using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 be better than just offloading more layers to RAM?
>>
>>103649947
Thats not so bad I guess.
Wasnt there a similar situation with a IBM monopoly in the 80s which forced optimizations?
2xxx$$ Nvidia 5090 with 600 watt and 32gb is crazy.
>>
what kills me most about the current crop of models is that no matter how smart, they'll all fall apart eventually just far enough into the context. It is inevitable.
>>
>>103649928
>For coding I wouldnt use local.
Isn't the new DeepSeek competitive with the best models while being a lot cheaper?
>>
>>103649996
Oh, yeah don't enable that, at least not in windows
It tricks your gpu into thinking it has more vram and when it spills into ram, the performance is going to tank HARD
Regular offloading is much faster than unified memory
>>
>>103650022
Yeah. But I still wouldnt make the jump.
Even sonnet 3.5 sometimes fucks up hard. Especially with loops. I'd rather take 3x if/else. You cant trust the output.
I wouldnt accept anything but the best. If its a hobby project or something its more than enough though.
>>
>>103649926
flux is boring. I started shitposting here because there's only so much you can do unless you want to inpaint for hours. Let's face it, at the end of the day txt2img is just a 1girl generator and not much more
>>
has anyone been able to quantize deepseek v3 to GGUF? I think DeepseekV3ForCausalLM is not supported?
>>
>>103650092
doesn't quantizing sparse moe models fuck up their performance?
the original is q8 already so there's not that much to trim down anyway
>>
File: 1707832964593536.jpg (94 KB, 1050x618)
94 KB
94 KB JPG
>Performs on par with Claude 3.6 Sonnet as a web agent while having only 9B params
https://huggingface.co/THUDM/cogagent-9b-20241220
https://github.com/THUDM/CogAgent
Holy fuuuuuuuck. China won again!!!
>>
>>103650134
is there anything like this but for RP?
>>
>>103650124
well, i just wanted to be able to try it. I have a 3090x4 + 512GB ram so I'm not sure im able to load the original model even at the native fp8
>>
>>103650134
those benchmarks are always cope
>>
>>103650156
Could've paid for billions of tokens with that investment
>>
>>103650134
(on a very very narrow task)

Small models are retarded. Always has been, always will be.
>>
>>103650181
Go away Sam, no one wants your oversized models
>>
>>103650162
I use the homelab for more stuff though, not only LLMs + it's still local
>>
>>103650046
flux big thing is the natural way of instructing it but it only works if you hit something it actually knows exactly, which is complete guesswork. The worst thing about that is that because of the nature of natural text, you really have to do a lot of guesswork how to write the prompt that it actually will pay attention to all parts and figure out if it even really understands all of it. Tags are easier. I found sdxl with controlnets to be much more versatile, personally
>>
>>103650195
Now I'm curious, what do you use it for?
>>
>>103650013
Well yeah but most of them are stable up till 32k context. Last year we were glad when we reached 8k max context on models. Keep in mind how quickly we're making progress.

The only thing that changed is that model makers straight up started lying about the max context the models can handle.
>>
>tts sucks
>every 3d model has hard requirements to cuda libraries that won't work with amd
/lmg/ was mistake
>>
>>103650248
>cuda libraries that won't work with amd
That's kind of a you problem though.
>>
>>103650215
VMs with multiple GPU's for passthough as a middle server for livestreaming with OBS using SRT to help with revieved packets due to the source being on cellular 5G. Then from the server to youtube using RMTP from a wired gigabit connection.
Also remote gaming with parsec or moonlight. It's handy also having x4 stable diffussion forge or comfy ui loaded with Flux for parallel outputs
>>
File: Pro WS WRX90E-SAGE SE.png (699 KB, 692x692)
699 KB
699 KB PNG
Should I get a threadripper pro or do I go straight for a server board?
>>
they uploaded the Deepseek v3 model card and paper:

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/README.md

It has distilled reasoning capabilities:

"Post-Training: Knowledge Distillation from DeepSeek-R1

We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3.
"
>>
>>103650306
big ass board lmfao
>>
>>103650306
>7 pcie slots
b r u h
>>
>>103649926
I kind of feel the opposite to this. Maybe I'm just retarded but imggen prompting and building refinement pipelines are a lot harder to intuit than tard wrangling LLMs imo. Competent image generation looks almost indistinguishable from the real deal, but it's a lot more finicky and getting there is harder than it is to get a sufficiently large LLM to match the quality of an average slop novel.
The main advantage of imggen models is that they're significantly easier to run on consumer hardware.
>>103650306
>7 slots
Why would anyone do this?
>>
File: file.png (1.17 MB, 2400x2400)
1.17 MB
1.17 MB PNG
>>103650427
>Why would anyone do this?
You want them to just not connect all those pcie lanes the cpu has?
>>
Johnny Dep's Speed 3
>>
File: file.jpg (614 KB, 1376x2012)
614 KB
614 KB JPG
>>103650427
>Why would anyone do this?
https://youtu.be/-nb_DZAH-TM?t=993
Chinese richfag mikubox
>>
>>103650393
That's the same amount the server mainboards have that CUDAdev and some others use in their mining rig builds. It's probably the maximum that's supported in terms of PCI-E x16 lanes with these CPUs.
>>
File: 2024-12-26_04-34-18.png (8 KB, 618x49)
8 KB
8 KB PNG
>>103650316
>ssd maxxing dream doa
so much for my panic shilling oh well time to look into cpumaxxing
>>
>>103650555
manic* fuck
>>
>github copilot has been out for 3 years
>free tier available now
>no open source/local alternatives
Why can't local keep up with proprietary shit?
>>
>deep repeat 3
>>
>>103650586
>why can't my home pc keep up with an entire industrial datacenter
gee I dunno anon
>>
>>103650586
This is the smartest kid of shitpost because it invites know-it-alls and shills to come defending their turf by giving an actual answer.
A truly efficient way to get a proper answer amidst the shit flinging.
>>
>>103650316
>fp8 native weight training:::We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.
>AMD GPU: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes.
Huawei Ascend NPU: Supports running DeepSeek-V3 on Huawei Ascend devices.
>We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.
cont
NOTE: The total size of DeepSeek-V3 models on HuggingFace is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights.
>>
>>103650612
yeah well fuck you
>>
>>103650611
>entire industrial datacenter
They offer it for free, they're running this shit on raspberry pis and we can't even match it with the free 405B parameter models
>>
>>103650631
True
>>
>>103650555
what is ssdmaxxing? Ive been away for awhile.
>>
>>103649866
I got 4090 for gaming and then got into AI. Get a 5090 for gaming if you game. If not then none of this shit is worth it.
>>
>>103650659
The idea that you could run a moe model on pcie 5 ssds.
>>
Man, this necessity to have the exact same vocab between the main model and the draft model fucking sucks.
qwen2-57b-a14b-instruct isn't compatible with qwen2-0_5b-instruct as a draft model? What the hell.
What are some 40 to 60B~ish MoE out there that I can use for my tests?
Mistral doesn't have a tiny v1 model that would be compatible with the original mixtral, their smaller one is 7b, right?
>>
>>103650659
Level 2 cope
Level 1 cope is cpumaxxing because you're too poor for H100maxxing
Level 0 cope is pretending that current tech LLMs will magically stop being retarded after a magic parameter number while your house burns down around you because you forgot to install a better breaker
>>
>ITT promplets getting BTFO
It doesn't matter how many params you have if you don't know how to write.
>>
>>103650788
>uhm sweaty you are using it wrong!
*yawn*
>>
>>103649789
Releasing a 685b model that defeats existing smaller param models isn't impressive
It's like when Grok 1 (314b) released and claimed it scores better benchmarks than llama 70b. Like yeah it does but fuck off.
>>
>>103650815
A 685B sparse MoE isn't the exact same thing, but I get what you're saying. But the barrier to entry to run that model is way lower than 314B non-MoE model.
>>
>>103650586
continue.dev+LM studio, but /g/ will say LM studio is a proprietary solution or something.
>>
>>103650860
I like aider better. Works with any editor.
>>
>>103650860
Yeah retard, no need for another reskinned llama.cpp
>>
Am I the only one who's not excited about the new deepseek?
It's just more of the same just scaled up.
At least the QwQ meme was something different.
>>
>>103650870
>double click and run?
>no retard, you need to install Gentoo and get llama.cpp working or you're not running efficiently enough!!!
Thanks.
>>
>>103650860
It's proprietary and brings nothing to the table.
>>
>>103650871
I dread the implications of the fact that its reasoning is distilled from the new R1 model. If deepseek v3 is this big. How big is R1?
>>
Just messing out with mistral big after i got a new gpu, is there a "recommended" Token size for all of the system instructions (im making my own) and character card together you should not go above?
>>
>>103650631
>They offer it for free
no, you're testing their model for free
>>
Got myself a second 3060 so I have 24gb vram. Are there any good MOEs that work at this size or am I still a vramlet
>>
>>103650918
1.4T dense, local will be saved!
Jokes aside, didn't people claim that it was <50B a few weeks ago?
>>
>>103650985
>didn't people claim
Guess it depends who those people were.
>>
>>103650983
did you actually get a 12GB 3060
also why would you buy 3060 when 3090 exists
>>
>>103650983
MOEs are optimized for cpu inference, but 24GB lets you run pretty much anything up to 30B at near-lossless quants and above reading speeds
70B at 3-4bpw is also doable if you don't mind waiting, with a decent draft model it'll be even better
Honestly, just wait until we get better, more efficient shit
>>
ok im going to sacrifice myself
im going to seduce m zuckerberg
>>
>>103650993
A few anons speculated as much since it was really fast (please don't tell me it's an even bigger MOE)
>>
>>103651013
>please don't tell me it's an even bigger MOE
okay, I won't
>>
>>103650973
>can use it for free (0 dollars)
>>um actually, if it's free then you're the product!
>>
>>103650998
> did you actually get a 12GB 3060
also why would you buy 3060 when 3090 exists
$400 vs more than $400 for a toy, hmm hard choice
>>
>>103650918
R1 is probably 236B
Just because it can into CoT that deserves being distilled into V3 doesn't mean it's bigger than V3. This is the logic for non-reasoner distillations. Think of reasoners more like you think of reward models. QwQ is 32B and beats llama-405B on a ton of benchmarks after all.
>>
>>103651094
>236B
Oh yeah a measly 236B
>>
>>103651094
Nothingburger then
>>
>>103651099
>>103651111
you're hard to please digits guys
>>
>>103651094
"Distilling" a small model to create a big one is an oxymoron. R1 is bigger than V3.
>>
>>103651178
"Distilling" is literally just "tutoring" rebranded after retards tutoring Llama-1-7B on GPT-4 outputs gave it a bad name.
>>
>>103651178
puerile semantics
What matters is the quality of data. Small reasoners can generate data that large non-reasoners cannot. They explicitly say that R1 is kind of retarded so they do a ton of work around rectifying it, they use V2.5 a lot too. We have no reason to think R1 is bigger.
>>
>>103651062
>he paid $400 for a 3060
bro...
>>
>>103651062
You played yourself.
>>
>>103651229
AUD
>>
File: 1709058286818351.png (32 KB, 827x122)
32 KB
32 KB PNG
>>103651196
>>103651214
So this is a different kind of process than the one that was used to "distill" LLaMA3.1 405B into the 3.1 70B and 8B variants?
https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
>>
>>103651256
This is a colloquial use of the term "distillation", they don't match logprobs, just train on outputs.
>>
>>103651246
How much is a used 3090 in aussieland?
>>
deepseek is really nice, finally a local that might be worth using over claude through open router
>>
It's a shame no one makes a local first model. Small active set MoE with Turbosparse type predictor for further sparsity.

SSD maxing a large model designed to run on GPU clusters is going to work poorly at best.
>>
I think deepseek v3 is the first model I feel like I can wrangle into being a proper AI Dungeon Master with some prompting magic. It barely needs lorebooks to be fed information about the 3.5e version of the game, and there's a lot of information.
This is pretty fucking cool man.
Even Claude and GPT4o wuld hallucinate Tome of Battle Meneuver names.
>>
File: kill me baby;yasuna;.png (178 KB, 500x500)
178 KB
178 KB PNG
I did some research on LLM support on a Linux phone (Oneplus 6, sdm845, Adreno 630, mesa 24.3).
>llama.cpp CPU
works without issues
>llama.cpp GPU - Vulkan
Mesa Freedreno Turnip doesn't support 16bit storage on Adreno 630 (present only on Adreno 650+ for now), which is required by Vulkan backend in llama.cpp
>llama.cpp GPU - experimental Qualcomm OpenCL
Mesa Freedreno Rusticl doesn't support subgroups yet, which are required by OpenCL backend in llama.cpp
>mlc-llm GPU - Vulkan
From what i understand, same as llama.cpp, 16bit storage is required
>mlc-llm GPU - OpenCL
f32 models run, but slow and output garbage

So far i'm stuck with CPU, i'm wondering how performant Adreno 630 can be once it supports the required features (i believe it has the hardware for it, but it's not implemented in Freedreno yet?).
Tale as old as time: GPUs other than Nvidia have dogshit support, while CPU always works.
Thanks for reading my blog.
>>
Looks like 3200mhz ram would only get more like 2-3tks out of this. You would need a more recent server for good speeds. The price just doubled lol.
>>
So cpumaxxer was getting like 8t/s with v2.5 which has 21b active parameters, now v3 has 37b active parameters.
Remember that he has 12-channel ddr5 memory.
My mental calculations say that an 8-channel ddr4 machine would get about ~2t/s or less.
I'd say wait for people doing tests before buying a ddr4 server.
>>
Personally I'm just waiting for a new architecture so I can finally run a non-lobotomized quant of a good model on my shit consumer hardware
>>
>>103651853
Didn't they release small v2 and v1 models back in the day?
I wonder if those could be used for speculative decoding to squeeze a couple more t/s.
>>
Apparently it was trained on only 2,000 H800s for less than 2 months, costing $5.6M. Why cant we crowd found something like this?
>>
>37b
I thought it uses 8 experts by default so isn't it 16B + 21B constant and you can just load those 21B into vram?
>>
>>103651874
Different tokenizers.
>>
>>103651853
buying a ddr4 server seems super scuffed - it's old shit, so cheap but also slow and it's practically useless outside of big MoE models
>>
>>103651971
With 37B active your gonna end up wanting a ddr5 server anyways for decent speeds. Though perhaps you could pair a ddr4 board with a 48GB card?
>>
>>103650998
>>103651062
Bro.... I paid $300 for my 3090
>>
>>103651853
>So cpumaxxer was getting like 8t/s with v2.5 which has 21b active parameters, now v3 has 37b active parameters.
>Remember that he has 12-channel ddr5 memory.
I just got https://github.com/kvcache-ai/ktransformers working and I'm getting 5 t/s with 2 channels. I have a 4090 and 192GB ram.
>>
>>103652084
? Does ktransformers support v3 already or do you mean 2.5?
>>
Can someone explain the Test Time Compute thing to a retard? How does it differ from CoT with ability to see the prior mistakes? Jewgle Flash Thinking thought process doesn't seem like it's that helpful for coding, it's just doing similar thing to what older models were doing, just a bit longer.
>>
>>103652112
2.5
>>
>>103651897
Even if that was a realistic thing to say, I doubt we would get gold on our first try.
>>
>>103652117
So you would get about half that with v3 if you had enough ram. Might be very doable then with 8 channels
>>
>>103652114
The other difference is that Test Time Compute uses RL to pick next steps to consider
>>
>>103652131
I'm don't know how viable ram overclocking is on 8 channel boards.
I have 4 sticks and I'm running them at 4800mhz because that's what worked with zero manual tweaking.
With only two sticks it goes up to 6400mhz. If those frequencies are possible on 8 channel boards then it would be even faster.
>>
No but really is ssd thing impossible with this? I would assume like in >>103651928 you would need 21B + context in your gpu and regular ram constantly. And then you would load experts from SSD with it still being 2B's so 1gb at 4bpw?
>>
>>103649764
electro magnet snow board is probably a cool product
>>
>>103652214
The big question that probably nobody here knows the answer to is how often do the experts change from one token to the next.
>>
>>103652190
You can get 5600mhz for.. well not cheap
>>
>>103652114
It's essentially a new type of finetune. Like how we have base models and then trained on instruction following made them instruct models. And then training on question/answering interactions made them into chatbots.

Now they also added Reinforcement Learning to train the models to "pick the best route" out of multiple different options.

Then what the "CoT" does is essentially make a couple of short drafts within its CoT and the RL training makes the model pick the best of these drafts and finish them up. So constantly while thinking the model makes a lot of "branches" where it can go and the RL finetuning then makes the model decide which of the branches is more likely to lead to a correct answer.

o3 is rumored to make 1000 branches, pick the best one and at every pivotal reasoning step again make 1000 branches.

It's extremely wasteful in terms of tokens and I think we will change how it's done severely. The way we do it today is extremely hacky with a lot of wasted tokens.
>>
>>103652214
You'd load 8 (it was 8 experts, right?) of the experts per token. Given how many total experts there are, you'll probably load new ones most of the time. At 4bpw, you'll load about 8gb per token. How long does it take you to load 8gb?
>>
>>103652283
The fastest ssds can do about 14 GBs sequential so it might be doable
>>
DeepSeek v3 feels like it finally hits the ideal intelligence level I've been waiting for
The only thing now is optimizing that shit into a size that doesn't require a dedicated RAM farm
>>
Btw, for deepseek use 1.7 temp. It gets fun then. 1.8 starts making mistakes.
>>
>>103652299
Though that would pretty much bottleneck it to 2 tks or a bit less.
>>
>>103652299
If you have the fastest, sure. still, that'd limit your maximum possible generation to < 2t/s
>>
>>103652340
Maybe it is not that retarded if you run it at 4 experts instead of 8.
>>
>>103652282
Close, but afaik it doesn't generate the actual drafts. Otherwise, the Flash 2.0 thoughts wouldn't be generated token by token. The goal of RL is to effectively prune the search by training it on multiple branches during training time. When the time comes to do inference, RL enables the model to pick the most probable option without expanding the others, which is faster, but results in a loss of accuracy / thoroughness
>>
>>103652349
2 tks for like $200 is not bad at all for that intelligence though.
>>
>>103652320
That's not a very big workable range.
>>
>>103652368
We don't know how it actually works I think o3 uses actual written out drafts because of how much tokens it used to answer a single ARC-AGI question (110 MILLION tokens)
>>
>>103652371
MAXIMUM. That's just loading the experts under optimal circumstances and absolutely nothing else. You still need to run the ~20B params and deal with the overhead of swapping experts and all that. I won't speculate about the realistic speed.
>>
>>103651897
That amount of money could be crowd-sourced (thought it wouldn't be easy) but
- you'd have to have knowledgeable people who aren't grifters heading the project
- you'd have to avoid breaking muh copyright in blatant ways when it comes to collecting training data - easy to do when you're a chink who doesn't have to care, but not so in le free west (otherwise ambulance chasing lawyers will eat all your crowdsourced money instead of training models)
>>
File: huh.png (370 KB, 1159x1037)
370 KB
370 KB PNG
>>103652320
>>103652387
That's a creative / sanity balancing line for people who thought the model was too dry. This model goes almost crazy (but still perfectly coherent / without anatomical mistakes) around 1.7. This scene was not supposed to be sexual and look at where it took it. Wheeze...
>>
After really giving deepseek v3 a go, here's my review.
Works great if you have one question that needs an answer.
Once you do RP with this thing it shits the bed. It seems VERY sure of what the next token should be, like seemingly all Chinese models, once it is on a set direction, it cannot change course.
It has horrid repetition issues. After a few back and forth messages, it will just copy segments of previous messages wholesale completely out of context.
It seems to get even weirder at higher contexts which isn't really unexpected, but the 163840 purported contexts limit is being generous.
So yeah, not that great for RP.
But that's all modern models. Model creators are obsessed with benchmarks, math and programming. At this point, a model's ability to converse feels vestigial.

You want my opinion, the glory days of chat models has been over for months. Anything coming out now will be a side grade at best.
>>
>14.8 trillion tokens
Scaling is dead
>>
>>103650622
Wait you're telling me that one Meta paper is now in a production model? That's cool.
>>
>>103652406
I feel pretty confident that o1 doesn't generate the branches. While we don't technically have concrete proof of that, we have a good idea of what Google does given that they show the thoughts rather than hiding them, and it gives performance reasonably close to o1 preview with their Flash model, which, presumably, is not nearly the best they have
o3 is likely something entirely different from Flash and o1 which OP was talking about. My guess is also that it's actually doing a more thorough search in a tree-like fashion and evaluating paths somehow. OpenAI obviously won't divulge any details because they're thirsty for their moat, but I'd expect Google (or somebody else) will release something similar relatively soon which should give us a good baseline
>>
Drummer's models are dogshit.
>>
>>103652511
>horrid repetition issues
Still never ran into this. What's your setup?
>>
>>103652511
0.15 rep pen and I have no had any rep issues since. Make sure you have your formatting right as well. Use 1.2-1.7 temp for creative stuff. Yea, it seems very sure of itsself but that is a sign of a smart model. I have had no issues up to about 50k context. And I would say some smut tunes probably still write better smut but this model is the only option now if you want actually intelligent RP / writing.
>>
>>103652511
Oh, and dont use openrouter. Apparently you you will 2.5 half of the time which is retarded compared to v3
>>
>>103652540
>0.15 rep pen
Doesn't that make repetition more likely?
>>
>>103652554
No, it starts penalizing tokens based upon them already being in the context
>>
>>103652564
That's presence penalty
>>
I tried deepseek v3 in (((the cloud))) and it seems more of a leaderboard whore than a model you'd actually use
it's not anything groundbraking even in coding, its supposedly strong suit
>>
>>103652540
>Yea, it seems very sure of itsself but that is a sign of a smart model
Overcooked
>>
Considering training knowledge domains: how far is the domain of long multiturn dicksucking in multiple varied interesting ways, from the domain of answering a single turn safe question / riddle / coding problem with one objective truth?
>>
>>103652569
In ST im using frequency penalty. Just used to calling it rep pen
>>
>>103652595
Yeah, it's unfortunate since despite the similar names and functions, they gave them different scales for whatever reason
>>
>>103652590
About 3.5.
>>
How does the new Deepseek model compare to the Hunyuan large A52B MoE? Why are we getting hopeful over this when A52B still has not yet gotten a q4 quant? Despite it being smaller than deepseek v3
>>
Okay, anybody used DeepSeek V3 for incest smut and is it worth spending money on?
>>
What models do the best with large context windows for parsing long documents? I hear gemini apparently excels at the task but I'd rather not use it for obvious reasons.
>>
>>103652511
Hate to be that guy, but are you SURE you're using V3?
>>
>>
>>103652620
Because no one can try Hunyuan-Large anywhere.
>>
>>103652244
>The big question that probably nobody here knows the answer to is how often do the experts change from one token to the next.
Just like quantization there's not even a real answer. The mixture of cache paper used only the top 2 experts for certain and the rest only if they were in cache. Unfortunately they didn't do any hitrate experiments with a small cache with that strategy.

A local model could be trained specifically for caching, loading only max 1 new expert per token with say 8GB worth of expert cache.
>>
>>103652717
I wonder how sensitive these models are to expert selection. What would happen if you picked, say, k arbitrary experts and just used those without swapping between them? Do individual experts have enough "general" capabilities to give decent outputs even if they aren't the best choice?
>>
>>103652702
If people aren't running A52B now, I don't see a chance for Deepseek V3 being local either.
>>
>>103652785
From what I remember Hunyuan-Large massively underperformed for its size and so was never mentioned again even by the creators.
>>
>>103652785
>>103652797
Also 52B active would start getting to the point where cpu only inference is just not gonna do it. 37B is more reasonable. Reading speed might be doable.
>>
>>103652692
>those cope benchmarks where qwen and 2.5 beats sonnet and 4o
what even is the target audience for it when it's so obviously bullshit
>>
Anyone know if I can run deepseek with 350GB RAM and 96GB VRAM?
>>
>>103652846
4 bit. Be sure to pass on the performance you get.
>>
>>103652846
I know.
>>
anyone tried the instructions to git clone https://github.com/deepseek-ai/DeepSeek-V3.git and run inference with torchrun?
>>
>>103652842
Kicks their asses on Livebench too from my understanding, and that one has a closed test set that is updated every few months
Next question
>>
>>103650612
back in my days this was called "bait" and you'd post an image of a fish and a hook to signify that the post was bait
>>
>>103652785
If the model was an actual breakthrough then people would build for it.
In reality it seems decent but not great, and the special requirements for it means few people will bother
>>
>>103652858
I can only find the FP8 model. Can I make it 4bit?
>>
>>103652905
They say how to on the page.
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/README.md
>>
File: 1393574362984.jpg (5 KB, 200x200)
5 KB
5 KB JPG
>>103652900
>>
>>103652905
>>103652916
That or wait. Im sure the usual suspects will have quants eventually. Being 700B its prob gonna take days though for them to quant then upload it.
>>
>>103652688
how do you know you're using V3 through the api? the model always answers that it's a GPT model.
>>
>>103652916
>>103652941
Yeah I would wait for gguf. INT4 quantization is an extremely naive lobotomy quant.
>>
>>103652943
Ask it "what model of deekseek are you?"
>>
>>103652943
For what it's worth, it might work now since they went and disabled Hyperbolic in the API on OR
>>
You do realize that this 685B model is 256 experts, 8 of which is active, so only 21B activated parameters over 8 experts, roughly 3B per expert. How expert can a 3B model be for the area of expertise? It is amazing that the coding performance is almost o1 level, but you do realize that all china AI models are nothing but stolen synthetics data from OpenAI/Anthropic, they just shove it into a model able to hold that many tokens.

OpenAI, Grok, even later LLAMA 4/5 are going for 2Trillion parameters (dense or activated ?) by next year. There is a huge difference when your model doesn't have free stolen synthetics from others that you have to do the grunt work to actually think inside the model instead of being told.
>>
>>103653000
This poster is a chinese marketer pretending to be a retarded anti chinese poster.
>>
>>103653000
1. holy esl
2. that is not how moes work
3. pretty sure that is every model since gpt4, are you really gonna cry that everyone is stealing from openai? fuck them.
>>
>>103652951
thanks, but it doesn't seem to work. it's strange because i'm using the documentation curl

> I am an instance of OpenAI's language model, specifically **GPT-4**. My design is based on the GPT (Generative Pre-trained Transformer) architecture
>>
>>103653000
Anon, for fuck's sake, learn how MoEs work next time you decide to make another retarded post.
>>
File: bait.png (93 KB, 625x626)
93 KB
93 KB PNG
>>103653000
>>
>>103653029
alright i made it work. I had to add a system message saying "You are a model by deepseek"
>>
File: deekseek2.png (69 KB, 1290x322)
69 KB
69 KB PNG
>>103653029
Odd, that is with top k 1 or temp 0? It does it for me. Then maybe OR is fucked atm. Maybe they are in the process of changing to it and are using some fill in model for now? I have no clue.
>>
>>103653000
Do you realize that deepseek literally said that they are using 37b active parameters?
>>
>>103653000
If it's so easy to make a good model with synthetic data, why aren't Meta/OpenAI/etc doing it with their own "pure models"?
DS3 is revolutionary, it's the first open model we got to get close to Sonnet 3.5
>>
>>103653029
That's the same response you get in their web chat interface after activating the web search.
>>
>Tfw they're falling for a Deepseek post made by Deepseek
>>
Guess im waiting for quants then.
>>
>>103653047
Wait, you have to TELL it it's a deepseek model, and you're trusting what it answers its version correctly? anon...
>>
>>103653048
maybe sillytavern adds some contextual info as input?

according to their documentation, you can't change top_k, only top_p, and it doesn't seem to affect the result. Anyway i'll leave the curl here

just changing the system prompt to
>You are a model by deepseek
seems to work

> I’m DeepSeek-V3, an artificial intelligence model created by DeepSeek
but sometimes it says it's just deepseek-chat.

curl --request POST \
--url https://api.deepseek.com/chat/completions \
--header 'authorization: Bearer KEY' \
--header 'content-type: application/json' \
--data '{
"model": "deepseek-chat",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "what model of deekseek are you?"
}
],
"stream": false
}'
>>
>>103653116
no, it's still fishy
>>
>>103653116
It's roleplaying.
>>
File: ds_v3_benchmark_table_en.jpg (178 KB, 1280x1091)
178 KB
178 KB JPG
I know the official is V3 at least, they even say so now: https://api-docs.deepseek.com/ under first api call

Also another benchmark
>>
>>103653205
Also qwen2.5 still standing strong. Makes me wonder what qwen3 will be like. If they manage to pass this with 72B...
>>
>>103652842
>4o
Not hard to beat that one but yeah codeforces is a giant meme
>>
>>103650134
>Claude 3.6 Sonnet
3.6?
>>
>>103653233
It's a nickname
>>
>>103653205
Kinda wild how this model / pricepoint pretty much completely steamrolls every paid API out there except maybe Gemini and the CoT meme models that are too expensive and useless for most day-to-day things
I think I see why OpenAI and Anthropic were so afraid now
>>
File: 1726380837764390.png (844 KB, 800x582)
844 KB
844 KB PNG
>>103653205
>Western cucks like Meta: "Noooo we can't give you our base model anymore it's too dangerous for the goyims!!"
>Based chinks: "Hey, you want a 680b base model? It's yours now"
>>
>>103653205
>Big shit beats small shit
damn I love Machine Learning!
>>
>>103653227
Livebench isn't and is closed. It seems to be beating everything else on there.
>>
>>103653277
That's why communism is superior to the oppressive capitalism.
>>
>>103653205
>>103653277
I should start learning Mandarin.
>>
>>103653289
Correction - council of retards beats shit way bigger than them
>>
>>103653289
Sonnet is very likely a big moe as well but with more active params and so is more expensive to run ($15 per output compared to $1.10)
>>
>>103653277
you add the fully uncensored video model (Hunyuan) and you realize that the chinks are actually the good guys in this era
>>
>>103652553
I thought you were joking because OR says v3 but then I tested it and got picrel
>>
>>103653310
Still trying to wrap my head around how basically every Western tech company has become the nationalistic censorship ridden clusterfuck we feared China would be, and China straight up doesn't give a fuck
>>
>>103653329
Im thinking OR just has some substitute model and has not swapped them yet. I could be wrong but on the official api it told me it was deepseek.
>>
>>103653310
Elon will ban proprietary models and force OpenAI to release o3 Open Source. The west has a chance.
>>
>>103653349
>o3
>Thousand dollars per task
No thank you.
>>
>>103653349
Ummmmmm no
>>
>>103653332
they're waiting for the daddy state to give them candy and protect them
>>
>>103653277
Wrong company to serve as example. Meta is more like Mistral where they release some things and don't release others. Meanwhile OpenAI releases fucking nothing. Not an Instruct, not a base, not experimental research models. Just nothing.
>>
>>103653369
Will the AI at least learn how to suck dick before it genocides all humans?
>>
>>103653385
anon.. sucking dick IS genocide
- big tech
>>
>>103653375
Meta generally releases most shit of value though. In my opinion, the hierarchy goes something like
>Generally releases models
>Qwen, DeepSeek, Meta
>Releases some shit, keeps the best shit for themselves
>Mistral, XAI
>Releases research, gives us some model table scraps if we ask really, really nicely
>Google
>Completely closed models and generally closed research, wants other people to do the same
>Anthropic, OpenAI
>>
>>103653337
I simultaneously hope it's both OR's and the model's fault
OR's because the model didn't really impress me when compared to every other model (though it didn't make any logical mistakes during my short chat)
But I also hope it's the model because it'd be the final nail in the coffin for me, confirming that either I'm a retarded ESL incel freak piece of shit slop magnet (doubtful) or that LLMs are just convincing illusions that quickly break down when you probe them for a bit
>>
>>103653422
>>103653386
>>
>>103653421
What is Mistral keeping to themselves?
>>
>>103653444
desu I don't really care what Mistral is keeping to themselves, they don't release good models anymore
>>
>>103653294
china hasn't been communist since 40 years ago
>>
>>103652515
>Secondly, DeepSeek-V3 employs a multi-token prediction training objective,
which we have observed to enhance the overall performance on evaluation benchmarks.
no they seem to only have used it during the training for fuck knows what but it seems to have improved it they do mention that it should improved the speed of speculative decoding somehow
honestly i have no fucking clue what im reading here but there seems to be lots of cool shit
>>
>>103653443
That's cool and I might test it later, but I've never used jailbreaks because I think that a good model doesn't/shouldn't need them
>>
>>103653468
Even claude needs it to write good. By good model do you mean one finetuned for creative writing / RP? Because that is the only way your gonna get one to write good out of the box at the cost of other areas.
>>
Congratulations on your first good model localbros. Too bad it's 600B but it has good outputs.
- /aicg/chad
>>
>>103653444
Not as much nowadays, but they did keep the larger models for themselves back when they started up (Mistral Medium and Mistral Large). They've gotten better about it and I debated about it, but between that and the ass licensing decided to drop them in that tier for now
>>
>>103653487
>first good model
>it's 600B
you can't have a good model without it being a giant goliath, the scaling laws prevent us to have nice little things
>>
>>103653385
Tulu-3 70B instruct at q8 is the most capable dick sucking llm ever created so I don't know what you're talking about.
>>
>>103653477
Ideally it should be able to handle everything perfectly out of the box, but since I mostly use them for RP/creative writing, I don't mind using a finetuned version that's worse in other areas. Unfortunately, even those run into the same problem(s) sooner or later
>>
>>103653504
Transformer scaling laws*
>>
>>103653205
Aider-polyglot has got to be a gamed benchmark. There's no way it's better at programming challenges but worse at general code editing compared to sonnet unless it is overfitted on exercism solutions.
>>
>>103653504
Yes bigger models will always be better than smaller models with equivalent training. That is why I think the future that local should optimally move towards are models/architecture that can have "sub-network extraction", or the idea of using only parts of the model for specialized tasks. For instance, MoE models where the experts are specialized towards subject areas. So if you simply just wanted RP, you could load only the most relevant RP experts to VRAM, the less relevant to RAM, and the even less relevant to possibly SSD, though something like Llama.cpp would need to be modified to be able to use all three at the same time.
>>
>>103653627
The future that local should move towards is shit that isn't static, something like liquid neural nets
But that's probably "too dangerous" in the hands of the public
I like your idea though, too bad that companies are moving away from "things you can run on a somewhat recent rig" to "things you can technically run locally (tm) but you need either a server motherboard or overpriced enterprise-grade GPUs"
>>
>>103650860
Continue can use ollama as well, or any OAI compatible API.
>>
>>103653332
China doesn't give a fuck about what the model generates, because they completely control speech in China any way.
>>
>>103653978
>China doesn't give a fuck about what the model generates
as it should, can't believe I live in a world where China is the country of reason
>>
https://arxiv.org/html/2412.17846v1
https://github.com/alonso130r/knowledge-distillation/tree/main
>Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing (NLP) tasks. However, these models are often difficult to deploy due to significant computational requirements and resource constraints. Knowledge distillation (KD) is an effective technique for transferring the performance of larger LLMs to smaller models. Traditional KD methods primarily focus on the direct output of the teacher model, with little emphasis on the role of prompting during knowledge transfer. In this paper, we propose a set of novel response-priming prompting strategies applied in the knowledge distillation pipeline to enhance the performance of student models. Our approach fine-tunes a smaller Llama 3.1 8B Instruct model by distilling knowledge from a quantized Llama 3.1 405B Instruct teacher model. We apply LoRA optimization and evaluate on the GSM8K benchmark. Experimental results demonstrate that integrating reasoning-eliciting prompting into the proposed KD pipeline significantly improves student model performance, offering an efficient way to deploy powerful models in resource-constrained environments. We find that Ground Truth prompting results in a 55% performance increase on GSM8K for a distilled Llama 3.1 8B Instruct compared to the same model distilled without prompting. A thorough investigation into the self-attention layers of the student models indicates that the more successful prompted models tend to exhibit certain positive behaviors inside their attention heads which can be tied to their increased accuracy.

This paper was published some weeks ago and didn't get enough attention, I think this sounds very cool. I will try to apply their code to Qwen2.5 7B/Qwen2.5 32B.
>>
>>103654006
Do you think "But think of the (fictional) children!" works as well in Chinas as in western countries?
>captcha 08YASS
Guess I got my answer
>>
>>103653833
>ollama
Wasn't that the one developed by that troon that's always 300 commits behind llama.cpp?
>>
>>103653467
>seem to only have used it during the training for fuck knows what but it seems to have improved it they do mention that it should improved the speed of speculative decoding somehow

When it's build into the model like this it's as much an early exit scheme as speculative decoding. Sometimes it might only need to do one decode step to produce multiple tokens, but you need some kind of confidence metric that the extra tokens are good.
>>
>>103654006
It's not reasonable to call non self censored models with censored speech an improvement.

We get the best of both worlds this way, but the Chinks themselves would get tiger chaired if they talked like we do.
>>
>>103654026
Allons-y, Alonso!
>>
>>103653490
They don't release models so they can go fuck themselves
>>
>>103654248
Hehe
Great show, that one. Until they ruined it
>>
>>103653443
>>103653477
Can you use this to guide local models as well like mistral large / llama3? Does this go under system prompt or should I put it under the story format?

>>103653504
Just how much RAM would I need to run this thing locally as a ballpark / speeds can I expect? I'm got 48 VRAM and 64 regular but wouldn't mind getting more in the future.
>>
>>103654272
Yes? And where you put it just changes how much it effects it. If its closer to the end of context it will have a stronger effect.
>>
>>103654272
>how much RAM
https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/tree/main
the fp8 is a 700gb model, if you have 1TB of ram maybe that'll do lol
>>
>>103653512
Prompt / settings? I remember trying it a while back when it first came out and it didn't find it anything special. Is this the updated tulu?
>>
>>103654310
You will need something like 400GB total for 4bit and some context it looks like.
>>
File: file.png (31 KB, 338x230)
31 KB
31 KB PNG
>>103654292
That makes sense but I've tried putting things at the end of Llama 3 models and it tends to go schizo a lot more, wasn't sure if it was still valid for them or not.

In screenshot it should go before the eotd EOS token but linux paint is ass
>>
>>103654340
And how fast do these models go when split? MOE models are supposed to be optimized for cpu but I'm pretty much a pure exl2 user so I don't want to put in all the effort of upgrading to get something like 2 t/s
>>
>>103654434
V3 is only 14B active parameters so running it purely on RAM is going to be decently fast if you have DDR5.
>>
>>103654434
Depends on the speed / number of channels of the ram.
>>
>>103654451
sadly it looks like its actually 37B active. DDR5 on dual channel will prob only get like 3 tks. With 8-12 channel server board though you might manage a useable 10+
>>
>>103654451
>We present DeepSeek-V3, a strong Mixture-of Experts (MoE) language model with 671B total parameters with 37B activated for each token.
>>
>>103652511
>It seems to get even weirder at higher contexts which isn't really unexpected, but the 163840 purported contexts limit is being generous.
You can physically do it but they don't make any claims that it works above 128K. From the paper:
>During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens[...] Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32K, and in the second stage, it is further extended to 128K.
...
>DeepSeek-V3, following supervised fine-tuning, achieves notable performance on the "Needle In A Haystack" (NIAH) test, demonstrating consistent robustness across context window lengths up to 128K.
NIAH lol. We know that's not real.
>>
>>103654525
Pretty sure from the responses and what happened after that he was using OR when it had 2.5 on it.
>>
Wasn't there a thing where people tried to turn MoE models into a collection of LoRAs that get applied on runtime to save on space requirements. It was the big talk of the town for a bit back when Mixtral first got released.
I guess that didn't go anywhere in the full year since?
>>
DeepSeek V3 is extremely based. It has actually useful suggestions rather than "play with the temperature to figure out what works for your use case :-)"
We recommend users to set the temperature according to their use case listed in below.

|USE CASE |TEMPERATURE|
|------------------------------|-----------|
|Coding / Math: |0.0 |
|Data Cleaning / Data Analysis |1.0 |
|General Conversation |1.3 |
|Translation |1.3 |
|Creative Writing / Poetry |1.5 |
>>
>>103654583
Who would have done it? The retards complaining MoEs were too complicated to fine tune?
>>
>>103654583
I think this is pretty much the same thing, and DeepSeek did it:
https://github.com/deepseek-ai/ESFT
https://huggingface.co/collections/deepseek-ai/esft-669a1e800bc10b3460569c70
>>
>>103654592
>Translation: 1.3
What is the reasoning behind this...?
>>
>>103654583
Nah. I remember the idea, creating a single baseline expert then creating adapters from the difference between this base and each of the experts that would be applied at runtime.
Something of the sort
>>
>>103654583
I still feel like LoRAs have a bit of a placebo effect to them. Not sure if LoRAs would be powerful enough to substitute for full on experts.
>>
>>103654670
It's for translating japanese media to english
>>
>>103649764
I'm trying to find a proper set up for SillyTavern and Flowgpt. Any idea how? I downloaded a version that works with FlowGPT and POE, whatever that might be, but I have no idea how to connect with it.
>>
>>103655161
* localizing
>>
xitter fags deepthroating deepseek are getting to be a bit much
the model is not bad and what they did with limited resources is impressive, and I admire they are one of the few labs willing to experiment with arch at scale. that said their models are always bad on multiturn and generally fail to pass vibe checks compared to their benchmarks, I just don't find the quality is there compared to the benches.
le epic whale mandate of heaven posting is cringe
>>
>>103652084
what is this ktransformers magic? an alternative to llama.cpp?
>>
>>103655404
I think its the closest we have come to claude and all it needs is some good RLHF to make it write good like claude does.
>>
>>103655055
they seem good to lower the hallucinations created when you try the "as advertised" body of knowledge and find out it is missing. A lora can get a medically trained model to avoid talking about educational choices when asking about stem cells.

tl;dr loras can improve, not add.
>>
>>103655425
>all it needs is some good RLHF
gonna be expensive as shit
>>
>>103655452
Yes but it has a fair and open license. Hopefully a company steps up.
>>
/aicg/ is loving Deepseek and they have hated every local model in the past before this.

This is the real deal. We are finally corpo level.
>>
>>103653460
you're dumb lol
>>
>>103654026
>I will try to apply their code to Qwen2.5 7B/Qwen2.5 32B.
Ok, forget about this. Each prompt is 600MB in logits.
>>
>>103655482
It feels good to finally have a good LLM
Also means that the others are going to need to step it up to stay competitive, so it's good for open models all around
>>
>>103655482
Deepseek will make a shit ton of money from coomers now.
>>
On llama.cpp one can override the expert count used during inference. So here's an idea that won't be implemented:
Use a single expert from a MoE model for speculative decoding.
Bye.
>>
4090 user, what's the current goto model for sloppy gooning? Need my holiday erp fix.
>>
Looks like theres now a deepseek proxy for people who want to try it: https://substitute-domains-pdas-specified.trycloudflare.com/
>>
>>103655482
which is perplexing because it is not good for RP at all, I really don't get the hype
it's nice it's cheap I guess but I'm more interested in using it for code than for its dry repetitive prose
>>
deeploop3
>>
>>103655834
Maybe your not using it right? Chorbo is dry as fuck without a good system prompt but with it will drain your balls.
>>
>>103655808
Brief me on the aicg "proxy" meme.
Is it just stolen api keys and somebody set up a server that forwards requests?
>>
>deepseek proxy
>Chorbo
did i click on the wrong thread
>>
Total Jewish victory.
https://x.com/TheRabbitHole84/status/1872066576750616769
>>
>>103655873
Yep.
>>
>Deepseek is nigh free and first open model that isn’t a total joke compared to paypig ones
So what’s the play here to bully openai and the like? Build up good UI/UX for running things locally or connecting to cloud providers that run open models? all options rn are kinda mediocre and struggle between switching from local to cloud provided (so you can’t have your 32b local slop + 685b hosted in one place)
>>
>>103655834
they're used to that because the corpo models are exactly like this, smart but incredibly dry.

I feel you can probably do something to deeploop by clever prompting and some regex scripting, but as is, it's not special. Contrary to the other big ones it might actually be worth it to put in the effort tho, as it really doesn't seem to have censorship nor positivity bias, nor do I think we'll get a scenario like OpenAI or Anthropic where they'll start to hunt coomers down. I'm not religiously against APIs, especially when the APIs undershoot what even the electricity would cost if I had the hardware to run it
>>
>>103655873
When it's not a wholly different model on the other end, or they inject a hidden prompt, yes.
You can scrape, for example, public github repos for people who commited their keys for claude, openai, deepseek, etc.
>>
>>103655856
What the fuck is a chorbo? It's ungoogleable.
>>
So why the fuck isn’t there a Cursor alternative that lets me connect to whatever endpoint I want instead of their shitty one that gives you like 50 requests a month to claude for $20
Seems like local llm is the perfect choice here, but the tooling for it is primitive
>>
>>103655900
it's barely worth it to steal deepseek. It basically costs nothing
>>
>>103655950
you underestimate aicg's poverty
>>
>>103655961
perceived poverty

I guarantee there are dual 4090 owners complaining they can't afford the electricity.
>>
>>103655942
>>
File: 1731650243193770.jpg (23 KB, 844x79)
23 KB
23 KB JPG
how could this happen non-local bros?
>>
>>103656069
>>103656082
go back
>>
>>103656069
aicg users deserve the rope
>>
>>103656087
*does cute 360 like miku!*
>>
>>103655943
True, there is literally nothing out there that even tries to address this, whats up with that?
>>
>>103656157
>>103655943
Because aider does the same thing and doesn't force you to use a shitty webpage as an editor
>>
>>103656069
...
>>
>>103654541
Man fuck off with your damage control. I was using the official API for my testing and that shit has massive repetition problems. I don't know how you all don't see it.
>>
>>103656190
Some genuinely don't notice repeating patterns, slop, and other things, I envy them a lot.
>>
>>103656190
Have you considered that it is far more likely that you fucked up something somewhere than it is everyone else not noticing a massive repetition issue?
>>
File: Itseveryoneelsebutme.png (564 KB, 500x713)
564 KB
564 KB PNG
>>103656190
>>
>>103656172
BTW youre retarded and there are several options available that have feature parity with cursor even. Learn to use google lmao.
>>
>>103656230
>options that have feature parity with cursor
You mean aider?
>>
DS3 is smart but it's dry AF, I don't know how you guys are finding this usable for RP or storywriting. For someone doing non-local there's no reason to use this over Claude except being broke.
>>
>>103656244
Aider? I don't even know 'er!
>>
>>103656258
>except being broke.
that's a pretty huge reason for a lot of folks.
>>
How many free tokens does deepseek give?
I have been trying to make it code something and I have spent like 20k tokens already.
>>
fellas who are using anubis
how is it for rp(fantasy and scifi)
also, is it good for nsfw or is it only good for shit that doesnt include sex?
got any masters
thanks in advance
>>
>>103656299
>fellas who are using anubis
no one, every1 using deepsex now grandpa
>>
>>103656306
Using an open weight model through an api doesn't make it local
>>
>>103656299
>drummer models
ishygddt
>>
>>103656223
Post your logs so I can point out the very simple IQ test you're failing.
>>
>>103656324
what do you use then?
>>
>>103656284
>20k tokens already.
I had to use like a million before it charged me a cent
>>
>>103656360
/aicg has a bunch now
>>
>>103656360
check >>103656341
>>
>>103656375
It also enables prompt caching by default, which reduces the input price by an additional factor of ten
>>
Can this thread get any more local?
>>
>>103656417
It's not our fault you're too poor to run it
>>
>>103656417
We'll be right back to local talk once everyone finishing fomoing into building CPUMAXX rigs to run v3 locally before LLaMA4 drops and shits on it at 70B.
>>
>>103656428
Should be cheaper to run at home than large mistral
>>
>>103656400
and i thought this thread is bad
>>
>>103656400
That's the exact template I was using to test the model. It didn't work. Repetition city.

I am open to the model being good, but it is clearly prone to fall into repetition and I don't know how people can't see this.
>>
What do we do now?
>>
>>103656436
This is the big reason imo. Unless llama 4 ends up being a big moe as well people will prob just wait before buying a server.
>>
>>103656299
I didn't think it was anything special compared to the other 3.3 fine tunes.
>>
>>103656371
Wouldn't you like to know, weather boy?
>>
>>103656453
The only person saying they have rep issues across here, aicg, aids and reddit seems to be you
>>
the answer is always stacking more parameters. you cannot escape this truth.
>>
>>103656480
that's why I use BLOOM 175B
>>
>>103656460
I currently run hanami. How do you think it's in comparison?
>>
>>103656480
then why is sorc trash?
even MM70b mogs it
>>
>>103656475
>seems to be you
Just wait. You'll all see it soon.
I'm smarter than the average gooner. Eventually the pattern will reveal themselves to you too.

That said, post your settings, temp Freq and presence penalty as well as what your cards includes.
>>
>>103656436
>LLaMA4 drops and shits on it at 70B.
Holy fuck you're placing way too much faith in Llama
>>
What if we.... Merge two deepseeks togethers?
>>
>>103655881
Those all seem low
>>
>>103656495
Never tried that one desu.
>>
>>103656560
what is your favourite currently then?
I am tired of hanami
>>
>>103656546
This but 8.
>>
>>103656436
Anon, this is Meta. Going by previous releases, it will beat DeepSeek V3 with a 600B dense model and their 70B will about on par with the latest Qwen 72B. There will be no MoE.
>>
>>103656436
The 70B of a Llama release is always the worst one though? Meta sucks at that size for some reason.
>>
>a 600B dense model
I sure hope not.
>>
File: mikustep.jpg (465 KB, 1150x1240)
465 KB
465 KB JPG
>flammenai/Flammades-Mistral-Nemo-12B
>allura-org/MN-12b-RP-Ink
>PocketDoc/Dans-PersonalityEngine-V1.1.0-12b
Are any of these worth using over magnum v4 12b?
>>
>>103656608
Zucc already said that the L4 flagship will be smaller than 400B so the entire generation can be considered DOA.
>>
>>103656570
Personally, Llama 3.3 EVA v0.0. It's not perfect and can go a bit schizo sometimes, but I like it because it's more creative and kino than other models that output things that are more expected and not as interesting for me.
>>
>>103656574
This but prune each expert to 1b first
>>
>>103656633
What's your setup for running that?
>>
>>103656631
NTA, but sauce for this?
>>
>>103656436
I'm hoping for a 70B model specialized in coding.
>>
>>103656588
>70B model
bold of you to assume they will have one of those
L4 will be 3B and 800B
>>
>>103656692
If there is a god, L4 will 3B 30B and 300B
>>
>>103656258
>50x cheaper
>Uuuhhh... What are you? Poor?
retard
>>
I was trying to make an application with deepseek and it didn't work for a while, I was getting disappointed in it, until I realized I modified a json file the program needed and it was fucking everything up.
tfw dumber than an AI
>>
>>103656702
If there is a god, L4 will be
>draft model
>fits in 5090 with draft model
>fits in 2x5090 with draft model
>free space
>>
>>103656702
You WILL get 1B 3B and 400B and you WILL be sad
>>
>>103656764
fuckin better not be
i was told 3090s would never be obsolete
>>
>>103656780
Don't worry, your 3x3090 build will only be slower by factor 3.
>>
>>103656780
It's ok you can run the 1B 2d2raft model
>>
>>103649764
Hey OP, don't forget to mention in the news that DeepSeekV3 (the instruct one) got released too.
>>
>>103656642
Here.
files.catbox.moe/4kzr9w.json
I also have a setting for COT which I think can be interesting in some situations but it takes some time to adapt to certain cards and it also makes replies slower so I stopped using it. If you'd like to try it out though here.
files.catbox.moe/nyvz0p.json
If a model does not do the <Thinking> thing, just prefill it in and it'll learn to do it after every reply. Same for if it doesn't give a response after thinking and ends the reply, just prefill with an asterisk or something random.

Credits to anons who I stole most of this from and did some revisions on.
>>
>>103656871
I should have specified that I'm asking about hardware but I will try the thinking thing, thanks.
>>
>>103652243
>electro magnet snow board
sick, that's a perfect late xmas gift anon
8 extra RAM sticks to run deepseek works too
>>
looks like deepseek is popping off in aicg and aids. Guess I need to get me a damn server
>>
>>103656961
>and aids
Do you think nobody can go and read that thread?
>>
>>103656890
Oh I use a DDR5 rig with a 3090 and 3060 12G. IQ4_XS runs at like 3.7 t/s (0 context).
>>
Wait why are they both named aicg now
>>>/vg/507722579
>>103654413
>>
>>103656985
>now
>>
>>103656994
Sorry if I haven't frequented aids in...8 months?
>>
>>103656985
hi newfriend
>>
>>103656985
aicg had a split off into a /g/ and a /vg/ thread months ago now aids is mostly unrelated
>>
wasn't aids the /vg one?
>>
>>103657009
why
>>
damn deepseek is bland as fuck. chinks are soulless bug people so we will get local models that solve the Riemann hypothesis but can't hold an interesting conversation. we're never getting Claude at home
>>
>>103657020
No, it's /vg/aicg.
>>
>>103657028
stupid drama, it's aicg we're talking about
>>103657020
again aids is another thing from the /vg/aicg thread
>>
>>103657034
skill issue of highest order
>>
>>103657034
go to aicg and steal one of their JBs lol
>>
>go to /vg/aicg
>ctrl-f "deepseek"
>33 mentions
>literally every single one of them is shitting on it
Wow!!!
>>
>>103657020
/aids/ is the thinly veiled novelai general that's visited and kept alive by n.ai staff and n.ai imgen spam. It's not a place to actually talk about LLMs.
>>
File: not spam!.png (84 KB, 669x1083)
84 KB
84 KB PNG
>>103657080
Damn thing kept saying it was spam there is so many times you are wrong in just one thread.
>>
>>103657211
See first reply in this thread.
>>
>>103657227
I assume that was you?
>>
>>103657155
Oh right that's the thread the offline-nc schizo trolls 24/7.
>>
>>103657155
Yes, it becomes extremely obvious if you start mentioning anything besides n.ai. The general suddenly gets very active disparaging whatever you mentioned. NAI is a has-been service. Please nobody waste their money there. I'd be nicer if they actually had nice staff.

https://api-docs.deepseek.com/quick_start/pricing
lol, the chinks made it more expensive and pretend it's actually a "discounted price" right now. Never, ever trust API services.
>>
>>103657237
I don't know what "that" is but the answer is no. I meant this >>103649767 the 9 replies limit is a known thing.
>>
File: Untitled.png (68 KB, 1627x621)
68 KB
68 KB PNG
>>103657227
you / he got instantly jumped on
>>
>>103657259
>9 replies limit
But I have logs I posted a good deal longer than 9 replies? And im not the only one either.
>>
>>103657276
On 4chan, dummy.
>>
>>103657252
>Never, ever trust API services.
ah yes, the old rug pull
>>
>>103657276
He means 9
>>postnumber
when posting on 4chan.
>>
>>103657211
You just have to divide them in batches of 9 quotes.
>>
>>103657252
lmao, doubled the input price and quadrupled the price for output tokens. Very nice.
>>
>>103657252
Still pretty cheap, but yeah.
>>
>do some RP with a card of an existing character from a moderately popular franchise using deepseek
>suddenly the character brings up another character from the same series that's not listed in the card and constructively uses details about them to compliment the situation
How I missed this with local models. That's why trivia knowledge about popular franchises is such a nice thing to have in a model.
>>
>>103657155
>>103657252
The samefag attempt would work a bit better if you didn't use the same weird n.ai spelling.
>>
>>103657312
just rag dude
>>
>>103657252
>>103657284

That is the price for the 200B model, so yeah, it's a discount since their API is now using the 600B model.
>>
>>103657312
That is the biggest positive for me. Claude was legit the ONLY other model that knew enough random triva about my fav fandoms to do spontaneous stuff like that. It brings stuff to life way better.

>>103657321
Even if we had 1M context you could not give it the same sort of understanding with RAG as training it on the entire fandom of something including ALL of its fanfiction.
>>
>>103657315
He's not wrong. Are you denying that it's NAI shill central?
>>
>>103657312
Yeah. This thing has amazing knowledge of dungeons and dragons material beyond the most recent stuff.
Really fucking dope.
>>
>>103657341
>He
kek
>>
>>103657312
I feel annoyed when this happens, because it's like the character isn't respecting MY canon.
>>
The level of difficulty to run things in this space: Imagegen (easiest) < LLM <<< TTS
>>
>>103657341
I think you need to get a fucking life dude
>>>/vg/507726019
>>
>>103657354
I need the feeling of living in the actual world of my fandom, not some shitty stage resembling it like all smaller models do.
>>
>>103656797
I also updated the param count.

>>103657359
>>103657359
>>103657359
>>
>>103657371
>no counter arguments
So are you denying it or not?
>>
>>103657371
Kek
>>
>>103657390
Why does NovelAI keep living rent free in your head? You localtards have been insecure about it since the start.
>>
>>103657412
Disappoint strikes deep anon, you would know if your dad ever came back with cigs
>>
>talking to himself
>>
>>103657412
This is kind of the thing. NovelAI is not relevant and will likely never be relevant again. I don't understand the appeal of spending all day every day shitting up multiple generals to copy paste the same post over and over.
Schizokun - nobody here gives a fuck about your mortal enemy. Either make an ontopic post or go somewhere else.
>>
>>103657412
Someone said "deepseek is popping off in aicg and aids" when he actually meant "aicg and vg/aicg" and he was rightfully informed that aids is a fake general meant to shill a singular product where no discussion takes place.
>>
>>103657473
that someone was 100% you
>>
>>103657494
There are, in fact, multiple people that can see /aids/ for what it really is.
>>
Any anons actually running Deepseek on a home setup?
>>
>>103657578
Not me, don't ask me anything.
>>
>>103657698
What is your shirt size?
>>
>>103657578
I'm trying, but adapting their torchrun generate.py script to cpu is a PITA. Lots of CUDA stuff baked in
>>
>>103649866
cpumaxx 4800 ddr5 ram 384ram 9334 dual cpus with a 1x 4090 is snailpace, requiring me to alt tab while I wait for it to process my ?eva 3.3 model 6bit which was 59gb size. I mean it works and is powerful but, for the goon on the run, look towards just gpus and cope.

I have 2 4090s but yet to test both atm I just say cpumaxx only if you are patient (you arent) as it streams at a snails pace.
>>
>>103658020
deepseekv3? What inference engine are you using for cpu?
>>
>>103658037
I had only tried the 3.3 model in the gguf version (forgot the full name) that was shilled here in the last week and haven't had time to do much with any other tests,

I plan to test more and shitpost about it here later like deepseek etc but Im a wagecuck and I can't even get time to setup my server properly yet
>>
>>103657762
G
>>
>>103649866
Its by far the best local yet and is close to claude. That said a better model could come out next month from Meta so who knows.
>>
>>103658381
>close
Is it really still just close?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.