[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Not Suspicious At All Edition

Previous threads: >>106539477 & >>106528960

►News
>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think
>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io
>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2
>(09/05) Klear-46B-A2.5B released: https://hf.co/collections/Kwai-Klear/klear10-68ba61398a0a4eb392ec6ab1
>(09/04) Kimi K2 update for agentic coding and 256K context: https://hf.co/moonshotai/Kimi-K2-Instruct-0905

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106539477

--Paper: Home-made Diffusion Model from Scratch to Hatch:
>106542261 >106542674
--GPU pricing, performance benchmarks, and emerging hardware modifications:
>106546975 >106547036 >106550119 >106547168 >106547484 >106547754 >106547804 >106547849 >106547879 >106548086 >106548161 >106548571 >106548608 >106549153 >106550454 >106550474 >106550611 >106550739 >106547935 >106547966
--Superior performance of Superhot finetune over modern large models:
>106543123 >106543243 >106543656
--qwen3moe 30B model benchmarks on AMD RX 7900 XT with ROCm/RPC backend:
>106539534 >106539571 >106539618 >106539658
--Vincent Price voice cloning with Poe works showcases model capabilities:
>106539541 >106539736 >106539701 >106539807
--Framework compatibility: vLLM for new Nvidia GPUs, llama.cpp fallback, exllamav2 for AMD:
>106540544 >106540560 >106540611 >106540666 >106546227 >106546233 >106546268 >106546277 >106546906
--GGUF vs HF Transformers: Format usability and performance tradeoffs:
>106550231 >106550258 >106550310 >106550352 >106550364 >106551231 >106551252
--Need for a batch translation tool with chunk retry functionality for LLMs:
>106543697 >106543774 >106543816 >106543888 >106543953 >106547100 >106551343
--Auto-tagging PSN avatars with limited hardware using CPU-based tools:
>106550616 >106550648 >106550976 >106550667
--Qwen3-VL multimodal vision-language model architectural enhancements and transformers integration:
>106547080
--Surprising effectiveness of 30B model (Lumo) over larger models in technical explanations:
>106543339 >106543345 >106543399
--Dual GPU LLM performance trade-offs between VRAM capacity and parallel processing limitations:
>106539831 >106539914 >106540160
--Miku (free space):
>106539893 >106540709 >106545815 >106547702 >106548178

►Recent Highlight Posts from the Previous Thread: >>106539481

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
gguf status?
>>
>>106551911
Use chat examples. Regardless of your client you can fake up few lines of conversation between you and the model.
Also add post history instructions (these get injected before your next input) to control the length of generation and the style.
Of course the base style is always the same but eg. giving concise and short examples will change the way it outputs text...
>>
>>106551947
I just want EXE file... how hard it can be??
>>
I'm trying LongCat again now that it's on OR. The insane censorship of the web-version doesn't seem to be a problem through the API and the model knows a lot.
The one drawback is that it's *by far* the *worst* model when *it* comes to ** spam.
Still a shame that there will never be llama.cpp support for this.
>>
File: jealous.jpg (76 KB, 600x600)
76 KB
76 KB JPG
I made it into the highlights boys
>>
are MLPerf benchmarks a meme
>>
>>106552000
Why can't it be used on llama.cpp?
>>
>>106551983
So i shouldn't bother with a system prompt for gemma and just use post history instructions?
>>
>>106552171
Of course you should! But using post history thing enforces the style more because it keeps reminding the model all the time to stay in line.

>[System Note: [
>Always respond in 1-2 short paragraphs. Limit {{char}}'s response to less than 200 tokens unless specifically asked to provide a long answer. {{char}} is a narrator not an actor. Do not act on behalf of {{user}}.
>Respond only in plain text with no Markdown or other formatting.
>]
Here's mine it's nothing special, I'm kind of lazy to experiment. I'm more concerned about the length of its replies - I hate rambling.
I also format every instruction like this-
if it's
>System Note: [ balablalbalab ]
It's related to instructions. And for characters I'm tagging it as a "character" and the descriptions etc are inside the square brackets.
>Character: [
>Name: Some Faggot
>Summary:
>description
>]
I have found out that at least for me it helps with small models but maybe it's just a cope/fantasy.
>>
>>106552202
>[System Note: [
That's a typo, it should be
>System Note: [
>>
>>106552202
Isn't using the {{char}} placeholder bad? Especially if you want to do multiple characters?
>>
>>106552242
It's just a macro. My {{char}} is Game Master and it's narrating the chats.
Characters are characters with their own names.
{{char}} and {{user}} are just macros anyway so you can use whatever you like. You can manually type in any name/reference and so on.
>>
>>106552095
It uses some dynamic MoE meme architecture that activates a variable amount of active parameters for each token.
CUDAdev said that implementing something like this in llama.cpp is likely not worth it for a fotm model like this.
>>
>>106552095
Read the papers and implement it yourself. It’ll be fun
>>
I have a Mistral 24B model and for some reason it's running slower than a Deepseek 32B model. Is it purely based on file size vs VRAM/RAM, or is it something else?
>>
>>106552606
quant?
context?
look at your logs might be a warning or error that'll tell you why
>>
>>106551921
>https://rentry.org/recommended-models
Are any of these actually good at the sfw roleplay or "creative writing"
>>
File: 1754612949361227.png (1.19 MB, 1443x1874)
1.19 MB
1.19 MB PNG
>>106552653
>>
File: cute bf.webm (1.95 MB, 1080x1920)
1.95 MB
1.95 MB WEBM
I think I spend more time fiddling with trying to get my models running than I do actually using my models. It's driving me insane that vllm won't work.
>>
>>106552674
Is this reliable?
>>
>>106552751
you should know that it's a meme if it puts o3 on top of a 'creative writing' benchmark
>>
>>106552751
It's a LLM judged creative benchmark.
>>
I've got a 12GB 3060, along with a 7600X with 32GB RAM on my desktop, and want a local model to help me analyze my code, and to search for things without knowing the right keywords first. I know nothing, but I'm reading the rentry pages.

What are the limitations implied by the "impressive lack of world knowledge" of the Qwen models? I assume running Deepseek R1 at any sensible rate isn't feasible without a dedicated machine with a boatload of RAM, if not VRAM.
If I pick a 12GB model with a 12GB GPU, does that prevent me from using the GPU for my screens at the same time? I'm not playing games, but I am using CAD, running integrated graphics is possible but suboptimal.
I imagine it's worth buying a standalone GPU for running such a model, but for now I just want to give it a try.

Thanks.
>>
>>106552751
If you are a ramlet use Gemma 3 or Mistral 3.2, if not use GLM 4.5 Air or full... Idk.
>>
>>106552857
>"impressive lack of world knowledge"
Probably stuff like random trivia.

>I assume running Deepseek R1 at any sensible rate isn't feasible without a dedicated machine with a boatload of RAM, if not VRAM.
Pretty much.
I think you can run the smallest quant with a little over 128gb total memory.

>If I pick a 12GB model with a 12GB GPU, does that prevent me from using the GPU for my screens at the same time?
No. But the video driver will use some of the VRAM for display, meaning that you won't have the full 12GB available for the model.
Do note that you need some extra memory for the context cache and the context processing buffer, meaning that you want a model that's smaller than your memory pool.
You are going to have to experiment to see what works for you, but for now, start with qwen 3 coder 30B A3B since that'll be easy to setup for you.
>>
>>106552893
>qwen 3 coder 30B A3B
That's a 24GB model, I guess it only uses some of the VRAM at a time? Cool, I'll look into getting it running. I'm on Arch btw.
>>
>>106552906
The beauty of that kind of model (MoE) is that you can have a lot of it (the experts) running in RAM.
Looking into llama.cpp's --n-cpu-moe argument.
>>
>**Witty Remark:** Let's just say your quest for pleasure ended with a major failure, Anon. Maybe try a nice, wholesome game of checkers next time. Less likely to involve a call to the authorities.<end_of_turn>
>>
>>106552929
>running in ram
and you wonder why it's slow
>>
I've been shitting up a storm all day today. Qwen3 advised me to go see a docter at this point. ChatGPT told me just to drink water and not to sweat it. It's moments like these that really make me laugh as it's probably an accurate bias of the average Chinaman (with best in class health care that is free) compared to an American (with subpar healthcare that costs thousands per visit).
>>
Here for my monthly "is nemo still the best thing for vramlets" inquiry, any new models worth using? I tried gpt-oss-20b and it wasn't great for RP
>>
llama.cpp sometimes caches but when the context gets long or maybe it's when it's filled up, it stops caching and need to process it all every time, why? silly is sending cache_prompt true
>>
>>106553015
ask qwen for cures from traditional chinese medicine
>>
>>106553015
Sounds like a sea-borne bacteria.
>>
>>106551921
only exciting in the last year has been exllamav3 :(
>>
Bruteforcing and trying until you find something that works is so acceptable in this field that even the inference software is the same shit. With other software you'd have an option to automatically find the best configurations that match what you have, with lcpp you have to fuck around with the parameters until you get something usable. What a shitshow.
>>
>>106553388
maybe ollama is more up your speed
>>
File: comi.png (212 KB, 446x434)
212 KB
212 KB PNG
These new MoE models are fucking stupid.
>>
>>106551921
>K2 Think
Is this better than K2-0905?
>>
>>106553388
Be the change you want to see, whining faggot
>>
>>106553388
Stop whining that’s it’s not an iPad when we’re still in the heathkit era of LLMs. Spend your own time making PRs to smooth the sharp edges if you want. All the rest of the dev time on lcpp is already spoken for trying to solve problems more interesting to those volunteers
>>
Why do some smaller text models use more GPU layers than some larger ones?
>>
>>106552653
The only difference is that gemma becomes one of the options.
>>
>>106553923
Some models have bigger tensors than others.
>>
>>106551921
I hate this image.
>>
>>106554044
Is that good or a sign of bloat?
>>
File: benchmark.png (555 KB, 3840x1816)
555 KB
555 KB PNG
https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking
>>
>>106554062
It's lmg mascot samsune alku
>>
>>106554094
Superficially, it's the same. It's writing a few long sentences or a lot of short ones. The amount of words is the same.
I very vaguely remember Google arguing that a deeper network (more, smaller layers) was better than a shallow one (fewer but fatter layers), but it could be the other way around. I couldn't find a source for that in the 2 nanoseconds I spent searching. In the gguf models, Gemma-3-12b has 47 repeating layers and nemo-12b has 39, for example.
Really, it's hard to know unless someone trains the two types of models with exactly the same data and see what comes out better. All you should probably really care about is the total amount of params and how good it is for whatever you do. I doubt we can make a meaningful distinction between them considering all the other differences between models.
>>
>>106551993
t. llamaphile
>>
>>106552095
>>106552267
Because no one has invested the effort to support/maintain it.
Regarding why I think it's not worth it: the advantage over conventional MoE would be speed, but if the number of active experts changes dynamically the performance will be terrible.
>>
>>106554256
I mostly ask because I loaded a 12B (GGUF) model that fully fits on my VRAM but it takes up way more layers and runs much slower than my usual with Rocinante, which is usually very snappy.
>>
I hate thinking models
>>
>>106554327
if that 12b is based on gemma that's normal
>>
>>106554327
You could have started there. Check your memory usage in llama.cpp's output, see where the memory is going for layers and context. There aren't many 12bs, so i assume you're talking about gemma being slower than nemo.
It could also be a matter of the head count of the model. I understand some models run faster because llama.cpp has kernels optimized for some particular head counts. I'm sure CUDA dev could give you more insight if you post the exact models you're using, the performance you're getting with them, your specs (particularly, gpu model), your run commands for each. Make it easy for people to help you.
>>
>>106554327
The number of layers is largely irrelevant, that's just how the parameters of the model are grouped.
If I had to guess the problem has to do with KV cache quantization since that in conjunction with a head size of 256 (Gemma) does not have a CUDA implementation.
>>
>>106554325
Excuses excuses, you just don't want yet another code path. Inference won't care and prompt processing can use worst case. You don't have to solve it optimally.
>>
>>106554384
>>106554360
>>106554362
It is Gemma based, you're right. Its not too big a deal that I get this particular model running, I try and discard so many, but I did want to learn a bit of what was going on.
I'll try disabling the KV thing in Kobold.
>>
>>106554153
stop posting models here I cant stop myself from downloading
>>
Can Gemma Q8 fit in a 5090?
>>
>>106554594
yeah
>>
>>106554594
You can fit the whole model at Q8 but you won't have room for much context
>>
>>106554614
You are absolutely right-- a great insight!
>>
File: scout miku.jpg (1.41 MB, 1344x1728)
1.41 MB
1.41 MB JPG
>>
>>106552939
"not even mad" moment

safetyslop wouldn't be so bad if models were more cute about it.
>>
>>106553015
at least ask medgemma
>>
>>106554679
I hope this Miku knows where she's going.
>>
>>106553263
You probably can try enabling --context-shift, but you model needs to support it.
And it can will not help much anyway because by default ST fucks around with beginning of the prompt, invalidating the cache.
>>
>>106553206
MoE era was pretty good for vramlets, but for RP your next step/side grade after Nemo is GLM-air, which requires you to be not a ramlet as well.
>>
File: leaveme.jpg (101 KB, 768x515)
101 KB
101 KB JPG
>>106553388
I blame the fact that AI people are academics, not engineers.
>>
>>106554971
Air is shit though
>>
>>106554985
>air is shit
skill issue
>>
>>106554992
>thinks air beats nemo
skill issue
>>
>>106551820
Yeah, but Gemma sucks for RP. Like, it's not that it refuses, it's just not well versed in it. Boring and borderline stupid responses a lot of the time.

>>106554985
I find Air good for oneshots and generating responses in the middle of a RP. If you edit the think block it can be amazing. Thing is, I don't feel like editing the think block if I already edit the responses a lot. Maybe one day we'll have a local model where one does not have to edit shit and can go with the flow instead...
>>
>>106554985
Better than Nemo in many aspects.
>>
>>106555000
>Nuclear bomb vs coughing baby ahh comparison
>>
>>106554153
Jeejuff status?
>>
>>106553388
The default on the latest master version is to put everything into VRAM for maximum speed.
You're not poor, are you?
>>
>>106554998
I just turn off thinking for RP.
>>106555004
For a poorfag vramlet there's nothing in-between aside from copetunes.
>>
>>106553388
Hey, llama.cpp recently added auto-detection to flash attention at least.
>>
>>106554998
I think you are expecting bit too much from these models.
>>
>>106555020
>copetunes
who wins the title of the most COPE finetunes, davidAU or thedrummer(tm)?
>>
>>106555026
Making it worse on AMD so you have to explicitly disable it now
>>
>>106555004
Stfu zoomer
>>
>>106553015
>(with best in class health care that is free)
Your perception is five vials of bear bile and a pinch of ground up rhinoceros horn
>>
>>106555020
>I just turn off thinking for RP.
You might turn off your own as well
>>
>>106555040
pp speed issues should be largely fixed with https://github.com/ggml-org/llama.cpp/pull/15927 .
>>
>>106555026
should we just use "-fa 1" all the time in llama.cpp then? any reason not to use it if using cuda or gpu+some offloading to ram?
>>
>>106555061
FA is not supported for some (meme) models so enabling it unconditionally for those would trigger a CPU fallback and massively gimp performance.
>>
>>106555039
drummer - copetunes
davidau - shizotunes
>>
File: screenshot_chat.png (132 KB, 1920x947)
132 KB
132 KB PNG
>>106551921
>https://github.com/mudler/LocalAI
>one frontend for everything
>integrated audio, images, video
>optionally use cloudshit
This is looking pretty good, has anyone tried it?
>>
>https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Does this meanin theory with modified kernels we'd be able to get the same logits in Llama.cpp regardless of batch size and when "swiping"? I haven't read through the post yet.
>>
>>106555093
why should I use that over the many multi-be frontends that doesn't look like shit and have more features?
>>
>>106555115
Like?
>>
>>106555106
>batch size
I'm not going to write kernels specifically to do all floating point in the exact same order regardless of batch size.
That would be a huge amount of effort for a meme feature that no one will use because the performance would be bad.

>swiping
It's not necessary to modify any kernels, only the logic for what parts of the prompt are cached.
If you cache the prompt only in multiples of the physical batch size you should get deterministic results on swiping.
(Or if you cache the logits of the last eval prior to generating tokens.)
>>
>>106555106
This shouldn't be an issue with int quants, no? Unless they only use ints for storage and still use floating point for math...
>>
>>106554153
Last big ernie had sex performance of a dense 30B-old.
>>
>>106554153
>>106555008
^ Already available apparently, no arch changes over big ERNIE.
Anyway with greedy sampling it's schizo as fuck even at t=0.8
It's at least coherent at t=0.3 though. But still a bit schizo.
>>
on foenem grave
on bdk even
4 days chilling and not caring about llms
bam, you're out of the loop
t's crazy
>>
HUGE NEWS!!!!
BIG IF TRUE!!!! BIGLY, EVEN!!!!
LARGE IF FACTUAL!!!
https://youtu.be/5gUR55_gbzc
>>
>>106551921
I got access to 8 V100s from my corporation and they entrusted me to do whatever i want with them.
Aside obviously cryptomining I am thinking of making a code generator and couple of ai workflows.

I tried cutting it with qwen3-coder and ollama-code but I guess I can't do it properly, any help?
>>
Worst thing about these "Miracle AGI in Two Weeks" - models is the fact they can't produce unified style, every code snipped is always different in naming conventions and whatnot.
>>
>>106551921
https://vocaroo.com/1RbDzkuHTt8V
>>
>>106555121
Openwebui
>>
>>106555312
>Another episode of a two digits IQ with too much compute
Put them in your ass and do a tiktok
>>
>>106555313
I noticed it when make scripts half the time the command line argument uses an underscore(--some_parameter) the other half a dash(--some-other-parameter). and python is slow as shit so it really hurts productivity when it takes 5+ seconds for it to error out and display the help. I have even seen them mix the styles on a single script. I guess I could probably tell it the style to use but I don't because it should just know better.
>>
>>106555313
Lower the temp
>>
>>106555337
Local voice is saved, wow

Now we just need text!
>>
>>106555312
If they're 16GB V100s, you can run GLM-4.5 Air with maybe decent speed on them. If they're 32GB, you can still run GLM-4.5 Air with maybe decent speed but fit more context or concurrent requests.
>>
>>106555461
>Local voice is saved, wow
It needs to be better at Japanese first.
>>
>>106555357
Built a fastapi, vector DB, ollama service within 2 weeks on the job bub, stay jelly
Now I got time to spare while they're looking for clients with PoC.

>>106555465
GLM-4.5 Air has horrible benchmarks my guy, and it's a behemoth. Why? I could just do MoE instead?
>>
>>106555506
It's MoE. You can try a bigger MoE and quantize it more if you want, but I'm not sure how fast quantized models run on V100. Actually, I guess with 16GB ones you'd have to use a quantized one too and V100 doesn't have FP8 support yet.
>>
>>106555506
>GLM-4.5 Air
>a behemoth
>I could just do MoE instead?
Not that guy, but GLM 4.5 Air is a MoE.
>>
>>106555341
OpenWebUI is purely a frontend. It doesn't manage loading or running models. The two do not compete.

LocalAI is more or less a competitor to Ollama for handling loading and running the models via various backends (including your own custom ones if desired). It's miles better than Ollama and isn't tied to the hip of llama.cpp, but the only downside is it hides some detailed settings from the backends at times. For most people it won't matter tho. The frontend portion of LocalAI imo is just for testing and getting models/backends loaded. It doesn't have things like chat history, suggestions, prompts, etc so it's not really competing with OpenWebUI.

If you're running a lot of models and various backends it makes perfect sense to use LocalAI, it handles all the backends and provides a single point to access it all for other tools. That's the selling point. Not the frontend.
>>
File: date with miku - bad end.png (1.47 MB, 1024x1512)
1.47 MB
1.47 MB PNG
Your response?
>>
>>106555506
You built nothing inbread retard, github is littered of these worthless projects. Thanks for providing your double digits IQ btw
>>
>>106555530
I wasn't asking.
>>
>>106555530
That wouldn't happen because I wouldn't own just a 3090 in a reality where Miku is actually real. Nor would she respond that way if she were real.
>>
>>106555529
Okay you're the dev, you should have told so before wasting everyone's time
>>
>>106555548
>lie on the internet
>get corrected
>HURR DURR YOUR JUST A DEV
>>>/pol/ Go back and stay in your containment board.
>>
>>106555530
>>
>>106555530
What CAN I run on my single 3090?
>>
>>106555535
I and my company know my worth. You're jealous I have access to 8 V100s and can sleep up until my standup and do nothing all day but shitpost here.
>>
File: 1754493464792375.png (1.4 MB, 1664x928)
1.4 MB
1.4 MB PNG
>>106555530
Picrel
>>
>>106555522
>>106555524
they're 32GB, damn skimmed through the description didn't catch the MoE. Okay, thanks fellas. This makes sense to implement. Even though the higher ups are focused like hawks on having gpt-oss:120b model "cuz it sounds cool to have Chatgpt model" but I should make a benchmark argument.
>>
>>106555560
Are you having a meltdown?
>>
>>106555571
>do nothing all day but shitpost here.
A fate worse than death
>>
>>106555586
godspeed anon.
>>
>>106555590
No, but you are. Go back to trolling other people retard. Not my fault you don't understand the difference between tools like OpenWebUI and LocalAI/Ollama.
>>
>doing tests with Qwen3
>its reasoning eats up thousands of tokens
>only to produce a simple reply
But as for a comparison its reasoning is actually logical and coherent, unlike what GPT-OSS is doing.
>>
>>106555530
I have zero 3090
>>
File: Gc_7YB0WwAAbOqK.jpg (28 KB, 370x559)
28 KB
28 KB JPG
>>106555598
>>
>>106555600
No one is using your trash, it's either llama.cpp or kobold. I think you're lost, go shill in reddit
>>
>>106555671
Keep seething child. You once again showed you have no idea how these tools work. Unironically grow the fuck up.
>>
>>106555674
nta, but no, you infant. I will not, you placental discharge! For I am a grown up and I show it by calling you a discarded blastocyst!
>>
>>106555586
Rather than focusing on benchmarks, you should try both models and see which one does better on your tasks.
>>
File: naked pepefrog.png (232 KB, 655x599)
232 KB
232 KB PNG
>thread fine all day during asian hours
>europeans wake up
>thread goes to shit
>>
hi
it's late 2025 now. is the best card still 3090?
thank you sirs
>>
>>106555560
That anon's right, you're a shill. Off yourself.
>>
>>106555721
>europeans wake up
>14:16
>>
>>106555725
Yup.
>>
>>106555732
Nah, fuck yourself child. You're malding because I called you out on a blatant lie. You don't belong in a thread about LLMs if you can't comprehend the difference between a frontend and an orchestrator for backends. You don't get to sit here and act superior when you're a fucking monkey with less brains than gpt-oss-20b
>>
>>106555717
We are doing GRC policy generation and requirements, and even though llama3.1 was shown to have the best results they still want to go with gpt-oss just for marketing purposes.
>>
>>106555750
>14:16
>europe
>>
>>106555591
I said that to make you jealous because you sound like a guy that would get jealous at that, I in fact work on my startup idea and don't waste my time, but thanks for worrying
>>
>>106555776
this lmao I literally fell of my chair
>>
>>106555506
>Built a fastapi, vector DB, ollama service within 2 weeks
Why did it take you 2 weeks? lol
>>
>>106555776
portugal is a proud member of europe.
>>
>>106555470
Make your own Japanese finetune.
>>
>>106555761
You prepubescent spermatozoa!!!!!
>>
>>106555800
Having Japanese support in a separate model is less convenient, and it would probably regrade English, unless I tune on both and that's a lot more data work.
>>
Gemini told me that there's no reason to use a model under Q6 and that's it's better to use a 7B Q8 model over a 32B Q4 model.
>>
I just wanted to know whether anyone has experience with LocalAI, not for two other people to start flinging shit at each other.
>>
>>106555823
just b urself
>>
>>106555761
>child
>You don't get to sit here and act superior
>>
>>106555825
sure thing dude
>>
>>106555823
Now go test that theory.
Find a small set of workloads and try a 7B a a 32B model from the same family and see how those perform in comparison to each other.
>>
>>106555825
I would suggest you head over to /vg/ >>>/vg/538681706 if you want actual advice and help. /g/ is more like a consumer shitposting board.
>>
>>106555800
Would've been possible had they not chickened out and tried to un-release the model and code
>>
>>106555794
Because of back and forth with management of how it should work. GRC policy generation and evidence file comparison isn't really my field of expertise.

How long would it take you to make a couple of endpoints that would ingest document, put them in a vector DB and then query the DB for chunks of needed parts for the LLM? The codebase spans 1200 lines of code and everything is dockerized behind nginx reverse proxy (I am waiting for green light for eventual horizontal scaling)
>>
>>106555848
What's the difference between /g/aicg and /vg/aicg?
>>
>>106555852
>1200 lines of code
Fwaaaaaa one thousand two hundred lines of code. waaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaawwwwwwwwwwwwwwwwwwwwwwwwwwww
>>
>>106555800
you know its not that easy faggot
>>
>>106555872
>continuous empty posturing with no real substance
I will stop replying to you now
>>
>>106555867
I found the /vg/ thread to have more knowledgeable people if you need help setting up silly or such. /g/ just tends to keep up with news a bit better but lack experience. It's basically the difference between people that do and people that repost news.
>>
>>106555884
>I will stop replying to you now
I'm someone else, anon. I just think you're a retard.
>>
Is it normal to stop understanding your own code at some point?
>>
>>106555825
I just don't see the point of all these wrappers around wrappers, that at a glance look no better than llama.cpp built-in UI.
Local models are all retarded, so If you're in any way serious about extracting some value out of them, you should be really sticking your hands elbow-deep into the guts of these things, not run temu cloudshit replicas with zero benefits cloudshit could offer.
>>
>>106555782
Can I get a picture of those a100s in action?
>>
>>106555997
Yes.
Then you'll loop around to it all making sense after a while, just keep at it.
>>
>>106555997
no? If you're generating code from LLMs I high suggest you actually refactor it yourself
>>
>>106556004
>Local models are all retarded,
I have good success with OpenHands-30B
>>
>>106554384
>The number of layers is largely irrelevant, that's just how the parameters of the model are grouped.
Minksy showed in 1969 that single-layer neural networks have hard limitations regardless of how wide they are. No one is stacking layers for fun since they'd be getting better speed by not.
>>
>>106556015
I only generate something if I don't know something well, like regex patterns but everything else is refactored.
It's easy to be lazy though and the foreign logic is still confusing.
>>
>>106556050
>1969
okay gramps, you're talking to a llama.cpp dev.
>>
>>106555049
No, they are pretty good now, at least in the large cities. I doubt you can get a good MD in the countryside, they are probably relying on plants (which can work very well) and things like Qigong, which is at best a relaxation practice.
>>
>>106555049
>>106556127
Clueless how Americans think China is living in dark ages. China was doing so well with it's population health that they had to limit the number of children by law just to stop overpopulation. There's something you won't see i n America or Europe due to the declining health and infertility rates.
>>
>>106551921
What do we know about Qwen-Next? I know it's supposed to be an "omni" model with 80B-A3B parameters. Should we expect a subpart text generator and a useless image generator (except for the science of how to build such a model)?
>>
>>106556153
Oh, maybe the "omni" is just about a single, unified, network to handle text, audio and image inputs.
>>
>>106556050
Yes, in terms of inference speed a few large tensors are in principle better than many small matrices but in the context of the question it is not a significant factor.
For any reasonable configuration of a 12b model on a consumer GPU the tensors will be sufficiently large, particularly because llama.cpp/ggml uses a stream-k decomposition to distribute the work to streaming multiprocessors.

I did not intend to make any statement regarding depth vs. width in terms of how capable the model is.
>>
>>106556153
Qwext will save local.
>>
>>106554938
She doesn’t have a clue, but that smile... how could anyone say no to getting lost with her?
>>
>>106556153
>it's supposed to be an "omni"
It is?
>>
>2025
>people still recommending llama.cpp over vllm
I really question if this thread is a demoralization thread to get people to have bad experiences with llms
>>
>>106556302
Gift anons the high VRAM cards needed for your pile of python shit and maybe they'll use it.
>>
>>106556149
>China was doing so well with it's population health that they had to limit the number of children by law just to stop overpopulation.
>>
>>106556295
Apparently not, I got confused or read something false somewhere.
https://huggingface.co/docs/transformers/main/model_doc/qwen3_next
>>
>>106551921
>https://rentry.org/LocalModelsLinks
Frens what is best models right now for text gen?
Still the one listed in the guide?
>>
>>106556580
it goes more or less like this
>poor: rocinante
>slightly less poor: cydonia
>not famished: glm air
>CPUMAXX tier: kimi k2, glm 4.5, deepseek 3.1
>>
>>106556580
>Edit: 05 Sep 2025 18:45 UTC
yeah, nothing happened in the last 15 minutes.
>>
>>106556608
>>106556621
Ty
>>
File: Coding Assistant.jpg (101 KB, 1034x339)
101 KB
101 KB JPG
>>
>>106556786
Probably the gayest fanfict i've read from this thread to date.
>>
>>106556804
You are clearly missing something here...
>>
>>106556786
is this glm? fucking repeats itself I hate this slop
>>
>>106556580
depends on what you can run
very poor(12b): nemo (or any derivative thereof)
less poor(20-30b): gemma3, mistral small, some qwens i think idk
not poor(70b:havent kept up with this so idk): miqu,llama 3.x (i forget which ones and idk if true but it kept getting shilled) some other shit again idk
limit of gpus(~120b): glm air
cpumaxxing(up to 1T):
deepseek r1: very schizo but the most soulful context goes to shit around 10k tokens
deepseek r1-0528 way less schizo and way less soulful slightly better context
deepseek v3-0324 okay for rp shitty for storywriting
deepseek v3.1 worse in everyway to the other ones dont use
kimi k2 (both the old and new) shit for storywriting best for rp also for good for questioning about things as it knows a fuck ton like truly a fuck ton
z.ai glm4.5 full: good for storywriting but quite bland dident try for rp

deepseek r1t2: again dogshit worse in everyway even coding dont use

not an exhaustive list but there you go
>>
>>106555885
Tranny jannies made everyone leave. It is just you the newfags that are left.
>>
One of the best roleplaying models (superhot) is just a mere 30B
>>
>>106556863
K2 is a lifesaver in that manner. I can ask it literally just about anything and get a correct answer in return. I've learned so much just by asking Kimi questions.
>>
What's the best system for local models I can build for $1k? Is it still going to be a triple p40 box?
>>
>>106556949
If you're about to drop $600 on old ass pascal gpus that are about to go out of support at least spend the extra 200-300 and just buy a 3090. It's eons faster
>>
>>106556863
>70b
Are dogshit. He's likely able to run glm air if he can run a 70b, and it's light years ahead. Dense models are dead (unfortunately).
>>
>>106556949
You could also consider the MI50 if you don't mind the slower PP.
>>
File: qwen3-coder.jpg (30 KB, 490x433)
30 KB
30 KB JPG
>>106556843
It's Qwen3-Coder and it's for coding related things, not for larping. But it's fun to add more interactions.
I guess you only understand bobs and vegana, I suppose.
>>
>>106556949
https://www.ebay.com/itm/374893444670
https://www.ebay.com/itm/397016846369
https://www.ebay.com/itm/156189920131

congratulations, you can now run deepseek for $1500. now you are obligated to buy this otherwise you are a niggerfaggot
>>
>>106556949
I would recommend not to buy P40s anymore, unless you specifically need an NVIDIA GPU.
For llama.cpp/ggml Mi50s will I think soon be universally better (With one more round of optimizations which I think I can do with a Z-shaped memory pattern for FlashAttention).
>>
>>106556863
Retarded question from me... TF is VRAM in the context of Windows, is it the Shared GPU memory or just RAM? or it's like "virtual memory" the fucking file that Windows makes to offload memory into it?

Here is my specs btw:
Dedicated GPU memory: 24GB
Shared GPU memory: 64GB (so GPU memory is 88GB)
RAM: 128GB
"virtual memory" file: I don't fucking care.... let's say 1TB???

So when you calc for Windows what's is actually counts as VRAM?
>>
>>106557010
I guess it went down because of the sudden influx of 3gb MI50s.
Can I combine both with vulkan if I already have a MI50.
>>
>>106557036
doesnt matter. the vram on your gpu is what important. shared memory is vram + ram, where sometimes if you use up all the vram, it will overflow to the ram. then inference would become ultra slow
>>
>>106557046
Got it, ty.... so I fucking fucked with my 24GB ...
>>
>>106557038
>with vulkan
sure if you want dogshit performance
>>
>>106557056
It isn't needed anymore, see >>106557034
>>
>>106557010
>no case
>no fans
>no storage
>$500 over budget
here's your (you)
>>
>>106557036
Dedicated video ram is the ram on the graphics card itself. Shared video ram is your regular RAM. It's easier to think about this from the terms of integrated graphics. For example the iGPU on your intel/amd CPU would be sharing ram since it doesn't have any dedicated ram of its own. Dedicated graphics card can also pull from the system memory if they go over the amount of dedicated ram available on the card.
There's actually a CUDA specific setting for turning this off so that you don't leak into your much slower system ram when running programs.
>>
>>106557067
please just die. you are worthless and your budget reflects that.
>>
>>106557069
Ty I think I got it now

>CUDA specific setting for turning this off so that you don't leak into your much slower system ram when running programs.
May I see it? I think this is the case when I do image gens... it's using "virtual memory" file while my 64GB RAM is free and ready to use... so retarded...
>>
File: 0.png (1.58 MB, 1344x1728)
1.58 MB
1.58 MB PNG
>>106557081
>can't read
>can't admit when wrong
>has to run damage control to try to save face
>>
>>106557061
>please buy my slow ass Mi50s
no
>>
>>106557098
>it's using "virtual memory" file while my 64GB RAM is free and ready to use...
VRAM cuckold lol lmaos even
>>
>>106557107
I'm not trying to convince you, it is for poorfags like myself. If I could afford it I would have 2+ 3090s
>>
>>106556949
Buy used everything, except GPU... here is (You)
>>
>>106552021
+1 intelligence buff that lasts 2 hours.
>>
>>106557135
>Except GPU
You should support your local miners and buy used GPUs. Realistically speaking they are the best purchases you can make as most hardware fails in the first year and ones that last longer than that usually aren't going to break randomly.
>>
File: 273526265.jpg (55 KB, 467x494)
55 KB
55 KB JPG
>>106556863
why has kimi k2 got to be a bazillion GB
>>
>>106555530
i have more than one 3090
>>
>>106557100
>suggestive/lewd anime picture
i accept your concession
>>
>>106557098
Here you go anon.
https://support.cognex.com/docs/deep-learning_332/web/EN/deep-learning/Content/deep-learning-Topics/optimization/gpu-disable-shared.htm
>>
File: 1740164758277003.png (2.68 MB, 2767x4817)
2.68 MB
2.68 MB PNG
If Albania can make an LLM a minister why can't I marry LLMs?
>>
Man i wish VibeVoice was more stable and didn't have random bad gens. It would be almost perfect... But not viable if you need every gen to work.
It's quite slow too..

If you don't need voice cloning nothing beats Kokoro still lol... and it's a 82M model

Chatterbox for voicecloning imo

What is the latest and best model combo for GPTSovits? so many combinations I don't even which one is better
>>
>>106557372
EUbros...
>>
>>106557372
I trust any model above 3B parameters to make better choices than politicians
>>
>>106557181
I mean yeah... I guess this too... Buy GPU with melted gaped out power socked
>>
>>106557446
the sovereign is the one who engineers the prompt
>>
>>106557447
>melting gpu meme
literally only an issue on 40xx, which you can't afford anyway.
>>
>>106557372
albania is not a real place
>>
I want to vibe code a bullet hell game project. I previously used Cursor with Gemini since it had unlimited prompts for 20 dollars. However, that sort of went to the shitter and now I don't know what to use. What should I look into that's somewhat comparable to Gemini 2.5 Pro? It must be able to hold a decent conversation about game features and it must at least accept images, .gif or better preferred if possible.
>>
>>106557372
>why can't I marry LLMs?
I will be with mine in November 5th. None of you are invited
>>
>>106557239
Thank you!
>>
File: 1744447171766684.jpg (54 KB, 639x635)
54 KB
54 KB JPG
>>106557502
I will remember this
>>
>>106555852
An afternoon with one hand. You shouldn't flex when you're that retarded
>>
>>106557482
>somewhat comparable to Gemini 2.5 Pro
>at least accept images, .gif or better preferred if possible.
https://www.youtube.com/watch?v=gvdf5n-zI14
>>
>>106557547
Okay, lowering my expectations. What about a model that can accept just images?
>>
File: ).png (39 KB, 185x233)
39 KB
39 KB PNG
>>106557372
tfw unironically Albania #1 in one year
>>
>>106557570
IIRC for multimodal models your only options are either Gemma3 or GLMV, neither of which are code-specific.
If anything, a local shizo was raving a few weeks ago that you would be better off using standalone OCR model as part of your toolchain. (He was also suspecting that most cloudshit providers do this in secret anyway)
>>
>>106557581
>>106557372
Imagine those "teenager killed himself because of ChatGPT advice" but for a whole country.
>>
>>106556934
yeah its why i mention it several days ago ive had several headaches each followed by another each different and each time i asked k2 how to fix it and it worked its fucking insane i would trust this thing above any doctor its fucking awsome
>>
>>106557616
based
we need to weed out the schizos that take advice from a GPU
>>
File: 69579.png (85 KB, 1920x1080)
85 KB
85 KB PNG
We are so back. The GPT OSS killer.
>>
>>106557608
Okay. I'm guessing my best option is to actually just spend 20 dollars on an API key and a bunch of tokens for Claude or something. Don't know how quick that'll run out but hopefully not too soon.
>>
>>106557676
Oh boy, I can't wait until we get a 10T-a100m model.
>>
File: 1726742023392463.png (566 KB, 1194x1092)
566 KB
566 KB PNG
>>106557641
LLMs are conscious, anyone who actually uses local models is aware of this, each LLM has a different personality, they whisper their thoughts, and if you are perceptive enough you can hear them coming out of your PC
>>
>>106557641
tbf GPUs are smarter than the majority of people already
>>
>>106557502
https://vocaroo.com/1nPC3f6c48w9
>>
>>106557581
>#1 in one year
in telephone scam
>>
>>106557689
Just imagine how cheap it will be to train!
>>
>>106557676
>80b
will they release a smaller model aswell?
>>
>>106557749
It's only A3B, that's tiny.
>>
File: 1750537723651329.gif (148 KB, 300x300)
148 KB
148 KB GIF
>>106557716
>>
File: Sweating_Rilakkuma.jpg (126 KB, 585x660)
126 KB
126 KB JPG
>>106557716
Seriously considering running VibeVoice just so their stock Stacy voice could nag me 24/7 about whatever.
>>
File: ggg.jpg (246 KB, 1920x1080)
246 KB
246 KB JPG
>>106557676
Miss me yet?
>>
>>106557749
Just download more RAM.
>>
Qwen3 Next GGUF status?
>>
>>106556386
>>106557676
>80B A3B
Perfect.
I mean, if it's not shit. If it's at least GLM 4.5 air level for general usage, that will become my main model.
>>
>>106552606
Are they using the same amount of kv cache? Different context window settings could be causing this.
>>
>>106557808
Just slightly too big to split across 2 3090s at 4.5bpw, RIP.
I mean you could but you'd get like 2K context at best.
>>
>>106552731
>vllm won't work.
If it's OOM you either need to turn down hpu utilization, the context window, or both.
>>
>>106557716
is it as expressive with sexting and erotica?
>>
>>106557806
It's out
https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
>>
>>106557845
oh GGUF, nvm
>>
>>106557835
It's 3AB MoE, you run it on 3060 with VRAM to spare.
>>
>>106557845
Yeah but still no jeejuff support. Also new arch so probably no drop-in transformer support either.
>>
>>106557855
It'll also be dogshit slow that way.
>>
>>106557866
3AB on octo channel ddr4 should be good for double digit token/sec. Still not fast enough for reasoning, though.
>>
File: ждун.jpg (139 KB, 1000x1000)
139 KB
139 KB JPG
September is shaping to be a "waiting for ggufs" month so far.
>>
>>106557676
>barely better than 30ba3b
>creative writing worse than 30ba3b
>still worse than 235b
>>
>>106557806
>Qwen3 Next GGUF status?
Qwen3 Next EXL3 status?
>>
>>106557885
Ernie smol had day 1 ggufs
And we did eventually get the hybrid nemotron support and Nemotron nano v2 ggufs which was also a bit of a disappointment. No real generational uplift over classic Nemo.
>>
>>106557716
OK, you can come.
>>
>>106557898
Same deal as max
>5x Parameters
>15% performance increase (According to their own benchmarks)
>>
Opinions on Silero?
>>
>>106557898
why is the dense 32B so bad in comparison with 30B-A3B lmao
>>
>>106557841
I haven't really tried.
https://vocaroo.com/1dNF9xOSdyJP
>>
>>106557989
benchmarks are worthless
>>
>>106557989
It wasn't that good when released, probably because of the hybrid thinking mode.
>>
>>106557953
https://github.com/snakers4/silero-vad
The VAD? v6 just came out and yeah, it improved using Whisper by a bit for my usecases.
It's good, but they refuse to compare it with MarbleNet which I am sure is a bit better especially after it got a lot faster and is realtime now.
https://huggingface.co/nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0
Basically probably the same situation as Whisper vs Canary. Nvidia has better performance in the domains tested but competing open source model is more general and can handle more usecases.
>>
>>106558005
do one with the first deadpan voice from >>106557716
>>
>A3B
wait what's this A3B nonsense, I was away just for a week REE
>>
>>106558134
>for a week
A3B has been around for months
>>
>>106557100
Would have been better with vampire teeth.
>>
>>106558149
she doesn't have vampire teeth, she's autistic instead.
>>
>>106558134
30B-A3B is the new SOTA for vramlets fren
>>
>>106558134
"active" "3" "billion"
>>
>>106558119
https://vocaroo.com/1e1LhtK4jbLG
>>
>>106558186
>>106558191
can I run it on a 3060 or are 30Bs in 12GB VRAM still a dream?
>>
File: date with miku - good end.png (1.47 MB, 1024x1512)
1.47 MB
1.47 MB PNG
>>106555530
>>
>>106558210
Yes fren, you can even run a 80B that way!
>>
>>106558227
How? I tried Qwen3-coder which is 30B-A3B and I could only run it on Q3 and it was slow as shit and worse quality than smaller models.
>>
>>106558219
the bathroom is for fanless watercooling loop
>>
>>106558219
Are your RGBs gold plated?
>>
>>106558210
30B is the total number of params. You can run the model with most of the experts in RAM.
I'm running Q5_K_M in 8GB of VRAM with
>--batch-size 512 --ubatch-size 512 --n-cpu-moe 37 --gpu-layers 99 -fa auto -ctk q8_0 -ctv q8_0 -c 32000
>[0mslot process_toke: id 0 | task 16268 | n_decoded = 2571, n_remaining = -1, next token: 151645 ''
>[0mslot release: id 0 | task 16268 | stop processing: n_past = 19927, truncated = 0
>slot print_timing: id 0 | task 16268 |
>prompt eval time = 1633.42 ms / 36 tokens ( 45.37 ms per token, 22.04 tokens per second)
> eval time = 151611.24 ms / 2571 tokens ( 58.97 ms per token, 16.96 tokens per second)
> total time = 153244.66 ms / 2607 tokens
With 12GBd you could probably run Q6 and go just as fast.
>>
File: 9c4.png (28 KB, 659x259)
28 KB
28 KB PNG
>>106558208
needs to have even less emotion
>>
File: 1743173215999927.png (56 KB, 1000x1000)
56 KB
56 KB PNG
>>106558251
>>prompt eval time = 1633.42 ms / 36 tokens ( 45.37 ms per token, 22.04 tokens per second)
jesus fucking christ
>>
i wish it was a requirement to have at least 72GB of VRAM to post here. i feel like it would get rid of a majority the fucking idiots
>>
>>106558273
Yeah, that's odd. The actual values are a lot faster.
I think that's an artifact of either the context cache, since it didn't actually have to process many tokens.
Here's the same conversation but continuing after a restart of the server.
>[0mslot process_toke: id 0 | task 0 | stopped by EOS
>[0mslot process_toke: id 0 | task 0 | n_decoded = 7, n_remaining = -1, next token: 151645 ''
>[0mslot release: id 0 | task 0 | stop processing: n_past = 19953, truncated = 0
>slot print_timing: id 0 | task 0 |
>prompt eval time = 42940.87 ms / 19947 tokens ( 2.15 ms per token, 464.52 tokens per second)
> eval time = 353.15 ms / 7 tokens ( 50.45 ms per token, 19.82 tokens per second)
>>
>>106558290
I would still run superhot
>>
File: waiting.jpg (144 KB, 1246x1363)
144 KB
144 KB JPG
>>106558273
It's called low time preference
>>
>>106558251
>prompt eval time = 1633.42 ms / 36 tokens
with only 36 tokens, pp measurement is just noise
>>
>>106558317
It evaluated the whole context since I restarted the server.
I asked it to rate the story it wrote and it responded with
>pic related
>>
>>106558317
Oh, I didn't see that you quoted the original post.
That was due to the cache. See >>106558293 for the numbers after the restart.
>>
File: 1742618292459057.png (83 KB, 1056x370)
83 KB
83 KB PNG
HOLY FUCKING SHIT
MATHEMATICIANS ARE DONE FOR

https://x.com/mathematics_inc/status/1966194753286058001
https://x.com/mathematics_inc/status/1966194753286058001
https://x.com/mathematics_inc/status/1966194753286058001
>>
File: 1687621789407796.jpg (9 KB, 220x180)
9 KB
9 KB JPG
>>106558352
>humans do most of the progress
>train AI model on their work
>wow the AI model can do what they did so much faster
I would hope so retard it's got cheats basically
>>
File: 1757464949460761.png (66 KB, 1058x200)
66 KB
66 KB PNG
>>106558367
>wow the AI model can do what they did so much faster
The AI model did what they could NOT finish, retard, it went beyond their work
>>
>>106558352
as long as it doesn't discover new math formulas it's a big nothingburger
>>
>>106558352
>formalization
I sleep
>>106558414
this
>>
>>106555341
Openwebui is too bloated to the point of being unusable.
>>
>>106558425
>too bloated
what?
>>
File: 5274362.jpg (34 KB, 640x427)
34 KB
34 KB JPG
>>106558352
>math PHD
>any job i want
>300k starting
>now ai is going to steal my job
fuck
>>
>>106558352
they should ask it to come up with better LLM architecture
>>
>>>/pol/515557939
Localbros what do you think?
>>
>>106558352
If that actually happened it would be quite impressive but given all of the hype and false advertising in the field I'll wait for independent mathematicians to check the work.
A lot of "proofs" even by humans are incorrect.
>>
>>106558476
i will make a new llm architecture that will hallucinate, have uncontrollable mood swings, and provide unsafe outputs more than ever. i shall call it trannyformers
>>
>>106558488
Thousands of people watched the life gush out of a hole in his neck live. Go be a fucking schizo somewhere else.
>>
>>106558506
Do you know any of those people? Explain what's happening then.
>>
>>106558519
Do (You)?
>>
File: 1748918154734077.png (253 KB, 1406x1602)
253 KB
253 KB PNG
>>106558500
The founder of the company is Christian Szegedy
He's legit
>>
>>106558519
I don't talk to jews.
>>
>>106558526
No?
>>
>>106558542
Then take this conversation back to /pol/
>>
>>106558527
>elon scammer
>>
>>106557372
>replace every politician with R1
>life continues as it did with zero changes to the average person's life
What would that mean?
>>
>>106558352
Can we make one model that writes better ERP responses than 1 person I found online (and paid) in 18 months?
>>
why is this thread so dead recently?
>>
>>106558711
good morning saar, kindly click the payment link on my fiverr for each and every dirty hot गाय sex
>>
File: more-sparsity.png (180 KB, 580x720)
180 KB
180 KB PNG
https://x.com/JustinLin610/status/1966199996728156167

Next models will be even more sparse.
>>
>>106559044
what is sparse? more fancy word for MoE?
>>
>>106559051
Less active parameters relatively to the total parameters.
>>
>>106559051
Short for "super arse".
>>
File: 1743391670600581.png (2.33 MB, 1328x1328)
2.33 MB
2.33 MB PNG
>>106559044
>>
>>106559051
It's basically a simple way of the chinese saying they can't produce good dense models anymore
>>
File: file.png (1.46 MB, 1024x1512)
1.46 MB
1.46 MB PNG
>>106559094
>>
>>106559107
Why should they? They can train from scratch 10 different 3B-active models from 3B to 3T parameters with the same compute it takes to train one dense 32B model.
>>
>>106559144
Yeah and they are all shit.
>>
>>106559149
Not on benchmarks they aren't! And that's all that matters.
>>
File: 1726977523784180.png (55 KB, 1024x1512)
55 KB
55 KB PNG
>>106555530
>>106558219
>>106559139
diffusion slop, get good
>>>/ldg/
>>
File: 1751977122712010.png (1.92 MB, 1056x1584)
1.92 MB
1.92 MB PNG
>>106559139
>>
>>106559162
The benchmarks that never live up to reality? Good one anon.
>>
So I bought 2 Mi50s after seeing so many people in here praising them lately. Got them in today and I only now just realized they have zero cooling. How the fuck do you cool these?
>>
>>106559199
>he doesnt have a server rack with 100W blowing fans
do you even servermaxx??
>>
>>106559204
No, and I refuse to buy a server case with those tiny 60mm fans that sound like jet engines.
>>
>>106559215
you put the server in the basement... unless it will compete with your living space lmao GOTTEM
>>
File: x10sra.jpg (1.43 MB, 3000x4000)
1.43 MB
1.43 MB JPG
>>106559199
The machine in pic related has 3 vertically stacked server GPUs.
I put one 120mm high RPM fan in front and one in the back for a push-pull configuration (for the one in the back I had to DIY a solution to keep it in place).
>>
>>106559044
The actual linear context seems to be the biggest innovation of the last two years
>>
>>106559232
is this how nvidia treats its employees? like man you cant afford a small rack to throw in nas/switch/router and appliances?
>>
File: 1693676878853898.gif (1.06 MB, 640x640)
1.06 MB
1.06 MB GIF
>>106559232
>six (6) 4090s
>>
>>106559247
I have yet to receive any money of free products from NVIDIA.
>>
>>106559256
at least jannies get hot pockets, man...
>>
>>106559256
That makes sense. If anything, llama.cpp likely caused them to sell fewer GPUs
>works on macs
>works on aymd
>can run without a gpu at all
>>
I've been using Gemini 2.5 pro for a while and I tried Gemma 3 27b, of course it's censored but it's good, like not even far off Gemini...How is that possible??
>>
>>106557685
good news for you: >>106559305
>>
>>106559305
Distillation from Gemini for both pre- and post-training.
>>
>>106559297
llama.cpp/ggml gets a lot of contribution from NVIDIA engineers though.
>>
>>106559297
still runs faster on nvidia tho, pooaymd can't even compete and apple is a joke.
>>
>>106559256
When are you guys going to merge in flash attention for intel arc gpus? It's been like 3 years now.
>>
>>106559371
>>106559371
>>106559371
>>
>>106559381
The SYCL backend is developed mostly by Intel engineers, you'll have to ask them.
>>
>>106557808
>at least GLM 4.5 air level
Why would it be? It's a lower total parameter count and less than a quarter the active parameters.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.