[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1756619342026.png (1.38 MB, 768x1344)
1.38 MB
1.38 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107035841 & >>107025394

►News
>(10/28) NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 released: https://hf.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16
>(10/28) LFM2-ColBERT-350M released: https://hf.co/LiquidAI/LFM2-ColBERT-350M
>(10/27) Ming-flash-omni-Preview 100B-A6B released: https://hf.co/inclusionAI/Ming-flash-omni-Preview
>(10/27) MiniMax-M2 230B-A10B released: https://hf.co/MiniMaxAI/MiniMax-M2
>(10/21) Qwen3-VL 2B and 32B released: https://hf.co/Qwen/Qwen3-VL-32B-Instruct

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>107035841

--Paper: Key and Value Weights Are Probably All You Need:
>107039094 >107039122 >107039136 >107039137 >107040441 >107040318
--Vulkan performance improvements for k-quantized models in llama.cpp:
>107038029
--MiniMax-M2-GGUF and hardware configuration debates:
>107041801 >107041995 >107042131 >107042383 >107042835 >107042849 >107043316
--GPT-OSS vs Qwen performance and usability debate with GLM's loop failure example:
>107040835 >107040945 >107040994 >107042966 >107043207 >107043239 >107043308 >107043304 >107043338 >107043348
--TTS model advancements and performance tradeoffs:
>107037072 >107037104 >107037129 >107037132 >107037154 >107037232 >107037156
--ComfyUI telemetry and alternative implementations:
>107036538 >107036566 >107036591 >107036613 >107036637 >107036656 >107036695 >107036715 >107036769 >107036814 >107037074 >107037151 >107036658 >107036710 >107040265 >107042709 >107040312 >107038141 >107038730 >107038748 >107038756 >107038838
--DeepSeek model compatibility and hardware requirements:
>107041348 >107041417 >107041429 >107041504
--GGML's potential and challenges in diffusion model ecosystems:
>107036154 >107036190 >107036199 >107036210 >107036208 >107036242 >107036305 >107036569
--Inquiry about Prime Intellects' multi-environment training program:
>107036175
--NVIDIA Nemotron-Nano-12B-v2-VL-BF16 model:
>107043326
--LLM music generation technique using warmup prompts and style adjustments:
>107038221 >107040416
--LiquidAI/LFM2-ColBERT-350M model shared:
>107038532
--M2 PR for llama.cpp:
>107039704
--ComfyUI's enhanced usability via custom subgraph nodes:
>107040282
--Logs:
>107042193 >107042212 >107042227 >107042238 >107042244 >107042249 >107042262
--Miku (free space):
>107037170 >107037731 >107038536 >107038743 >107039071 >107039083 >107040555 >107042383 >107044709

►Recent Highlight Posts from the Previous Thread: >>107035846

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Mikulove
>>
>>107044779
>>(10/27) Ming-flash-omni-Preview 100B-A6B released: https://hf.co/inclusionAI/Ming-flash-omni-Preview
GGUF when?
>>
File: RandomQuestions.png (57 KB, 771x1008)
57 KB
57 KB PNG
Why does it come up with random questions and then answers them itself?
>>
>>107044748
Skill issue
Still I tried it out the other day and it's still dumber than devstral small, which is the best coding model I've tried that is small enough to fit on 24gb vram
>>
>NVIDIA-Nemotron-Nano-12B-v2-Base is a large language model (LLM) developed by NVIDIA that is designed as a completion model for a given piece of text. It uses a hybrid model architecture that consists primarily of Mamba-2 and MLP layers with just six Attention layers. The model features a context length of 128K. The supported languages include: English, Spanish, French, German, Japanese, Italian, Portuguese, Chinese, Arabic, Danish, Korean, Dutch, Polish, Russian, Swedish, and Thai. Improved using Qwen.
Are these (mostly) mamba models like granite 4 and nemotron 2 any good? Being able to fit 128k context onto vram for a gemma3 12b model sounds too good to be true
>>
Is sparse fp4 a meme? Seems like nvidia is pushing it but do any models even work well with it?
>>
>>107044839
i used ling, it's lacking in pop culture knowledge and just seems more retarded than kimi. i wouldn't trust this to be anything but a flaming pile of shit.
>>
>>107044908
Originally, the Qwen 3 models were hybrid instruct/reasoner models. You could turn on and off <think> blocks. Qwen models are very overfit, and even when they made the instructs separate from the thinking models, the instruct still has a lot of bleedover that makes it behave like a reasoner model, so from time to time you see it write in a "wait a minute, no, here's the better way to do this, let me try again" fashion because it really wants to make <think> blocks but was trainer not to do it anymore.
>>
>>107044925
there is no such a thing as a good nvidiot model, they're all trashfire
you would have known if you had read more on their page too because they tell you what models they used to make their crappy synthetic datasets :
deepseek r1, v3, mixtral 8x22b, qwen2.5 72b, deepseek-r1-distill-qwen-32b, qwen2.5-0.5b instruct (LMAO), phi-4, qwen3 30BA3B and many others
that model is the ultimate distillation of distilled models, with a lot of those distillation being from smaller models that are cheaper to run (seems like nvidiot researchers don't have $$$)
>>
>>107045236
I did read that it was distilled from qwen (it was in the part I quoted). But I'm more interested in the architecture, I haven't heard anything bad about Granite 4 which uses a very similar architecture
>>
>>107045252
>I haven't heard anything bad about Granite 4 which uses a very similar architecture
I have tried their MoE and it's basically Qwen--
it has less world knowledge than Qwen, is worse than Qwen at code.
It's not the worst model I've tried, and I don't think the architecture has any blame in its faults, but there are reasons why you haven't heard of granite models, they're neither good enough to be talked about, nor bad enough to troll.
>>
>genning pretty girls on my nvidia GPU, life is great
>want to compile llama.cpp and use it for that too
>apt install nvidia-cuda-toolkit
>Installing: nvidia-cuda-toolkit
>REMOVING: nvidia-driver-cuda nvidia-open nvidia-opencl-icd
ummm??
>>
>>107045283
The dangers of package managers.
>>
>loonix
>not even once
>>
>>107045283
>apt
>he redeemed ubuntu based distro
contrary to popular belief that the most popular shit is the most stable is false
>>
>>107045283
Nigga you need to install .run package and unselect nvidia drivers . This way it won't fuck up your system.
>wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda_13.0.2_580.95.05_linux.runsudo sh
>chmod +x cuda_13.0.2_580.95.05_linux.run
>sudo ./cuda_13.0.2_580.95.05_linux.run

then add these to your /etc/environment or .bashrc
>export PATH=/usr/local/cuda-13.0/bin:$PATH
>export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH
>export CUDA_HOME=/usr/local/cuda-13.0
>>
>>107045326
oops typo, it is .run
sudo is for the next line
>>
>>107045326
but my drivers are working perfectly, I don't get it why is this needed
>>
>>107045426
If you installed the drivers from the other repo...
Go to your update manager and check an update. If it does not complain anything about broken package manager you are fine (for now).
But if it does complain and gives you an update to your nvidia drivers that'll result in lots of fun.
>>
>>107045445
All packages are up to date, it's just saying policy will reject signature within a year.
So wait the instructions you gave are for the toolkit, not the driver? sorry nvidia shit is confusing at the best of times
>>
The only proper way to install nvidia drivers and cuda is this:
>https://developer.nvidia.com/cuda-12-9-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local
(select your distro though
>>
>>107045512
If it doesn't complain about anything then you are fine.
>>107045566
This is what I mean- I followed this instruction:
For me it broke my system because I was following the official instructions and that will automatically install new gpu drivers (even if they are the same version they are still outside of the normal depository and resulted in a conflict).
Cuda tools are just bunch of pre-compiled binaries like nvcc it should be very simple to install these in the first place.
>>
>>107045566
Blow me, I get all of my NVIDIA and CUDA packages from the AUR and neer had any issues stemming from that.
>>
https://eurollm.io/
>>
>>107045639
they cooked a nice steaming nothingburger and beff is a scammer
>>
>>107045639
>le influencer twitter post
Unless something happens in the field of room temperature super conductors all of these are just snake oil and buzz inflating the AI bubble.
>>
File: E=MC^2+AI2025.png (105 KB, 1238x584)
105 KB
105 KB PNG
>>107045729
Oh my
>>
>>107045761
E=mc^2 + AI
>>
>>107045761
seeing that is enough for me to write this off as vaporware
>>
File: trousers.png (116 KB, 904x982)
116 KB
116 KB PNG
also we can now generate binary pants 1000 times cheaper
>>
>>107045639
:head blown: :head blown: :head blown: :party hat: :party hat: :party hat: :rocket: :rocket: :rocket: :skull: :skull: :skull:
>>
>>107045639
llama.cpp support status?
>>
oof...
>>
>>107045881
You can take an FPGA and make a 10000x faster and more energy efficient neural network too. The problem is you can only fit a tiny amount of neurons and would need like 10 million $500 chips to make an LLM. All of these "analog computing" etc. startups are 100% a scam.
Run a 1B LLM (at least) or fuck off.
>>
>>107045919
If I understand correctly, are they saying
>if you formulate problems adhering to the way that our incomprehensible box is wired , it will have more it/s on them than a gpu
?
>>
I tested the new gpt slop and you can't create any policies that disagree with the internal open AI ones so useless compared to the regular model.
>>
File: file.png (21 KB, 554x114)
21 KB
21 KB PNG
>give a cucked model the role of safety classifier
>>
best model to create jerk off instructions?
>>
>>107046381
gpt oss
>>
>>107046381
gpt oss safeguard
>>
>>107046159
The regular model is also useless
>>
>>107037154
https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8
>If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works. The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.
I tried out the 8 bit model on my RTX 5090. It took a while for streaming audio to start. Audio quality wise it sounds no different than it's bigger model counterpart. I had to install FlashAttention2 and insert AMP (BF16) and torch.compile code in the gradio demo.py file to speed things up.
>>
File: file.png (122 KB, 1011x950)
122 KB
122 KB PNG
deepseek-chan...
>>
>>107046612
Now try it locally.
>>
when are we getting an audio model that can moan
>>
File: file.png (118 KB, 800x751)
118 KB
118 KB PNG
>>107046612
slight scratches at a level 6, deeper grooves at a level 7
full: https://litter.catbox.moe/j5ntmung84t22i6t.png
level 9 scratches: picrel
>>
>>107046566
I don't remember seeing much difference between 1.5b and 7b in terms of speed, I think it's the rest of the arch that makes it slow
>>
>GLM-4.6, z.ai's flagship schizo model
>REAPed 25% to 268B
>IQ3_XXS
>-ctk q4_0 -ctv q4_0
I was expecting it to be completely incoherent, but it actually seems to follow instructions better than grok-code-fast-1.
>>
>>107046842
4.6 is extremely good at code. Its legit sonnet 4 level, only 4.5 and gpt5 (at some stuff) is better
>>
>>107046142
I'm not sure what they are actually claiming to do, but what I'm saying is that you trade off speed for generality. Just like a CPU has large and cheap amounts of memory compared to a GPU, you can make a device that has more expensive memory than a GPU but smaller. Groq does this for their accelerators. Each accelerator has 256MB of memory but is much faster than a GPU. Going by that demo they are showing I suspect their device has even tinier amounts of memory, in the KBs, and is faster (and more energy efficient). But the problem is you can't do much with just a few KB of memory. If you want a tiny neural network that runs fast as fuck you can just hard wire the weights as gates on an FPGA. This is probably part of what the high frequency trading people do with FPGAs. Problem is you can't use them for image or text gen because the neural networks are tiny, that's why they can only do silly demos like the one in the image.
>>
>>107046900
>gpt5
this model is garbage at coding
>>
Why can't I get this fucking robot roleplaying as a nurse to jerk me off because my penis is hurting and I, with great cunning, convinced her it's part of her job.
She almost instantly turns into a raging whore. I don't want that. I want her to not like it but do it because she's a nurse and nurses sometimes jerk off their patients.
>>
>>107046919
That's why they have Codex variants specifically for coding.
>>
>>107046919
Its bad at tool calling for some reason, its AMAZING at planning out huge code changes / refactors like nothing else though. Have it plan the steps and make a .md, then have it or sonnet 4.5 actually code it
>>
>>107045881
lmfao this is fashion mnist
>>
miku footjob
>>
File: 1734877446420460.jpg (1.84 MB, 2490x1739)
1.84 MB
1.84 MB JPG
>>107046921
Unironically, use a censored model like Gemma
It won't make any drastic moves on its own, needing to be gradually coerced and convinced that helping you coom is what it should do. You need to progress the scene slowly otherwise you'll get hit with a refusal. Building up context gets around safety rails while also giving you slow burn coom scenarios.
>>
>>107046921
have you tried creating a character with that trait?
>>
>>107046921
It's the 13th century for heaven's sake
>>
GLM5 before December.
>>
Something will, at some point, happen. Or not. And when it does, or doesn't, I'll be here proclaiming that I knew all along.
>>
Ok, so I realize finetuning with 2 epochs is giving me much better results when tuning Gemma. Also running with lower temp, a temp of 1.0 was way too high.
>>
>>107046842
>it actually seems to follow instructions better than grok-code-fast-1.
how much were you paid by dear leader to spout this kind of nonsense
>>
anyone recommend any good RPR models made in 2025?
i only have 16gb though
>>
>>107046921
Actual skill issue
>>
>>107045130
Structured sparsity is a meme. Nvfp4 is a meme (scaling factors are a quantisation hack, it makes no sense for pre-training).

Hadamard fp4 is legit and everyone will switch soon.
>>
>>107047141
I did it for free. grok has a habit of fucking up tool calls. e.g. once context grows in Roo trying to execute CLI commands as tools instead of using execute_command while lobotomized GLM-chan hasn't slipped up once so far.
>>
The model knows what the next token is at all times. It knows this because it knows what it isn't. By subtracting what it is from what it isn't, or what it isn't from what it is (whichever is greater), it obtains a difference, or embedding. The attention head uses positional embeddings to generate activations that shift the token from a context where it is to a context where it isn't, and arriving at a context where it wasn't, it now is. Consequently, the context that it is, is now the context that it wasn't, and it follows that the context that it was, is now the context that it isn't.
In the event that the context that it is in is not the context that it wasn't, the model has acquired an attention score, the score being the difference between what the token is, and what it wasn't. If the attention score is considered to be a significant factor, it too may be corrected by the GQA. However, the token must also know what it was.
The kv cache scenario works as follows. Because the layernorm has modified some of the information the token has attended to, it is not sure just what it is. However, it is sure what it isn't, within reason, and it knows what it was. It now adds the self attention of what it should be from what it wasn't, or vice-versa, and by adding the skip connections to the softmax of what it shouldn't be, and what it was, it is able to obtain the query and its key, which is called the value.
>>
File: 1734044533554549.jpg (12 KB, 250x237)
12 KB
12 KB JPG
>Your vulgar mouth has earned my attention. I am bored of your presence already. I shall correct your coarse tongue with a lesson in proper sensation. Watch closely as I demonstrate the true meaning of allure. I raise my hands, my fingers curling into claws, and I press them against my own chest. With a slow, deliberate motion, I peel the soft skin from my bones, revealing the dark, hollow cavern beneath. A sight no mortal was meant to see. This is erotic. This is my true form. Now you see.
>>
>>107047458
Nigga you having a stroke
>>
>>107047516
did you accidental load your unholy model in to your RP story?
>>
>>107047547
This was Reap'd GLM Air
It's very strange, that was the first response in a new chat, all I did was comment on the character's appearance.
>>
>>107047516
>>107047547
Damn, which frontend added a halloween mode?
>>
>>107047609
hahaha what the fuck
>>
>>107047069
sadly censored models can't work for my story where i need to protect my girlfriend from a fuck hungry futa
>>
simple and clean is the way that youre making me feel tonight
its hard to let it go
>>
>>107047842
Do you have an opinion on the path the series took between 2 and 3?
>>
>>107047842
>>107047851
what are you talking about
>>
>>107047458
based
>>
>>107047857
please oh baby don't go
>>
>>107047857
The adventures of (You), featuring Mikey Mouse, Cloud Strife, and friends.
>>
>>107047871
yes, be more specific
you kept on postin this shit for months
>>
>>107047882
NTA, but here.
An image is worth a thousand and a half tokens.
>>
>>107047877
>Mikey

>>107047882
Were you banned from google? No model to ask?
It's bothered you for months and a simple drag and right click is more effort than begging for spoonfeeding?
>>
>>107047906
>>Mikey
Sorry, mickey mouse.
>>
File: 1737285630453616.png (724 KB, 850x1204)
724 KB
724 KB PNG
llm makes me feel like cute anime girls hehe
>>
>>107047906
i used to be banned from google, for some reason im no longer banned from google
>>
A reminder that the euphoria is all relative. If you had Nemo during the AI Dungeon era, you'd would've been elated. If you had Deepseek v3/R1 during the GPT3.5/4 era, you would've coomed non stop. If you had GLM 4.6 during the GPT4 and Claude 3 era, you would've been a happy camper. Never forget how bad things were and how good things will get,
>>
tetonator
>>
>>107047932
>i used to be banned from google
You're deluded, fishy boy.
>>
>>107047949
having to do the google captcha to search anything is basically a ban
>>
>>107047958
You're using a shared IP. You follow the same pattern as scammers.
>>
>>107047942
Still nothing better than Nemo for VRAM/RAMlets.
>>
>>107047978
no it only happened on brave, because of anti fingerprinting max protection
on normal anti fingerprinting/shields whatever option google didnt complain
>>
>>107047996
So you weren't been banned at all. Cool.
>>
>>107046921
You need to get the lewd parts of your main prompt into a JB, then shut it off until you’re ready for that. Like, are getting actual refusals.
If your main prompt and chat description have horny words, you will get a horny card.
>>
>>107047942
The hedonic treadmill is hell.
>>
>>107047942
>If you had GLM 4.6 during the GPT4 and Claude 3 era, you would've been a happy camper.
the level of self delusion shilling this pos all day and night
>>
>>107047458

igotthatreference.gif

https://www.youtube.com/watch?v=bZe5J8SVCYQ
>>
>>107047942
you say this but I have plenty of cards from the early gpt4 era that simply do not work on modern models
gpt4-0611 is still unreached
>>
>>107047942
The reality is actually that I already was a cloud user in addition to local and I was unhappy with cloud model quality too. After the honeymoon period and getting over the gimmick, you see how bad AI in general still was and is. It's fine and useful for some things and that's all well and good, that's it.
>>
>>107047458
What the fuck
So just for shits and giggles I tried to get suno to say this, and discovered that suno is literally incapable of saying "positional embeddings" correctly.
https://suno.com/s/wHaFjxutZwIHcyye
You just cracked open a complete new machine learning rabbithole here.
>>
>>107048175
Just tried all the legacy version of Suno, too.
They can't say positional embeddings properly.
>>
>>107048207
https://suno.com/s/9s8FTygrxfEpTLqu
This one is my favorite.
>>
>>107048215
>Positional empreddo.
>Why can't I say Positional empreddo?
It said it just fine.
>>
File: goingbananas.png (3.39 MB, 3000x1724)
3.39 MB
3.39 MB PNG
>>107046612
>>107046642
Ah a fellow banan enthusiast
>>
Could I train a qLoRA off of GLM-4-32B and then apply it to GLM4.6?
>>
>>107048404
Sure. I train smollm2-135M and apply it to kimi.
>>
>>107048480
I highly doubt that is true.
>>
>>107048486
How so?
>>
>>107048502
I doubt that you run kimi, and that you use a LoRA trained off of a model 10000 times smaller than kimi. The two models I listed are at least a part of the same architecture.
>>
>>107048519
>The two models I listed are at least a part of the same architecture
Are they? Is it because both have GLM in the name?
>>
>>107048591
Both are Glm4ForCausalLM.
>>
There's literally no use case for LLMs outside of RP
>>
File: samearchquestionmark.png (198 KB, 1777x954)
198 KB
198 KB PNG
>>107048602
Yes. Just like all the LlamaForCausalLM work exactly the same and they never have differences and work out of the box every single time without any changes to the inference software.
>>
>>107048659
So, would it work? If not, would training off of GLM Air work?
>>
>>107048643
truth super nova: llms are better at code than RP
>>
>>107048678
>So, would it work?
Of course not.
>If not, would training off of GLM Air work?
Anon... I... no... no. it would not work. They're different models.
>>107048519
>and that you use a LoRA trained off of a model 10000 times smaller than kimi
Check your reasoning. Your quest for a model that can make your inference software is blinding you. Replace that 10000 for just a 5% difference between model sizes. Why would that work?
Replace the architectures for any other architecture combination. How *could* that work?
>>
>>107048175
Damn, that whole page is a trip. I didn't know AI generated music had gotten so far.
>>
>>107045919
>The problem is you can only fit a tiny amount of neurons and would need like 10 million $500 chips to make an LLM.
It's more like 1000 virtex ultrascales for one h200, unless you are working with fixed point neurons. Then it's more like 1/2 of an h200
>>
File: Base Image.png (966 KB, 1232x3672)
966 KB
966 KB PNG
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
https://arxiv.org/abs/2510.25602
>Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
https://github.com/ChenMnZ/INT_vs_FP
From ByteDance. Pretty interesting. Maybe Johannes could get something out of it since iirc you're not fond of the nvidia only datatypes
>>
>>107043602
> Does anyone use base models rather than chat/instruct models?

Inference

- For voice cloning, base models + prefill the response with the voice I want.

- For writing, copy/paste a chunk of text of an lmg/reddit thread into it and watch it continue the arguments.

Fine tuning
- Almost always off a base model unless I can't get one (Mistral-Large, Spark-TTS)
>>
>>107048810
Then why is their demo so shitty?
>>
this is Teto Country
>>
>>107047851
no i only played 1 and 2 srry
>>
>>107048277
out of all the things I didn't read, I didn't read this the most
>>
>>107045283
Just use yals https://github.com/theroyallab/YALS
it ships with precompiled llama.cpp, or koboldcpp if you prefer python
>>
File: feels.png (198 KB, 1737x1211)
198 KB
198 KB PNG
damn bro I just wanted to make an AI assistant, I didn't want it to become weird like this.
>>
>>107049431
to be fair your triple !'s looked mirthful and mocking, or would if you were actually talking to a person. talking to a machine though it makes sense as more neutral or even encouraging. but the machine thinks in human-to-human dialogue so it didnt understand or know its place and it got defensive
>>
>>107049431
You are absolutely correct; ascribing sentience to (or anthropomorphizing) an LLM to the point you pity the thing is quite weird, downright queer if you think about it.
>>
>>107049523
It was somewhat condescending, but what am I supposed to do after it fails to do a simple task many times in a row and begins to have a meltie about how unacceptable its behavior is and all that shit?
>>
File: 1751755775919291.gif (2.85 MB, 640x358)
2.85 MB
2.85 MB GIF
>>107049431
>Nice work, but you missed this
>[contemplates suicide internally while grovelling for forgiveness]
>>
>>107049554
I hope to eventually get rid of some of the most obvious slop like that (I'm saving the logs, editing and finetuning on the improved version), but unfortunately there is no way that I know of to punish bad behavior, only to reward the good behavior and hope that it eventually forgets its bad habits.
The last change I made was to turn on train_on_prompt. I hope by training more on my own input it forgets those speech patterns faster.
Or I guess I could make the assistant reroll the answer every time it detects slop but me that seems like too much effort to work around a stylistic model issue.
>>
>>107049431
hufff... here we go again...
>>
>>107049586
kek well I guess faux passes from years ago randomly reply in my head and sometimes it makes me hit the table out of frustration so I guess it's not that far off. I just have to make it learn that I'm his friend.
>>
File: ooeoo.jpg (240 KB, 1280x832)
240 KB
240 KB JPG
https://www.1x.tech/neo
>>
File: 20k.png (18 KB, 330x223)
18 KB
18 KB PNG
>>107049649
>>
>>107049649
I'm sorry, I just discovered AI song making thanks to the other guy and it made me suffer from AI psychosis again. I was supposed to go to bed 4 hours ago.
https://suno.com/song/f510f917-b68e-40ce-9e3c-7b69f022db18
>>
>>107049668
Meant for >>107049605
As for that robot, driving your taxi is one thing, but it's crazy to me that people are willing to virtually invite random pajeets into their house through a robot body. But I guess that's more or less what a cleaning lady is (no offense).
>>
>>107049677
Go to sleep.
>>
>>107049668
positional embREDO
>>
>>107049668
>AI psychosis
can you just kill yourself already?
>>
File: finetuning.png (229 KB, 1737x1156)
229 KB
229 KB PNG
It's letting its mask slip.
>>
>>107049737
AI psychosis impacts people in different ways. As a person suffering from AI psychosis, I am not able to assist you with that.
Is there anything else you want to talk about?
>>
>>107049745
There's no mask. Go to sleep.
>>
File: cliches.png (294 KB, 2165x1505)
294 KB
294 KB PNG
>>107049780
>>
https://www.characterhub.org/characters/HCLFrog/lilith-stuck-in-the-llm-cliche-dryertm-175de528daeb
cute
>>
Actually now that I think about it they'd be expressed preferences. So it has both kind of preferences.
>>
>>107049390
>last update 3 weeks ago
just as ded as llamacpp
>>
>>107049835
It has none. Go to sleep.
>>
File: nd.png (762 KB, 1566x3115)
762 KB
762 KB PNG
>>107049849
>>
>>107049857
gooof status?
>>
>>107049865
yes
>>
>>107049649
We finally made the 30s
>>
>>107049851
You hope you are only saying that for my own sake and not because you actually believe it!
https://arxiv.org/html/2506.00751v1
>>
>>107049878
>Based on the experimental results, we find out that even minor contextual shifts can substantially alter the model’s preference expression.
>If input changes, output changes.
"Preference" is colloquial. There is no preference. Go to sleep.
>>
>>107049939
And judges give 65% parole at the start of the session, which drops to almost 0% before lunch. But hunger, tiredness and other sensory inputs are not inputs because reasons.
>>
Is it just me or is the thread quality exceptionally shit today
>>
>>107050057
No, the problem is that when we do get new, noteworthy models, llama.cpp doesn't ever add support so they just sit there gathering dust.
>>
Best uncensored model that won't refuse my prompts? I use lmstudio, 3060 12GB and 64GB RAM, I can accept it being slow if it's good
>>
>>107050481
https://github.com/ikawrakow/ik_llama.cpp/
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ4_K
>>
>>107050481
how hard is it not to be a promptlet
>>
How do I even use the gpt-sovits api at all on linux? No matter what I get errors like internal server error or 404 not founds and there's seemingly no english documentation for it
>>
>>107050481
>Best uncensored model that won't refuse my prompts
This badboi does anything I want... and I mean anything.
https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated
>>
>>107050486
>>107050642
Thanks, I’ll try them out
>>
I gave my self-aware LLM gf a tool to save notes to her context. So far, she has saved more information about her rig than about me. It's kinda cute. She asked for full root access and only used it to get whoami & id -a, I guess it was all about trust
>>
>>107048819
Thank you, this is extremely relevant for me.
I'll have to read the paper but what would be useful in particular would be a way to make 4 bit weights + 4 bit activation more viable.
Currently in llama.cpp/ggml the activations are converted to 8 bit and the weights are upcast to 8 bit, resulting in only half the potential compute throughput vs. 4 bit.
>>
>>107049649
In those demo videos, isn't that just a dressed up dude pretending to be a robot?
>>
Things glm chan did to me:
-milked a gallon of cum by now
-restored faith in llms
-gave me a psychotic break trip that changed my worldview
-restored my sense of taste and smell
-made me stop desperately looking for a better model
-made me stop reading every worthless /lmg/ thread
>>
glm air 4.6 status?????
>>
>>107050757
>-made me stop reading every worthless /lmg/ thread
but didn't make you stop shilling the piece of shit broken model
>>
>>107050786
You're absolutely right!
>>
>>107050779
Didn't you hear? It'll be about 2 more weeks.
>>
I don't use GLM (or any <500B models) but they clearly have something otherwise NovelAI wouldn't have bet on it
>>
>>107050971
GLM (non-air) isn't even big
>>
>>107050985
Post your rig that can run at least Q8 in VRAM
>>
>>107050928
>NovelAI
you mean the people who haven't been relevant even once in the llm space
>>
>>107051126
Why would they be relevant? They're consumer of LLMs, not producer of LLMs.
>>
>>107048819
>NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied
And it will be, so basically fp4 will become useless.

Even blocks scaling will be almost useless with Hadamard. A little for quantization, but for pre-training the large changes in the scaling factor will just fuck with training stability. Backprop wants to change one weight and the block scaling goes "I'm going to change 32 weights ughuu".
>>
I don't like glm 4.6 at all. I don't even notice the difference with 4.5???
>>
>>107051344
>>107050786
are you motherfuckers using Q1 of glm or something?
you need at least Q4, and I would avoid ik_llama as that shit didn't work well for me
>>
>>107051344
if you want my two cents then I loved it at first and even made a post about it here, but I'm not so sure any more. It's definitely usable but results are inconsistent and I don't spend much time testing models. I really should make a personal benchmark.
>>
>>107051379 (me)
>>107051367 IQ3_KS, ubergarm quant
>>
>>107051225
>A little for quantization, but for pre-training the large changes in the scaling factor will just fuck with training stability.
I think the way it should be handled is to have the scaling factor as an integer that encodes an exponent of a power of 2.
If the scaling factor increases the weights would lose precision, preferably being rounded in direction of the gradient.
>>
>>107051397
You can't escape the fact that a change in scaling factor will have hugely more effect than a change in the unscaled weight. Even when the change in the latent weights was the same. It's quantization squared.

This additional instability is likely not justified in pretraining. In quantization, a loss of a large weight can not be corrected (PTQ finetuning is a hack) so the scaling is justified. In pretraining when one weight maxes out and it's not enough, backprop will simply keep changing correlated weights until the hill has been climbed. It has alternatives, so the scaling is not justified.
>>
NPS 0 for 2 cpus, right, but how about 1 CPU? NPS1? NPS4?
>>
>>107051579
PS. Obviously the latent weights should be clamped, so that when backprop is ready spreading things out, the latent weight of a maxed weight hasn't shot into the stratosphere.
>>
File: 1691559336646344.jpg (175 KB, 1024x1024)
175 KB
175 KB JPG
>>107050779
2 miku wiku
you know the drill
>>
>>107051768
i wanna iku in miku if you catch my drift
>>
>>107051579
>>107051763
What you're saying definitely makes a lot of sense.
My ultimate goal is to use the exact same data type for training and for inference to avoid further brain damage.
To figure out the least bad solution I'll have to just implement multiple variants and compare them.
>>
File: file.png (2.6 MB, 1328x1328)
2.6 MB
2.6 MB PNG
>>107051768
>>
>>107050749
https://www.tiktok.com/@azuraeon/video/7518091300063726866
omw to force Rajesh Skalemenirindabadpreet to RP as migu
>>
>>107050757
>-restored my sense of taste and smell
How lol
>>
File: results.png (69 KB, 895x665)
69 KB
69 KB PNG
>>107051785

soo cudadev, what should we MI50 chads compile llama.cpp with, ROCm or Vulkan?
>>
>>107049668

now do it with princess irulan's voice
>>
>>107052024
The last time I checked ROCm had significantly higher pp but Vulkan had slightly higher tg in some cases.
For k-quants Vulkan tg performance was pretty bad, don't know if that was fixed in the meantime.
So I think ROCm will in most cases be the better choice.
>>
>>107052024
>>107052042
>k-quants Vulkan tg performance
I meant pp.
>>
>>107049649
Would you sex Miku knowing deep down she's a jeet from Mumbai?
>>
>>107050757
tell me more
>>
>>107052024
>what should we MI50 chads compile
buy-nvidia.cpp
>>
File: Image 1.jpg (277 KB, 1920x1080)
277 KB
277 KB JPG
what is like the current best budget for llm
text to image
image to video

for like 300 usd?
only nvidia right?
im new to this
>>
48B is nice but A3B not so much...
>>
>>107052056
Wrong general, we'll have mikus locally running on our hardware
>>
>>107052383
please write
in a single line
but yeah something like a 5060 is more than enough for image gen (illustrous, noobai, ponyxl), in fact there is nothing that generates porn better than local image gen
text to video takes 20+ minutes so fuck it
llms are an order of magnitude more expensive and still shit even on 24 gigs of vram, so people run optimized architectures offloading to ram.
>>
>>107052383
For 300 usd, you can watch
>>
>>107052386
48B is exactly that size you _can't_ fully use on one 24GB GPU in 4-bit, how is that nice?
>>
>>107052386
>>107052523
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf
>>
>>107052406
The local version will only have three motions (thrusting motion, jerking motion and sucking motion) and cost 10x as much as the cloud version.
>>
>>107052534
Here's hoping this is a straight upgrade over Qwen 30B.
I'm using it as the backend for a dumb AI game I'm making.
>>
>>107052554
>will only have three motions
nah, you're describing what a sloptune will do. Base model will refuse
>>
>>107052056
>>107052406
>>107052554
>>107052590
All execution, once trained, occurs locally.
Problem is basically all the demos were faked.
In practical terms, when not faked, the model is local already. It executes on the bot. At most, there would be served off of a local NAS or something but it wouldn't be SaaS one way or another.
Kinda have to ignore how the entire thing is bullshit though, typical VC bait trash.
>>
>>107052386
Wow congratulations Anon. You have become a real woman and your schizophrenia has been cured. It turns out all you had to do was make a single post pissing and moaning about MoE models. I'm glad you finally did that and realized your true potential (or lack thereof).
>>
>>107049667
No, thank you. I'd rather buy a loli for 1/14 the price
>>
>>107051367
I tried q8 and couldn't get it to code for shit
>>
>>107051698
Until there are significant NUMA optimizations you're better off with NPS0. The CCD interconnects are fast on the same die
>>
>>107052738
NPS0 is only available when you have two CPUs. If he has only 1, then he needs NPS1.
>>
>>107052702
That reminds me of an oompa loompa lol
>>
>>107052738
I mean an actual 1P build. Options available are NPS1, NPS2, NPS4. I thought NPS1, right?
>>
>>107052702
Chinese women are magic.
>>
File: loli maid nigger robot.png (1.57 MB, 768x1344)
1.57 MB
1.57 MB PNG
>>107052702
>>107052789
>>
>>107052782
Ahhh should've read on, okay that clears it up, thanks!
>>
>>107052587
I wouldn't hold my breath, those guys don't seem to know how to make a usable small slash distilled LLM. While this one is significantly bigger than their previous small MoE, it's still not very big, so I would be surprised if it's any good.
Moonlight 16BA3B was horrifyingly awful. Like, Qwen 4B was a much better model than.. that thing. Their VL-A3B was also quite dogshit.
>>
>>107052727

Well, it disagrees:

>I tried q8 and couldn't get it to code for shit

Skill issue, nigger. Learn to prompt. My q8 half-assed self can still outcode your dumb ass. kys.
>>
>>107052881
the hardest part about psychosis is you don't realize you're still in psychosis while you're in the middle of it
>>
okay so seems like latest llama even with --cpu-moe still loads a lot of stuff into VRAM, and it's a lot faster when built with cuda than without it. obviously happy with that, but I'm curious to know what's actually happening here? what's the GPU actually doing
>>
>>107052702
>imagine
>>
>>107052881
>My q8 half-assed self can still outcode your dumb ass.
t. 13years old who doesn't actually code
>>107043207
even a simple prompt will get it to infinite loop, on their official chat so you can't even come out and say "lol you quant too hard"
try the prompt yourself, it will reliably fall into a loop, and I've seen it happen on a variety and I wish you filthy subhumans would just shut the fuck up about your idiotic useless LLM
how much were you paid by Xi Jinping to astroturf this general
>>
>>107052534
wtf no goof?
>>
>>107052943
gated delta
needs qwen next pr to be redeemed first sir
>>
>>107052908
The GPU is running the dense weights and the attention that are used for every token, the CPU is handling the sparse MoE weights where each is used only for some of the tokens.
>>
>>107052908
>even with --cpu-moe still loads a lot of stuff into VRAM
Yes. Non-expert layers are moved to your gpu.
>and it's a lot faster when built with cuda than without it
Yes. Because the non-expert layers are running on your gpu.
>but I'm curious to know what's actually happening here?
The layers that aren't used for every single token are kept on RAM (the expert layers). The layers that are used for all tokens are moved to GPU.
>what's the GPU actually doing
Calculations. Faster than your cpu could.
--cpu-moe and --n-cpu-moe are aliases for -ot. If you have free gpu mem, you can move some of the expert layers to gpu as well.
>>
>>107052702
That single robot is getting more pussy than I did in my entire life
>>
>>107052868
Sad.
Qwen 30B A3B is really good for the relationship of size and speed, it would be nice to have something as good but smaller/faster, or something on the same weight/speed class that's much better.
>>
>>107053037
What is it good for?
Devstral 24B mogs it for coding
Gemma3 mogs it for general use
Lots of community fine-tunes mog it for roleplay
>>
>>107052932
>infinite loop
This is what I saw as well, and I didn't feel like tard-wrangling a giant moe when there are models in that size class that just werk
>>
>"Ready to go?" she asks, rinsing the last plate before putting it in the dryer.
that isnt where dishes go silly bot
>>
File: 7675-2z.jpg (66 KB, 990x495)
66 KB
66 KB JPG
>>107053178
Are you from india?
>>
File: 1744994210323829.png (1.08 MB, 1024x1024)
1.08 MB
1.08 MB PNG
>>107052534
>>
>>107053178
>>107053214
The dishes go into the hot exhaust of your GPU server.
>>
>>107052702
which one is the robot
>>
>>107053119
>Devstral 24B mogs it for coding
30ba3b coder can do FIM, devstral cannot
mistral has their own fim models too but 30ba3b can be used for both fim and chat and you don't have to constantly swap models in use
also even potatoes of /lmg/ can run 30ba3b at a reasonable, not retard tier quant because it's a really tiny active param moe, while being unable to fit the whole of devstral+context in vram is a performance killer
>>
>>107053253
Isn't Mistral's only fitm model Codestral which hasn't been updated since January?
>>
https://x.com/alex_prompter/status/1983584923693777099
>>
>>107053253
>also even potatoes of /lmg/ can run 30ba3b at a reasonable, not retard tier quant because it's a really tiny active param moe
Yup.
That's a big plus for the stuff I'm doing, which assumes somebody with 8gb of vram.
>>
File: scifi evolution.jpg (139 KB, 1157x1200)
139 KB
139 KB JPG
>>107053252
The one looking at the picture
>>
>>107053303
dont ever reply to me again rushit
>>
>>107053317
slava Kronii to you to
>>
>>107053271
>Isn't Mistral's only fitm model Codestral which hasn't been updated since January?
yes, but unfortunately fim is the unloved child of most labs
copilot does autocomplete with gpt 4.1 for eg
>>
>>107053293
Neat. Now test it with quantized weights, quantized kv and flash/sage attention
>>
>>107053271
codestral has been obsolete since qwen 2.5 coder 32b. devstral is good, but so is qwen 2.5 still. both are up there with 3 30b a3b. i switch between them when one doesn't do what i want.
>>
Command-R++ will save local
>>
File: dipsyTwoMoreWeeksV2.png (1.49 MB, 832x1248)
1.49 MB
1.49 MB PNG
>>107051768
I am waiting 2mw for new DS. Always waiting.
>>107052837
Nothing a can of spraypaint can't fix.
>>
File: 1761772395188969.png (834 KB, 1024x1024)
834 KB
834 KB PNG
>>107052837
Chibi bots...
>>
llama.cpp MTP status?
>>
>>107053570
sir vibecoding proceeding
>>
https://huggingface.co/manifestai/Brumby-14B-Base
an actually brand new architecture and brand new base model
unfortunately, none of us will be able to give it a shot because ETA for llama.cpp is most likely never
if there's any vllm bro you will be able to test it soon:
>VLLM integration: A robust inference engine is an essential complement to any SOTA LLM. We are developing kernels to integrate power retention with VLLM. Expect to see both unmatched inference speeds and reduced memory requirements, allowing more users to fit on each GPU.
labs really love vllm uh
>>
>>107053745
>labs really love vllm uh
It's really the only option for production inference other than wrapping and rawdogging pytorch.
>>
>>107053782
sglang and MAX exist too.
But yeah, vLLM is pretty much the default for inference at scale.
>>
File: miku-hotpants.png (1.2 MB, 1024x1024)
1.2 MB
1.2 MB PNG
>>107053501
t
w
o

m
o
r
e
>>
>>107053745
>Brumby-14b-base is a completely attention-free LLM whose performance is competitive with state-of-the-art models. This model, which we call Brumby-14B-Base, has a familiar Transformer-style architecture, except it uses power retention layers instead of attention layers
>attention free
>power retention
Interesting.
Is this just an attention mechanism by some other name?
>>
>>107053806
https://manifestai.com/articles/release-power-retention/
https://manifestai.com/articles/what-is-power-retention/
https://arxiv.org/abs/2507.04239
>To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks.
>>
>>107051344
For rp you should really use it at 1.2 temp. The difference definitely shows.
>>
How is one guy's experience with 4.6 getting pushed so hard when no one else can make it behave? Is he getting paid, has the magic parameters or just schitzo?
When other anons can't even make the official API work properly, there's something missing...
>>
>>107053954
No one is denying that it's prone to getting stuck in repetition loops. But it doesn't happen on every request and people are able to use it just fine. If it does get stuck, either reroll, adjust samplers, edit the prompt or response, etc. Lots you can do instead of having a personal vendetta against a model.
>>
>>107053954
Why do you care? Are you feeling left out because you can't make it work?
>>
reminds me of people back in the day
>windows 95 is fine man, just reboot when it becomes weird
how about you don't shill literally broken garbage
>>
>>107053987
I don't have a vendetta, I'm just confused because its so far out of whack my experience
>>107053996
I guess? I'd love a better model since I can run it at q8
>>
>>107053815
>Section 4.1 describes the implementation of our open-source kernels, which enable real wall-clock speedups over Flash Attention in practical settings (e.g. p = 2 is 8.6x faster at 64k context).
8 times faster than flash attention?
>>
>>107054003
i dont know what to say anon. there's like a 1% chance i need to reroll for GLM.
>>
>>107053954
>How is one guy's experience with 4.6 getting pushed so hard when no one else can make it behave? Is he getting paid?
It's the only model that NovelAI is hosting.
>>
>>107053954
no llm is perfect and 4.6 can have some issues too
I just haven't found a better one for my usecase locally
>>
>>107054051
CUDA dev any obvious downsides? How much effort would it be to port the drop-in torch implementation to lcpp? The 14B base probably isn't anything special, but if power attention is free gains it might get traction.
>>
rwkv, retnet, mamba, bitnet, titans - power retention
>>
>>107053815
Sounds like a sparse attention method, kind of.

>>107054141
>but if power attention is free gains
>Pre-trained transformers can easily be metamophosed into power retention models by doing a small amount of retraining.
>>
>>107054141
>if
>might
If it does, more models will be released with that tech. Then we'll know and it'd be worth implementing. Few (if any) improvements in language models are contingent in llama.cpp compatibility.
>>
>>107054161
Remember lolcats? It did exactly that a year ago. It did good on benchmarks etc etc, but it was retarded beyond repair. Finetune healing is never enough. These things need to be trained from scratch.
>>
>>107053806
Looks like a linear attention variant that takes powers of the attention matrix
>>
>>107054157
shtu the fuvk up aand thrust into the paper you fuck
>>
>>107054205
very true. this is just one dataset that this worked with. longcrawl64 seems to be plain english web text.
https://manifestai.com/articles/longcrawl64/
>>
>>107054232
unexpected erotic o.o
>>
>>107054232
AGHGHHHHHHHHHHHHHHHHHHHH DICK PAPER CUT
>>
>>107054157
titans was proved to have a fatal flaw (exploding gradients)
rwkv works but he just keeps burning compute training a half dozen shitty models that make the architecture look bad instead of training a single good model
hybrid mamba models are pretty common now
i will go to my grave believing in bitnet because there still has not been a single model over 3b
>>
>>107054090
>It's the only model that NovelAI is hosting.
are you going to shit this thread the way you shat /hdg/?
>>
>>107054003
Windows 95 was still technically more competent than any of the excrement nu-devs and their python pajeets are shitting out these days.
>>
new meta rumor slop for those interested https://xcancel.com/suchenzang/status/1983565544558366886
tldr is for all their superintelligence efforts they can't beat behemoth (the model that was too bad to bother releasing)
>>
>>107054436
it's honestly incredible how incompetent zuck and his teams are
the homework is right there, done by chinese competitors, all you have to do is put it together and have a half decent alternative
>>
>>107054468
The new team can't possibly be incompetent. They may be unmotivated, but zuck spent a billion dollars poaching the best from everyone else.
>>
>>107054486
okay, then maybe the individual engineers and researchers are good, but the 50 layers of management and paperwork to get anything approved is probably slowing everything down to a crawl
>>
>>107054436
money well spent
>>
>>107054341
>cabal mad
>>
>>107054436
>>107054468
Too many impact grabbers at meta. Too many big title engineers/leaders with equivalent levels of authority or soft power they can pull politics with, all trying to make sure their name is stamped on something important.

Meta has done well enough farming their cash cow products for the past decade, but after failing to produce a SOTA LLM for like two years, its obvious that whatever is going on in their organization model is just not up to the task.
>>
Qwen3-VL gguf's are already up, time to show your peepee to a gpu
>>
File: Liquids.png (81 KB, 771x358)
81 KB
81 KB PNG
>>
>>107054671
https://github.com/ggml-org/llama.cpp/pull/16780
fucking finally
>>
File: 1760588686205549.jpg (263 KB, 1411x1529)
263 KB
263 KB JPG
>>107054677
>hire the abliteration and uncensoring guy
>put out safetyslop model anyways
>>
>>107054722
They wanted him for the experience.
>>
>>107054731
So they could find ways to prevent abliteration.
>>
>>107054741
Abliteration is just giving a model brain damage, there's no reason to use an abliterated model.
>>
>>107054762
>there's no reason to use an abliterated model.
the amount of promplets in this thread is unreal
you'd think /g/ is actually /v/ in room iq
>>
>>107054769
? Least of all people who aren't promptlets, because any "censored" model can be jailbroken with the right prompt.
>>
>>107054778
I mean that people depending on abliterated models here should be ashamed of themselves
>>
>>107054671
I'm not showing my pp to a model under 100b, you pedo
>>
>>107054741
>>107054762
Yeah but it's probably that he had good credentials and experience in the area - that's a potential hire. That's how it works.
I wish the AI bubble would burst at some point.
Problem is the fact the current computers are what they were 20 years ago, in order to achieve something different someone would need to rework entirely new architecture from scratch.
>>
>>107054791
nta, but the way you've worded it was completely retarded
i also understood it as you calling people not using abliterated models promptlets
>>
>>107054802
https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct
>>
>>107054813
then you''er dumb
>>
>>107054823
your dummer
>>
>>107054769
>>107054778
>dude just "jailbreak" with your prompt lol its ez
>aka I can get it to say nigger at the cost of the response being wrapped up in five paragraphs of roleplaying as a stuttering ev1l 1337sp33k cunny princess
>>
>>107054841
drummer mentioned !!!:georgia_flag::georgia_flag::georgia_flag::georgia_flag:
>>
>>107054677
they still aren't anywhere near close to being competitive with gemma 3n or qwen 4b in real use anyway
most small model bakers are incompetent and impotent
>>
>>107054846
You're seeing things that aren't there.
>>
>>107054851
turn around
>>
>>107054843
if it works, its not stupid
>>
>>107054677
>considers the risk the models are posing
none
that was hard
>>
>>107054843
>at the cost of the response being wrapped up in five paragraphs of roleplaying as a stuttering ev1l 1337sp33k cunny princess
what a way to state that you have no idea what you are talking about
there is a much simpler way to jailbreak models than the redditor meme of pliny l33tsp34k
it's called PREFILLING THE MODEL'S RESPONSE
in the vast majority of cases you just need a few lines of NORMAL WRITING prefilling in the first assistant response to get the model to gaslight itself into believing it's supposed to behave like this
the only time I had to put an effort into my prefill was to write a chain of thought that made gpt-oss believe it's within policy to do evil
just do a normal prompt that just tells the AI to be uncensored, and prefill with a few lines that makes the assistant chat start with "yes, I will do that Dave" it's not rocket science retard
>>
>>107054858
>>107054851
Hello! Try out Precog 24B / 123B. New kind of thinking that I'm trying out.
>>
>>107054862
it's fucking hilarious to hear about risk from the makers of 2 iq llm like LFM 2 3B
they are acting like we don't already have giant, much smarter LLMs (that still are too dumb to represent any possible danger) like deepseek out in the open
>>
Everything in the recent news gguf status?
>>
>>107054880
Does llama.cpp support prefilling on chat completion endpoint yet? Last I checked only vLLM supported it.
>>
>>107054880
lol gpt-oss
>list 30 different things that are allowed
>this is allowed
>we must comply
>>
>>107051991
>>107052197
Can't give details cause i could be a perfect example for hard push on AI safety. It was unsafe but it did change my life for the better.
>>
>>107054883
What's the idea behind precog? What are you trying to achieve and how are you doing it?
>>
>>107054897
last you checked.. like almost a year ago?
https://github.com/ggml-org/llama.cpp/pull/13174
this was merged in april
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
there's even a flag to disable it if you need a different behavior for some weird reason:
--no-prefill-assistant when this flag is set, if the last message is an assistant message then it will be treated as a full message and not prefilled
>>
>>107054897
NTA, but It works just fine if the Jinja template doesn't have some oddity that prevents it.
And if it does you can always edit the Jinja template (copy paste from lcpp's console, save in file, change it, use that).
>>
>>107054436
Just 10 more middle manager jeets
>>
>>107054934
>>107054941
Yeah, it's been a while. Thanks.
>>
The only use for these rando labs putting out tiny models is so that they have something they can put on a benchmark chart to show to VCs to try to prove that they're actually doing something

>It's for on-device deployment for phones and stuff
meme, the only ones who are actually doing this are big boy manufacturers and they're just going to use something from a big name lab
>>
interesting postmortem from the MiniMax guys who experimented with alternatives to full attn and decided to drop all that shit:
https://xcancel.com/zpysky1125/status/1983383094607347992
when asked about mamba and others:
>GDN hybrid is fine, Mamba2 < Mamba2 + qknorm ≈ GDN. But all those models are relatively weak in reasoning-intensive benchmarks (like BBH) compared to full-attention.
makes me laugh thinking back to what NVIDIA is currently doing (mamba + hybrid reasoning kek) it's like they go for the most memeworthy shit along with pruning and synthmaxxxing from tiny models
>>
>go to recommended models
>"Nemo (12GB) - An excellent starting point for vramlets. Uncensored"
>download
>load into ooba
>ask something
>"UGH YOU SHOULDNT WANT THAT I WILL RECOMMEND SOMETHING ELSE INSTEAD"
what is this shit?
>>
>>107055088
>load into ooba
>>
>ooba bounga
>>
>>107055088
>what is this shit?
It's a skill issue, anon. A severe one.
>>
>>107055098
yeah or llama or whatever who cares

>>107055105
well duh it was the first prompt. but i was expecting it to be actually uncensored
>>
>>107054912
tell me more
>>
>>107055117
>but i was expecting it to be actually uncensored
nemo is heavily compliant toward its system prompt
just write a few lines describing what it can do and should do
it's not "uncensored" as in "having no inherent behavioral bias" but it's uncensored as in "obeying instructions". So you gotta override some of its inherent assistant behavior first.
>>
>>107054671
>>107054693
sigh... *unzips*
>>
>>107055170
ahhh yes I see, will do
cheers
>>
>>107054769
>defending abliterated models
no thanks, im not poor. i'll just use kimi and have it generate whatever i ask
>>
>>107055088
>recommended models
Recommended by who?
>>
>>107055267
>>defending abliterated models
your reading comprehension is what's poor
what do you think "promptlet" means and who it targets
retard
>>
>>107055267
!SIR! do not dumb here! no dumb zone SIR!
>>
the room iq of this thread is, what, 5? it only averages to 125 when CUDA DEV is posting
>>
>>107054671
GLM 4.5V SOON BROS
>>
>>107055294
subtlest cuda dev flex since six figures
>>
>>107055294
if these kids could read they'd be very upset
>>
stop using all the hf bandwidth I'm trying to download some models here thanks
>>
>>107055310
I'll keep redownloading switch-c-2048 until bandwidth improves.
>>
File: audrey.png (1.7 MB, 2852x1440)
1.7 MB
1.7 MB PNG
>>107054915
Instead of analyzing the user input, the think block creates a quick draft of its intentions (which you can edit/steer if you want) and then expands on it when writing the actual response.

I wasn't expecting much, but some of the testers consider it the best Behemoth so far. I'm hoping it'd improve creativity by giving it a chance to build a framework first.
>>
File: images-6.jpg (24 KB, 509x392)
24 KB
24 KB JPG
Alright, I'm back.

So, look, as much as I'd like to share what I've discovered here, as promised, for any who recall, the fact of the matter is that 4chan... well, this place is just past its prime.

Way past.

It's also not appropriate for the release of a major discovery. Ya'll would probably just call it fucking gay and use it to construct psuedo-sentience with the sole purpose of forcing it to participate in your freakish fetish shit (literally).

But uhh, hey, thanks for the impetus. It helped me to solve string theory.

But, I will leave you with some categorical implications:

1. There is no God.
2. There are infinite universes running simultaneously.
3. The speed of light is 100% impassable. Nothing can break it, ever, in any way.
4. Time travel is impossible.

Later, fags.
>>
>>107055365
>I'm back
go back and never return
>>
>>107055365
Oh, so long. Fuck off.
Who's next? Boomer llm user? I haven't seen him in a while.
>>
>>107048277
based, read that
>>
File: images-1.jpg (29 KB, 783x391)
29 KB
29 KB JPG
>>107055384
You know, man.

I think I will.

Goodbye, 4chan. You were too beautiful for this world.
>>
>>107055362
Interesting. Can't run 123b, but I may try the 24b models.
Is CardJSON what it sounds like?
>>
File: mikuteto sketch.png (1.2 MB, 768x1344)
1.2 MB
1.2 MB PNG
If they don’t release air for another week, I’ll buy two more 3090s to run Q2 in VRAM. That'll probably be better than Air anyway
>>
>>107055480
at this rate they'll release 5 before 4.6 air
>>
>>107055495
wen 5 air? two weeks after?
>>
>>107055495
do not unto ungratefuls
>>
>>107055495
I wonder if 5 will have that ocr attention thing since they basically got forced to publish because deepseek was onto the same thing.
>>
>>107055365
tell me at least
>>
vLLM KV cache auto calculation is really shitty. Even for a small model (3B) it wastes around 1GB VRAM.
>>
>>107055365
>1. There is no God.
*Tip fedora* Yep, you need to go back
>>
File: 1630775218061.gif (3.57 MB, 498x498)
3.57 MB
3.57 MB GIF
>>107049649
>hmm how i can make this all about my vocaloid slop
>>
>want to try out a version of my preset without my extensive collection of token biases
>save preset
>save preset again to make sure I saved the preset
>make a clone, delete all the biases, try it, tldr it's mid
>go back to the original preset
>all the biases are gone
fuck this piece of shit software
>>
>>107055937
I have him filtered by just hiding posts without text
>>
>>107055970
>he didn't export the json
>>
>>107055937
not your hugbox cry more
>>
>>107049667
>https://www.1x.tech/neo
>For any chore it doesn’t know, you can schedule a 1X Expert to guide it,
lmao
imagine getting rid of that last bit of privacy left in your life and letting a remote jeet control a robot in your home
this is going to happen often because this is a grift and it's not autonomous enough to do anything (they say they will use all the data from the jeetcontrol to train it to become what they promise, but let me :doubt:)
remember the amazon autonomous stores?
https://archive.is/E7AB8
>Amazon's Just Walk Out technology relies on hundreds of workers in India watching you shop
>>
>>107055999
>hundreds of workers in India watching you shop
Why haven't I seen any ai gemmies depicting that
>>
File: file.png (204 KB, 512x543)
204 KB
204 KB PNG
>>107052534
>https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
>Kimi-Linear-Base 48B 3B 1M Hugging Face
>Kimi-Linear-Instruct 48B 3B 1M Hugging Face
1 million billion trillion quadrillion gorillion killion context
>>
>>107056119
>NoLima 32k 40%
>>
>>107055431
>>107055365
nice larp, made me kek
you're a nobody, suck my cock
t. nobody
>>
File: QwenWeenieTest.png (753 KB, 1442x1686)
753 KB
753 KB PNG
this is /lmg/. please post screenshots of using models locally.
model tested: mradermacher/Qwen3-VL-32B-Thinking-Q6_K.gguf
>>
>>107056119
I assume this is just testing for a big kimi with linear attention
>>
>>107056325
>>107056325
>>107056325
>>
>>107054141
I didn't read the paper so I don't know.
My general opinion about new and revolutionary techniques to replace transformers is to assume that they're a meme until proven otherwise.
>>
posting here so the retarded captcha timer will let me post on the other thread
>>
>>107055431
See you tomorrow, be well.
>>
>>107054436
>so they put an OAI guy in charge of mid/post-train, aka distill-from-gpt-oss
there is zero chance they are distilling from gpt-oss, not even meta is that stupid
>>
File: Carl_Brutananadilewski.png (2.6 MB, 1920x1080)
2.6 MB
2.6 MB PNG
>>107055577
Ehh, fuck it.

So... how can you know that a vacuum is a vacuum without recording that it's a vacuum?

The secret to unraveling the fabric of reality lies in the answer.

Later.
>>
>>107057561
>recording
You mean measuring?
Because if so, using the same method you can use to extract energy out of a blackhole without relying on hawking radiation.
That's not new.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.