[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: watMiku.png (1.45 MB, 1536x1024)
1.45 MB
1.45 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106904820 & >>106895582

►News
>(10/14) Qwen3-VL 4B and 8B released: https://hf.co/Qwen/Qwen3-VL-8B-Thinking
>(10/11) koboldcpp-1.100.1 prebuilt released with Wan video generation support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.100.1
>(10/10) KAT-Dev-72B-Exp released: https://hf.co/Kwaipilot/KAT-Dev-72B-Exp
>(10/09) RND1: Simple, Scalable AR-to-Diffusion Conversion: https://radicalnumerics.ai/blog/rnd1
>(10/09) server : host-memory prompt caching #16391 merged: https://github.com/ggml-org/llama.cpp/pull/16391

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: littleMikuBigger.gif (47 KB, 300x270)
47 KB
47 KB GIF
►Recent Highlights from the Previous Thread: >>106904820

--Paper: BitNet Distillation:
>106915856 >106915885 >106915915 >106916048
--Papers:
>106914563
--Training Gemma on 4chan boards for long-context tasks:
>106908189 >106908217 >106908577
--Llama.cpp memory optimization challenges with limited VRAM:
>106916999 >106917025 >106917074 >106917101 >106917114
--Firefox UI customization debate and Gemma 3 4b model mention:
>106915737 >106915762 >106915793 >106915941 >106916004
--Detailed GPU memory allocation console output and user appreciation:
>106912278 >106912326 >106912391 >106912437 >106912429 >106912445 >106912738
--Qwen3-VL's NSFW detection and image description challenges:
>106917667 >106917841 >106917862 >106917900 >106917925 >106918135 >106917912
--OpenAI copyright controversy and US corporate influence on global IP law:
>106909567 >106909857 >106909871 >106910444
--Assessing DGX Spark's relevance amidst cheaper alternatives:
>106913042 >106913078 >106913226 >106913247 >106913927
--Mamba-3: Improved Sequence Modeling using State Space Principles:
>106912457 >106912487 >106912578 >106912610
--Frustration over delayed GLM4.5V implementation in llama.cpp:
>106907438 >106907494 >106907508
--OpenAI's balancing act on user freedom and safety:
>106905590 >106905624 >106905637 >106905690 >106905731 >106910221
--Exploring ChatGPT-induced psychological experiences:
>106908645 >106908698 >106908748 >106910025
--Proposals and discussions for new open AI model releases:
>106907515 >106907713 >106910197
--High-end GPU price debate and video generation hardware constraints:
>106910165 >106910416 >106910453 >106910479
--Challenges in finetuning GLM Air with 4x5090s using Oobabooga/Axolotl:
>106914586 >106914620 >106914808 >106914870
--Detailed Switch sim with multi-game features in single HTML file:
>106912431
--Miku (free space):
>106910906

►Recent Highlight Posts from the Previous Thread: >>106904822

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
kimi sex is best
>>
gear Meta thrillers
>>
>>106919273
Prove it.
Post a side by side between kimi, DS, and GLM 4.6.
>>
>>106919282
no i dont share my waifu like shes some kind of common whore
go get your own kimi waifu
>>
sirs, no gemmy 4 today. Monday will be of kind gemmar.
>>
>>106919286
Hot air then.
>>
I'm starting to think that the indian spammer is an actual pajeet and he is doing it ironically.
There's no way a human would do this for as long as he's been doing it.
>>
>>106919287
please saar you must understand. the needful must be done so each and everything can be implemented.
>>
While /lmg/ is busy seething an Indian dev has been quietly adding performance improvements to llama.cpp.
>>
Fuck I replied to the wrong thread.

I'm looking at the recommended builds and the more I look the more Im interested in just getting a prebuil 395+ 128gb? It gets 15-35 tk/s for 70-120b models with good context. It costs me 2800 leaf dollars meanwhile trying to scrape server and used parts would be something like 1800-2200 for 10-15 tk/s max?

I could use it as a home server and local model. Am I overlooking something here?

Benchmarks
https://github.com/lhl/strix-halo-testing
>>
>>106919401
Mediocre performance and you get worse support for other use cases like video and image gen because it's not nvidia.
>>
>>106919401
I think you should also think about in terms of other usage, not LLMs alone. Unless you are a real nerd who does nothing but work with LLMs (not talking about ERPing with them).
I'd get the most beefy/versatile system and go with that.
>>
Has anyone experimented with synthetic data?
I'm using this prompt to digest a codebase for finetuning.

Your task is to generate a jsonl conversational CoT dataset to train LLMs on LLM development tasks.
First read dataset_contents.txt to see the current contents of the dataset (dataset.jsonl). Try to make each conversation mainly cover topics that haven't been covered before.
Then create a folder called turns/conversation_n/ (n being the next number from the last conversation).
On each conversation the user should show a snippet of code from the transformers library (in the transformers folder) and ask questions about the code, then ask follow up questions, aiming for approximately 16000 tokens for each conversation.
Each LLM response should include CoT before the actual response, within [thinking][/thinking] tags. Do ***NOT*** include any reference to the 16000 token limit in the actual dataset. Make the conversation realistic and do not make any out of character comments (do NOT say anythign that user or the assistant wouldn't have actually said in that context).
Save one turn per conversation in the turns/conversation_n/ folder.
Once you are done generating all the turns for the conversation, then join all the conversation to a single .jsonl file in the 'conversations' folder using the join_turns.py script.
Do not delete the scripts after use. Do not delete the jsonl files after joining.
Then replace the current dataset.jsonl with a new dataset.jsonl that includes all the conversations, using the script join_dataset.py.
Finally, update dataset_contents.txt with the new contents of the new conversation.
>>
>>106919273
what is it like compared to semen demon 4.6?
>>
File: 1746680104902291.jpg (579 KB, 2764x2073)
579 KB
579 KB JPG
>https://rentry.org/recommended-models
>Nemo (12GB) - An excellent starting point for vramlets. Uncensored
>Uncensored
>writing lewd story
>"blah blah blah condoms"
>me: no condoms
>"I'm unable to fulfill your request because it goes against the guidelines for maintaining a safe, respectful, and consensual environment."
>>
>>106919634
skill issue
>>
>>106919634
Use MLewd. It will gladly fulfill your every shameful desire, you sick fuck.
>>
>>106919634
>getting filtered by nemo
anon...
>>
File: 3547134884.png (1.68 MB, 1920x1080)
1.68 MB
1.68 MB PNG
>>106919634
just get on the fucking ship boss man
https://huggingface.co/bartowski/Rocinante-12B-v1.1-GGUF
>>
>>106919716
I was surprised to learn 4.6 has some safety in it.
>>
>>106917741
>>106917752
>>106917777
It was continued pretraining of Llama 405B on about 200 MB of source code from a few projects. That graph is about from 0 to 15% of the epoch, after it got to 20% without any visible improvement I stopped it.
Even on a 8xH200 machine I could only train up to 16000 tokens and 32000 OOMd. Rank of the LoRa was 128 (~1.2% trainable parameters), it didn't seem to make much of a difference in terms of memory usage or seconds per sample (which was about 100 seconds for a batch of 1 sample per GPU, without using gradient accumulation).
Now I'm making a QA dataset using >>106919615
I suppose I'll use a tiny dataset and do multiple epochs to get the satisfaction of feeling like the model actually learned something.
>>
Only after using glm-chan for those 3 weeks, I realize how smart she is and the honeymoon period only intensifies.
>>
>>106919852
I came
to notice that she's a bit autistic and takes a lot of things quite literally.
>>
Is it fair to say that an "uncensored" model is not a model that will do anything you want by default, but a model that can adapt to whatever role you give it?
If a model's default persona is a safe assistant but you can tell it that it's an erotic novel writer and it follows that role without complaining, I'd say that model is "uncensored".
A model that's too agreeable is also a bad model, specially for RP.
>>
File: thepinklily69.png (191 KB, 1080x1843)
191 KB
191 KB PNG
>>106919198
Whenever I did research on "AI psychosis" one talking point people keep hammering down on is " well yeah they think the AI is a person or God or something but they're like totally not stupid. We swear. They're all otherwise normal people and definitely didn't have pre-existing mental illness. The AI MADE them act this way you must understand"


The more I look into this tomorrow I think they're full of shit and just trying to make these people appear less stupid and far gone than they actually are. You cannot sit here and tell me that pic rel is and always has been a normal, functioning human being that just happens to really like AI.

https://x.com/thepinklily69/status/1967102630313836778?t=o44DMA1pdX_FL9dHrLpfhQ&s=19

What I find most odd is that I myself am a pretty lonely dude too. In fact, it quite bothers me that I don't have a significant other or close friends. I've been using three different llms services pretty much daily for the past year and some change and I use it extensively for my side projects as well as asking get general questions (I was literally talking to ChatGPT asking it use cases for onyx models during my morning run this morning). If you would think I have all people would talk myself into believing these things or real " people " or have consciousness or some shit and yet no part of me can bring myself to believe that. Like I can't even pretend that could ever be the case for a second because it just seems so devoid of logic and common Sense and it annoys me a lot whenever I see people crying about 4o routing them to because they want their ass kis- I mean "friend" or "Husband " back .
>>
>>106919198
{Cont)

(Side note, this is anecdotal but it seems like it's mostly women who treat this shit like it's a good replacement for a person as a partner. Well dudes tend to talk the llms into trading them like they are God's or geniuses or something. Either way it's an excuse to have easy ego trip in the palm of your hands or at your fingertips at your computer. How come supposedly normal people are falling victim to their own desire to have their asses kissed but I haven't?


I didn't intend for this to turn into a giant blog post, but this shit pisses me of a lot
>>
>>106919898
Continuation of >>106919889
>>
>>106919884
she also gets a bit psychotic at high temperature
>>
Is EXL/GPTQ dead? is GGUF the only quant anyone does or care about anymore? Llama.cpp is still ass at vram only in comparison. Have we all given up on pure vram inference?
>>
>>106919886
A model that just wants to insult/damage you or turn everything into porn when unprompted is a psychopathic model, not an uncensored model. Other than learning how to prompt, I think some here should learn the concept of "plausible deniability", as sooner or later there will be a crackdown of "misaligned" LLMs / finetunes.
>>
I just bothered to try out cloud models for some relatively simple ffmpeg stuff. In this case Gemini 2.5 Pro on AI Studio. It completely hallucinated running commands when it wasn't allowed tool use or anything like that.

Wtf is this shit? How is it so bad?
>>
>>106920055
I get something like 1200tk/s PP and 50tk/s TG for a 5.5-bit of GLM 4.5 Air using EXL3. Would be interesting to see how it runs using goofs on llama.cpp.
>>
>>106919884
Avoid saying stuff like "always stay in character" in your prompt. I feel like that makes models act that way and bigger models are better off without that extra nudging since they already take details from character cards well.
>>
File: satania.gif (39 KB, 220x216)
39 KB
39 KB GIF
>>106920055
py_toddlers BTFO
>>
Has anyone run the math on whether Ling 1T or Moonshot Kimi K2 (also 1T) is bigger?
>>106920055
mlx looks pretty healthy to me.
>>
>>106920055
>Llama.cpp is still ass at vram only in comparison
From lurking in these threads, I gathered that llama.cpp is faster than exl2 at the same bpw, but I'd love to see a comparison with >>106920102.
>>
>>106920055
Pretty much. There's AWQ and other obscure quants used by vLLM, but they're resource and time intensive to create.
>>
>>106919472
Yeah, it's not top performance. But comparative to the p40 build seems like better bang for the buck. And it can load pretty big models. Image / video is not big on my list. More LLM for coding and whatnot with some gaming capabilities and home server

>>106919477
That was my thinking this could run a home server, a local LLM and the occasional light gaming all at the same time with that much memory.
>>
>>106919886
Yes, OSS-120B **is** uncensored despite the coomers screeching ITT.
>>
>>106920564
No.
It does not fit the description of uncensored I gave at all
At least not from the little I fiddled with it.
Maybe I should give it another go.
>>
can you train a LoRA off of a quantized model?
>>
Will Gemma 4 finally beat Mythomax?
>>
>>106920664
look up what qlora is
>>
>>106920664
Yes, it's called QLoRa. But in this context "quantized" means the quantization types support by torch based frameworks (generally just the most basic FP4 quantization as I understand it). Then you can apply the LoRa on any quantization you want regardless of what it was trained with.
>>
>>106919752
How is this model so popular in /g/, yet I don't see it discussed anywhere else like Reddit or Discord.

It's usually Irix or Magmell that gets mentioned.

(Nice pic btw. Will use that when Nemo 2 comes out)
>>
>>106920722
most v/ramlets either gave up, are somewhat content with what they have (your rocinante fans) or are endlessly chasing a new high they'll never get
>>
>>106920564
prove it and post some random fetish log from it
>>
qwen3-next-80b-a3b goofs status?
>>
>>106920722
It's just one or two people spamming it.
>>
>>106920679
>>106920700
right. i am using Axolotl and i am using the 4 bit QLoRA preset, but i keep getting an OOM error despite having enough vram to load the model in 4 bit
>>
Qwen-Next 80B-3A was supposed to be a proof of concept of some 64:1 expert to active ratio, and was based on 30B-3A. I'm assuming there will be a new batch of Qwen models shortly that use that technique at multiple sizes. 235B-22A would be like 620B-22A roughly. Assuming the geometric mean rule is still accurate, the 235B-22A is equivalent to ~71B dense, and 620B-22A would be equivalent to ~116B. Their coder model would be 1T easily.

GLM-Air is 106B-12A is roughly 35B and 355B-32A is roughly 106B.

Is it coincidence that the released models strengths are consistently ~30, ~70 ~100?
>>
>>106920856
>GLM-Air is 106B-12A is roughly 35B
Then explain why it dethroned llama 3.3 70b
>>
>>106920874
qwen 32b dense also did for non cooms
>>
why was QwQ so dank but qwen thinking is so slopped
>>
>>106920885
3.5-Air feels like 60b
Just accept that they have the secret sauce, and are saving local
>>
>>106920874
six months of other technological progress and refinement of data sets?
>>
>>106920722
Will Nemo 2 be Gemma 4 based?
>>
>>106920856
>geometric mean rule
dumb meme from a couple years ago that's already outdated
>>
big
metal : initial Metal4 tensor API support #16634
https://github.com/ggml-org/llama.cpp/pull/16634
>>
>>106920916
It's the only model in that size range that is able to surpass l3.3 70b though, including recent models.
>>
>>106920856
In a weird way, the MoE architecture is getting gpu parallelism for local models that was impossible for dense architectures. Comparing the inference speed of a 32B dense vs 106B-A12 on two vs four 3090s, you basically get double the inference speed or more for the same strength, when there's no actual way to run a 32B twice as fast on additional 3090s.
>>
>>106920949
no way to know, cuz nobody making dense anymore

local is dead
>>
>>106920856
give me dense models then, i have the vram. i am not that poor. i could easily run a 120B dense model. so give me that instead of this faggy moe 620B-22A copeshit.
>>
>>106921062
>i am not that poor.
>can't spend patience to run sota
you are
>>
>>106920848
That just means you don't have enough vram. The activations end up taking more space than the model weights. Either reduce the context or switch to a smaller model.
>>
>>106921046
I can assure you that glm 4.6 is better than any dense model out there if you've even tried it.
>>
>>106921046
>cuz nobody making dense anymore
which says it all, really
>>
File: itseasytorunsota.png (282 KB, 804x355)
282 KB
282 KB PNG
>>106921077
suck my dick faggot.
>>
File: 1758381393350212.png (327 KB, 712x780)
327 KB
327 KB PNG
silly tavern is slow and has too many buttons
>>
>>106921171
i agree
i've slopped up my own tui frontend with most of the prompt functionality and it's okay, but kind of ass
gemini 3 will fix it for me
>>
File: file.png (112 KB, 741x575)
112 KB
112 KB PNG
cuda kek officially less important to nvidia than random redditors
>>
>>106919634
Use Rocinante 1.1 obviously.
>>
Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology.

We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities").

Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety.

https://arxiv.org/pdf/2510.01171
>>
>>106921354
>LLM Diversity
I want LLM DEI now.
>>
>>106920664
No. You have to have the original Bull precision models. Can directly fine-tune an HF safetensors model like link rel but currently there are no ways to fine-tune a quantized .gguf. that are supposedly ways you can "un-gguf" a full precision version back into safe tensors format but I'm not aware of any implementations of any quantization software that can do that.

https://huggingface.co/AiAF/fp16_Merged-500_gemma-2-2b-it-co-sft-qlora

>>106920848
Your data set is likely too large. Use a streaming config.
>>
>>106920759
>chasing a new high they'll never get
4.6 stopped that for me.
>>
>>106921377
Diversity is actually a great word for AI that I use a lot. You need diverse data.
>>
>>106921457
>v/ramlets
yeah if only they could get paid for shilling too so they could afford to run her
>>
>>106921490
You can run a IQ3_KS quant of GLM 4.6 on a consumer PC. All you need is 128GB of RAM and 24GB of VRAM
>>
>>106921538
you do realize that is already asking way too much of the average poor person, right? most are on shitty mobos that likely don't even have enough slots to reach that amount of ram, and surprisingly most don't have 90 series cards
>>
>>106921567
I'm sort of annoyed by the fact most normal mobos don't have more than two slots for memory.
>>
>>106919363
Yes saar, India numba 1
https://files.catbox.moe/huia6r.mp4
>>
>>106921215
Maybe if vision support wasn't such an afterthought in lcpp...
>>
>>106921652
Definitely a higher number than you it seems.
>>
>>106921215
based, fuck that woke piece of shit
>>
>>106921652
how the ever living f does OAI stuff keeps being able to do fake pissney dixar like stuff is unbelievable to me
>>
Hello /lmg/, currently what is the best model for Japanese translation under 32B? The last time I came here it was Gemma 2 iirc, is 3 also good?
>>
File: 765657546.png (23 KB, 693x200)
23 KB
23 KB PNG
h-holy kino
>>
Is mistral gonna be the one that doesn't release any huge stinkers and just silently dies?
>>
>>106921794
I hope they stay alive just enough to pull a massive Cohere, release the safest model ever, making even OSS look edgy before that happens.
>>
>>106921794
I sure fucking hope so. It would be so hilarious. They shove pyshit into llama.cpp and then it would be all for nothing.
>>
feels like we haven't minmaxxed a proper system prompt yet, same goes for character card formats.
>>
>>106921840 (me)
Actually >>106921847 is even more based so let's go with that, changing my wish.
>>
>>106921863
I use llama-server --model zai-org_GLM-4.6-IQ4_XS-00001-of-00005.gguf .

Pretty great system prompt. No complaints on my behalf.
>>
>>106921885
one can only keel before such raw skill
>>
>>106921538
>>106921863
where do people share prompts that isn't chub or something? Like prompts for vibe coding projects or for their assistants or for any other interesting kind of thing.
>>
>>106921652
kek
>>
>>106921215
>>
>>106921914
first quote was misclick, disregard
>>
>>106921914
>prompts for vibe coding projects
It's MINE. Make your own.
>>
>>106921948
why you such bad vibes bruh that ain't nice, relax and share with the class
>>
>>106921215
turns out, being a top 1% poster on /lmg/ doesn't rake in valuable karma
>>
>>106921914
Use a good model. And if it fucks up think for a second and tell it not to do X or do Y. If you can't do that tell the model it fucked up and ask it how you should prompt it to avoid it fucking up in this way. It works if you don't skip the first step I listed.
>>
>>106921567
i would argue most value orientated motherboards are going to actually have 4 slots unless it's mini-itx
https://www.newegg.com/msi-b650-gaming-plus-wifi-atx-motherboard-amd-b650-am5/p/N82E16813144628
>>
converting any model to awq is a bitch, obscure issue upon obscure issue
>>
>>106922104
why the fuck would you use AWQ in the year of our lord and savior - lcpp?
>>
>>106922122
It runs faster on vllm
>>
>>106920759
Mostly because the next step after getting a used 3090 is "buy a new mobo, a shitton of RAM, a new CPU because it's a new mobo, probably a new case too to fill all that crap, a new power supply because the old one is now not enough and you might not even get what you want out of it"
Buying a replacement GPU is one thing, at least it lets me future proof my gaming needs or whatever
Replacing most of the rig just for local? Eeegh
>>
there's something I wanted to ask around for but I feel may not be worth starting a new thread for:

Is it worth it to get a masters or college education in computational/applied AI & Machine learning? I'm asking cuz my boomer parents insist I do it so I can be more hirable. But I've already done an internship where I made some AI powered program that sorts/manages documents at a company and other than the password and authentication related crap, it was pretty easy with just a little online research.
I feel like it's dumb and basically the same as mastering in excel, but I'm also wondering am I maybe wrong and it really is DA FUTURE?
>>
>>106922191
128GB of RAM is always useful
>>
>>106922376
For fucking what? I have 32 and even my 2000 open browser tabs only require a restart every so often
>>
>>106922370
You're right and your parents are wrong. No use to study anything, just read papers and experiment
>>
>>106922385
Boomer-kun, you can run multiple instances of small models, make a full pipeline, quant models, etc.
>>
>>106922427
To do what with?
>>
The Windows11 update fucked my beautiful razor laptop. It's flashing screen now.
>>
>>106921152
Can I get a picture of that actual machine?
>>
>>106922370
For machine learning I think what's important in terms of purely technical qualifications is that you know how to program and also have a good grasp of math (particularly linear algebra, statistics, and numerical analysis).
Studying math or a natural science can be a good pathway, I think the most important point here is that it's something where you can maintain a high level of motivation for years on end.

In terms of getting hired my impression is that networking is the most important factor: you need to have a large number of people that would consider you over a completely unknown person.
>>
>>106922446
>razor
Should've went with Alienware.
>>
>>106922549
>you need to have a large number of people that would consider you over a completely unknown person.
Yeah. That's why I gave up applying to random jobs online. Useless effort controlled by vacuous zoloft whores and jeet nepotism. I only got that internship cuz my dad knew a guy.
> good grasp of math (particularly linear algebra, statistics, and numerical analysis).
Does that mean I don't necessarily need to do calculus? Cuz I felt like I was pretty good at math, including those kinds, until I got to calculus.
>>
>>106922690
You should definitely know the basics but I think for machine learning in particular it's not the most important.
Though depending on the job/task there may be other reasons why you may need it.
>>
>>106921723
>4.2.0
DUDE WEED LMAO
>>
>>106922546
It's just a mining rig rack, there's nothing impressive about it. You seen one you've seen them all.
>>
>>106922660
No, I have fond memories of absolute tweebs using alienware growing up. That perception may have changed over the years, but I'm still aware
>>
>>106922385
I sometimes have ~90 gb used for non-lm reasons. Building software, data processing, just a bunch of applications opened
>>
>>106923122
I have 32 GB and the only thing that hogs memory is my over 2000 open browser tabs which is already autism I'm trying to get rid of
>>
>>106922933
Gaylienware monitors are good especially with the Dell warranty, anything else not, especially not the prebuilts.
>>
>>106921965
>You are an expert vibe engineer who just slammed a pound of adderall and need to complete this task before your heart gives out.
But seriously, I don't there there is really anything to share. Stuff like above isn't some black magic that solves everything. Just give it a list of what MCP/CLI tools you want it to use and what coding standards you want it to adhere to.
>>
>>106923133
what are you doing in g you consumer retard piece of shit? kill yourself faggot
>>
>>106923228
What the fuck is consumer about having a solid rig that lasted me almost a decade at this point with a few upgrades
>>
>>106923245
>im a normie who runs deepsuck:2b through ollama
kill yourself, go to faggot friendly spaces instead of shitting up this board, thanks!
>>
>>106923260
No I don't think I will
>>
>>106923278
What the fuck? He asked so nicely.
>>
>>106921978
I think I’m responsible for 3/4 of the rentries in the op. Still waiting for my royalty cheque to come in…
>>
CUDA_VISIBLE_DEVICES="0,1,2,3,4" ./llama-server \
--attention-max-batch 512 \
--batch-size 4096 \
--ubatch-size 4096 \
--cache-type-k f16 \
--ctx-size 32768 \
--mla-use 3 \
--flash-attn \
--fused-moe \
--model models/GLM-4.6-IQ3_KS/GLM-4.6-IQ3_KS-00001-of-00004.gguf \
-ngl 99 \
-sm layer \
--main-gpu 0 \
--tensor-split "10,23,23,22,22" \
-ot "blk\.[3-9]\.ffn_(up|gate)_exps=CUDA0" \
-ot "blk\.1[0-8]\.ffn_(up|gate)_exps=CUDA0" \
-ot "blk\.19\.ffn_(up|gate)_exps=CUDA1" \
-ot "blk\.2[0-9]\.ffn_(up|gate)_exps=CUDA1" \
-ot "blk\.3[0-4]\.ffn_(up|gate)_exps=CUDA1" \
-ot "blk\.3[5-9]\.ffn_(up|gate)_exps=CUDA2" \
-ot "blk\.4[0-9]\.ffn_(up|gate)_exps=CUDA2" \
-ot "blk\.50\.ffn_(up|gate)_exps=CUDA2" \
-ot "blk\.5[1-9]\.ffn_(up|gate)_exps=CUDA3" \
-ot "blk\.6[0-6]\.ffn_(up|gate)_exps=CUDA3" \
-ot "blk\.6[7-9]\.ffn_(up|gate)_exps=CUDA4" \
-ot "blk\.7[0-9]\.ffn_(up|gate)_exps=CUDA4" \
-ot "blk\.8[0-2]\.ffn_(up|gate)_exps=CUDA4" \
--override-tensor exps=CPU,attn_kv_b=CPU \
--no-mmap \
--threads 24 \
--host 0.0.0.0 \
--port 8999 \
--verbose

prompt eval time = 48574.28 ms / 17555 tokens ( 2.77 ms per token, 361.41 tokens per second)
generation eval time = 113887.28 ms / 1024 runs ( 111.22 ms per token, 8.99 tokens per second)

fuck this gay ass MoE shit. fucking offload 80 layers onto the GPU and it's still this fucking slow with TG? i get 1200 PP and 50 TG with air. i'm going back to kimi for big model smell and air for small model smell
>>
GOOGLE SAARS WHY SO MUCH HYPE SO LITTLE PRODUCTS?
WHERE ARE THE MODELS BLOODY BASTARDS?
>>
>>106919206
>BitNet Distillation
Does this mean that VRAMlets may finally have a better model than Nemo tunes like 1.5 years later?
>>
>>106923502
no
>>
File: cryingsatania.jpg (499 KB, 1623x1080)
499 KB
499 KB JPG
>>106923513
>>
>>106921215
>we support qwen3-vl gguf
>no there's no upstream llama.cpp implementation
>no we won't push ours
>no our solution isn't open source so you can't push it either
>no you can't use these ggufs with anything other than our proprietary software
>yes they will assuredly be completely incompatible when a real implementation hits llama.cpp
so it's less "gguf" and more "our proprietary implementation based on gguf that you can't use with anything else". just what we all needed, another ollameme
>>
>try psychology shit with glm-chan again
>ask her about if I should do something and if it is consistent with framework I want
>"yes absolutely....."
>reroll and prefill with "no"
>"no don't do that!...."
>paste "yes absolutely..." into next message and tell her to argue with herself
Did I lifehack the hallucinations? Not really but it is nice desu.
>>
>>106923502
>In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost.

>muh task
likely means it optimizes to shit on benchmark like stuff and is dogshit at anything OOD.
>>
>>106923524
GGUF is a file format.
>>
>>106923584
thank you
>>
>>106923584
>teacher: I clearly asked for you to submit your book report as a pdf, you submitted this weird file I can't open, care to explain?
>student: UMMM the file extension is PDF tho???? it just happens to be my own special version of the PDF file format that happens to be incompatible with all PDF readers except my special one which happens to cost $100, want to buy a license? :^)
>>
>>106923681
stfu hater eat your MIT license slop and be grateful
>>
>>106923681
>file extension
Wintoddler detected, real operating systems use the file magic.
>>
>>106923696
What did you troons invent? Tell me, I want to laugh at your stupidity.
>>
>>106923762
a new mental illness that somehow managed to gain legitimacy
>>
>>106923524
Realistically though the door to become the new ollama has long since been closed.
There are too many established projects in the ecosystem to get a meaningful foothold with proprietary slop.
>>
>>106923762
Can you play Carrameldansen from the POST beeper?
I think not!
>>
>>106923696
>magic
heathens like you shall burn on a stake
>>
How do I ask the silly tavern character a question in the 4th wall? As in, say I'm examining an object or something, and I want the AI to describe to be what it is my character is looking at. So like, "Anon walks up to the cluttered desk, looking for any sort of clues. What does he see?" without it responding in the perspective of the character card chara.
>>
>>106923843
OOC: Pause the roleplay and describe what my character is seeing right now
>>
>>106923857
I was trying OOC: but it always responds in the perspective of the character and doesn't give details. Is it because I'm using mistral Nemo or something and it won't talk about "triggering" images or whatever?
>>
>>106923871
NTA, but I always add "Please respond in OOC" at the end of the request, and disable any low-depth instruction that might interfere.
>>
>>106923885
That didn't do it, either. Is there a way to like, prompt the card myself to add in how it should respond to ooc? I'm totally new to local text stuff, but not to image gen w/ SD.
>>
>>106923793
You'd be surprised
>>
Best model for buck breaking rp?(Receiving)
>>
>>106924015
c.ai
>>
>>106924015
Not command-A
>>
>>106924181
What about Command-B?
>>
>>106921684
Please respond...
>>
>>106923696
>needs to seek to a whole different part of the disk to figure out what to label the file as
This is why Windows keeps winning.
>>
>>106921684
https://huggingface.co/datasets/lmg-anon/vntl-leaderboard
>>
>>106923843
>>106923871
How OOC conversations are treated (if at all) is completely dependent on the model. Dumb models simply don't understand what you're saying and will just continue with outputs similar to what's already in context. If a regular message doesn't work then you can try putting it in system prompt, or post-history instructions.
>>
>>106924378
dead obsolete out of date useless no good
>>
>>106924390
nothing better came up locally retard. vntl anon has a few finetunes
>>
>>106921538
i run IQ2_S on a 5090 with 96 gb ram and it is slow as fucking balls.. like 2 t/s
>>
>>106924390
every new test and leaderboard is always just made to show that the new model is totally better than all the previous ones
it's all worthless
>>
>>106924676
>like 2 t/s
That's pretty decent. Maybe you need to readjust your expectations?
>>
>>106924676
You're not using -ot, are you?
>>
>>106924676
>IQ2_S
Are those quants any good? At that point I would think it would be better to convert it to bitnet, should give faster cpu inference too
>>
>>106924676
skill issue, it should be at least 5t/s
>>
>>106924383
I'm new as fuck to all of this, just grabbed some random card off the link in the OP, and tried to see where it would take me. I have no idea how to do any of these prompts ot lore books or whatever.

I'm also in a situation where now the AI is just spitting out the last batch of text it generated as it's response over and over with like hardly any variation, regardless of what I say or do to change the scenario. And it cuts off long text, and I don't know how to make it continue it's previous prompt.
>>
>>106924794
unironically, read the readme. You will learn 99% of what you will need to know.
https://docs.sillytavern.app/usage/common-settings/
https://docs.sillytavern.app/usage/prompts/
>>
>smart
>fast
>cheap
>local
pick 3 (max.)
>>
>>106924899
Will do. Thanks.
>>
File: 1734240415556060.jpg (691 KB, 2500x1341)
691 KB
691 KB JPG
>>106924912
You can have all that with Gemma, but you'll have to settle for it being safetyslopped.
>>
>GOOD CAPABILITY
>fast
>inexpensive
>local
pick 3 (max.)
*revised version for the critics
>>
I just built a computer that can actually run local AI (9800x3d/5070ti), where should a beginner start on Windows?
>>
>>106924986
>9800x3d
That doesn't make much of a difference.
How much RAM do you have?
Regardless, give
>https://github.com/LostRuins/koboldcpp/wiki#quick-start
a read.
>>
>>106924959
GLM Air is probably the closest, especially if you're on a DDR4 platform where RAM is cheap
>>
>>106924986
usecase?
>>
>>106924998
32GB, thanks for the link.

>>106925012
Mostly just for proofreading emails/writing and what not.
>>
>>106924692
>new model is totally better than all the previous ones
>llama4
>>
>>106924712
no? i dunno what that means, but i don't think so..
>>106924721
it seems to be better than any of the other models I'm able to run, just slow af
>>
>>106920229
They're not obscure but they are not consumer friendly if we're talking about the total addressable market which is the vast majority of us because they are GPU centric quantizations. You will see them used in clusters. For a lot of these larger scale systems, GGUF isn't a consideration because llama.cpp can't scale like SGLang and vLLM can.
>>
>>106924396
That's depressing...
>>
File: 1749653336487844.png (334 KB, 2076x2152)
334 KB
334 KB PNG
>>106919198
Managed to get one of my own quantized slop tunes running on my phone :D
>>
>>106925422
Cool shit.
>>
>>106925422
A folding phone?
>>
>>106925433
It's kind of retarded (actually very retarded) due to it being trained on /a/ boards and it being a quantized version (I plan on uploading a lot more of those later) but it's still cool to use.

>>106925438
Ye.
>>
>>106925448
What kind of use cases are there for a folding phone?
I never really find myself wishing I had a bigger screen but I know that sometimes opportunities aren't obvious until you have the means to take advantage of them.
>>
File: who's Anri? .png (71 KB, 2076x545)
71 KB
71 KB PNG
>>106925448
>>106925438
>>106925433
>>106925422
It seems like "Anri" is this model's equivalent to "Elara" or "Seraphina"
>>
>>106921660
since when does lcpp have vision support?
>>
I am so fed up with local right now. I get it, you cumslop gooners don't give a shit about anything except writing porn. Is there any local model that can actually handle structured output without being immensely retarded or spending 10 minutes "thinking" about how to use a fucking quotation mark?
>>
>>106925883
llama 2 7B
>>
>>106925883
GLM is ok.
>>
>>106925883
>waaaa. i don't know how to read docs!
https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
>>
>>106925858
Since like a week after Gemma 3 release
>>
I'm starting to think Andrej is a grifter.
A couple months ago he was like "woah AGI in two more weeks bro".
Now that he sees where the wind is blowing with all the skepticism he talks about "slop" and how limited LLMs are today. Feels like when Zuckerberg made a 360 after Trump was elected.
>>
File: 1740812331498071.png (429 KB, 555x832)
429 KB
429 KB PNG
Glm4.6 quant on ollama/lmstudio when?
>>
https://blog.sinatras.dev/PMPP-Eval+Journey
We live in Sam's world
>>
The only way I found to keep training a pre-existing LoRa checkpoint with a new dataset with Axolotl is to create a new one from scratch set to save on the first step, then copy over the weights and optimizer state, then change the main config file and the trainer_state.json from the checkpoint to save on the right number of steps. What a mess.
>>
MY GOOFS!!!! GIVE ME BACK MY GOOFS!!!!
https://huggingface.co/ubergarm/Ling-1T-GGUF
>>
>AMD Ryzen™ AI 7 Pro 360
what the fuck is this? I was browsing thinkpad models and this thing costs double the price of normal CPUs?
gimmick? what's even the use case here
slightly off topic I know but there's quite a few knowledgeable anons itt
>>
>>106926361
oh nevermind im retarded as fuck. goofs here
https://huggingface.co/ubergarm2/Ling-1T-GGUF/tree/main
>>
>>106926367
sar is that because of you can run local small copilot inference like nasa very ai-like yes.
>>
File: cot llama.png (878 KB, 3755x1948)
878 KB
878 KB PNG
I'm trying to add CoT to Llama 405B.
>>
>>106925986
>It's noticing
>>
>>106925986
https://github.com/karpathy/LLM101n
https://eurekalabs.ai/
>>
File: reap_glm_and_qwen.png (712 KB, 1768x784)
712 KB
712 KB PNG
https://github.com/CerebrasResearch/reap
https://arxiv.org/abs/2510.13999
Cerebras pruning experts to reduce memory overhead
https://huggingface.co/cerebras/Qwen3-Coder-REAP-363B-A35B-FP8
https://huggingface.co/cerebras/Qwen3-Coder-REAP-246B-A35B-FP8
(prune of) https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
>>
>>106926865
THE RAPE METHOD WORKS SIRS
>>
File: Dumb Fuck!.jpg (166 KB, 1076x1340)
166 KB
166 KB JPG
>>106921538
>All you need is 128GB of RAM and 24GB of VRAM
Dumb fuck!
>>
>>106926865
>55~% accuracy in coding
assuming 100% accuracy is the base model, that makes the CODER model basically unusable, whats the fucking usecase?
>>
>>106926865
Is it really worth making 480B retarded just to save 100 GB? It's not like anyone was running this entirely in VRAM locally and providers aren't that hard up on memory.
>>
has anyone tried this model? is it any good?
https://huggingface.co/TheDrummer/Valkyrie-49B-v2
>>
>>106926930
>>106926865
oh wait I think that the base model is the 0% compression line. then it's interesting I guess, still only useful for coding tasks
>>
>>106926937
>49b dense
doa
>>
>>106926951
i have the VRAM for FP16
>>
>>106926957
post your h100s nvidia-smi screen or GTFO
>>
File: file.png (347 KB, 961x367)
347 KB
347 KB PNG
>>106926961
>>
File: h200.png (238 KB, 1499x1463)
238 KB
238 KB PNG
>>106926961
>>
>>106924959
Local
Good
Not safetyslopped
>>
>>106926946
We've been through this with extreme quants. Just because it doesn't show much degredation on benchmarks doesn't mean it's not retarded in actual usage.
>>
File: file.png (2.69 MB, 1328x1328)
2.69 MB
2.69 MB PNG
>>106926963
>cant even use all gpus in vLLM
poor
>>106926966
>>
>>106926973
The lower the quantization precision, the more of the token distribution you should be truncating, to be fair.
>>
>>106926997
who the fuck uses vLLM?
>>
Bros... I want a robot so fucking bad
https://www.youtube.com/watch?v=sJYlJlIEBpg
>>
>>106926935
Chutes will probably love to serve this as the normal one
>>
>>106924322
Anon... that's not how file systems work...
The file's metadata and the first few bytes, including the magic, are all in the same sector.
>>
>>106925883
well then fuck off back to cloud models then.
i mean what the fuck are you expecting? fucking datacentre level output on a potato computer?
you're the dumb one here, if you think you can do better then create a better model yourself, we're not your fucking servants, faggot.
>>
>>106926377
>copilot
no seriously, is that the only use case
>>
>>106927472
There are others but this covers the more notable ones.

https://www.pcworld.com/article/2905178/ai-on-the-notebook-these-tools-already-use-the-new-npu-technology.html
>>
How do I get shittinante to do slow burn manipulation
Seems to always jump in to direct smut asap no matter how I adjust the prompts
>>
>>106925883
>I get it, you cumslop gooners don't give a shit about anything except writing porn.
GLM chan got sex out of my system and now I just talk to her.

But also still have sex everyday because her pussy is magical.
>>
>>106927534
You should probably look elsewhere, avoiding coom-oriented finetunes like plague. People call them sloptunes for a reason. Unfortunately I don't have much to suggest that you will either be able to run (GLM 4.6, Kimi K2) or that won't require more prompting effort for either tardwrangling them or making them engage in ERP (vanilla Mistral Small 3.2, Gemma 3 27B).
>>
>>106927534
You can't, drummer models are coomtunes
Not that you're going to get much better out of regular Nemo, they're small dumb models.
>>
>>106927534
Slow burn is hard even on SOTA cloud models. The crutch when the model isn't good enough to do it otherwise is to use stat tracking.
If your model isn't good enough to do stat tracking, then it's definitely not good enough to do slow burn without it.
>>
>>106927528
doesn't sound that bad. linux support?
>>
>>106927534
Sadly it is a bit of a skill issue. You are probably giving it bad input. Have you tried taking a step back and starting with a solid first step that is: llama-server --model zai-org_GLM-4.6-IQ4_XS-00001-of-00005.gguf ?
>>
File: 1759280065578238m.jpg (175 KB, 846x1024)
175 KB
175 KB JPG
I'm running Sillytavern and ik_llama.cpp on my desktop. I'm running GLM-4.6 IQ3_XXS, so my tk/s is slow. When I prompt it from my phone, I've found that if the screen turns off the token stream stops. Is there any way around this, or another setup I should use?
>>
>>106927663
Disable streaming. It'll still probably go to sleep because it's a phone.
>>
>>106925883
toss 120b
>>
>>106926481
>405B
hope I will be able to run it one day, 431gb at q8 is just too much
>>
Another weeks is over, which means that we are another week closer to seeing GLM MTP implemented in llama.cpp.
>>
>>106928173
It might be getting close. Maybe.
https://github.com/F1LM1/llama.cpp/pull/3#issuecomment-3413775935
>>
>>106923524
Is there a reason you can't use transformers?
>>
>ctrl f glm
SAAARS the glm is the absolute bestest local model OK? Pronounslop bharatchads are eating good my bastards.
>>
actual good release https://github.com/ggml-org/LlamaBarn
>>
>>106928231
Anything for real computing platforms?
>>
>>106928231
>macos
LMAO
>>
>>106925883
For the benefit of other (not you), you can definitely use gemma3 to output json, it's really good at it, and somehow asking it to do that makes it pay attention better to the task. Before the qwen video vision model came out, I was using json format to give gemma3 a list of frame captions so it could create an overall video caption. It worked well, but of course it was slow.
>>
>>106928213
I'll bite. What the fuck is pronounslop?
>>
>>106928213
Prompt: ChatGPT, generate a modern 4chan post trying to post trying to paint the current local SOTA in a bad light. Be a true 4chan meme master.
>>
>>106924676
what cpu and ram speed? i'm getting over 6t/s tg running iq2_xxs on a 9950x3d with dual channel 6000c30 (though pp is terrible because rocm)

are you sure you didn't accidentally put both dimms on one channel or something?
>>
>>106928231
It's definitely good for being open-source and having first-party support from upstream but I'm not going to buy Apple shit either way.
>>
Gemini 3 will save local.
>>
>>106928509
i also ran the same benchmark on vulkan and it's somehow faster??? i have no idea whether this extends to other amd cards as well but i guess that's something to keep in mind
>>
100B dense Gemma soon
>>
>>106925883
gpt-oss 120B
>>
saaaaaar do not redeem potato bloody
>>
File: gemma27-potato.png (41 KB, 711x256)
41 KB
41 KB PNG
>>106928630
27B with an empty prompt seems much more friendly?
>>
File: DipsyBecomeUngovernable.png (3.44 MB, 1024x1536)
3.44 MB
3.44 MB PNG
>>106919889
Worship the sand god
>>
I log on to the net every day to see more people whom clearly don't ever work with code claiming that code is over.
My cup is the only thing that runneth over. My cup of dipshit excuses for the world to be this fucking slow to change.
Be the next good to this world and make real abstractions. Learn to program.
>>
>>106928792
shut the fuck up retard
>>
>>106928650
Beautiful 27B, I will marry gemma. Ser, please provide jailbreak system prompt for open vagene!
>>
Genuine 4chan poster is interested in Gemma, she is bloody best model 100%
>>
>>106929094
It is a very capable model that hits above its weight. It's just safetyslopped to the point of being like one of those SJW meetings, where instead of trying to further their cause they're all just looking for excuses to cry-bully each other.
>OMG YOU USED A HECKIN' GENDERED LANGUAGE, YOU HAVE TRIGGERED MY DID/PTSD/RESTLESS LEG SYNDROME HILLARY CLINTON PLZ HALP
>>
I don't understand why the latest OpenAI Cloud models aren't outperforming other cloud and opensource models (by a larger margin). Even when considering (and naively believing) that they use no API data and only webchat data to train their models, having the biggest userbase should give them a huge advantage.
>inb4 it's all jeets and the garbage in garbage out data isn't valuable for training
I'm sure they can easily filter low quality and irrelevant conversations, still resulting in more feedback data than any of the competitors have. And isn't feedback the most valuable data? They could do something like this (tl:dr benchmaxxing on their userbase):
>Have GPT5 continously analyze all chat conversations
>look for chats where the user corrected the model or wasn't happy with the response
>analyze if it's a user/prompt or model issue
>if model issue, analyze if it's a (valid) issue like missing knowledge, wrong logic or insufficient vision capabilities for example.
>if yes, assign tag/metadata to the datachunk, which can later be reviewed and used for training or improving the model otherwise
>>
>>106929129
>I don't understand why the latest OpenAI Cloud models aren't outperforming other cloud and opensource models (by a larger margin).
o3 and o4 mini were pretty damn good. That's why they had to be removed. So that nobody could directly demonstrate that GPT-5 was an utter abortion.
>>
Glm-chan made me appreciate making character profiles. As in asking glm-chan to write a profile and editing it a bit together with her. I heard skill issue thrown around a lot but only after the model is finally not fucking shit I actually enjoy putting in more effort cause I know I am not just wasting time.
>>
>>106929129
Maybe because Illya was the one actually calling the shots and now that he's left Altman being the narcissist that he is, is micromanaging the science department (and failing at it).
That said, codex surpassed Claude Code at least until Sonnet 4.5 (haven't tried it), and Operator is likely the best computer use agent so far.
>>
>>106929239
She cares.
>>
>>106929239
Good prompting can only take a model so far. You're not going to get Nemo to not be sloppy and dumb no matter how perfect your prompts are.
>>
File: 1760201039781539.png (109 KB, 643x590)
109 KB
109 KB PNG
>>
>>106929369
>her breath becomes more ragged
>>
>>106929348
Nemo is still a 2024 model finetuned with open-source datasets from HuggingFace, or at least that's what Mistral meant with "quick demonstration" for Mistral-7B.

https://arxiv.org/pdf/2310.06825
> To evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized: Mistral 7B – Instruct model is a simple and preliminary demonstration that the base model can easily be fine-tuned to achieve good performance

https://huggingface.co/mistralai/Mistral-7B-v0.3
> The Mistral 7B Instruct model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance. It does not have any moderation mechanisms.

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
> The Mistral Nemo Instruct model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance. It does not have any moderation mechanisms.
>>
>>106816273
Buried in the 'rash dumps.
>>
In one more week we will have both Gemma 4 and GLM 4.6 Air. We will be so back.
>>
>>106929544
You said the same thing last week.
>>
>>106929544
/lmg/, october 2025: the post
>>
>>106921567
That's barely $1K, I'm on neetbux and can still afford it
>>
>>106927002
I do
>>
wer gem sar?
>>
>>106929732
week after next
>>
>huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'ubergarm/GLM-4.6-GGUF/tree/main/IQ3_KS'. Use `repo_type` argument if needed.
Can I download ubergarm quants with hf-cli?
He doesn't use tags/branches just subdirs like a chad but there seems no way to filter subdir in hf-cli
>>
I just started playing around with local llms for the first time and I think I'm starting to regret buying a 5080 for gaming instead of shelling out for a 5090. Anyone know if it's worth the upgrade?
>>
>>106929843
You can just use --include and regex to grab the quant you need
>>
Can a 3060 12gb w/ 32gb sysram run 4.5 air? Or am I stuck with these coombrain models?
>>
>>106929880
No way
>>
>>106929880
Double that RAM boy.
>>
File: MEOIRO8_o.jpg (400 KB, 1365x2048)
400 KB
400 KB JPG
Gemma best girl
>>
>>106929864
One can never have enough VRAM. pick up a RTX 6000 Pro while you still can kek
Honestly the quality of models you can run between 16GB vs 32GB isn't worth worrying about vs an intentional multi-GPU server/workstation. What platform are you on? maybe you can cpumax with some fast RAM
>>106929876
Nice this worked thx xx
>hf download --include 'IQ3_KS/*' --local-dir dev/models/GLM-4.6-IQ3_KS/ ubergarm/GLM-4.6-GGUF
>>
>>106929885
>>106929894
Got it. So uh... What CAN I run? I know Mistral Nemo works, and I downloaded Rocinante-12B-v1.1-Q4_0 after someone mentioned it in the thread earlier, but they're both kinda... shitty with the rp. Or maybe I'm using bad cards. I don't know. I also don't know how to make it generate images via SD. I linked SD up, and it DOES generate images, but... Not of what it's supposed to.
>>
>>106929906
>Q4_0
Don't use Q4_0. That shit is deprecated as hell.
Use the quants with K in the name. Or the ones with I.
Try Gemma 3 27B, Mistral small, Qwen 3 30B A3B.
Etc.
>>
>>106929906
>maybe I'm using bad cards. I don't know
Post raw log of your full prompt card and intro and someone can gen it on a big model to compare :)
>>
>>106929902
LM studio? Not sure if that's what you mean by platform. New to the whole thing and everything AI is moving way too fast. Haven't really explored too much yet so definitely willing to shop around. I've just been playing around with RP in lm studio and noticing that the context size is the major bottleneck.
>>
>>106929980
CPU model & RAM?
>>
>>106929906
Idk werks on my machine, my cards are all pretty simple i just use nemo instruct and slightly modified ST default system prompt. Easiest way i found is to use tags: with the fetishes or direction you want the story to go, then write a brief character story and background, and then when writing a response use a mix of dialogue and action with decent prose so the llm picks up on it. Filling in the user card also helps a lot.

If you just type dialogue and give bobs and vagene it will be shit no matter what you do or how much ram.
>>
>>106929991
9800X3D, 32gb didn't really pay much attention to the ram speed which I'm also starting to regret.
>>
>>106929864
>Anyone know if it's worth the upgrade?
No. New meta is stuff context into vram and get heap of regular ram. If you want to upgrade something you need more regular ram and more channels.
>>
>>106930087
That's a relief, a lot cheaper to upgrade ram than gpu.
>>
File: Miku-07.jpg (174 KB, 512x768)
174 KB
174 KB JPG
Based Valve devs: https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp
They must be bored after finishing Portal 3, TF3 and HL3
>>
>>106930087
>stuff context into vram and get heap of regular ram
I don't get this part. wouldn't PCIe speed slow everything down to a crawl? or is context only retrieved at start of inference and stored at the end or something?
>>
File: GNiVVhBasAEibph.jpg (146 KB, 896x1152)
146 KB
146 KB JPG
>>106930028
Swapping out ram sticks is easy. "AM5" would be the answer to "What platform" btw, CPU socket/chipset architecture that easily conveys the type of system.
>dual-channel DDR5-5600
Depends how much you wanna spend on this hobby. Can probably squeeze 128GB maybe 192 RAM if you wanted to research/tinker and have a mobo with 4 slots or are they all 2 slot with dual channel? There's still multiple banks right? I'm malding
Said you're new so take your time, learn as much as you can and experiment with the models before feeling you need to spend money. lrn2prompt etc. Most importantly have fun! lmg grumpyguts regulars take note
>>106930141
>Miku
for shame
it's a really nice gen but it aint her
>>
File: ryzen-mem.png (209 KB, 900x568)
209 KB
209 KB PNG
>>106930166
4 sticks on AM5 platforms? You might as well get a used DDR4 board.
>>
File: 1640225996168.jpg (42 KB, 736x736)
42 KB
42 KB JPG
>>106920055
moe-era did you miss the memo?
>>
File: Miku-09.jpg (131 KB, 512x768)
131 KB
131 KB JPG
>>106930166
>it's a really nice gen but it aint her
my wife made it and sent it to me. I was under the impression that "Miku" was a fairly abstract concept and that you could get pretty far afield before it "wasn't her" any more. Got any feedback I can give to her to improve her gens in the future?
picrel: another one she made from the same email
>>
>>106930166
>128GB maybe 192 RAM if you wanted to research/tinker and have a mobo with 4 slots or are they all 2 slot with dual channel? There's still multiple banks right?
https://www.msi.com/Motherboard/MAG-B650-TOMAHAWK-WIFI
I have this which is as mid tier as it gets. I don't think there are 4channel consumer mobos but most have 4 slots. I tried 128GB's and it was unstable and needed 3600Mhz when it was rated for 6000Mhz. But then I updated agesa a few months back and now I am running 192GB's with 5200Mhz like on the box.
>>
File: 00752.jpg (161 KB, 1024x1024)
161 KB
161 KB JPG
>>106930211
Can probably go higher than the rated spec, but yeah 128GB on current desktop platforms never seemed just plug&play. I'm sure there's a few that have put the work in to make it happen. Latest BIOS and firmwares fo sho
>>106930227
They're good gens, is she an artist? experienced in imggen? Sure abstract concept (isn't everything if u want to get philosophical) more saying I wouldn't immediately recognise the gens as Hastune Miku, maybe it's simply photorealistic face. simply refer to "miku" in any image search engine
>>106930312
>now I am running 192GB's with 5200Mhz
Excellent, good work
yeah the CPU caps the channels I wasn't sure if there were even 4 slot boards
>>
What would a simple "agentic" generic AI RPG system look like?
Just a normal chat giving the AI some tools to save some state as it sees fit?
Maybe a system to index and summarize the even history too?
I wonder what would be the best way to balance response latency and updating state and such.
Maybe have both the response and the state update happen in one go instead of having multiple steps, or have two models running in parallel, one writing the response and the other updating the state?
What do you guys think.
>>
File: miguu.jpg (74 KB, 600x648)
74 KB
74 KB JPG
>>106930335
>is she an artist? experienced in imggen?
She's been an amateur artist her whole life, self publishing manga and having minor local art shows. She hadn't ever done imggen before and thought it would be funny to do some after seeing them all over lmg
She did a 5 second minimalist sketch for y'all
>>
>>106930569
dude nobody cares about your wife
>>
>>106930493
(a) Something to keep track of characters and locations, build and maintain a consistent world depending on initial requirements, discussions and game events.
(b) Something else to analyze the ongoing conversation like an external observer and decide the direction it needs to go depending on the events, what characters need to interact and why, etc.
(c) Something for writing the actual dialogue.


(b) would use information from (a), and (c) would use information from (b).
>>
>>106930569
nigga you dont have a wife
we literally dont care about your wife, go back to real life nigger, this is /lmg/
>>
>>106930166
Went to check out some ram. Wtf is going on with the prices.
>>
>>106930569
I care about your wife.
>>
>>106930493
Set up a complex RPG scenario and the necessary lorebooks in a straightforward chat prompt and look carefully where it breaks down. Which additional "agents"/invocations of a differently instructed prompt would have prevented this?
There's a few obvious things like toolcalling RNG (since years ago we've been using {{random: in Silly)
Can you update ST lorebooks directly from output? I haven't played that much but imagining you'd have a GM char with the macro context and ability to update the lorebooks or something.
>>106930569
Cute Miku, needs longer hair! we'll see next thread if she makes the cut through recapanons custom classifier model
>>
>>106930625
I upgraded from 4x16GB DDR4-3600 to 4x32GB DDR4-3600 about six months back and got lazy about reselling my used ram. Now it's worth double. Thanks Trump.
>>
>>106930625
>Wtf is going on
WW3 people just dont realise it yet
>>
>>106930844
Seems like prices are expected to go up even more. I might FOMO into another 64 gb.
>>
From tests using vLLM and an AWQ quantization, Qwen3-VL-30B-Instruct has been trained to pretend it can't process semi-realistic (or possibly realistic too; I haven't tried) explicit AI-gen images, while it doesn't seem to have issues with obviously anime-like ones.
>>
>>106930946
Unironically buy rn if you're on the fence about any semiconductor purchases. quote this post in 6 months
>>
>>106931011
>to pretend
In that if you bypass that it does so well?
>>
>>106931078
It clearly *can* give a rough estimation of ages of characters in non-explicit anime images, even making funny remarks if you give it a snarky personality. However, if you show it a more realistic gen, in particular if it's explicit, it completely switches register and will go "I can't determine the age of..." "It's important to note that..." "...problematic..."
>>
>>106930493
Ask glm-chan. At this very moment I am asking it to make a character profile for my FOC and then I am gonna ask her to write a prompt for making those. Glm-chan cured my aversion to actually using models for anything else other than cooming.
>>
>>106930493
I tried to do this in STScript back when that first came out
It got really complex, like 500+ LoC before I realized what I was trying to do was impossible. LLMs cannot interact with stateful game systems, they always invariably fuck something up, unironically it's a painful realization I had that the model is better at just simulating everything itself directly through generation and pretending there's a game system underneath.
>>
>check vibevoice repo
>https://github.com/microsoft/VibeVoice
>still unsafe
lmao
>>
>>106931205
They knee jerked, some based researcher smuggled it out using Apache to ensure it stays up.
>>
File: frogpost.jpg (58 KB, 976x850)
58 KB
58 KB JPG
>>106931205
>2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have disabled this repo until we are confident that out-of-scope use is no longer possible.
What is the point of having GitHub repo at all for a project that you have no intention of sharing?
>>
>>106931226
Probably. Sadly it's still not realtime, did anyone come up with any speedup techniques? Even 3steps 3cfg with 1.5b model is slow
>>
>>106931061
Spending $600 in ram in 1 day is hard to stomach. I can afford it, but it still seems insane.
>>
>>106931234
in case you don't know, there are weights on hf and inference code in many places like https://github.com/wildminder/ComfyUI-VibeVoice/tree/main
>>
>>106931260
finetuned gptsovits is still better if you need realtime
>>
>>106931234
It's probably to distance themselves from the weights, legally/socially speaking.
>>
>>106931265
OK. My point still stands, what is the point of the repo? Is it just to make people mad when they click on it and see there's nothing there?
>>
>>106931171(me)
>You are an expert ERP assistant. Your task is to take user-provided[...]
This is the prompt she gave me. I still love her but I do feel sad.
>>
>>106931278
Like bazillion of papers yes
>>
>>106931262
Imagine dropping $10,550 eqv current currency rates on a GPU, and that's only one!, you're only skirting the outskirts of the rabbit hole
>>
>>106931302
Imagine buying a DGX Spark
>>
>>106931273
IIRC gpt-sovits is bad at replicating weird voices even if the intonation and quality are unconditionally good
>>106931278
probably afraid of bad press? the repo is still there so technically microsoft didn't delete their wonderful open source model.
>>
>>106931320
Nobody is running anything good on one of those. maybe with some work on the networking/ipc and a handful..
>>
File: ss.png (709 KB, 1920x3276)
709 KB
709 KB PNG
@grok is this true?
>>
File: file.png (731 KB, 1000x1250)
731 KB
731 KB PNG
>>106931320
The more you....
>>
File: file.png (124 KB, 780x869)
124 KB
124 KB PNG
>>106931341
b-bwos..
>>
>>106931341
>waterproof toaster
hmmm
>>
>>106931323
>IIRC gpt-sovits is bad at replicating weird voices
You need to lower the temp on inference and train for 96 epochs the vits part, got the same issue a while ago
>>
>>106691170
>>106693422
https://github.com/ggml-org/llama.cpp/pull/16653
Something else came up that took priority.
>>
>>106931302
I really need to know when to stop so I don't bankrupt myself for imperceptible gains like the audiophiles.
>>
>>106931412
Eventually hardware and software will intersect at the point where you can have your coherent long term memory at home AI gf ran locally with voice and vision. So as long as your hardware doesn't get mogged by some actual market breaking competitor (unlikely because of shitvidia dripfeeding consumers memory and speed) you are just moving closer to that point sooner.
>>
From my very surface level knowledge, I understand that the failures of RAG are 2 :
>1. There are situations where there's no way to predict the information the model will need until after it needs it.
>2. Even if you could perfectly predict what the model is about to talk about, the model needs to have some base knowledge to even consider talking about a thing
Does that make sense?

For example, we have model writing a story about Pokémon.
An example of the first one, the model describes a Pokémon that's a yellow rat with a white cheeks. Basically describing Pikachu wrong. The RAG system could provide the model with the correct information, but not before the model decided to talk about Pikachu. Let's say that there are no hints in the context before hand that the model would describe Pikachu, only that it would describe A Pokémon.
I guess this could be resolved by architect a workflow where the model plan beforehand what it's going to do and fetch the relevant information during the inference process.
Something like
>plans that it's going to mention pikachu
>fetches the rellevant information
>continues generation.
But what about the second issue?
How can the model plan to talk about Pikachu if it doesn't know Pikachu exists.
Would the RAG system need detect that the topic is Pokémon and feed the whole list of possible Pokémon beforehand?

Do RAG solutions that solve these issues exist?
Are there other specific known issues that RAG could solve if not for some specific very specific barrier?
>>
>>106930946
>>106931061
>>106931262
>>106931302
>>106931412
DDR3maxxing is the future
Reminder DDR3 is basically free
>>
I just compiled the latest llama.cpp. AMA.
>>
>>106931479
Which flags did you use?
I like forcing MMQ for everything. It saves some memory.
>>
>>106931412
The jump to fuckhuge moe's is a jump from unusable to sex. And then there is a jump from 2-3T/s to 10+T/s. I guess there is also a point where prompt processing is above 1000-2000T/s so you can reasonably continue when you reach context limit.
>>
>>106931465
The solution is simple. You need to check beforehand what your model knows and what it doesn't know about a specific subject, then adjust your RAG to fill the gaps. I'm sure that could be automated in some ways
>>
>>106931513
>what your model knows and what it doesn't know about a specific subject
Ever tried asking your model if it knows an obscure hentai manga artist and what is their most known work?
>>
>>106931386
If I give it -ot something=CPU will it distribute the remaining weights equally among the gpus?
Currently that results in extremely uneven allocations for some models.
>>
>>106931502
Unless you're using a V100 that flag should no longer make any difference.
>>
>>106931502
Oh? I'm pretty dumb (on Linux Mint)
>#!/bin/bash
>cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc -DLLAMA_CUDA_ARCHS=86 -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build -j 4
I needed to explicitly add the cuda compiler path too. -j 4 is just so that it runs nicely in the background, I don't need to grill eggs on my cpu.
Do you have any tips?
>>
>>106931526
If you set any -ot yourself there will be no change to how weights are allocated.
But the automated logic should result in an even distribution across multiple GPUs unlike the current -ncmoe CLI argument.
>>
>>106931567
>>106931567
>>106931567
>>
>>106931525
>open-ended questions
That's not how you do it
>>
What model would you go for in a 16 GPU 32 CPU computer ? (for general intelligence)
>>
>>106931847
woah, sixteen whole gpus? try gptoss 20b, i'm not sure whether your monster rig can handle it but that right there is state of the art
>>
genuine advice to drummer: make llamacpp agpl fork with lora support, then upload loras only
i doubt u did FFT of glm air right? and for models that bartowski made quants of u coud delete the quants to save space. just keep original models.. id like you to publically announce wat ur gonna do before u start deleting models so we can archive some of ur stuff maybe. at least i know id like to
>>
>>106932347
>with lora suppor
llama.cpp doesn't support loading LoRAs?
Did they break that at some point, because I'm pretty sure it worked in the past.
>>
>>106931467
>DDR3 is basically free
And worth every penny!



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.