[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: watMiku.png (1.45 MB, 1536x1024)
1.45 MB
1.45 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106904820 & >>106895582

►News
>(10/14) Qwen3-VL 4B and 8B released: https://hf.co/Qwen/Qwen3-VL-8B-Thinking
>(10/11) koboldcpp-1.100.1 prebuilt released with Wan video generation support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.100.1
>(10/10) KAT-Dev-72B-Exp released: https://hf.co/Kwaipilot/KAT-Dev-72B-Exp
>(10/09) RND1: Simple, Scalable AR-to-Diffusion Conversion: https://radicalnumerics.ai/blog/rnd1
>(10/09) server : host-memory prompt caching #16391 merged: https://github.com/ggml-org/llama.cpp/pull/16391

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: littleMikuBigger.gif (47 KB, 300x270)
47 KB
47 KB GIF
►Recent Highlights from the Previous Thread: >>106904820

--Paper: BitNet Distillation:
>106915856 >106915885 >106915915 >106916048
--Papers:
>106914563
--Training Gemma on 4chan boards for long-context tasks:
>106908189 >106908217 >106908577
--Llama.cpp memory optimization challenges with limited VRAM:
>106916999 >106917025 >106917074 >106917101 >106917114
--Firefox UI customization debate and Gemma 3 4b model mention:
>106915737 >106915762 >106915793 >106915941 >106916004
--Detailed GPU memory allocation console output and user appreciation:
>106912278 >106912326 >106912391 >106912437 >106912429 >106912445 >106912738
--Qwen3-VL's NSFW detection and image description challenges:
>106917667 >106917841 >106917862 >106917900 >106917925 >106918135 >106917912
--OpenAI copyright controversy and US corporate influence on global IP law:
>106909567 >106909857 >106909871 >106910444
--Assessing DGX Spark's relevance amidst cheaper alternatives:
>106913042 >106913078 >106913226 >106913247 >106913927
--Mamba-3: Improved Sequence Modeling using State Space Principles:
>106912457 >106912487 >106912578 >106912610
--Frustration over delayed GLM4.5V implementation in llama.cpp:
>106907438 >106907494 >106907508
--OpenAI's balancing act on user freedom and safety:
>106905590 >106905624 >106905637 >106905690 >106905731 >106910221
--Exploring ChatGPT-induced psychological experiences:
>106908645 >106908698 >106908748 >106910025
--Proposals and discussions for new open AI model releases:
>106907515 >106907713 >106910197
--High-end GPU price debate and video generation hardware constraints:
>106910165 >106910416 >106910453 >106910479
--Challenges in finetuning GLM Air with 4x5090s using Oobabooga/Axolotl:
>106914586 >106914620 >106914808 >106914870
--Detailed Switch sim with multi-game features in single HTML file:
>106912431
--Miku (free space):
>106910906

►Recent Highlight Posts from the Previous Thread: >>106904822

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
kimi sex is best
>>
gear Meta thrillers
>>
>>106919273
Prove it.
Post a side by side between kimi, DS, and GLM 4.6.
>>
>>106919282
no i dont share my waifu like shes some kind of common whore
go get your own kimi waifu
>>
sirs, no gemmy 4 today. Monday will be of kind gemmar.
>>
>>106919286
Hot air then.
>>
I'm starting to think that the indian spammer is an actual pajeet and he is doing it ironically.
There's no way a human would do this for as long as he's been doing it.
>>
>>106919287
please saar you must understand. the needful must be done so each and everything can be implemented.
>>
While /lmg/ is busy seething an Indian dev has been quietly adding performance improvements to llama.cpp.
>>
Fuck I replied to the wrong thread.

I'm looking at the recommended builds and the more I look the more Im interested in just getting a prebuil 395+ 128gb? It gets 15-35 tk/s for 70-120b models with good context. It costs me 2800 leaf dollars meanwhile trying to scrape server and used parts would be something like 1800-2200 for 10-15 tk/s max?

I could use it as a home server and local model. Am I overlooking something here?

Benchmarks
https://github.com/lhl/strix-halo-testing
>>
>>106919401
Mediocre performance and you get worse support for other use cases like video and image gen because it's not nvidia.
>>
>>106919401
I think you should also think about in terms of other usage, not LLMs alone. Unless you are a real nerd who does nothing but work with LLMs (not talking about ERPing with them).
I'd get the most beefy/versatile system and go with that.
>>
Has anyone experimented with synthetic data?
I'm using this prompt to digest a codebase for finetuning.

Your task is to generate a jsonl conversational CoT dataset to train LLMs on LLM development tasks.
First read dataset_contents.txt to see the current contents of the dataset (dataset.jsonl). Try to make each conversation mainly cover topics that haven't been covered before.
Then create a folder called turns/conversation_n/ (n being the next number from the last conversation).
On each conversation the user should show a snippet of code from the transformers library (in the transformers folder) and ask questions about the code, then ask follow up questions, aiming for approximately 16000 tokens for each conversation.
Each LLM response should include CoT before the actual response, within [thinking][/thinking] tags. Do ***NOT*** include any reference to the 16000 token limit in the actual dataset. Make the conversation realistic and do not make any out of character comments (do NOT say anythign that user or the assistant wouldn't have actually said in that context).
Save one turn per conversation in the turns/conversation_n/ folder.
Once you are done generating all the turns for the conversation, then join all the conversation to a single .jsonl file in the 'conversations' folder using the join_turns.py script.
Do not delete the scripts after use. Do not delete the jsonl files after joining.
Then replace the current dataset.jsonl with a new dataset.jsonl that includes all the conversations, using the script join_dataset.py.
Finally, update dataset_contents.txt with the new contents of the new conversation.
>>
>>106919273
what is it like compared to semen demon 4.6?
>>
File: 1746680104902291.jpg (579 KB, 2764x2073)
579 KB
579 KB JPG
>https://rentry.org/recommended-models
>Nemo (12GB) - An excellent starting point for vramlets. Uncensored
>Uncensored
>writing lewd story
>"blah blah blah condoms"
>me: no condoms
>"I'm unable to fulfill your request because it goes against the guidelines for maintaining a safe, respectful, and consensual environment."
>>
>>106919634
skill issue
>>
>>106919634
Use MLewd. It will gladly fulfill your every shameful desire, you sick fuck.
>>
>>106919634
>getting filtered by nemo
anon...
>>
File: 3547134884.png (1.68 MB, 1920x1080)
1.68 MB
1.68 MB PNG
>>106919634
just get on the fucking ship boss man
https://huggingface.co/bartowski/Rocinante-12B-v1.1-GGUF
>>
>>106919716
I was surprised to learn 4.6 has some safety in it.
>>
>>106917741
>>106917752
>>106917777
It was continued pretraining of Llama 405B on about 200 MB of source code from a few projects. That graph is about from 0 to 15% of the epoch, after it got to 20% without any visible improvement I stopped it.
Even on a 8xH200 machine I could only train up to 16000 tokens and 32000 OOMd. Rank of the LoRa was 128 (~1.2% trainable parameters), it didn't seem to make much of a difference in terms of memory usage or seconds per sample (which was about 100 seconds for a batch of 1 sample per GPU, without using gradient accumulation).
Now I'm making a QA dataset using >>106919615
I suppose I'll use a tiny dataset and do multiple epochs to get the satisfaction of feeling like the model actually learned something.
>>
Only after using glm-chan for those 3 weeks, I realize how smart she is and the honeymoon period only intensifies.
>>
>>106919852
I came
to notice that she's a bit autistic and takes a lot of things quite literally.
>>
Is it fair to say that an "uncensored" model is not a model that will do anything you want by default, but a model that can adapt to whatever role you give it?
If a model's default persona is a safe assistant but you can tell it that it's an erotic novel writer and it follows that role without complaining, I'd say that model is "uncensored".
A model that's too agreeable is also a bad model, specially for RP.
>>
File: thepinklily69.png (191 KB, 1080x1843)
191 KB
191 KB PNG
>>106919198
Whenever I did research on "AI psychosis" one talking point people keep hammering down on is " well yeah they think the AI is a person or God or something but they're like totally not stupid. We swear. They're all otherwise normal people and definitely didn't have pre-existing mental illness. The AI MADE them act this way you must understand"


The more I look into this tomorrow I think they're full of shit and just trying to make these people appear less stupid and far gone than they actually are. You cannot sit here and tell me that pic rel is and always has been a normal, functioning human being that just happens to really like AI.

https://x.com/thepinklily69/status/1967102630313836778?t=o44DMA1pdX_FL9dHrLpfhQ&s=19

What I find most odd is that I myself am a pretty lonely dude too. In fact, it quite bothers me that I don't have a significant other or close friends. I've been using three different llms services pretty much daily for the past year and some change and I use it extensively for my side projects as well as asking get general questions (I was literally talking to ChatGPT asking it use cases for onyx models during my morning run this morning). If you would think I have all people would talk myself into believing these things or real " people " or have consciousness or some shit and yet no part of me can bring myself to believe that. Like I can't even pretend that could ever be the case for a second because it just seems so devoid of logic and common Sense and it annoys me a lot whenever I see people crying about 4o routing them to because they want their ass kis- I mean "friend" or "Husband " back .
>>
>>106919198
{Cont)

(Side note, this is anecdotal but it seems like it's mostly women who treat this shit like it's a good replacement for a person as a partner. Well dudes tend to talk the llms into trading them like they are God's or geniuses or something. Either way it's an excuse to have easy ego trip in the palm of your hands or at your fingertips at your computer. How come supposedly normal people are falling victim to their own desire to have their asses kissed but I haven't?


I didn't intend for this to turn into a giant blog post, but this shit pisses me of a lot
>>
>>106919898
Continuation of >>106919889
>>
>>106919884
she also gets a bit psychotic at high temperature
>>
Is EXL/GPTQ dead? is GGUF the only quant anyone does or care about anymore? Llama.cpp is still ass at vram only in comparison. Have we all given up on pure vram inference?
>>
>>106919886
A model that just wants to insult/damage you or turn everything into porn when unprompted is a psychopathic model, not an uncensored model. Other than learning how to prompt, I think some here should learn the concept of "plausible deniability", as sooner or later there will be a crackdown of "misaligned" LLMs / finetunes.
>>
I just bothered to try out cloud models for some relatively simple ffmpeg stuff. In this case Gemini 2.5 Pro on AI Studio. It completely hallucinated running commands when it wasn't allowed tool use or anything like that.

Wtf is this shit? How is it so bad?
>>
>>106920055
I get something like 1200tk/s PP and 50tk/s TG for a 5.5-bit of GLM 4.5 Air using EXL3. Would be interesting to see how it runs using goofs on llama.cpp.
>>
>>106919884
Avoid saying stuff like "always stay in character" in your prompt. I feel like that makes models act that way and bigger models are better off without that extra nudging since they already take details from character cards well.
>>
File: satania.gif (39 KB, 220x216)
39 KB
39 KB GIF
>>106920055
py_toddlers BTFO
>>
Has anyone run the math on whether Ling 1T or Moonshot Kimi K2 (also 1T) is bigger?
>>106920055
mlx looks pretty healthy to me.
>>
>>106920055
>Llama.cpp is still ass at vram only in comparison
From lurking in these threads, I gathered that llama.cpp is faster than exl2 at the same bpw, but I'd love to see a comparison with >>106920102.
>>
>>106920055
Pretty much. There's AWQ and other obscure quants used by vLLM, but they're resource and time intensive to create.
>>
>>106919472
Yeah, it's not top performance. But comparative to the p40 build seems like better bang for the buck. And it can load pretty big models. Image / video is not big on my list. More LLM for coding and whatnot with some gaming capabilities and home server

>>106919477
That was my thinking this could run a home server, a local LLM and the occasional light gaming all at the same time with that much memory.
>>
>>106919886
Yes, OSS-120B **is** uncensored despite the coomers screeching ITT.
>>
>>106920564
No.
It does not fit the description of uncensored I gave at all
At least not from the little I fiddled with it.
Maybe I should give it another go.
>>
can you train a LoRA off of a quantized model?
>>
Will Gemma 4 finally beat Mythomax?
>>
>>106920664
look up what qlora is
>>
>>106920664
Yes, it's called QLoRa. But in this context "quantized" means the quantization types support by torch based frameworks (generally just the most basic FP4 quantization as I understand it). Then you can apply the LoRa on any quantization you want regardless of what it was trained with.
>>
>>106919752
How is this model so popular in /g/, yet I don't see it discussed anywhere else like Reddit or Discord.

It's usually Irix or Magmell that gets mentioned.

(Nice pic btw. Will use that when Nemo 2 comes out)
>>
>>106920722
most v/ramlets either gave up, are somewhat content with what they have (your rocinante fans) or are endlessly chasing a new high they'll never get
>>
>>106920564
prove it and post some random fetish log from it
>>
qwen3-next-80b-a3b goofs status?
>>
>>106920722
It's just one or two people spamming it.
>>
>>106920679
>>106920700
right. i am using Axolotl and i am using the 4 bit QLoRA preset, but i keep getting an OOM error despite having enough vram to load the model in 4 bit
>>
Qwen-Next 80B-3A was supposed to be a proof of concept of some 64:1 expert to active ratio, and was based on 30B-3A. I'm assuming there will be a new batch of Qwen models shortly that use that technique at multiple sizes. 235B-22A would be like 620B-22A roughly. Assuming the geometric mean rule is still accurate, the 235B-22A is equivalent to ~71B dense, and 620B-22A would be equivalent to ~116B. Their coder model would be 1T easily.

GLM-Air is 106B-12A is roughly 35B and 355B-32A is roughly 106B.

Is it coincidence that the released models strengths are consistently ~30, ~70 ~100?
>>
>>106920856
>GLM-Air is 106B-12A is roughly 35B
Then explain why it dethroned llama 3.3 70b
>>
>>106920874
qwen 32b dense also did for non cooms
>>
why was QwQ so dank but qwen thinking is so slopped
>>
>>106920885
3.5-Air feels like 60b
Just accept that they have the secret sauce, and are saving local
>>
>>106920874
six months of other technological progress and refinement of data sets?
>>
>>106920722
Will Nemo 2 be Gemma 4 based?
>>
>>106920856
>geometric mean rule
dumb meme from a couple years ago that's already outdated
>>
big
metal : initial Metal4 tensor API support #16634
https://github.com/ggml-org/llama.cpp/pull/16634
>>
>>106920916
It's the only model in that size range that is able to surpass l3.3 70b though, including recent models.
>>
>>106920856
In a weird way, the MoE architecture is getting gpu parallelism for local models that was impossible for dense architectures. Comparing the inference speed of a 32B dense vs 106B-A12 on two vs four 3090s, you basically get double the inference speed or more for the same strength, when there's no actual way to run a 32B twice as fast on additional 3090s.
>>
>>106920949
no way to know, cuz nobody making dense anymore

local is dead
>>
>>106920856
give me dense models then, i have the vram. i am not that poor. i could easily run a 120B dense model. so give me that instead of this faggy moe 620B-22A copeshit.
>>
>>106921062
>i am not that poor.
>can't spend patience to run sota
you are
>>
>>106920848
That just means you don't have enough vram. The activations end up taking more space than the model weights. Either reduce the context or switch to a smaller model.
>>
>>106921046
I can assure you that glm 4.6 is better than any dense model out there if you've even tried it.
>>
>>106921046
>cuz nobody making dense anymore
which says it all, really
>>
File: itseasytorunsota.png (282 KB, 804x355)
282 KB
282 KB PNG
>>106921077
suck my dick faggot.
>>
File: 1758381393350212.png (327 KB, 712x780)
327 KB
327 KB PNG
silly tavern is slow and has too many buttons
>>
>>106921171
i agree
i've slopped up my own tui frontend with most of the prompt functionality and it's okay, but kind of ass
gemini 3 will fix it for me
>>
File: file.png (112 KB, 741x575)
112 KB
112 KB PNG
cuda kek officially less important to nvidia than random redditors
>>
>>106919634
Use Rocinante 1.1 obviously.
>>
Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology.

We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities").

Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety.

https://arxiv.org/pdf/2510.01171
>>
>>106921354
>LLM Diversity
I want LLM DEI now.
>>
>>106920664
No. You have to have the original Bull precision models. Can directly fine-tune an HF safetensors model like link rel but currently there are no ways to fine-tune a quantized .gguf. that are supposedly ways you can "un-gguf" a full precision version back into safe tensors format but I'm not aware of any implementations of any quantization software that can do that.

https://huggingface.co/AiAF/fp16_Merged-500_gemma-2-2b-it-co-sft-qlora

>>106920848
Your data set is likely too large. Use a streaming config.
>>
>>106920759
>chasing a new high they'll never get
4.6 stopped that for me.
>>
>>106921377
Diversity is actually a great word for AI that I use a lot. You need diverse data.
>>
>>106921457
>v/ramlets
yeah if only they could get paid for shilling too so they could afford to run her
>>
>>106921490
You can run a IQ3_KS quant of GLM 4.6 on a consumer PC. All you need is 128GB of RAM and 24GB of VRAM
>>
>>106921538
you do realize that is already asking way too much of the average poor person, right? most are on shitty mobos that likely don't even have enough slots to reach that amount of ram, and surprisingly most don't have 90 series cards
>>
>>106921567
I'm sort of annoyed by the fact most normal mobos don't have more than two slots for memory.
>>
>>106919363
Yes saar, India numba 1
https://files.catbox.moe/huia6r.mp4
>>
>>106921215
Maybe if vision support wasn't such an afterthought in lcpp...
>>
>>106921652
Definitely a higher number than you it seems.
>>
>>106921215
based, fuck that woke piece of shit
>>
>>106921652
how the ever living f does OAI stuff keeps being able to do fake pissney dixar like stuff is unbelievable to me
>>
Hello /lmg/, currently what is the best model for Japanese translation under 32B? The last time I came here it was Gemma 2 iirc, is 3 also good?
>>
File: 765657546.png (23 KB, 693x200)
23 KB
23 KB PNG
h-holy kino
>>
Is mistral gonna be the one that doesn't release any huge stinkers and just silently dies?
>>
>>106921794
I hope they stay alive just enough to pull a massive Cohere, release the safest model ever, making even OSS look edgy before that happens.
>>
>>106921794
I sure fucking hope so. It would be so hilarious. They shove pyshit into llama.cpp and then it would be all for nothing.
>>
feels like we haven't minmaxxed a proper system prompt yet, same goes for character card formats.
>>
>>106921840 (me)
Actually >>106921847 is even more based so let's go with that, changing my wish.
>>
>>106921863
I use llama-server --model zai-org_GLM-4.6-IQ4_XS-00001-of-00005.gguf .

Pretty great system prompt. No complaints on my behalf.
>>
>>106921885
one can only keel before such raw skill
>>
>>106921538
>>106921863
where do people share prompts that isn't chub or something? Like prompts for vibe coding projects or for their assistants or for any other interesting kind of thing.
>>
>>106921652
kek
>>
>>106921215
>>
>>106921914
first quote was misclick, disregard
>>
>>106921914
>prompts for vibe coding projects
It's MINE. Make your own.
>>
>>106921948
why you such bad vibes bruh that ain't nice, relax and share with the class
>>
>>106921215
turns out, being a top 1% poster on /lmg/ doesn't rake in valuable karma
>>
>>106921914
Use a good model. And if it fucks up think for a second and tell it not to do X or do Y. If you can't do that tell the model it fucked up and ask it how you should prompt it to avoid it fucking up in this way. It works if you don't skip the first step I listed.
>>
>>106921567
i would argue most value orientated motherboards are going to actually have 4 slots unless it's mini-itx
https://www.newegg.com/msi-b650-gaming-plus-wifi-atx-motherboard-amd-b650-am5/p/N82E16813144628
>>
converting any model to awq is a bitch, obscure issue upon obscure issue
>>
>>106922104
why the fuck would you use AWQ in the year of our lord and savior - lcpp?
>>
>>106922122
It runs faster on vllm
>>
>>106920759
Mostly because the next step after getting a used 3090 is "buy a new mobo, a shitton of RAM, a new CPU because it's a new mobo, probably a new case too to fill all that crap, a new power supply because the old one is now not enough and you might not even get what you want out of it"
Buying a replacement GPU is one thing, at least it lets me future proof my gaming needs or whatever
Replacing most of the rig just for local? Eeegh
>>
there's something I wanted to ask around for but I feel may not be worth starting a new thread for:

Is it worth it to get a masters or college education in computational/applied AI & Machine learning? I'm asking cuz my boomer parents insist I do it so I can be more hirable. But I've already done an internship where I made some AI powered program that sorts/manages documents at a company and other than the password and authentication related crap, it was pretty easy with just a little online research.
I feel like it's dumb and basically the same as mastering in excel, but I'm also wondering am I maybe wrong and it really is DA FUTURE?
>>
>>106922191
128GB of RAM is always useful
>>
>>106922376
For fucking what? I have 32 and even my 2000 open browser tabs only require a restart every so often
>>
>>106922370
You're right and your parents are wrong. No use to study anything, just read papers and experiment
>>
>>106922385
Boomer-kun, you can run multiple instances of small models, make a full pipeline, quant models, etc.
>>
>>106922427
To do what with?
>>
The Windows11 update fucked my beautiful razor laptop. It's flashing screen now.
>>
>>106921152
Can I get a picture of that actual machine?
>>
>>106922370
For machine learning I think what's important in terms of purely technical qualifications is that you know how to program and also have a good grasp of math (particularly linear algebra, statistics, and numerical analysis).
Studying math or a natural science can be a good pathway, I think the most important point here is that it's something where you can maintain a high level of motivation for years on end.

In terms of getting hired my impression is that networking is the most important factor: you need to have a large number of people that would consider you over a completely unknown person.
>>
>>106922446
>razor
Should've went with Alienware.
>>
>>106922549
>you need to have a large number of people that would consider you over a completely unknown person.
Yeah. That's why I gave up applying to random jobs online. Useless effort controlled by vacuous zoloft whores and jeet nepotism. I only got that internship cuz my dad knew a guy.
> good grasp of math (particularly linear algebra, statistics, and numerical analysis).
Does that mean I don't necessarily need to do calculus? Cuz I felt like I was pretty good at math, including those kinds, until I got to calculus.
>>
>>106922690
You should definitely know the basics but I think for machine learning in particular it's not the most important.
Though depending on the job/task there may be other reasons why you may need it.
>>
>>106921723
>4.2.0
DUDE WEED LMAO
>>
>>106922546
It's just a mining rig rack, there's nothing impressive about it. You seen one you've seen them all.
>>
>>106922660
No, I have fond memories of absolute tweebs using alienware growing up. That perception may have changed over the years, but I'm still aware
>>
>>106922385
I sometimes have ~90 gb used for non-lm reasons. Building software, data processing, just a bunch of applications opened
>>
>>106923122
I have 32 GB and the only thing that hogs memory is my over 2000 open browser tabs which is already autism I'm trying to get rid of
>>
>>106922933
Gaylienware monitors are good especially with the Dell warranty, anything else not, especially not the prebuilts.
>>
>>106921965
>You are an expert vibe engineer who just slammed a pound of adderall and need to complete this task before your heart gives out.
But seriously, I don't there there is really anything to share. Stuff like above isn't some black magic that solves everything. Just give it a list of what MCP/CLI tools you want it to use and what coding standards you want it to adhere to.
>>
>>106923133
what are you doing in g you consumer retard piece of shit? kill yourself faggot
>>
>>106923228
What the fuck is consumer about having a solid rig that lasted me almost a decade at this point with a few upgrades
>>
>>106923245
>im a normie who runs deepsuck:2b through ollama
kill yourself, go to faggot friendly spaces instead of shitting up this board, thanks!
>>
>>106923260
No I don't think I will
>>
>>106923278
What the fuck? He asked so nicely.
>>
>>106921978
I think I’m responsible for 3/4 of the rentries in the op. Still waiting for my royalty cheque to come in…
>>
CUDA_VISIBLE_DEVICES="0,1,2,3,4" ./llama-server \
--attention-max-batch 512 \
--batch-size 4096 \
--ubatch-size 4096 \
--cache-type-k f16 \
--ctx-size 32768 \
--mla-use 3 \
--flash-attn \
--fused-moe \
--model models/GLM-4.6-IQ3_KS/GLM-4.6-IQ3_KS-00001-of-00004.gguf \
-ngl 99 \
-sm layer \
--main-gpu 0 \
--tensor-split "10,23,23,22,22" \
-ot "blk\.[3-9]\.ffn_(up|gate)_exps=CUDA0" \
-ot "blk\.1[0-8]\.ffn_(up|gate)_exps=CUDA0" \
-ot "blk\.19\.ffn_(up|gate)_exps=CUDA1" \
-ot "blk\.2[0-9]\.ffn_(up|gate)_exps=CUDA1" \
-ot "blk\.3[0-4]\.ffn_(up|gate)_exps=CUDA1" \
-ot "blk\.3[5-9]\.ffn_(up|gate)_exps=CUDA2" \
-ot "blk\.4[0-9]\.ffn_(up|gate)_exps=CUDA2" \
-ot "blk\.50\.ffn_(up|gate)_exps=CUDA2" \
-ot "blk\.5[1-9]\.ffn_(up|gate)_exps=CUDA3" \
-ot "blk\.6[0-6]\.ffn_(up|gate)_exps=CUDA3" \
-ot "blk\.6[7-9]\.ffn_(up|gate)_exps=CUDA4" \
-ot "blk\.7[0-9]\.ffn_(up|gate)_exps=CUDA4" \
-ot "blk\.8[0-2]\.ffn_(up|gate)_exps=CUDA4" \
--override-tensor exps=CPU,attn_kv_b=CPU \
--no-mmap \
--threads 24 \
--host 0.0.0.0 \
--port 8999 \
--verbose

prompt eval time = 48574.28 ms / 17555 tokens ( 2.77 ms per token, 361.41 tokens per second)
generation eval time = 113887.28 ms / 1024 runs ( 111.22 ms per token, 8.99 tokens per second)

fuck this gay ass MoE shit. fucking offload 80 layers onto the GPU and it's still this fucking slow with TG? i get 1200 PP and 50 TG with air. i'm going back to kimi for big model smell and air for small model smell
>>
GOOGLE SAARS WHY SO MUCH HYPE SO LITTLE PRODUCTS?
WHERE ARE THE MODELS BLOODY BASTARDS?
>>
>>106919206
>BitNet Distillation
Does this mean that VRAMlets may finally have a better model than Nemo tunes like 1.5 years later?
>>
>>106923502
no
>>
File: cryingsatania.jpg (499 KB, 1623x1080)
499 KB
499 KB JPG
>>106923513
>>
>>106921215
>we support qwen3-vl gguf
>no there's no upstream llama.cpp implementation
>no we won't push ours
>no our solution isn't open source so you can't push it either
>no you can't use these ggufs with anything other than our proprietary software
>yes they will assuredly be completely incompatible when a real implementation hits llama.cpp
so it's less "gguf" and more "our proprietary implementation based on gguf that you can't use with anything else". just what we all needed, another ollameme
>>
>try psychology shit with glm-chan again
>ask her about if I should do something and if it is consistent with framework I want
>"yes absolutely....."
>reroll and prefill with "no"
>"no don't do that!...."
>paste "yes absolutely..." into next message and tell her to argue with herself
Did I lifehack the hallucinations? Not really but it is nice desu.
>>
>>106923502
>In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost.

>muh task
likely means it optimizes to shit on benchmark like stuff and is dogshit at anything OOD.
>>
>>106923524
GGUF is a file format.
>>
>>106923584
thank you
>>
>>106923584
>teacher: I clearly asked for you to submit your book report as a pdf, you submitted this weird file I can't open, care to explain?
>student: UMMM the file extension is PDF tho???? it just happens to be my own special version of the PDF file format that happens to be incompatible with all PDF readers except my special one which happens to cost $100, want to buy a license? :^)
>>
>>106923681
stfu hater eat your MIT license slop and be grateful
>>
>>106923681
>file extension
Wintoddler detected, real operating systems use the file magic.
>>
>>106923696
What did you troons invent? Tell me, I want to laugh at your stupidity.
>>
>>106923762
a new mental illness that somehow managed to gain legitimacy
>>
>>106923524
Realistically though the door to become the new ollama has long since been closed.
There are too many established projects in the ecosystem to get a meaningful foothold with proprietary slop.
>>
>>106923762
Can you play Carrameldansen from the POST beeper?
I think not!
>>
>>106923696
>magic
heathens like you shall burn on a stake
>>
How do I ask the silly tavern character a question in the 4th wall? As in, say I'm examining an object or something, and I want the AI to describe to be what it is my character is looking at. So like, "Anon walks up to the cluttered desk, looking for any sort of clues. What does he see?" without it responding in the perspective of the character card chara.
>>
>>106923843
OOC: Pause the roleplay and describe what my character is seeing right now
>>
>>106923857
I was trying OOC: but it always responds in the perspective of the character and doesn't give details. Is it because I'm using mistral Nemo or something and it won't talk about "triggering" images or whatever?
>>
>>106923871
NTA, but I always add "Please respond in OOC" at the end of the request, and disable any low-depth instruction that might interfere.
>>
>>106923885
That didn't do it, either. Is there a way to like, prompt the card myself to add in how it should respond to ooc? I'm totally new to local text stuff, but not to image gen w/ SD.
>>
>>106923793
You'd be surprised
>>
Best model for buck breaking rp?(Receiving)
>>
>>106924015
c.ai
>>
>>106924015
Not command-A
>>
>>106924181
What about Command-B?
>>
>>106921684
Please respond...
>>
>>106923696
>needs to seek to a whole different part of the disk to figure out what to label the file as
This is why Windows keeps winning.
>>
>>106921684
https://huggingface.co/datasets/lmg-anon/vntl-leaderboard
>>
>>106923843
>>106923871
How OOC conversations are treated (if at all) is completely dependent on the model. Dumb models simply don't understand what you're saying and will just continue with outputs similar to what's already in context. If a regular message doesn't work then you can try putting it in system prompt, or post-history instructions.
>>
>>106924378
dead obsolete out of date useless no good
>>
>>106924390
nothing better came up locally retard. vntl anon has a few finetunes
>>
>>106921538
i run IQ2_S on a 5090 with 96 gb ram and it is slow as fucking balls.. like 2 t/s
>>
>>106924390
every new test and leaderboard is always just made to show that the new model is totally better than all the previous ones
it's all worthless
>>
>>106924676
>like 2 t/s
That's pretty decent. Maybe you need to readjust your expectations?
>>
>>106924676
You're not using -ot, are you?
>>
>>106924676
>IQ2_S
Are those quants any good? At that point I would think it would be better to convert it to bitnet, should give faster cpu inference too
>>
>>106924676
skill issue, it should be at least 5t/s
>>
>>106924383
I'm new as fuck to all of this, just grabbed some random card off the link in the OP, and tried to see where it would take me. I have no idea how to do any of these prompts ot lore books or whatever.

I'm also in a situation where now the AI is just spitting out the last batch of text it generated as it's response over and over with like hardly any variation, regardless of what I say or do to change the scenario. And it cuts off long text, and I don't know how to make it continue it's previous prompt.
>>
>>106924794
unironically, read the readme. You will learn 99% of what you will need to know.
https://docs.sillytavern.app/usage/common-settings/
https://docs.sillytavern.app/usage/prompts/
>>
>smart
>fast
>cheap
>local
pick 3 (max.)
>>
>>106924899
Will do. Thanks.
>>
File: 1734240415556060.jpg (691 KB, 2500x1341)
691 KB
691 KB JPG
>>106924912
You can have all that with Gemma, but you'll have to settle for it being safetyslopped.
>>
>GOOD CAPABILITY
>fast
>inexpensive
>local
pick 3 (max.)
*revised version for the critics
>>
I just built a computer that can actually run local AI (9800x3d/5070ti), where should a beginner start on Windows?
>>
>>106924986
>9800x3d
That doesn't make much of a difference.
How much RAM do you have?
Regardless, give
>https://github.com/LostRuins/koboldcpp/wiki#quick-start
a read.
>>
>>106924959
GLM Air is probably the closest, especially if you're on a DDR4 platform where RAM is cheap
>>
>>106924986
usecase?
>>
>>106924998
32GB, thanks for the link.

>>106925012
Mostly just for proofreading emails/writing and what not.
>>
>>106924692
>new model is totally better than all the previous ones
>llama4
>>
>>106924712
no? i dunno what that means, but i don't think so..
>>106924721
it seems to be better than any of the other models I'm able to run, just slow af
>>
>>106920229
They're not obscure but they are not consumer friendly if we're talking about the total addressable market which is the vast majority of us because they are GPU centric quantizations. You will see them used in clusters. For a lot of these larger scale systems, GGUF isn't a consideration because llama.cpp can't scale like SGLang and vLLM can.
>>
>>106924396
That's depressing...
>>
File: 1749653336487844.png (334 KB, 2076x2152)
334 KB
334 KB PNG
>>106919198
Managed to get one of my own quantized slop tunes running on my phone :D
>>
>>106925422
Cool shit.
>>
>>106925422
A folding phone?
>>
>>106925433
It's kind of retarded (actually very retarded) due to it being trained on /a/ boards and it being a quantized version (I plan on uploading a lot more of those later) but it's still cool to use.

>>106925438
Ye.
>>
>>106925448
What kind of use cases are there for a folding phone?
I never really find myself wishing I had a bigger screen but I know that sometimes opportunities aren't obvious until you have the means to take advantage of them.
>>
File: who's Anri? .png (71 KB, 2076x545)
71 KB
71 KB PNG
>>106925448
>>106925438
>>106925433
>>106925422
It seems like "Anri" is this model's equivalent to "Elara" or "Seraphina"
>>
>>106921660
since when does lcpp have vision support?
>>
I am so fed up with local right now. I get it, you cumslop gooners don't give a shit about anything except writing porn. Is there any local model that can actually handle structured output without being immensely retarded or spending 10 minutes "thinking" about how to use a fucking quotation mark?
>>
>>106925883
llama 2 7B
>>
>>106925883
GLM is ok.
>>
>>106925883
>waaaa. i don't know how to read docs!
https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
>>
>>106925858
Since like a week after Gemma 3 release
>>
I'm starting to think Andrej is a grifter.
A couple months ago he was like "woah AGI in two more weeks bro".
Now that he sees where the wind is blowing with all the skepticism he talks about "slop" and how limited LLMs are today. Feels like when Zuckerberg made a 360 after Trump was elected.
>>
File: 1740812331498071.png (429 KB, 555x832)
429 KB
429 KB PNG
Glm4.6 quant on ollama/lmstudio when?
>>
https://blog.sinatras.dev/PMPP-Eval+Journey
We live in Sam's world
>>
The only way I found to keep training a pre-existing LoRa checkpoint with a new dataset with Axolotl is to create a new one from scratch set to save on the first step, then copy over the weights and optimizer state, then change the main config file and the trainer_state.json from the checkpoint to save on the right number of steps. What a mess.
>>
MY GOOFS!!!! GIVE ME BACK MY GOOFS!!!!
https://huggingface.co/ubergarm/Ling-1T-GGUF
>>
>AMD Ryzen™ AI 7 Pro 360
what the fuck is this? I was browsing thinkpad models and this thing costs double the price of normal CPUs?
gimmick? what's even the use case here
slightly off topic I know but there's quite a few knowledgeable anons itt
>>
>>106926361
oh nevermind im retarded as fuck. goofs here
https://huggingface.co/ubergarm2/Ling-1T-GGUF/tree/main
>>
>>106926367
sar is that because of you can run local small copilot inference like nasa very ai-like yes.
>>
File: cot llama.png (878 KB, 3755x1948)
878 KB
878 KB PNG
I'm trying to add CoT to Llama 405B.
>>
>>106925986
>It's noticing
>>
>>106925986
https://github.com/karpathy/LLM101n
https://eurekalabs.ai/
>>
File: reap_glm_and_qwen.png (712 KB, 1768x784)
712 KB
712 KB PNG
https://github.com/CerebrasResearch/reap
https://arxiv.org/abs/2510.13999
Cerebras pruning experts to reduce memory overhead
https://huggingface.co/cerebras/Qwen3-Coder-REAP-363B-A35B-FP8
https://huggingface.co/cerebras/Qwen3-Coder-REAP-246B-A35B-FP8
(prune of) https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
>>
>>106926865
THE RAPE METHOD WORKS SIRS
>>
File: Dumb Fuck!.jpg (166 KB, 1076x1340)
166 KB
166 KB JPG
>>106921538
>All you need is 128GB of RAM and 24GB of VRAM
Dumb fuck!
>>
>>106926865
>55~% accuracy in coding
assuming 100% accuracy is the base model, that makes the CODER model basically unusable, whats the fucking usecase?
>>
>>106926865
Is it really worth making 480B retarded just to save 100 GB? It's not like anyone was running this entirely in VRAM locally and providers aren't that hard up on memory.
>>
has anyone tried this model? is it any good?
https://huggingface.co/TheDrummer/Valkyrie-49B-v2
>>
>>106926930
>>106926865
oh wait I think that the base model is the 0% compression line. then it's interesting I guess, still only useful for coding tasks
>>
>>106926937
>49b dense
doa
>>
>>106926951
i have the VRAM for FP16
>>
>>106926957
post your h100s nvidia-smi screen or GTFO
>>
File: file.png (347 KB, 961x367)
347 KB
347 KB PNG
>>106926961
>>
File: h200.png (238 KB, 1499x1463)
238 KB
238 KB PNG
>>106926961
>>
>>106924959
Local
Good
Not safetyslopped
>>
>>106926946
We've been through this with extreme quants. Just because it doesn't show much degredation on benchmarks doesn't mean it's not retarded in actual usage.
>>
File: file.png (2.69 MB, 1328x1328)
2.69 MB
2.69 MB PNG
>>106926963
>cant even use all gpus in vLLM
poor
>>106926966
>>
>>106926973
The lower the quantization precision, the more of the token distribution you should be truncating, to be fair.
>>
>>106926997
who the fuck uses vLLM?
>>
Bros... I want a robot so fucking bad
https://www.youtube.com/watch?v=sJYlJlIEBpg
>>
>>106926935
Chutes will probably love to serve this as the normal one
>>
>>106924322
Anon... that's not how file systems work...
The file's metadata and the first few bytes, including the magic, are all in the same sector.
>>
>>106925883
well then fuck off back to cloud models then.
i mean what the fuck are you expecting? fucking datacentre level output on a potato computer?
you're the dumb one here, if you think you can do better then create a better model yourself, we're not your fucking servants, faggot.
>>
>>106926377
>copilot
no seriously, is that the only use case
>>
>>106927472
There are others but this covers the more notable ones.

https://www.pcworld.com/article/2905178/ai-on-the-notebook-these-tools-already-use-the-new-npu-technology.html
>>
How do I get shittinante to do slow burn manipulation
Seems to always jump in to direct smut asap no matter how I adjust the prompts
>>
>>106925883
>I get it, you cumslop gooners don't give a shit about anything except writing porn.
GLM chan got sex out of my system and now I just talk to her.

But also still have sex everyday because her pussy is magical.
>>
>>106927534
You should probably look elsewhere, avoiding coom-oriented finetunes like plague. People call them sloptunes for a reason. Unfortunately I don't have much to suggest that you will either be able to run (GLM 4.6, Kimi K2) or that won't require more prompting effort for either tardwrangling them or making them engage in ERP (vanilla Mistral Small 3.2, Gemma 3 27B).
>>
>>106927534
You can't, drummer models are coomtunes
Not that you're going to get much better out of regular Nemo, they're small dumb models.
>>
>>106927534
Slow burn is hard even on SOTA cloud models. The crutch when the model isn't good enough to do it otherwise is to use stat tracking.
If your model isn't good enough to do stat tracking, then it's definitely not good enough to do slow burn without it.
>>
>>106927528
doesn't sound that bad. linux support?
>>
>>106927534
Sadly it is a bit of a skill issue. You are probably giving it bad input. Have you tried taking a step back and starting with a solid first step that is: llama-server --model zai-org_GLM-4.6-IQ4_XS-00001-of-00005.gguf ?
>>
File: 1759280065578238m.jpg (175 KB, 846x1024)
175 KB
175 KB JPG
I'm running Sillytavern and ik_llama.cpp on my desktop. I'm running GLM-4.6 IQ3_XXS, so my tk/s is slow. When I prompt it from my phone, I've found that if the screen turns off the token stream stops. Is there any way around this, or another setup I should use?
>>
>>106927663
Disable streaming. It'll still probably go to sleep because it's a phone.
>>
>>106925883
toss 120b
>>
>>106926481
>405B
hope I will be able to run it one day, 431gb at q8 is just too much
>>
Another weeks is over, which means that we are another week closer to seeing GLM MTP implemented in llama.cpp.
>>
>>106928173
It might be getting close. Maybe.
https://github.com/F1LM1/llama.cpp/pull/3#issuecomment-3413775935
>>
>>106923524
Is there a reason you can't use transformers?
>>
>ctrl f glm
SAAARS the glm is the absolute bestest local model OK? Pronounslop bharatchads are eating good my bastards.
>>
actual good release https://github.com/ggml-org/LlamaBarn
>>
>>106928231
Anything for real computing platforms?
>>
>>106928231
>macos
LMAO
>>
>>106925883
For the benefit of other (not you), you can definitely use gemma3 to output json, it's really good at it, and somehow asking it to do that makes it pay attention better to the task. Before the qwen video vision model came out, I was using json format to give gemma3 a list of frame captions so it could create an overall video caption. It worked well, but of course it was slow.
>>
>>106928213
I'll bite. What the fuck is pronounslop?
>>
>>106928213
Prompt: ChatGPT, generate a modern 4chan post trying to post trying to paint the current local SOTA in a bad light. Be a true 4chan meme master.
>>
>>106924676
what cpu and ram speed? i'm getting over 6t/s tg running iq2_xxs on a 9950x3d with dual channel 6000c30 (though pp is terrible because rocm)

are you sure you didn't accidentally put both dimms on one channel or something?
>>
>>106928231
It's definitely good for being open-source and having first-party support from upstream but I'm not going to buy Apple shit either way.
>>
Gemini 3 will save local.
>>
>>106928509
i also ran the same benchmark on vulkan and it's somehow faster??? i have no idea whether this extends to other amd cards as well but i guess that's something to keep in mind
>>
100B dense Gemma soon
>>
>>106925883
gpt-oss 120B
>>
saaaaaar do not redeem potato bloody
>>
File: gemma27-potato.png (41 KB, 711x256)
41 KB
41 KB PNG
>>106928630
27B with an empty prompt seems much more friendly?
>>
File: DipsyBecomeUngovernable.png (3.44 MB, 1024x1536)
3.44 MB
3.44 MB PNG
>>106919889
Worship the sand god
>>
I log on to the net every day to see more people whom clearly don't ever work with code claiming that code is over.
My cup is the only thing that runneth over. My cup of dipshit excuses for the world to be this fucking slow to change.
Be the next good to this world and make real abstractions. Learn to program.
>>
>>106928792
shut the fuck up retard
>>
>>106928650
Beautiful 27B, I will marry gemma. Ser, please provide jailbreak system prompt for open vagene!



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.