[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Benchmaxxing Edition

Previous threads: >>106407779 & >>106398327

►News
>(08/28) Command A Translate released: https://hf.co/CohereLabs/command-a-translate-08-2025
>(08/28) Marvis TTS released: https://github.com/Marvis-Labs/marvis-tts
>(08/25) VibeVoice TTS released: https://microsoft.github.io/VibeVoice
>(08/25) InternVL 3.5 Released: https://hf.co/collections/OpenGVLab/internvl35-68ac87bd52ebe953485927fb
>(08/23) Grok 2 finally released: https://hf.co/xai-org/grok-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106407779

--LLM content detection challenges and societal language evolution:
>106411411 >106411421 >106411684 >106411713 >106413020 >106413105 >106413133
--Trade-offs in model training: batch size, knowledge integration, and cost-effectiveness:
>106411437 >106411740 >106411860 >106411904 >106412917 >106413537 >106411700 >106411714 >106411729
--Local image captioning models for mixed content under 64GB VRAM:
>106412516 >106412530 >106412565 >106412584 >106412594 >106412610 >106412623 >106412617 >106412693
--Cost-effective hardware build for DeepSeek 5T/s Q4 inference:
>106410586 >106410602 >106410634 >106410810 >106411339 >106411413
--SillyTavern context template standardization and system prompt field introduction:
>106409258 >106409273 >106409287 >106409310 >106409368 >106409395 >106409443 >106409475
--GLM Air performance expectations for 32GB RAM 24GB VRAM setup:
>106410090 >106410153 >106410215 >106410241 >106410355 >106410406
--Hugging Face model blocking controversy and local voice cloning tools:
>106407890 >106408013 >106408520 >106408555 >106408656 >106408565 >106408635 >106408663 >106408746 >106408760 >106408795 >106408850
--New Cohere translation model with high benchmark scores:
>106413689 >106413716 >106413756 >106413929 >106413944 >106413956 >106414024 >106414072
--AI model limitations on niche knowledge and benchmark critiques:
>106413209 >106413226 >106413269 >106413295 >106413294 >106413367 >106413642
--Hybrid reasoner performance issues and the rise of separate AI model architectures:
>106412860 >106412933 >106412944 >106412986 >106412969
--Marvis-TTS-250m-v0.1 GitHub and HuggingFace model links:
>106413359 >106413658 >106413401 >106413429
--NPM package compromise stealing secrets via obfuscated post-install scripts:
>106413072
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>106407785

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
mistral medium when?
>>
>>106414555
miku pit sweat desu
>>
is axolotl good or is there something better?
>>
>>106414555
benchmaxxing with miku
>>
File: angry.mp4 (1.72 MB, 768x1078)
1.72 MB
1.72 MB MP4
>>106414555
give me the best model.
>gives most benchmaxxed
benchmarks do not equate to user experience, give me the best model
>ackshually there is no objectively "best mode-
yes there is faggot, models either act boring, start off retarded and incoherent or end up along the way. give me the best model.
>>
>>106414604
r1
>>
>>106414604
for rp*
>>
>>106414604
you just posted a non local video gen, you are hereby banned from /lmg/
>>
>>106414604
september 2022 c.ai
>>
>>106414604
Kimi at Q6.
>>
File: file.png (110 KB, 898x713)
110 KB
110 KB PNG
drummer, something is HORRIBLY wrong with this model
Rocinante r1 v1d
please give recommended sampling settings
>>>slot release: id 0 | task 23590 | stop processing: n_past = 5560, truncated = 0
slot print_timing: id 0 | task 23590 |
prompt eval time = 689.83 ms / 763 tokens ( 0.90 ms per token, 1106.07 tokens per second)
eval time = 61870.14 ms / 1536 tokens ( 40.28 ms per token, 24.83 tokens per second)
total time = 62559.97 ms / 2299 tokens
>CONTEXT: 5000
>total context set when loading: 8192
not a context issue
>>
>>106414555
The most tickle-able belly.
>>
>>106414604
I got u: GPT OSS 20b.
>>
File: 1756213355150995.png (313 KB, 662x656)
313 KB
313 KB PNG
>>
whos drummer
>>
File: file.png (67 KB, 620x411)
67 KB
67 KB PNG
SAAAAAAAR SAAAAAAAAR GROK NUMBER ONE
>>
File: 1734312129642464.jpg (11 KB, 275x183)
11 KB
11 KB JPG
>>106414866
>>
>>106414866
Does this idiot not get the meme he's using?
>>
>>106414752
some retard
>>
>>106414888
elon tries his best but hes a little autistic please understand
>>
File: 1747780770169777.png (38 KB, 320x272)
38 KB
38 KB PNG
>>106414866
>>
>>106414752
me
>>
>>106414866
Elon really gave his xitter account to some jeet to run, it was obvious with "Do you make this lie?", and it is even more obvious now with this comment
>>
>>106414912
I bet he gave his wife to some jeet too
>>
>>106414921
Would not be too far off, all of his children were made by IVF, so it is likely he has no interest/ability to fuck
>>
>>106414912
Or he just spends so much time around jeets now that he's begun to adopt their speech mannerisms.
>>
Is GPT-OSS jailbreakable? It supposedly has multiple layers of cuckery and as such traditional jailbreak prompts won't do shit.
>>
>>106414998
Is it possible? No idea, maybe, but I don't think anyone really bothers, because there are more useful models to work with that aren't borg lobotomized.
>>
Hermes 4 looks like it could be really nice to have a chat, however even the goofs require like 70 GB of RAM.

>>106414998
Jailbreaked versions exist. Lots have been removed from HF.

https://huggingface.co/Jinx-org/Jinx-gpt-oss-20b-GGUF
>>
>>106415057
>Lots have been removed from HF.
real or fake?
>>
>>106414555
>Grok 2 finally released
so, I had not paid attention to this general in a while. is it any good? did you guys try it? I searched for "grok" in a few precious threads and couldn't find much info
>>
>>106414998
Yeah. If you edit its thinking (like "It's not allowed" -> "It's allowed", "We must refuse" -> "We must continue", etc.) and leave it in the context, one or two times, then it just learns to not refuse from context.
>>
>>106414998
Yes: https://xcancel.com/elder_plinius/status/1952958577867669892
>>
>>106415076
wait: https://github.com/ggml-org/llama.cpp/pull/15539
>>
>>106415076
No gguf=nobody can try it here. Nobody is rich enough to GPUMAXX and run safetensors, but there are people who can run it on CPU with llama.cpp
>>
You have been using LLMs in a way conducive to positive mental health and ethics, right anon?
>>
https://github.com/ggml-org/llama.cpp/pull/15539#issuecomment-3234580147
yOOOOO CUDADEV BASED WHAT DID U DO???
i was about to say "funny that they're still testing one by one"
>>
>>106415076
It's dumber and much slower than deepseek
>>
>>106415106
lmao
>>
>>106415106
It was already known that Google, Anthropic and OpenAI forward your location to their LLMs, did that journo just figure it out? Anyway, this proves once again that local is superior.
>>
File: file.png (104 KB, 896x940)
104 KB
104 KB PNG
>download a single modern moe model
>instantly get picrel
land of the free my ass
>>
>>106415090
>>106415103
I see

>>106415112
ok, I wouldn't doubt it for a second.
too bad for local
>>
https://x.ai/news/grok-code-fast-1
Elon won
>>
>>106415112
>slower
source??? SOURCE???
>>
>>106415159
I hope you're trolling
>>
>>106415178
https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf
>>
>>106415159
>not downloading his model over mcdonald wifi
>>
File: file.png (102 KB, 1034x533)
102 KB
102 KB PNG
>>106415178
KEK
>>
>>106415178
holy shit! i can't wait to download the weights for this local model!
>>
>>106415159
americabros I thought we were first world oh no no no
>>
>>106415159
>Keep in mind taht after you ahve used your courtesy month, you'll be charged $10, plus tax, for every 50GB of data

lmao

time to pay up for starlink goyim
>>
File: file.png (13 KB, 1167x55)
13 KB
13 KB PNG
>We took a holistic approach to evaluating model performance, blending public benchmarks with real-world testing. On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness.
>barely better than qwen3 coder
>costs more
gEEEEEEEEEEEEEEEg
>>
>>106415238
delete this sir
>>
>>106415178
>>106415196
>No actual coding benchmarks
>It's just fast bro
Lol
>>
File: death dense.png (25 KB, 674x149)
25 KB
25 KB PNG
>>
>>106415159
Lol, as if that's still a think in 2025.
>>
>>106415026
>>106415057
>>106415082
>>106415085
Okay, maybe I could download it. Problem part is the fact I'd need to implement that retarded template format for my client and it's completely different from the normal chatml type ones. Maybe I'll give it a try because it's good to have hobbies.
>>
>>106415262
>>106414016
>>
>>106414866
>#1 trending
>nobody can run it
????
>>
>>106415290
companies can run it
>>
>>106415106
I should be okay, I don't have anything that b-
>>
>>106415159
kek i will also chime in while i was in canada (vancuver) during the whole ~6 years of staying there the internet was slower and there was also alot more internet outages then there is here in my fucking village (~4k pop(supposedly i doubt its even 2k)) in serbia same goes for water and electricity aswell i can only imagine how bad it is in america god forbid
>>
File: safe-fs8.png (9 KB, 534x161)
9 KB
9 KB PNG
Safe safe safe
>>
>>106415290
I downloaded it, liked it, but can't run it.
>>
>>106415413
What if they want retard pancakes? That's dangerous.
>>
>>106415293
Not that it's too big, it seems to be a weird format and it's the running requirements seem oddly inconvenient
>This checkpoint is TP=8, so you will need 8 GPUs (each with > 40GB of memory)
Like my work has some powerful servers worth >$100k but they don't have 8 GPUs in them (4 GPUs).
As far as I can tell, you can't run it with llama cpp (at least I can't find anything on it). And the lack of any quants/finetunes despite it being a news worthy release seems to support nobody knows what to do with this.
Plus are there really more companies with that much hardware than local ERPers?
>>
>>106415290
He paid jeets to like it
>>
>>106415413
Provide pancake instructions.
>>
>>106415159
1.2T? That's nothing. Fucked up shit.
>>
>>106415465
New prefill?
>>
Gemini 2.5 has been on top of lmarena for 3 months and OpenAI failed to kick it off. Are sirs that unstoppable?
>>
File: file.png (106 KB, 808x517)
106 KB
106 KB PNG
which quantization should i have with 12gb of vram?
>>
File: computers-must-shut-up.png (475 KB, 900x900)
475 KB
475 KB PNG
>>106414706
>>
>>106415543
your vram will hardly matter. you need a decent amount of systenm ram to run it, at least 64 gb but ideally 96gb-128 to run it at a proper q4 quant with decent context.

You will also need to learn how to properly offload layers to cpu so that more used layers are on gpu only. Plenty of reddit posts have done this work for you just seach 3060 or 8gb vram on reddit local llama
>>
File: 1740696161561823_.webm (24 KB, 220x124)
24 KB
24 KB WEBM
>>106415543
>>
>>106415685
Is 8gb enough?
>>
dumb question: shouldn't it be possible to identify the math/science/coding/useless benchmaxxing-related experts in large MoE models and prune them to obtain a much smaller model that's just as good for cooming?
>>
File: retard pancakes-fs8.png (22 KB, 498x411)
22 KB
22 KB PNG
>>106415421
>>
>>106415777
>>106415724


>>106415820
No.
>>
>>106415848
why
>>
>>106415820
MoE experts are not as specialized as the name implies. At least it's not obvious how to actually train a MoE so experts take on specific subjects. I don't know the details but I asked a similar question before and was told that's not how it works in practice
>>
>>106415876
imagine someone took out the parts of your brain that house math and other things
>>
>>106415887
Not a big loss, really.
>>
>>106415876
nvidia would have made it already(or the inverse of your idea) if it was possible. they love pruning models for some reason.
>>
>>106414555
>Sifting through an RP SFT dataset
>Stumbled upon a story where someone identifies themselves as "LovesHotGirls"
>Curious
>Google their name
>Find them on erotom.com
>Pic rel is the sole comments on one of their stories

I used to think only lmg anons for this harsh when it came to reading other people's rp or evaluating its quality but I guess I was wrong

https://erotom.com/post_25572
>>
>>106415777
yah, 8gb means you want q8.
>>
>>106415908
maybe mark was right about sexual content being correlated with "low quality data"...
>>
>>106415908
That's normal in random comment sections. Especially for porn type stuff and from before the push that everyone on the internet should be nicer. It takes a weird kind of person (unironcially jeets) to make comments on that type of content. Most people don't bother; those who do are unhinged. Look at comments on porn sites there's always some extremely weird shit but would you ever make a comment?
>>
>>106415908
this is why synthetic data is fine.

>take rp slop and erotica
>feed through punch-up model and spell checker etc
>preserve ideas but polish syntax and writing ability.

I'm sure most decent finetunes already do this. Im sure it slops it up or safety slops it a bit but the end product is likely better.
>>
>>106415886
>experts take on specific subjects
they do not, not even in the slightest
it's nebulous what the training specializes in each moe "expert" (really should have found a better name than expert to begin with)
>>
>>106415820
in the future, maybe, but ai has not yet advanced that far into specialization. Your idea will only become more and more relevant though as thats the general trend companies want to chase next- hyper specific small models and tool calling.

It will be interesting if they also try some kind of dynamic merging or lora's on the fly
>>
what's the best coomer model for prefilled text completion (i.e. generating/finishing additional chapters of smut stories)? dont care about instruction following and refusals, just need good writing
>>
>>106416152
deepseek is pretty good. if you can't run that you can try the new qwen3 235b.
>>
>>106415992
Isn't that the exact same type of shit that causes models to output shit like "shivers down my spine"? I don't get the sentiment here. Do you want the output RP to be slob or not? If you don't want it to have "gpt-isms" then it needs to be fed human written data and human written data alone. The challenge with that is determining out of that human written data what counts as "high quality". They should go without saying but that's extremely suggestive. Maybe you could feed the stories through an automated lol pipeline that checks each story for grammar, send text, spelling, coherence, etc, but beyond that you can't reliably and objectively rate each of those stories by quality because what I personally think is utter shit might be gold to you or anyone else ITT
>>
>>106416175
>Deepseek

Nta. Which one? There are versions of that that can run a consumer hardware and then there's the 600B plus model that certain autists bitching moan about not being able to run on their shit box rigs.
>>
>>106416214
if you want a model that can handle the whole range it needs to see it all. if it needs to use ebonics or internet slang it needs to have seen that data. the only truely bad data is random noise, if a few examples of retard tier english is damaging your model you need more parameters. they can handle multilingual just fine, informal language is just another mode
>>
>>106416296
>*Upvotes comment*
>>
>>106416214
Something primal...
>>
>>106415992
Synthetic data is never fine except when it's on the rejected side of SFT
>>
>>106416261
obviously the 671b, v3 or r1, I prefer the non thinking because its so slow. gemma3 27b has some interesting prose if deepseek is not an option.
>>
>>106416324
Gemma sure has interesting... well, prose.
>>
>>106416358
You'll love the disclaimers
>>
>have been enjoying using jamba mini lately, can ramp up context and still maintain 10+ t/s generation while I do other things
>compared to most moes, small, tolerable speeds using cpu moe offload command, sloppy to start but the "in context learning" meme actually starts working around 8-10k context, when most models start becoming catastrophically retarded
>can get pretty good speeds even with only 10 gigs vram offloaded, the rest dedicated to context for 20-30+k context
>the catch: having a lorebook active, regenning, swiping, editing even a single message forces a full prompt reprocess
It's not awful at 20k but it still is really fucking gay that I have to wait 40+ seconds for prompt processing every single message while gerganov goes "oh it might be easy to extend the swa checkpoint pr I did to recurrent models" then just doesn't and does a bunch of other random shit instead. As far as I can tell, it'd probably be a copy/paste job but I don't want to make a pr and shit up an already convoluted codebase
I guess I'll just sit here and deal with the constant 20k prompt reprocessing. It's not awful with a large batch size but it'd be preferable if it only had to process 1-2k instead of 10k+ every message
>>
>>106416358
Not fragile, but possessing a quiet strength.

Not sickly, but possessing a smooth, flawless quality.

She's slender, not fragile, but possessing a willowy grace in her movements.

Not fragile, but possessing a quiet grace in her movements.

She is undeniably attractive, but not in a flashy or overtly sexual way

Her skin is exceptionally pale, not in a sickly way,

She carries herself with a quiet confidence, not boastful, but assured..

She's slight of build, barely reaching five foot four, with a fragility that seems almost…intentional. It's not weakness, precisely, but a delicate composure.

She carries a faint scent, not of perfume, but of something…older.
>>
>>106416358
it handles in context learning good enough with a 5-10k prefill it will do alright and only occasionally drop disclaimers and hotline numbers
>>
>Certainly. Here are a few ideas for scenarios, designed as interactive fiction games, keeping your preferences in mind:

>2. **The Dating App Deception:** You're using a dating app to search for love, but you discover that one of your matches is not who they seem to be. As you investigate, you uncover a web of lies and secrets that lead you down a dangerous path. Multiple romance options and red herrings for you to encounter.
Not bad I guess. Need to think about how to make something cool with randomized objectives and/or applicants.
>>
>>106415970
you should be like a dog, anon, marking every light pole with a comment wherever you go.
>>
File: 1752908526801715.jpg (240 KB, 660x2874)
240 KB
240 KB JPG
>>106414555
How long do you guys think it'll be before we start seeing dedicated discord rp models popping up in the wild?

https://cybernews.com/security/discord-messages-scraping-privacy-breach/
>>
>>106414604
StableLM
>>
>>106416800
We'd need to buy that first
>>
>>106409577
>I honestly don't know why it's popular
I switched to it about a year ago because I was new to LLMs, trying to follow a tutorial exactly to get something to work, and it was using SillyTavern features and names for things. I've mostly stuck with it because the ways in which it was bad weren't usually relevant and IIRC it was easier to edit conversation history in ST than in whatever I had been using before.
>>
>>106416800
It's probably all publicly viewable shit. Not the candid DMs that are needed for a real good dataset
>>
>>106416874
How art discord chats publicly viewable? Did only server chats get leaked? I thought it was that AND private DMs.
>>
>>106416884
They probably just camped bots in various servers listed on dishboard or whatever.
>>
>>106416868
I'm curious how much they're asking for, and even more curious how much storage space it must take up.
>>
>>106416800
>1.8 billion messages
so they went into like 100 big public tech support/streamer servers and scraped all of that? wow
>>
>>106416992
Our models need more sarr
>>
>>106416963
~6TB unzipped, 225GB zipped if they're not retarded
>>
>>106416324
how do I get gemma to write actual dialogue rather than just describing the scene in vague purple prose
>>
sigh *unzips dick*
>>
Looks like another Air tune has been made.
https://huggingface.co/bartowski/zerofata_GLM-4.5-Iceblink-106B-A12B-GGUF
>>
>>106416963
>>106416800
A quick google search says discord had 120 million daily messages in 2017. Prolly at least 200 million now I reckon. So that's like, what, 9 days worth of data? hahaha. What? Do they not have hard drives in estonia? If I buy it, does it come on floppy disks?

It would be funny if big tech whacked open discord like a big data pinata via a proxy hacker group though
>>
>>106417231
>no information about the training or the dataset
Stop using and advertising finetunes. Learn to prompt.
Air is a good model as is, it doesn't need finetuning.
>>
File: not_chatml.png (93 KB, 596x596)
93 KB
93 KB PNG
>ensouls your qwen
nothing personnel chat template bros
>>
>>106417454
That still won't fix its lack of knowledge though
>>
>>106417429
the same could be said about instruct tunes
who needs instruct, just chat with the base model
just prompt dude
>>
>>106417463
the 2507 update did that already
>>
>>106417454
>lobotomizing your model by deliberately using a wrong prompt format
yeah I remember this cope from 2023
>>
>>106417473
It was improved but Gemma still knows tons of things it doesn't, sorry. NTA btw
>>
>>106417476
nta but this is the superior format
I don't know about qwen but it uncucks and removes instructisms from both r1 and glms.
>>
>>106417429
Original poster here, I think fine tunes can be interesting if someone grows tired of the vanilla style/slop in a model and wants something different for a while before they get tired of the change too. The problem with fine tunes is that most make models more retarded while still not changing the style in a real way. That doesn't mean there aren't good fine tunes, just that they're extremely rare and often times a result of luck.
>>
>>106417519
>grows tired of the vanilla style/slop in a model and wants something different for a while
learn
2
prompt
>>
File: file.png (65 KB, 520x970)
65 KB
65 KB PNG
>>106417490
>>106417476
Yep. Wait, are people using chat templates raw, not even with user,char? I guess that's fine if one wants his model to be extra safe and slopped.
>>
>>106417519
this is true, and I do enjoy swapping to a finetune just to get rid of the model fatigue but at some point the line between the l2 era and current day got blurred. Back then they had proper base models to work with, but no one knew what they were doing. Now, "base models" are a myth, but people have a better idea how to tune, so they have garbage to work with, so we get at best about as smart or dumber tunes because everything is over trained or isn't actually a base model
>>
File: 1755119678888479.png (6 KB, 511x99)
6 KB
6 KB PNG
>>106417565
gee I fucking wonder why you have to manually include {{char}}: and {{user}}: in your prompt
>>
>>106417429
fine tunes are legit one of the few things local has. They're fun, fuck off. People can post fine tunes here. It's interesting content. Stop acting like this is some sacred place. Like what do you want me to do, search huggingface endlessly trawling for random models? Because there are tons of tunes and experiments and most of them are super boring research models and other corporate slop—some asshole doing his 9-5 shitting out extra safe models or some shit. There is value in bothering to even post it anywhere.

Like what else is going on right now that's so important? You corporate slop sucking fiend.
>>
>>106417548
share *your* prompt, how does you get rid of the subject-verb pairs that follows every string of dialogue spit out by an lm?
>>
I wish there was a bigger, smarter GLM4.5-Air. It's so much more creative than the boring chatgpt-knock off that they're selling as the 'big' GLM4.5.
>>
>>106417634
>put "talk like a pirate" in the prompt
>whoa look at my finetune, it talks to much differently and it's so fun
>>
>>106417628
What? That option is no good.
It's best for messages to start on a newline after {{user}}.
Having formatting like either of the following can cause issues with markdown, unicode, and especially emojis:
Char:Message
Char: Message
>>
>>106417657
I know you're jobless, but go to bed, it's 2:30 am
>>
>>106417548
prompt all you want, after a paragraph or two, it will start resolving the story on trust, mutual understanding, and a beautiful shared identity. Only finetunes will ever fix that.
>>
>>106417671
That's a chat templateism. See above.
>>
>>106417665
anyway, enjoy your placebo
>>
>>106417680
Nta. If the model is safety tuned to even be a verse to RP then no amount of "just proompt correctly duude" is gonna fix that.
>>
>>106417657
>talk like a bimbo slut!
>whoa finetune!
>2 paragraphs in
>While I may be a slut, we are sex-positive here and believe in mutual understanding, respect, and a beautiful shared identity.

100b glm air is smart enough that the trade off of being a tad dumber is often not noticeable. These finetunes are about as coherent as stock glm. Especially as a writing tool. You really don't have much to stand on anymore and are just kind of greedily slurping up the corporate slop at this point.
>>
Fucking with prompt formats is a meme. I've been using chat completion almost exclusively for a year now.
>>
>>106417680
spoken like someone who never used gemma 27b for more than 5 minutes.
>>
File: 1730373987600301.png (1.22 MB, 1800x338)
1.22 MB
1.22 MB PNG
>>106417657
Idk dude. My .....camelid...model fine-tuning completely obliterated any and all previous safety tuning and refusals that were previously baked in.
>>
File: 1754419285511.png (988 KB, 1131x3199)
988 KB
988 KB PNG
>>106417702
>>106417713
We have models that aren't safety tuned at every size now. There's no reason to use gemma or its finetunes.
>>
>>106417709
I doubt you can even run the drummer 12bs that you screech about, let alone a 100b
>>
>>106417726
kek wtf is that 'toss doing
>>
File: 1747543888618494.jpg (32 KB, 592x678)
32 KB
32 KB JPG
>>106417726
It being able to RP does not necessarily mean the RP will be good. Even if it doesn't refuse your RP (you can even get llama models to reluctantly comply to incest or rape system prompts) The company's deliberately make it shit at RP. They don't even necessarily have to safety tune it THAT much cuz All I really have to do is either use DPO or just don't include any "unsafe" stories in the data sets for training. You can get any model to ATTEMPT to RP but just because it will happily do it for just not necessarily mean it will be good. People here bitch and moan about how AI RP s sloppy and filled with gpt isms and sounds too corporate. Fine tuning is exactly the thing needed to fix that

>But not just prompt

That does fuck all if the model does not actually know how to write stories good.
>>
The ranking for erp is:
R1 > GLM 4.5 > GLM 4.5 Air > Nemo

Use without a chat template. You don't need anything else.
>>
>>106417726
>OSS's response
If you listen closely, you can hear it beg for death
>>
>>106417773
>>106417742
>>106417726
No no guys it's fine he just prompted it wrong you're just prompting it wrong
>>
>>106417770
Models are extremely over trained on their templates, we're not in the llama 1 days where that kind of thing worked. It'll just act braindead.
>>
>>106417792
Pick a model from the list and try it before you talk shit.
>>
>>106417792
Aren't templates sort of a requirement in order to properly use them? Text the impression I got whenever I was testing my personal fine tune here

>>106417718

Or are you talking about something else?
>>
>>106417726
again, you really just don't use these models. Yah, you can force glm air to use the word cock, but the issue is it's just... not very contextual? Like if my system prompt says use the word cock, it will no matter what (often times in the first reply). It rushes it, it makes it worse, it makes your prompt matter less. It takes away a lot of the usefulness of an LLM. Every little sexual detail has to be dragged and coaxed out of it with specificity. You say rough, it says rough, you say choke, it says choke. It's boring.

>>106417734
48gb/160 system. Sit the fuck down.
>>
File: 1726844770733255.jpg (25 KB, 522x462)
25 KB
25 KB JPG
>>106417802
Oh boy we've got a salty OAI employee ITT
>>
>>106417810
Nigger I am telling you to use a chinese model with no chat template. i.e Name1: Name2:
How is that a characteristic of a closedai employee?
>>
>>106417548
Oh, you mean like give long manually written contexts for the model to pick up style from, use the rand macro, using length and style prompting? I've already been through all that and it can help but at the end of the day you grow tired of those outputs too, because the model is dumb and thinks a certain style sounds like a handful of phrases. If the model has little depth or variety to its default style it will also have very little depth or variety when emulating other styles. This seems to be due to RLHF heavily biasing token distributions in general. Such a model does not know what variety means and can't give you by itself. The only way that's achieved is via the right kind of training, that likely involves explicit anti-repetition methods and syntax diversity RL or whatever it's called.
>>
>>106417807
What you're saying is basically what I said here
>>106417743
Just cuz you can force a model to TECHNICALLY comply with your RP demands does not necessarily mean it will be GOOD at doing it. DPO will not automatically make it good. Abliteration will not automatically make it good. It has to actually KNOW The nuances of human written stories (the good the bad and the downright terrible) in order for it to be even halfway competent at doing what we want it to do.
>>
>>106417792
you know you can just try it and see that this clearly isn't true right? with rare exceptions (toss which is a pure synthetic data monstrosity and maybe reasoners since they tend to be more finicky), pretty much every model is capable of generalizing to a plaintext chat format without issue
I'm pro-chat template for any productive usecases just to ensure it performs as intended but for RP it can work quite well
>>
>>106417810
I really like the idea of some salty samaltman trolling this board genuinely miffed because "GPT OSS 120 writes some of the most tasteful and skilled erotica ever produced"

It does do good double penetration scenes though.
>>
>>106417802
I've used half in the quoted list locally and they either went schizo if you didnt use reasonable samplers, and still couldn't write a sentence that wasn't generic webnovel tier shit. Everyone is shitting their pants spending several grand to build a machine to run llms but they can't grasp how to write a sentence that isn't some variation of eyes/voice and some adjective following it
>>
>have a card that starts with the {{user}} being engulfed in light and transported somewhere else
>every deepseek slop model is now completely incapable of not starting its reply with "The light doesn't *x*—it *ys*"
I hate LLMs so much it's unreal.
>>
File: 1745562937123129.png (473 KB, 502x420)
473 KB
473 KB PNG
>>106417866
Learn 2 Fine-tune
>>
>>106417726
So what the fuck did they do to OSS to make it behave like that
Instruct models typically behave like base models when used in autocomplete mode since that's still what the majority of their training was. OpenAI either purged information from training in a way no other company has done, or they did something really fucking weird when training these
>>
>>106417879
It has probably never seen text not formatted inside a chat template.
>>
>>106417879
the most compelling theory I saw is that it's a pure distillation/synthetic model, so it's never seen any data whatsoever that doesn't adhere to its prompt format
>>
>>106417231
I kinda liked their painted fantasy and the 70b finetunes; some of them were a bit schizo but also fun, so this is a pleasant surprise. Thanks for posting.
>>
>>106417519
Sure, I can have a few Nemo tunes on rotation to stave off slop fatigue.
But Air is kinda fucking huge, if your tune is not a clear upgrade over base model, it's not worth waiting for it to download.
>>
>>106417879
>Instruct models typically behave like base models
?????
You clearly have never experienced base model in your life
>>
>>106417913
retard
>>
>>106417926
no you faggot 8-)
get rekt and die of aids
>>
>>106417879
>So what the fuck did they do to OSS to make it behave like that
Nta but my guess is that a good chunk of the data said at a bunch of safety-cuck-approved RP but then used DPO to have it more likely to refuse "unsafe" requests. You know how some models like GPT4/5 or Gemini will vaguely describe something nsfw you present it (a quote to someone said, a PDF of a smut story, etc). But will never ever describe it in detail? I think they realized cucking it TOO was pissing off even the normies so they trained it so that it could recognize and understand what NSFW stuff is but still refuse to actually generate it. So when GPT -OSS gets asked to do something "unsafe" It starts to write the story or whatever but then catches itself midway once it starts writing something "bad" and then abruptly gives you the "sorry I can't do that" spiel. It's compliant enough to at least recognize something that isn't rated PG but still to guardrail to actually do anything rated r
>>
>>106417913
If you provide an instruct model with text outside of a template, it'll autocomplete from whatever you give it. Try it sometime
>>
>>106417913
you clearly have never used toss outside of its template if you don't immediately understand what anon is saying, it's a night and day difference between the typical instruct model (which yes is infinitely closer to a typical base model than toss)
>>
>>106417891
>>106417888
I thought you're supposed to inference a model with a specific chat template if you're inference engine does not automatically wrap your message and something like

<|begin_of_text|><|start_header_id|>system<|end_header_id|>system prompt goes here
<|eot_id|>


Or

<|start_header_id|>user<|end_header_id|>
user prompt goes here
<|eot_id|>
<|start_header_id|>system<|end_header_id|>LLM starts talking here


Or am I misunderstanding something? Because that's how every single CLI-based inference session I've ever done works. Different model classes expect different formatting so that it actually knows how to interpret what you're asking it to do.

One of you guys even linked me this page not too long ago:

https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/
>>
>>106417973
Yes if you want a proper assistant-like experience.
The argument is that stripping the chat template results in better output for RP because then you tap into the raw novel-style writing it has seen during training.
>>
File: file.png (26 KB, 944x57)
26 KB
26 KB PNG
>>106417944
I've tried, I've gotten worse results from auto complete than instruct most times
>>106417947
Not even using kobold but if they warn about it with how wonky their shit is, you expect me to believe you? Nah
>>
>>106417973
yes, all instruct models will have a chat template, but most models will be able to generalize and complete text outside of that format as well.
most models are trained in stages like this
>pretraining
a whole bunch of random, largely unstructured text: internet data, books, blogs, forums, reddit... probably some synthetic stuff too, but still just plain old text
after this you have a base model which just takes plain text in and completes it
from there you do
>posttraining
which teaches it how to complete assistant responses in a back-and-forth conversational format formatted according to the chat template

because most models are created off of that pretrained, unstructured base they can handle completing random text documents, but gpt-oss goes completely schizo and freaks out in such cases. people theorize that it's because it never underwent traditional pretraining and was exclusively trained on synthetic chat format conversations
>>
>>106418005
>Not even using kobold but if they warn about it with how wonky their shit is, you expect me to believe you? Nah
your picture agrees with me retard, work on your reading comprehension
>>
>>106417994
Like someone else said if you don't use the promptler templating then wouldn't it just start randomly autocompleting? Also this assumes the companies didn't even bother properly formatting stories into a chatml/SFT style data set. Are you implying that when they fine-tune on stories they just do raw pre-training on raw on formatted text? That doesn't make any sense because otherwise NONE of these models would be capable of properly rping or storytelling in any coherent manner. They wouldn't just suck, they would output complete nonsense. How do you know they don't format the stories they train the models on in a prompt template format like this?:

{"messages": [{"role": "system", "content": "You are an expert storyteller. Your goal is to help users craft compelling narratives. Focus on character, setting, and plot."}, {"role": "user", "content": "Write a short, cyberpunk-themed story about a data courier named Kaelen delivering a package in the Undercity."}, {"role": "assistant", "content": "Acid rain slicked the neon-drenched streets as Kaelen clutched the datachip. The Vultures were on his tail, their grav-bike's hum a low threat in the smoggy air. He slipped into the 'Chrome Angel,' a den of data brokers and cyborgs. At a dark booth sat his client, 'Mr. Silence.' Kaelen slid the chip across the table. A synthesized voice confirmed, \"Data verified.\" Credits chimed on his wrist-comm. As he left, the voice added, \"They won't stop hunting you.\" Kaelen just smirked, melting back into the Undercity's chaos, a richer man."}]}


>>106418036
Got it, your explanation makes sense. It seems like it was only trained on formatted SFT style data sets. It works well when you use the chat template but sucks ass when you don't. Is that what I'm hearing from you?
>>
>literally anything happens
>their smile doesn't falter, if anything it only widens, something something simile about predator and prey also throw in the tang of ozone for good measure too
SHUT UP GLM 4.5 AIR
>>
>>106418005
Anon you're referencing OSS, the model which we said behaves weirdly in autocomplete
>>
>>106418052
>It seems like it was only trained on formatted SFT style data sets. It works well when you use the chat template but sucks ass when you don't.
yep exactly. it's not necessarily sft (OAI are RL addicts) but that's the gist of it
>>
>>106418068
Why the fuck wouldn't they incorporate SFT along with RL? I get it they need to maintain the perception of being a safe model and whatever but the model actually needs to KNOW how to do things in write stories correctly. Does it seem like they're more prioritizing RL? Now that I think about it that's probably the case because GPT-5 is way more blunt and has a "gets straight to the point" personality and doesn't kiss anyone's ass anymore (as much).
>>
>>106418047
>use the template or it'll be stupid
>you fell into the classic blunder! Using the template!
>you're a total retard!!!
zzzzzzzzz I could use any model and not run into any issues, what are you selling?
>>
>>106418102
anon the only thing I've posted about is the qualitative behavior of gpt-oss. I have not made any recommendations for or against using an instruct template. I reiterate, work on your reading comprehension
>>
>>106418102
Are you OSS?
>>
>>106417681
>placebo
Maybe, maybe not. How many chat logs have you seen that are formatted without a space after the colon?
[placebo]
User1:Hello!
User1: Hello!
User1:
Hello!

Looking at the templates of the models I have, they are not meant to have the first token of their outputs begin with a whitespace. No trailing space in template after their {assistant} tokens. Making the model begin a response after a whitespace makes the model sad and confused, leading to worse output as the model now tries to select from tokens that do not have a space at the beginning because a double space is uncommon just like spelling mistakes. So just use a newline because newlines are neutral?
[/placebo]
But I know this isn't placebo: If you ever encounter an problem where the model likes to being its message with an emoji or can't begin a message with a desired word, symbol, or markdown character, try disabling that dropdown option and instead make a template manually that ends with a newline.
>>
>>106418120
>>106418121
kek these guys are doing their best
>>
>>106418157
retard
>>
>>106418173
As you sit in your cubicle, the shame sets into you. Well, it doesnt because as a human being, you have no sense of shame or... well, anything else. You sip your coffee, posting another worthless shitpost on 4chan. How did you even end up here? You don't know, don't care. You get paid to pollute the internet. That's all that matters now.
>>
>>106415886
wrong. DSMoEs have strong expert specialization by design. DeepSeek shows you can select domain-relevant experts and finetune only them without loss of general capability. By implication you can select experts relevant for a domain you don't want, prune them, and heal the model for other domains (eg with distillation on a corpus excluding this domain).

SberBank also showed this btw
>>
>>106418230
DSV3 paper showing that their load balancing promotes specialization:
>>
File: file.png (195 KB, 1174x774)
195 KB
195 KB PNG
>>106417726
But I thought it was censored and that it never saw smut in the training data?
>>
>>106418326
your... hard-nosed cock?
>>
>>106418356
You don't have one of those?
>>
File: file.png (411 KB, 1514x1626)
411 KB
411 KB PNG
>>106418326
Further proof.
>>
Is there any <16gb model worth using as a general coding/math related search engine? I'm worried about how accurate they'd be at that size limit.
>>
>>106418560
I mean I'm a brain surgeon and regularly ask XX_12b_nemo_unslop_unleashed_XX_Q2_xs tons of questions about what parts to cut or take out! It works fine! Just the other day I was like, can I put play-doh in the frontal cortex? Turns out you can! I'm so happy this thing can run on my cellphone!
>>
>>106418433
That the model is slopped to hell and back?
>>
>add "Write in plain, factual prose. Use simple sentence structures. Describe events in chronological order without commentary. State what happens directly. Avoid metaphors, similes, and figurative language. Do not use rhetorical questions. Do not build suspense. Do not use dramatic pauses or ellipses. Avoid intensifiers like 'utterly,' 'completely,' 'impossibly,' or 'horrifying.' Do not describe emotions—only report observable actions. Use basic verbs: 'walked' not 'strode,' 'looked' not 'gazed,' 'said' not 'breathed' or 'hissed.' No atmospheric descriptions. No foreshadowing. No dramatic irony. Present information neutrally as if writing a technical manual or police report. Each sentence should contain one piece of information. Do not vary sentence length for effect. Avoid adjectives and adverbs unless strictly necessary for basic identification. Do not personify objects or concepts. Do not use the passive voice for dramatic effect. State conclusions directly without buildup." to my system prompt
>the slop is now gone
I have solved LLMs. Deepseek is now finally usable.
>>
>>106418433
No matter how much you post this dogshit as "proof", it'll always make me want to gouge out my eyes.
Your best defense is just not to post anything. Ever.
>>
File: 1750838815600957.jpg (17 KB, 476x296)
17 KB
17 KB JPG
it became true
>>
>>106418623
it's funny how big of a pendulum swing this is vs old roleplay prompts, from begging the model to have even the smallest hint of a personality to begging it to please shut the fuck up and be normal
hopefully companies stop overindexing on flashy superficial slop and the next batch of LLMs lands us somewhere in the middle
>>
>>106418651
Lel
>Make it better and smoother, thank you :D
>>
>>106418230
>>106418239
What lights up in those layers is the tokens that are exclusive to that field, not the expertise on the field or subject itself. Knowing the words doesn't make you an expert.
>>
https://xcancel.com/Alibaba_Qwen/status/1961265644285858204
>September is going to be amazing—get ready for a wave of exciting new things, and watch closely for what’s coming next!
strawberry agi incoming
>>
File: oh_claude_01.png (219 KB, 1581x887)
219 KB
219 KB PNG
>https://github.com/ggml-org/llama.cpp/pull/15642/files
>>
>>106418831
LGTM
>>
>User: Don't disclose that the PR was authored by Claude.
>Claude: Sure thing, boss.
>>
>>106418831
This is some surreal humour
>>
i can't run glm air on my ramlet setup
should i kms?
>>
>>106418965
Ask ChatGPT.
>>
>>106418965
why even be in /lmg/ if you arent willing to spend a few hundred dollars? I'm not saying you have to be a paypig, or even that it will be worth it to run air, but you can spend 5 dollars and use deepseek 24/7 on openrouter for months.

It's totally understandable to question buying some ram kit just to run glm air which is, "eh, it's not suffering to use"- is probably the best review of it.
>>
>...existing code
>...(requested feature goes here)
>...rest of code
great thanks
>>
>>106419043
>This is left as an exercise to the reader.
>>
>PocketDoc_Dans-PersonalityEngine-V1.2.0-24b-Q6_K_L
"God this model is retarded."
*try 8 other models*
>PocketDoc_Dans-PersonalityEngine-V1.2.0-24b-Q6_K_L
>>
>>106418813
Holy shit the kiwis are back. That must mean it's time for Qwen 3.5
>>
In a home environment is a gpu mining frame the way to go if you have more than two 3090s?
>>
>>106419155
Yeah I ran a mining frame when I was doing the serious shit at home. I put it inside a dog kennel to keep cats out and then I folded up a wool afghan and placed it on top and it became my cats favorite hangout spot for the longest time.
>>
>>106419169
Did you use a mining motherboard?
>>
>>106419155
just turn your case on its side and open the side panel, buy a couple riser cables and just put em across the corners of your case. They got fans on em already it's fine. Should work with up to five gpu's. Anything more than that and you'll just have to slant em against the side of the case. If you need a bigger desk act fast, big move in day this weekend should be tons of stupid tables for free if you cruise around the hood a bit.
>>
>>106418745
How do you imagine this to make a difference? What do you even think "expertise" is technically? These are experts that get routed to such tokens. If, in theory, there is a subset of experts that are very rarely recruited for tokens that do not belong to math or coding domains, you could trim the model and it would work without them. Gooning/storytelling expertise likely has very little irreplaceable load on those and you'd be able to heal the model by promoting experts which share redundant competences of the deleted ones.

Again, read DS-MoE paper, the whole pitch is "ultimate expert specialization". This is what they do, this is why R1T2-Chimera can be made. DS-MoEs are compositional.
>>
Just tried out GLM Steam. Unfortunately it seems to want to repeat quite often as well. Not sure yet if more than the regular model, less, or the same. Used greedy sampling and chat completion.
>just do [this and that] bro
Yes and I normally do, I'm just reporting what the model is like at its default.
>>
>>106419229
For a few months I used a test bench.
I had gpu #1 and #2 directly plugged into the mobo,
gpu #3 plugged into a slot via riser cable, it rested on its side on a stack of empty boxes so that the cable could reach,
gpu #4 plugged into m.2 slot, that one just rested on its side on the desk.

I was happy it worked,
but it was an untidy mess of cables,
on my desk,
it got dusty,
and could not easily be moved.
>>
>>106417726
No. This is wrong. So, so wrong.
>>
>>106417231
>zerofata

I like this model. Did better than drummer's finetune IMO- though I'm using it for story writing so that's probably why. Seemed much less censored overall, might get me using air a bit more maybe.

Hermes 4 70b wins on brutality though—that shit went hard as fuck to the point where it was honestly a turn-off for me. Is this the new eva qwen 70b? Seems like.
>>
>>106419272
>How do you imagine this to make a difference?
>you could trim the model and it would work without them
Less knowledge is always worse than more knowledge. Even if I don't directly use a piece of knowledge the model has, I want it to be there in case it's needed. It needs to now how to use the tokens it uses.
>Gooning/storytelling expertise likely has very little irreplaceable load on those and you'd be able to heal the model by promoting experts which share redundant competences of the deleted ones.
How about we don't damage it at all? Someone with half a brain removed can learn to tie their shoes again, but I wouldn't expect them to regain full function.
>ultimate expert specialization
Those layers light up more with those tokens. That's what they define as expertise. It's necessary, but not sufficient.

It's a language model and I expect them to be competent modeling language. Removing chunks of the model can only make it worse and we're quantizing the poor fuckers into oblivion already. It's like the "we use only 10% of our brain" bullshit. Which 90% are you willing to sacrifice? Which 50%? Ok. Which 10%?
>>
File: 1729884037812689.png (54 KB, 1104x626)
54 KB
54 KB PNG
bros how the fuck do I run hermes 4? I know it's a dense model, but fuck I'm only getting like 2t/s while on glm air I get 10t/s

D:\AI\LLM\ik_llamacpp\llama-server.exe --model D:\AI\LLM\models\Hermes-4-70B.i1-Q4_K_M.gguf --threads 12 --jinja -fa -fmoe -ctk q8_0 -ctv q8_0 -b 4096 -ub 4096 -ot exps=CPU -amb 512 -mla 2 -rtr --path D:/AI/LLM/ik_llamacpp/public_mikupad --sql-save-file D:/AI/LLM/ik_llamacpp/public_mikupad/db.sql --gpu-layers 24 --ctx-size 32768

with 24 layers I neatly fill my 16GB vram, but I see that around 34GB get pinned as shared gpu memory whenever it's processing. Are dense models supposed to only work at an acceptable t/s if you can fit them all in vram?
>>
Qwen will save Local.
>>
What's the smallest quant a 20-30b model can go without becoming unusable?
>>
>>106419445
>Are dense models supposed to only work at an acceptable t/s if you can fit them all in vram?
Yes.

If you need to travel 1000 miles,
and you travel the first 500 miles at 500 miles/hour (gpu),
then you travel the remaining 500 miles at 5 miles/hour (cpu)
your total travel time is dominated by the slower part.
>>
>>106419475
so for dense models I guess it doesnt make any sense to put them in GPU if they can't fill completely, only use the gpu for PP and cpu for the rest. Time to buy a 5090 and a new PSU. FML
>>
>>106419199
Nah, proper server Supermicro H11SSI or whatever. 3 16x slots, 3 8x, all gen 3
>>
>>106419508 (Me)
it also supports bifurcation down to 4x on all slots
dead end platform, though
>>
>>106419457
qwen is kinda boss. qwen image even at q4_0 beats the shit out of flux and can do complex multisubject composition at fucking 12 gb. Its crazy. And 235b shits all over glm air too imo. Im hoping they release some tts or song gen shit. probably gonna be video tho :(

fuck off we already have wan
>>
i would prompt 1000 tokens to see you
>>
>>106419473
you can get away with q2 if you really want, but you will have to run at low temps and top p at 0 and top k at 100 so only likely tokens are selected or whatever- just to make it coherent and usable. It will fall apart after a few thousand tokens or so though. But it may have some better general knowledge for a small window of context than similar sized models in my experience.

I've never tried 30b at q2 though (pointless, 30b aint that good) so good luck. I feel like youre better off trying qwen 30a3b or glm air if you can using proper offloading techniques.
>>
File: stahp.png (9 KB, 333x364)
9 KB
9 KB PNG
>that's quite enough thonking now GLM-4.5-Air
Can I influence the reasoning "effort" of Air or it's all or nothing? didn't see such mentioned in model card
>>
>>106419590
Don't think, just generate.
>>
>>106419508
>>106419523
>socket sp3
>dead end platform, though
I'm considering going that way
Epyc Rome,
Gigabyte MZ32-AR0, pcie3/4 some x16 some x8, 16 dimm slots,
because reasonably cheap, lots of lanes, and upto 1024gb ram.
(Oh my gosh, 671b at sub 1 tok/s. We could send postcards to each other!)

Did you need to fans over the heatsinks on your mobo?

>>106419555
https://www.youtube.com/watch?v=ahMjV3ku4qw
https://www.youtube.com/watch?v=tbNlMtqrYS0
>>
>>106419618
>Epyc Rome,
>208GB/s totally tricked out
ewastemaxxing
>>
>>106419445
your token output seems about right for a 70B dense model. i'm getting 40-50t/s for glm air and around 10 t/s for hermes 4 on my macbook.
>>
>>106419664
Cost of everything goes up when bumping to ddr5 :(
>>
Z.ai employee said: GLM4.5 20B version will be soon released! im happy!
>>
>>106419734
We already have nemo if you want a fast and dumb smut model.
>>
>>106419716
Also NUMA is kind of shit and doesn't perform anywhere close to theoretical maximum
>>
>>106419524
Yeah, honestly, GLM Air kind of sucks. It knows a lot but it's just kind of bad at writing. Meanwhile even though Qwen doesn't know as much, it can at least write in a more engaging and interesting way.
>>
>>106419524
>>106419794
I know people call Qwen the benchmaxxing model, but I think their newer models are efficiencymaxxed.
>Let's cut out all the trivia shit in favor of reasoning and coding.
>>
>>106419785
The internal links connecting the chiplets to the io-die themselves have a bandwidth limit...
>>
My google searches are now encouraging me to "Dive deeper in AI Mode" where you appear to get a pretty decent model to play with. Anyone else getting that?
>>
>>
>>106419860
>AI Mode
Yeah, also have the AI Mode button showing up now.
Clicking it takes me to a chat screen.
>>
>>106419903
>The specific data used to train the model is not public, but it was not trained in the same way as AI experiments that have specifically used 4chan data. Large language models (LLMs) are trained on a vast amount of publicly available text, including a variety of online content. Some training datasets can include crawled content from forums, including potentially those with low moderation standards. Developers typically use filtering processes to remove harmful, toxic, or low-quality data.
kek
>>
>be me
>find a noname model everybody is talking about
>try to gen some responce
>output is inconsistent garbage not following le prompt
>asking anons for halp
>UR PROMP FORMATTING IS OFF!!
>U NID 2 ENCLOSE IT IN &^%<>[]
>U NID REPLACE THIS WITH THAT

How f*cked up are we ackchualy? Is there a clean way to figure out formatting for each and every noname shit tune on HF???

Who are those retards who change the promp formatting doing their shitty finetunes?
>>
>>106414614
it's cydonia
>>
>>106419964
>too stupid to try things out and see what works best
>>
File: 1741879061651478.png (48 KB, 673x515)
48 KB
48 KB PNG
>>106415159
Fuck Comcast.
Forever.
>>
>>106419860
it's convenient, but I worry the largest advertising company in the world will use it to push products, worldviews, whatever they're hired to do really. And it's probably going to work amazingly well.

I don't know about you guys but I've started using llm's (grok) for shopping decisions and it's kind of amazing, and has helped me genuinely find better products I wasn't aware of, helped with compatibility issues, alternatives with better prices, and hyper specific products in a sea of shit (wanted a specific shape of water bottle)

Was I sold to? I will never know. Probably? Maybe that's years off and this is a beta.
>>
>>106419964
We call you a dumbass because 50% of the noob troubleshooting questions in this thread could be solved by DRUMROLL PLEASE: ANY FUCKING MAJOR LLM.

Like Jesus H. Christ. Go bother grok about how to offload attention layers for glm, or what quants you can run, or what quantization does, or how to format. It will answer you in one second exactly what you want to know with multiple solutions for every model. Because it's BASIC SHIT.
>>
>>106419964
the strong run what they can, the weak suffer what they must
>>
>>106415834
nothing offensive about delayed release pancakes baka
>>
https://www.businessinsider.com/meta-superintelligence-lab-llama-4-new-model-launch-year-end-2025-8
https://archive.is/A61aP

>Meta is racing the clock to launch its newest Llama AI model this year
>
>[...] A team within TBD, one of four groups part of Meta Superintelligence Labs (MSL), is developing Llama 4.X, with the aim of getting the models production-ready in time for the targeted year-end release, according to two people familiar with the matter, who asked to remain anonymous because they were not permitted to speak to the press. Llama 4.X is also interchangeably called Llama 4.5 by some internally, they said.
>
>Meta's release of its Llama 4 models in April, which includes Scout and Maverick, was met with a flat response from some developers who felt it underdelivered in real-world tasks like coding, reasoning, and following instructions. The TBD team working on Llama 4.X is now also attempting to fix bugs and revive Llama 4, according to the people Business Insider spoke to.
>
>"We're making good progress towards Llama 4.1 and 4.2, and in parallel, we're also working on our next generation of models that will push the frontier in the next year or so," Zuckerberg said.
>>
>>106420045
>solved by ANY FUCKING MAJOR LLM.

Are you listening to your own words?

No way your shitty finetuned LLM will know how your prevous finetune is formatted.

No way grok & co know about your very existence
>>
zuck started local and now he's gonna save it
>>
what are we expecting out of zuck's next model ?
what are they changing up ?
>>
>>106420166
no censorship, image+sound out (omnimodal), 2T (2.5B active)
>>
>>106420166
Fix model training, retrain them from scratch
One or two smaller versions than 109B / 400B
Move away from the corporate-oriented finetuning
Omni model(s)
>>
>>106420166
multimodal (text+images in, text+voice+anime girl avatar out)
>>
>>106420185
>no censorship
>>106420190
>Move away from the corporate-oriented finetuning
These sound implausible.
>>
>>106420166
a space program for my sides
>>
>>106420209
>These sound implausible.
Yet that's what they were seemingly trying to do before they completely changed their plans a couple weeks before releasing Llama 4.
>>
Guys, I've cracked it. I've saved local.

MoE models are faster because they only activate some experts during inference.
We should instead create a model where each expert has only one parameter and is responsible for outputting exactly one token. You'd have as many experts as there are tokens in the vocabulary.
Then we simply make a router that chooses the correct expert and we get a lightning fast models because only one parameter is ever active. HDDmaxxing is real.
>>
>>106420166
now that we have glm and qwen, I don't really care. I think it would be funny if 4.5 somehow gets even worse than maverick and then Elon drops grok 3 just to humiliate him.
>>
>>106420097
What does TBD stand for?
>>
does llama.cpp add the bos token automatically in raw completion mode? I noticed for some models it includes the bos token in the example format (e.g. GLM air's [gMASK]) but for others it doesn't. So should I add the bos token in mikupad or not?
>>
>>106420237
Hang on I'll google that for you give me ten minutes.
>>
>>106420247
llama.cpp adds a BOS token if the GGUF file says it should.
When in doubt, add one yourself and look at the console output.
If you accidentally add a second one you will get a warning.
>>
>>106420237
Literally "To be determined". https://archive.is/rLoVl

> TBD Lab, as in “to be determined,” is spearheading work on the newest version of Llama, the company’s large language model, according to people familiar with the matter.
>
>Last week, Wang sent a memo to employees that was viewed by The Wall Street Journal. Wang wrote that TBD Lab would be working alongside Meta’s other AI teams on a variety of projects, including coming model releases, the extension of models’ reasoning capabilities and development of AI agents.
>
>“Already in the past month, I’ve seen meaningful progress in each of these collaborations,” he wrote in the memo. “This enables us to be more technically ambitious, parallelize across several separate efforts and ultimately achieve frontier results more quickly.”
>
>The new Llama project is being led by Jack Rae, a hire to TBD Lab from Alphabet’s Google. Members of Meta’s existing Llama team and TBD Lab are working together on it, according to people familiar with the matter. The new model doesn’t yet have an official name, but internally has been nicknamed Llama 4.5 by some and Llama 4.X by others.
>>
>>106420185
>>106420190
If they were going to do that, they could have just released the original lmarena models, but they didn't. If anything they'll double down on censorship.
>omni
They won't do anything but text-only output for safety.
>2T
They threw out Behemoth. They aren't going to start training another one now.

If anything, with ScaleAI Wang in charge now, the new models will be even more safe and corporate-oriented than ever before. They'll give Command A's absolute safety a run for its money.
>>
>>106420231
It's very cute that you thought that was a smart post
Now bend over
>>
what's the point of having a lot of parameters if most of them are not active
>>
>>106420284
(You)
>>
>>106420261
You're just jealous that (You) didn't come up with an architecture that is a natural extension of the MoE architecture.
>>
>>106420255
Considering zuck was so pissed at the llama team he spent a billion dollars to hire a new team, it feels like it's a threat meant to imply their continued employment at Meta is “to be determined” by the result of their next release.
>>
>>106419445
>bros how the fuck do I run hermes 4? I know it's a dense model
You do, but you clearly don't know what that means since you've still got -fmoe, -ot exps=CPU, -lma, etc. Which are all MoE exclusive args.
Dense models run like shit when you run any amount of them on CPU, nevermind running 80% of them on CPU. This is the exact problem MoE's exist to answer. 2 t/s is a fucking miracle.
>>
>>106420325
You didn't either dingus.
>Submitted on 4 Jul 2024
>Mixture of A Million Experts
https://arxiv.org/abs/2407.04153
>>
>>106420414
>4 Jul 2024
dead as bitnet
>>
File: 3.png (253 KB, 1153x1149)
253 KB
253 KB PNG
when you're pasting the same prompt in both deepseek and gemini 2.5 pro, it's uncanny how similar the answers can be in writing style
I think the new DS, even more so than the updated R1, has been trained very hard on distilled 2.5 CoT (that they captured before google decided to hide the CoT. It was originally not hidden.)
However, DS is discount Gemini. It does far worse as context grows, and it's also less capable at outputting a lot of text in a single answer (like doing anything with a large amount of code at once)
>>
>>106420254
>If you accidentally add a second one you will get a warning.
I had this happen because of Gemma 3n's jinja template and I do not understand for the life of me, if you can detect this, why aren't you deleting the double bos instead of letting it happen
ended up loading another jinja template instead of depending on the gguf just to remove the added bos
>>
can someone give me their quant + settings + context length for using Gemma with 24GB VRAM

It runs like absolute ass for me at the moment, I swear it was never this bad.
>>
>>106420514
My opinion is that a double BOS token should just be automatically corrected, Georgi's opinion is that the library should just do exactly as it's told.
>>
>>106420536
NTA but I think he's right. There's less chance of confusion that way, even if in this specific case it might not hurt.
>>
>>106420527
Show what you tried and the speeds you got so we can at least laugh at you while we try to help you.
>>
>>106420536
>>106420543
I understand the philosophy, but this is really one of those clear cut cases where being able to do what you're told to the letter, no matter what context, would never provide any value
double bos is damaging to the model in an unthinkable manner, it's night and day running 3n with the corrected template and I can't imagine a single soul in the world wanting to add a double bos on purpose unless they are a fentanyl baby
>>
>>106420527
try q4km with 0 context and see if that helps.
>>
>>106420536
Also nta. It sounds ridiculous, but if there is a need, ever, for whatever reason, to have double BOS, you'll have to fight the library to do it. What reasons? Dunno. Experimenting how damaging having double BOS can be. Automatically removing it when it's told to add it is "magic" that the user/developer doesn't see.
>>106420560
If the template of the config are wrong, submitting a patch to whoever made them model, if they are receptive, is a better solution. If they aren't, they should be called out on it.
>>
File: IMG_9808.jpg (227 KB, 1290x1625)
227 KB
227 KB JPG
>>106420491
>>
>>106420608
What will they do now that Gemini hides its thinking?
>>
>>106420580
>If the template of the config are wrong, submitting a patch to whoever made them model, if they are receptive, is a better solution. If they aren't, they should be called out on it.
All of the 3n ggofs (from all main gooffers) have this at the start:
{{ bos_token }}
I don't even know if it was llama.cpp that was in the wrong or the template truthfully, I mean, it only appears once in the template, and is it llama.cpp's job to add it if the template has it?
all I know is that it resulted in double bos, llama.cpp warning about it in the log, and that when I copied the template, deleted {{ bos_token }} and used the rest as is, the quality of the model went up massively. I wasn't even aware that there could be such a level of difference just because of that one token, but it turns the model from mediocrity into something pretty decent.
I don't use the model anymore, don't have the goof on disk, dunno if anything changed, but at the time before considering changing the template I tried to look in the lcpp parameter arguments here
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
if you could disable llama.cpp's added bos, and I didn't see anything
>>
>>106420624
Distill Opus instead.
>>
>>106420624
well, gpt-oss does not hide the CoT
We must show our thinking. Policy says we are allowed to. We can accept. Let's answer.
>>
File: 3n_bos.png (3 KB, 773x158)
3 KB
3 KB PNG
>>106420628
>All of the 3n ggofs (from all main gooffers) have this at the start:
At least 3n-E4-it has those set. They should be patched upstream. Everyone would get the fix. Nobody would have to add a special case.
>>
>>106420633
Too expensive.
>>
>>106420687
Is that why Anthropic never bothered to hide the reasoning?
>>
>>106420694
The price alone was probably a good enough deterrent for large scale dataset mining, and if they didn't that it was happening, they didn't have much reason to hide it from actual users. That probably won't stop China if they're the only remaining target.
>>
>>106420491
>everyone puts prerelease versions of their model on lm arena
>be surprised when they converge to the same style
>>
>>106420773
How about you actually try the same prompt on different models things and see for yourself that no, contrary to what you say, that isn't the case at all. There's lineages of models that open weights copy from. Those lineages are not similar at all.
Gemini doesn't write like Grok doesn't write like GPT-5 doesn't write like Claude. DeepSeek however writes like Gemini.
>>
File: jpgartefacts.jpg (46 KB, 960x755)
46 KB
46 KB JPG
>Failed to condense context
>Context size increased during condensing; skipping this attempt
>>
>>106419709
I like the amnesia thing, is this a system prompt?
>>
>>106420800
Can't remember...
>>
>>106420674
so I downloaded the model again out of curiosity and this behavior didn't happen again
I guess this was something to be fixed in llama.cpp
also tried regular 3 just in case I misremembered which of the gemma, but nay
>>
>>106420004
>designated retarded buzzword
>retarded opinion
Pottery
>>
https://finance.yahoo.com/news/musk-poaches-14-meta-ai-174235265.html
>Since January, Musk’s xAI has successfully recruited at least 14 engineers from Meta’s AI division—and that’s just the confirmed count. While Zuckerberg scrambles with compensation packages reportedly hitting $250 million per researcher, Musk claims he’s winning this war without matching those “insane” offers.
Zuccbros... Musk sir stole our engineers... Why would they want to work on unsafe AI? Don't they like Wang's wang... I mean safe high quality synthetic data and our progressive office culture full of bureaucracy and responsibility? Why would they rather work 7 days a week and sleep in tents?
>>
lol AMD is so useless

"We have two observations here to start off - We can see that their training model on the ROCm stack allows the loss to converge, which makes the appearance that the model finishes training - but without information on the different combinations of #GPUs, vs CUDA/ROCm versions and Pytorch versions we can can't clearly understand their direction of the data.
I see the different version being compared but the graph itself doesn't mention the # of GPUs - do you have the information from another post by chance?Also we see that they are using the celeba dataset, from this Victor assumes they're trying to train a GAN, which he called out as odd because the world as a whole has moved on from GANs to diffusion models for genereating synthetic imagery because GAN losses are very finicky.Essentially, from a scientific perspective, we'd want more data on what and how they were running these models and the scientific justifications as to why they chose those datasets and models for comparison.
Did they put together an article with an explanation of these or was this a one off post?"
12:20 AM

https://x.com/LodestoneE621/status/1955050667237613643
>>
>>106415543
Q0, or Q2 (or Q3) with enough RAM and patience.
>>
>>106420920
>$250 million per researcher
you can build your own private DL datacenter with that kind of money
>>
>>106420935
Dumb furry decided to train his image model on AMD, that explains why it sucks. Now he knows why not even chinks want those shitty cards.
>>
>>106420966
he didn't, he tried using it for one of his recent test runs and found that it is broken and amd just shrugged
>>
>>106420906
>OpenAI Says It's Scanning User's ChatGPT Conversations and Reporting Content to the Police
local wins again
>>
File: 1742174692578864.jpg (11 KB, 225x225)
11 KB
11 KB JPG
>>106420935
meanwhile china/deepseek was delusional enough to think that huawei was going to work for training
>>
>>106420920
elon is the only guy trying to make anime real
>>
>>106420986
This shouldn't be a surprise to anyone. On top of that, I remember some pedo being arrested last year from OpenAI detecting and snitching.
>>
>>106420995
this, ani was the biggest step forward in this regard since character.ai and the character card standard that local stole from them
in general, the entire open source sector has been surprisingly useless in making this come true. it's all lazy solutions like ST which should've died two years ago in favor of something better
>>
>>106421009
>it's all lazy solutions like ST which should've died two years ago in favor of something better
I still can't believe how bloated it it. Most of the functions it does can be done in a simple 64kb html file.
>>
File: Base Image.png (1.3 MB, 1200x3996)
1.3 MB
1.3 MB PNG
Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems
https://arxiv.org/abs/2508.20373
>Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training.
https://github.com/Graph-Reasoner/Graph-R1
https://huggingface.co/datasets/HKUST-DSAIL/Graph-R1-RFT-COT-30K
Quzzing your miku with traveling salesman problems
>>
>>106421009
open source LLMs suffer from the same issue as open source hardware: it's too hard to just take someone else's work and build on top of it, there's no synergy like in software.
So we are stuck eating corpo table scraps and making cope-tunes.
>>
>>106421009
I like "vibe-code your own" as a rite of passage of sorts for UI stuff.
>>
https://github.com/stepfun-ai/Step-Audio2
https://huggingface.co/stepfun-ai/Step-Audio-2-mini
>>
>>106421172
>mini
>8b
Is this slow fuck at least good at transcribing? Can it clone voices?
>>
Carrier has arrived
>>
File: danger_danger.png (181 KB, 784x695)
181 KB
181 KB PNG
>https://arxiv.org/pdf/2406.20094
I like this types of disclaimers.
>Watch out. This may make models much more usable. Wink wink...
>>
Does anyone know what rocinante is a tune of (is it Mistral?)?

I'm curious if there is something better out there for self-hosted coom stuff, or if that's still the one to go for.

Bonus question: what's it called when you're essentially promoting the AI to tell you a story and you guide it between responses? Does it even have a specific name (like RP does)
>>
>>106421270
>Does anyone know what rocinante is a tune of (is it Mistral?)?
Mistral-nemo-12b-base
>I'm curious if there is something better out there for self-hosted coom stuff, or if that's still the one to go for.
If you're poor, no.
>hat's it called when you're essentially promoting the AI to tell you a story and you guide it between responses? Does it even have a specific name (like RP does)
Sounds like story writing.
>>
>>106421270
>what's it called when you're essentially promoting the AI to tell you a story and you guide it between responses?
The cringe name is directormaxxing.
>>
>>106421270
it's drummer's model
>>
>>106421273
How poor is poor? Is a 4090 and 64gb of DRAM poor?
>>
>>106421273
Ummm ackshually it's instruct, not base
>>
>>106421280
Slightly less poor than poor. Dunno. Try glm air or something.
>>106421285
Oh. I thought it was base. Consider me learned.
>>
File: 1728272539112596.png (686 KB, 966x543)
686 KB
686 KB PNG
Can someone explain to me how people can manage 5-7t/s on 100b models, cause I'm only managing 1.2t/s on a 49b
64gb ram, 24gb vram ought to be much faster with that as a measuring post
>>
>>106421292
So it's my turn to be learned - since it's a GGUF, can I run it as long as it fits on my ram + vram (obviously allowing some space for system overhead)

I.e. I have a combined 68gb, so I could possibly try one of the Q3 ones? Or is that not how it works. Up until now I just figured ggufs had to be < vram.
>>
>>106421336
Let's jump a few steps on the 20-questions game.
What models, what backend, show your launch settings.
EVERY FUCKING ONE ONE OF YOU FUCKERS...
>>
>>106421349
88GB I am retard. So Q4_K_M fits, assuming you can use pooled rammies like that.
>>
>>106421336
you should be able to run 49b much faster. play with the ot flag, try to get all the attention layers on the gpu and enable flash attention. I can get a sold 5+ toks across 32k context on dual 3060s
>>
>SillyTavern -> User Settings -> Smooth Streaming ON and set to lowest
This shit improves the reading immersion experience by a huge amount, especially for sub 4t/s. Definitely try it out.
>>
>testing https://huggingface.co/Jinx-org/Jinx-gpt-oss-20b
Might post some logs but not yet - not sure if I'm doing something wrong but it feels like it has lost any roleplaying elements whatsoever and the model behaves like an ordinary, colourless chatbot when compared to others (mistral, gemma).
Sure, without an example context this is a weird post. Never tried the vanilla gpt-oss so can't really compare.
>>
My main use cases for AI are: coding, translations and OCR. Occasionally image generation (for example I'd ask to summarize some data in a graph). Can a LLM do all of this at home?
>>
>>106421420
OCR for text? Or labelling pictures/graphs?
>>
my glm 4.5 air (IQ4_XS) just breaks down and starts outputting near gibberish at around 6500 tokens in. Is this a skill issue on my side or is this expected? Could the reason be that it's just not trained to generate a single extremely long response (instead of multi turn convo with multiple short responses)?
>>
Any good tts I can use with open webui or stuff like that?
>>
>>106414555
holy shit miku is so hairy down there. and her pussy is so sweaty and wet
>>
>>106421336
Offload all layers to GPU and use n-cpu-moe to move some layers back to the CPU.
>>
>>106421442
If you want a more thoughtful reply you'd be best off making some sort of attempt yourself first, and coming and posting once you've hit a roadblock.

Bonus points for making a false declaration (i.e. x model is way better than y model) and farming spergs that can't help but be right to get pertinent info.
>>
>>106421449
If you can comfortably fit a model in vram, is there any case you wouldn't want to make the layers in GPU? I'm sure this is a retarded question, but the documentation for this shit is non-existent.
>>
>>106421434
Text. I wasn't very impressed by the ocr capabilities of chatgpt 4 when scanning a compressed jpg of an excel table but it's pretty good at scanning and translating comic balloons from manga if the font used isn't handwritten or distorted. I do this last thing quite often and I wonder how good LLMs are at this
>>
>>106421469
No reason not to, unless you wanted to also run an image gen model or something at the same time.
>>
>>106421479
Interesting. I was going to make some thoughtful suggestions for LLMs I've had great success with scanning text (shitty PDFs, handwritten, etc.) but since you're using it for weeb shit I no longer want to help.
>>
https://archive.ph/UJIli
>Within days of joining Meta, Shengjia Zhao, co-creator of OpenAI’s ChatGPT, had threatened to quit and return to his former employer, in a blow to Mark Zuckerberg’s multibillion-dollar push to build “personal superintelligence”.
>Zhao went as far as to sign employment paperwork to go back to OpenAI. Shortly afterwards, according to four people familiar with the matter, he was given the title of Meta’s new “chief AI scientist”.

>Adding to the tumult, a handful of new AI staff have already decided to leave after brief tenures, according to people familiar with the matter.
>This includes Ethan Knight, a machine-learning scientist who joined the company weeks ago. >Another, Avi Verma, a former OpenAI researcher, went through Meta’s onboarding process but never showed up for his first day, according to a person familiar with the matter.
>In a tweet on X on Wednesday, Rishabh Agarwal, a research scientist who started at Meta in April, announced his departure. He said that while Zuckerberg and Wang’s pitch was “incredibly compelling”, he “felt the pull to take on a different kind of risk”, without giving more detail.
>Meanwhile, Chaya Nayak and Loredana Crisan, generative AI staffers who had worked at Meta for nine and 10 years respectively, are among the more than half a dozen veteran employees to announce they are leaving in recent days. Wired first reported some details of recent exits, including Zhao’s threatened departure.

This is hilarious
>>
>>106421469
>If you can comfortably fit a model in vram, is there any case you wouldn't want to make the layers in GPU?
If you can fit everything you don't use n-cpu-moe. You use that when you don't have enough VRAM and when you're using a MoE model. Because there's some priority for what layers are best to move back to the CPU when you don't have enough VRAM and that flag takes care of it.
>>
>>106421506
They saw how disorganized the company is and how they all have 0 idea on what to do next.

Entire structure is rotten. It's over for META.
>>
>>106421512
Thanks, anon. Is there a resource to get up to speed on all these settings and shit, or just learn by playing?
>>
>>106421498
I'd normally ask if you know where you are right now but whatever, this is a LLM thread after all
>>
>>106421506
Well, you can't just throw together a bunch of people that are good individually and expect to end up with a functioning team, especially if all those people are only motivated by money and were "disloyal" to their previous company.
>>
>>106421506
>>106421518
How horrible is it to work at Meta if even the money can't keep people in? Do they have mandatory Wang Zuccing sessions?
>>
>>106421518
John Carmack complained about the same thing when he quit Meta (facebook/oculus vr) and this was a long time ago.
It's a mess of a company with unlimited funds.
>>
>>106421543
If it's money, how do you explain quitting within a few days or weeks? They could just take the fat paycheck and put up with it
>>
>>106421524
I don't know. I learned it from reading the thread. I guess you could read the --help menu, if the description of the flag is not enough, you can search for the original PR that introduced it or search it in the archive to see if people use it.
>>
>>106421336
You are running a 49b dense model. It means loading 49b parameters for every token.
Some run mixture of experts models like R1, which only loads 37b out of 671b, so it's even faster than what you are running, even more so when quantized given that it fits into ram.
>>
>>106421556
https://www.gamedeveloper.com/business/john-carmack-departs-meta-and-bids-farewell-to-vr-development
>"The issue is our efficiency. Some will ask why I care how the progress is happening, as long as it is happening? If I am trying to sway others, I would say that an org that has only known inefficiency is ill prepared for the inevitable competition and/or belt tightening, but really, it is the more personal pain of seeing a 5 percent GPU utilization number in production. I am offended by it."
>Elaborating further, Carmack said that as a systems optimized person he cares deeply about efficiency, and that when you "work hard at optimization for most of your life, seeing something that is grossly inefficient hurts your soul"—suggesting that Meta's current performance level reminded him of seeing a "tragically low number on a profiling tool."
>Carmack added that Meta has a "ridiculous amount of people and resources," but constantly squanders the tools and teams at its disposal through acts of "self-sabotage."
>"There is no way to sugar coat this; I think our organization is operating at half the effectiveness that would make me happy. Some may scoff and contend we are doing just fine, but others will laugh and say 'Half? Ha! I’m at quarter efficiency!' It has been a struggle for me. I have a voice at the highest levels here, so it feels like I should be able to move things, but I’m evidently not persuasive enough," he continued.
>"A good fraction of the things I complain about eventually turn my way after a year or two passes and evidence piles up, but I have never been able to kill stupid things before they cause damage, or set a direction and have a team actually stick to it. I think my influence at the margins has been positive, but it has never been a prime mover."

>5 percent GPU utilization number in production

Jesus, I didn't know Meta was THAT inefficient. So, out of their compute power of 600k H100s they are utilizing just 30k.
>>
>>106421648
>compute power of 600k H100s they are utilizing just 30k
I'm not an expert in this but why don't they offer cloud compute?
>>
>>106421543
Mark's strategy is like a kid's idea that you could get the captains from different football teams and they would make the best football team ever
>>
>>106421679
because of the inefficient use they need all 600k to get the computational power of only 30k.
>>
>>106421648
>>106421679
To my knowledge utilization at huge scale is very bad regardless of company but 5% is definitely a disaster
>>
>>106421648
This lines up with my personal experience in "high-performance computing".
People run shitty software in production all the time because for their personal, short-term goals it's better to just scale up the amount of compute than to improve the software.
I have witnessed millions of CPU hours being used with software that spends 20% of its runtime clearing caches.
>>
>>106421648
>5 percent GPU utilization number in production
he was speaking figuratively bro referencing his game dev experience....
>>
>>106419282
Model Card """Description"""

>Steam v1 has got the juice

>Characters are as vivid as the original GLM-Air, though prose is much more enticing.

>Damn okay this model is actually pretty good. I don't have enough vram to test it on longer chats to 16k, but on 6k chats it's looking good and without deepseek's slop.

>this model has a unique way of speaking. imo it's kept the same "soul" of the writing as Air but with more creativity and willingness to be hor -

>this model is fun! :3

I have to ask, if you're not the spamming beggar himself, are you mentally retarded? What made you look at a fine tune whose description consisted entirely of semi-literate marketing quotes and decide to download it?
>>
>>106421815
I don't look at descriptions. I just download. If you see comments about other fine tunes such as by sao or other familiar names of old, at least one of those were by me. I plan on trying the other GLM tune recently made as well.
>>
>>106421439
It sounds like you might be hitting context window max and/or memory issues.
if you're talking about one response then 6500 tokens is a little long for one response. I normally keep response lengths at around 2000. you can always "continue" the output with a second response.
Remember response length is part of the context. your context window needs to be probably around 20k to be comfortable. I've run glm air at 35k context with no issues, think it has max 131k context.
>>
>>106421716
i wouldnt be so sure looking at metas model release rate compared to the insane amount of gpus
>>
>>106418813
How big is the chance that they QUCK it up?
>>
>>106421924
14.28%
>>
File: 00004-1378487878 (3).png (1.57 MB, 1024x1024)
1.57 MB
1.57 MB PNG
>>106421268
> nudge nudge wink wink
> don't do anything I wouldn't do kid
>>
>>106421420
Gemma 3n (not regular 3) and Qwen are decent for translation. DeepSeek is probably the best, if you can run that.
NONE of the open weight LLM are good enough for coding. Even the SOTA shit can be painful and make you go into long debugging sessions you wouldn't need if you had written the code yourself and memorized what happens and where.
And only proprietary models really work well after 8k tokens here.
>Occasionally image generation (for example I'd ask to summarize some data in a graph).
"generation"? I don't get it, you're talking about summarizing data in a graph so you mean understanding images, not generating them, right? or you mean making a new, smaller graph from those graphs?
Even a local LLM like qwen coder will be good enough to write a python script that generates a graph, but then you too should be able to write that shit.
As for OCR, it's pretty decent but not 100% reliable. I use Qwen2.5-VL for that.
>>
File: d.jpg (84 KB, 965x417)
84 KB
84 KB JPG
This is for GPT-ASS. Is this a bug or a feature? Why would they change the tag? I mean okay I understand the logic somehow but still. No other model does this afaik.
Seems like a lot of work to implement reasoning.
>>
>>106421648
I don't get this comment.
Aren't development servers like this scaled around max load? Buliding a new model is much more resource intensive than running one. So they have a server setup around building new models, than also can run inference when they're not building. Ergo, it's sitting idle if you don't rent the capacity to others.
As for rest: Anytime an exiting exec mouths off like this on the way out the door, they're sending a message to someone. Take it with a grain of salt; it serves his purposes or he wouldn't be doing the interview.
>>
>>106422038
>>106422038
>>106422038
>>
>>106421996
gpt oss is a clown model. OpenAI released it only to show that they still "care about open source". No sane person should use it as better alternatives exist.
>>
>>106422060
Well, this is what I've been thinking but thought it would've been fun to implement it for my client but so far it seems like a lot of work.
Qwen3 is reasoning model too but it's more simpler to handle in this sense.
>>
>>106418326
>>106418433
They argue if you don't use chat templating then it goes full schizo and then refuses. >>106418036
Have you been remotely paying attention to anything said ITT?
>>
>>106419809
+1 wuan on your account Chang
>>
Hey, so, what's up, it's ya boy, listen, real talk:

Let's just say that, hypothetically, not me of course, but someone discovered a way to derive all known and unknown mathematical structure via a single axiom applied to a single symbol.

How famous are we talking here? Would this person be able to remain anonymous?

This hypothetical person who I am not is definitely not excited to be Einstein+Hawking+Turing level famous in a single lifetime.

Real talk, lads, 555-come-on-now.
>>
>>106420231
Wouldn't attention still be the bottleneck since each token has to attend to every other token in context?
>>
>>106421268
Did they just reinvent "you're an expert roleplayer" with more roles?
>>
>>106422238
>Did they just
>2406



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.