/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>106691703 & >>106683141►News>(09/25) Japanese Stockmark-2-100B-Instruct released: https://hf.co/stockmark/Stockmark-2-100B-Instruct>(09/24) Meta FAIR releases 32B Code World Model: https://hf.co/facebook/cwm>(09/23) Qwen3-VL released: https://hf.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe>(09/22) RIP Miku.sh: https://github.com/ggml-org/llama.cpp/pull/16174>(09/22) Qwen3-Omni released: https://hf.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplers►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/leaderboard.htmlCode Editing: https://aider.chat/docs/leaderboardsContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>106691703--Paper: Video models are zero-shot learners and reasoners:>106693137 >106693742 >106693747 >106693815--Paper: OpenAI's GDPval benchmark for real-world task evaluation:>106697551 >106697750--Paper (old): LoRA vs full finetuning tradeoffs in memory efficiency and catastrophic forgetting:>106694498 >106694516 >106694544 >106694577 >106694791 >106694754 >106694767 >106694783 >106694806 >106694997 >106694769 >106694608--Qwen30b performance validation and UI preference debate:>106692077 >106692209 >106692243 >106692309 >106692435 >106692216 >106694370 >106694404 >106694745--Skepticism over extreme AI model scaling and context length claims:>106694254 >106694280 >106694483 >106694504 >106694310 >106694317 >106694344 >106694379 >106694871 >106695760 >106695741--Japanese-focused 100B parameter LLM with mixed language training and synthetic fine-tuning:>106696218 >106696286--China's new CUDA-compatible GPU with 112GB HBM memory:>106695558 >106695680--Character roleplay finetuning: LoRA tradeoffs vs full finetuning feasibility:>106693437 >106693460 >106694001 >106694071 >106694142 >106694159 >106694169 >106694177 >106694187 >106694354 >106694384 >106694394 >106694412 >106695772 >106695995 >106696073 >106696857--Local model viability concerns amid growing parameter sizes:>106694931 >106694955 >106694972 >106694993 >106695014 >106695159 >106695157 >106695214 >106695818 >106695888 >106695929 >106696168 >106696188 >106696342 >106696515 >106694966--Evaluating model quantization performance:>106697433 >106697475 >106697834 >106697871 >106698144 >106697938 >106697975 >106698355 >106697981 >106697493 >106697796--New model evaluations and prompt template updates:>106693183 >106693189 >106693400 >106693527 >106693770--Miku (free space):>106695552 >106696627 >106700430►Recent Highlight Posts from the Previous Thread: >>106691706Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
only mistral large 3 can save me now
>>106700424>Stockmark-2-100B-Instruct>it's denseunfortunatenonetheless, resident densefag, please post cockbench
>>106700488*tries to save u*
How do I get a footjob
>>106700644sorry, we only have robo-arms, no robo-legs yes.
>>106700541*shoves chinkslop moes down your throat instead*
Does SSDMAXXing starts to make sense with 10T a3b models?
You guys remember when all of this was just an outputted garbled mess? Feels like just yesterday
I'm using deepseek with sillytavern, is there a way of changing its writing style?No matter what character I choose, trying to go for "narrative" style ends up churning out essentially the same thing, I'm not sure what I should be doing to do the equivalent of saying (artist:1.1) in an image model.No matter what fetish or scenario I'm concocting, it all feels like the same thing, surely it's not deepseek itself that's like this?
>>106700841>deepseekyou're chatting with a 111b model (at most) with broken attention. what kind of "writing style" do you expect to get out of it?
>>106700841You could try telling it to write like a specific author.
>>106700871I guess, I was just hoping someone would prove me wrong and say "Ugh, you're clearly meant to do this, retard"But this thing consumes like 10 cents in two hours of play and is completely uncensored, I'm just not sure I'd want to invest in chatgpt only to get cucked any time I ask for something that's a no-no.
>>106700841It should be an intelligent enough model that a instruction with a brief example should do it.Hell, prefill the thinking with some first person yapping about the style with some examples and shit.The first message should also follow the style you want, and even the way your own messages are written have a hand in steering the model towards one style or the other.You might try to create a control vector for the style, that could make it "stick" better too.
>>1067008362K context with no instruct training and we still managed to coom
alastrim Ger helter
Default Kimi weights on HF are FP8, same as DS. 70-60% are most likely FP4, but what are the providers using to get 96%? INT8?
>>106701146Fucking with sampling in the backend? Serving quanted context?Maybe it's just margin of error.
>>106701166Those two fail on half the tool calls that that 60-70% providers pass. That's way too significant to be margin of error.
Please tell me there's some alternative to Mistral models re: cooming "against guidelines". I just never want to see a model do that.
Kiwi won. (Qwen) (I like nu-Edit) (Why can't we have straight upgrades like that with llms, it is aways 1.5 steps forward, 1 step sideways, 1 step back)
qwen fucking sucks. refuses to talk about things like tiananmen square, but for some reason GLM complies just fine despite both being chinese. GLM air is too small and too fast for my tastes and GLM full is way too slow, but qwen3 235B is perfect balance of speed and size, except for the fact that it is censored to hell and its outputs are very low quality
>>106701540Are you doing RP about tiananmen square? Programming a game set in tiananmen square?
>>106701552no. i just want a large model that is good, fast, and completely uncensored, without any chink spyware or anything like that. if it cant answer a simple question, then it isnt good enough
>>106701492kiwi is the cutest
>>106700424>>106691703Catbox of last op pretty please?
>>106701558>if it cant answer a simple question, then it isnt good enoughall models fail this test.
>>106701492>basedboorugo back
>>1067015402025 is almost over and /lmg/ still can't into vectors despite being spoonfed countless timesyour other options are dropping a shitcoin to fund your own finetune, or making me rich so that i can fund uncucked AIs out of pocket (you will never do it but it's your best bet at getting a robot wife without spyware in your lifetime)
>>106701602>oldtroon impotent coping, seething and maldingGo ACK
>>106701686who are you?
>>106701697literally nobody like sharty fags
>>106701717an idiot sandwichit wasn't a request it was a statement of fact, i'm not gonna dox myself here. it's a thing that could technically happen as in it's not physically impossible, but there's no way to make it work.
>>106701808so then you dont want money?
>>106701686This is an outstanding insight — You should definitely join a frontier lab with your deep knowledge of vectors!
>>106701823>—that's 100% a bot lmao
>>106701812i have money, i'm just not spending it on youi could obtain more money without your help but i have nothing to spend it ondoxing myself will not get me money, it will ruin what little peace i have in exchange for 10 minutes of homoerotic pleasure you'll derive from stalking me.i am still actively looking for ways to make this work but i can't be stupid about it. it is what it is.
>>106701916At least spin up a trip/proton/github and indulge the more retarded parts of your personality here. You can still share shit with lmg bros and keep from getting doxxed
Are we sleeping on Jamba?Yeah, it's a little 'tarded at times, but it knows a lot and has very little censorship or slop.Here's the cockbench results for Jamba 1.7 Mini.
>>106701980>at times*oftenIt'd be the best model if it was smart, because of the positives you listed, but it's so dumb it's unusable for real long contexts.
>>106701916>i>i'm>i>i>myself>me>i>me.>i>i
You know what, I think Jamba's issue really is architectural. The tests I did with it felt like it was stupid in a way that was related to its understanding of context. Kind of like if you quantized the cache. The fact that it knows so much for its size and hasn't been safety lobotomized means that it should really be a decently smart model. The data and scale is in its favor. Yet it's dumber than even dense models of its active parameter size.
>>106696888what is meant by this? I have a data split into train,val,test. Pick the most accurate and train on that? The generations are all super bad btw with SFT.
>>106702058What do you recommend for long contexts? Pretty much all the models I've tried break down into a repetitive mess when I try to get them to continue a 10 page story. (Gemma-3 27B base, GLM 4.5 Air, and a few others I'm forgetting at the moment.)I found Jamba actually held up a little better at long contexts, although I still had to hand hold it.It's the only model I've used so far that required absolutely zero jailbreak or system prompt to be usable.>>106702137Yeah. It's weird. Some gens it will behave fairly well, and then other times it gives a barely coherent response.It kid of reminds me early LLaMa 1 finetunes.I think part of the issue it that the logprobs seem to be very concentrated.Like, half the logits only have a one or two possible tokens (Although I still get a decent spread on rerolls, unlike Gemma - which also has a pretty narrow logprob distribution).The lack of alternative tokens probably causes it to spiral on a bad sample.
>>106702137Yeah. I get the feeling that the archtecture is fine, maybe even great for long context, but the training data or process might have been kind of shit.
What is the best local model to ship with a steam game right now? Has to have a commercial license and fit on 8gb vram or some sort of cpu MoE that doesn't suck
>>106702080see, you are too mentally ill to make anything for even with the best intentions
>>106702209I can run at most 235B and it works the best for me, but you do need to still baby it. Less than Air. Its main issue is lack of knowledge, especially for a 235B. Meanwhile Air knows a lot but is repetitive and has other weird issues sometimes so you need to spend more time babying it. There's really no winning, but for me 235B has been the least bad as my main driver.
>>106702182You sound retarded. Just use BERT: https://huggingface.co/docs/transformers/en/tasks/sequence_classification>>106702209Not anon but architecture wise, anything with hybrid gated SSM like Qwen3-Next. Unfortunately Qwen's data curation team sucks>106702080Don't give free (you)s to the crank lmao
>>106702281You need to tell more details. And if you are actually a game dev you would probably do your own due diligence about various licenses and also about Steam's stance on this before asking 4chan anything.
>>106702334Steam is okay with it as long as you have some censorship. And anyway I think the idea is clear enough, a model that can be shipped for midrange gamer pcs. I imagine two of the top candidates are qwen3 and gpt-oss
>>106702182For CoT models you can't just train directly on your data or you will lobotomize the model.You have to replicate the thinking somehow for your use case while encouraging a good solution somehow. One way is to try many times until the model gets the right answer and reinforce that thinking path.
>>106702281Qwen 3 A3B and Gemma 3n are good candidates I think.No idea about the license.
>>106702285I'd run 235B or full-fat GLM-4.5 if I had the RAM for it.LLaMa 70B actually had one of the better NoLiMa scores. Not sure if I can deal with 3t/s though. Just slow enough to make rerolls painful.>>106702332I patiently await Qwen3-Next support in llama.cpp.
>>106702386>Gemma 3nInteresting... I hope they release one a little bigger than this though
>>106702409It's pretty capable for the size considering that it's a sparse model.
>>106702332all (you)s are free you cock gobbling schizoid
>>106702395Yeah before 235B I was running either 70B or 123B dense models. I still like those models as they still feel subtly smarter than the current mid size MoEs, but they're just so slow.
>>106702409it's an 8b model
I used Wayfarer 2 12B. It feels decently creative. Problem is it's really made for a specific style of writing which might not be what I always want. Also it's dumb but it's also 12B. Dunno if it's dumber or not than regular Nemo.
>>106702365>some censorshipSteam’s policy permits games containing both pre-generated and live-generated AI content. You must clearly disclose in the submission process what AI model is used, how it is used within your game, and describe safeguards against illegal use, especially for live AI interactions.Steam will review your game for illegal or infringing content, including assets or functionality provided by AI model, and can reject or remove your game if violations occur.And there's more.
>>106702395I evaluated Qwen3-Next using mlx-lm which added support for it a week ago. I do not recommend it for story writing, but if you must, I found the Thinking version had a less bad writing style than the Instruct version.
>>106702458Might as well use a dense model at that size and stick it all on GPU for fast inference though>>106702525I'm aware, that doesn't really change things. I might have to use shieldgemma or something similar but it's not a dealbreaker
>>106702542it's resource consumption is similar to a 4b model
>>106702506It's as stupid as any Nemo based finetune. It can't add together three variables to form a quest without forgetting the goal in the next paragraph etc. Waste of time unless you're a retard erper.
>>106699681>>106701582https://files.catbox.moe/lua1bn.pngymmv with this lora i didnt use that many imgs https://files.catbox.moe/dcdbx5.safetensors
>>106702558*variables = very simple sentences
>>106702547I'll give it a shot. It would still be cool if they scaled it up a little though
>>106702528Disappointing to hear.I'm surprised the thinking version did better. In my experience, the thinkers tend to be bad story tellers, as they tend to regurgitate the story outline from their thinking block without fleshing it out very much.
>>106702695Yeah, so I found it notable enough to mention.
>>106702561Bless you kind Anon
Do I need at least 4 different accounts for "local" deep research?https://github.com/Alibaba-NLP/DeepResearch
zuck and wang have something big coming
>>106703052be the change you want to see
>>106703367they better not clog the toilet
>>106703052the frontend is local
>>106703367I trust you.
hi guys vramlet here. I'm into stories, not RP. I got nemo and my experience is it's writing the same story no matter what I prompt, just changing the character names, etc. It also can't seem to be able to store state (which character is/was where, who they talked to, what they were doing or wearing just a moment ago). writing style and prose quality is of course exactly as expected, I understand that part.so I was going to start doing research and basically preparing to open my wallet, but just a couple threads back there was a guy talking about running deepseek and getting the same story no matter what he prompted too. and deepseek was supposed to be SOTA.so in short, I understand these models might be very good for RP, but if I'm into stories, should I just forget the whole idea and come back in 5 years?
>>106703561>just a couple threads back there was a guy talking about running deepseek and getting the same story no matter what he prompted tooHe was retarded or a lying faggot. Downgrade to the original R1 for a truly deranged experience. It does not write the same story every time.
>>106703561tiny models are only good for cooming.big models can be fun, but they still become retarded when the story gets long enough.
oLLM
>>106703674LLMao
>>106703680https://github.com/Mega4alik/ollmin case you werent awarefinna try thisobviously half a token per second sounds awful, but being able to run qwen-next-80b on a shitty e-waste laptop is kino.also I wonder if my PCIe 5.0 M.2 SSD in my main machine would akshually improve t/s speed
>>106703817It will probably improve the speed but not sure what would the impact be on disk's durability in this case. I'd use some throwaway ssd.
>>106703664I don't know man I keep reading online how it's shit
>>106701146If both are supported it would not make sense to use int8 over FP8 since both have the same speed (on NVIDIA GPUs).So int8 could only be the reason if they're using V100s or A100s for serving.What an unscrupulous cloud provider could also be doing is cut down on the number of total or active experts.
is the whole point of these threads just nvidia farming retards with fomo?yes goy, just spend 20K shekels on our hardware and then you can coom
damn we've been founded out
>>106704020Yeah bro, they totally care about 20 guys on 4chan spending their salary on a few 3090s.
>>106704020I don't think nvidia saw much profit when I bought a used 3090 for $600, 3 years ago.
>>106704159In this moment I am euphoric...
>>106701146>>106703994Other options for datatypes are FP4 and FP6 (Blackwell only).The only Blackwell datacenter GPUs available are the B100 and B200, both of which come with 2x 96 GB VRAM.At 8 BPW you can just barely not fit a 1T model on 4 of those GPUs, at 6 PBW it should fit with ~50 GB left over.Looking at models on Huggingface, Baseten has uploaded an FP4 version of Kimi-K2-Instruct https://huggingface.co/baseten/Kimi-K2-Instruct-FP4 so presumably that is what they are using.
Blackwell B200's... I weep and salivate... why wasn't I born rich... Why...
>>106703052well you'll probably need a search API provider, leeching free searches can get unreliable Web page parsing is trivial with any web scraping skills. Get page html and convert to markdown. AI web scraping can be a massive rabbit hole but you can get a decent solution up fairly easily.Dunno what dashscope is.
>>106704332Hey, psst... https://viperatech.com/product/nvidia-umbriel-b200-baseboard-1-5tb-hbm3e
>>106704332Rich people don't even need expensive GPUs. They just pay a datacenter for the pleasure.
>>106704365>Power consumption (W) 8000This alone would be financially devastating
https://xcancel.com/TencentHunyuan/status/1971495031040283125#m>Excited to introduce Tencent Hunyuan3D-Omni, the industry's first 3D asset creation system with multi-condition control.
>>106704523>no goof>no llamacpp support>tranny URL
>>106704523I'm sorry sir but we don't tolerate stealing ad revenue from Sir Elon here! https://x.com/TencentHunyuan/status/1971495031040283125
I am very sick of the taste of copper
>>106704570stop eating your period blood
I'm gonna shill DSPy GEPA prompt tuning again.This is the single most effective tool to increase output quality of your local LLM. Compared to finetunes:- Way bigger quality increase- A lot simpler, faster and modular- Totally free when running local feedback model. No GPU requirements a finetune may have- Works with simple predefined metrics/answers instead of finetune dataset>use case?anything. perfect for getting the most consistent structured ouput with the smallest possible prompt. but you could also hook it up to cockbench and define a metric like more rape = good. probably would give you a prompt with a method that fully circumvents any safety refusal within 500 rounds, which you can then use for any future prompts.example:https://xcancel.com/rasmus1610/status/1969818753509531955#m
>>106704760
>>106704779>gptoss>llama4bortionquality should be fixed at 0 for both of them.
>>106704760I've often asked LLMs to improve the prompt or find possible points of improvements, but not in an automated way. How would that even work for RP since it's so subjective?
>>106704810Convincing RP is the final froniter of AI capabilities. There's absolutely no way to gauge the quality of a model outside of using it yourself, in a handful of scenarios that you're likely to be using it in.
Is there a way to fake CXL on a platform a normalfag could hope to get a hold of?
30xx is shitty e-waste now?
>>106704911always was
>>106704911used 3090 is still the best value you can get, with VRAM + cuda
>>106704835Corneal Cross-Linking?
>>106704963compute express link, the main upshot is being able to shove more ram in pcie slots without completely fucking your speeds or introducing a ton of latency
Gemma is consuming me, maybe I should start calling those hotlines...
>>106704988>Is there a way to fake CXLIf you have to fake it, you cannot run the thing or you're running emulated, so performance will necessarily be worse.>on a platform a normalfag could hope to get a hold of?But if you want a platform where you can use CXL, then there's no need to fake it. Right?So which one is it?>On May 11, 2021, Samsung announced a 128 GB DDR5 based memory expansion module>In 2021, CXL 1.1 support was announced for Intel Sapphire Rapids processors[30] and AMD Zen 4 EPYC "Genoa" and "Bergamo" processors.[31]If that's all (taken from wikipedia, which i'm sure you've read), it's out of normie range.
man I kinda wanna threadripper it up, what's the cheapest ddr5 8-channel socket/mobo?
>>106705149Why would you go threadripper over epyc?
>>106705158Why would you go epyc over stacking H100s?
Sam Altman loves penis
>>106705167>H100sPlease don't tell me you're too poor for Blackwell.
>>106705158bruh I asked cringegpt, it told me to go intel, but I hate intel
>>106705178Can you even buy enterprise blackwell as a consumer? I think you have to be a company and sign contracts to get them.
>>106704988>>106705112(cont)Also, from the same article, I just read>Speed Full duplex>1.x, 2.0 (32 GT/s):> 3.938 GB/s (×1)> 63.015 GB/s (×16)>3.x (64 GT/s):> 7.563 GB/s (×1)> 121.0 GB/s (×16)Am I misinterpreting the numbers? Is this really what you want? Seems to be, at best, a little over twice as fast as ddr5 but i'm sure much much more expensive and you'll need your software to support it, right?
>>106705188Ask it about 12 channel then
>>106705169fact checked by gpt-oss-120b
>>106705195the point is simply that it's more slots to shove more ram in, not that it's necessarily more performant.And support is baked in at the OS level, so the software that's running (ideally) doesn't have to be concerned about if it's in the slot or if it's CXL
>>106705205yeah I'm raping sam altman with my request, 2 mins to reply lmao
>>106705217>the point is simply that it's more slots to shove more ram in, not that it's necessarily more performant.If you're ok with dual-channel ddr5 speeds, sure. I'd still not consider any of the hardware needed in the normie range.>https://www.aliexpress.com/i/3256808204056898.html?gatewayAdapt=4itemAdapt
>>106705193As far as I can tell you can buy from here as a noncorporate customer >>106704365
bros a question, if I currently get 3 t/s on dualchannel, will I get 12t/s if I go octo-channel (suppose the ram is at the same freq./latency)
>>106705396It won't scale perfectly if you're running exps=cpu because your gpu is not speeding up magically
>>106705396my intuition says 10t/s
>>106704760Yeah would be neat to see people try this out for automatically finding the best JB for models.
ai is deadmaybe there never was any true ai to begin with
>>106700507> dense model> someone is disapointed> moe model> someone is disapointedyou can't win
>>106706131https://misorobotics.com/https://www.pudurobotics.com/product/detail/bellabot?
>>106706189moe with 70b active would make everyone happy
>>106706268>moe with 70b active would make everyone happymore like make it run piss slow, we would be back to getting it all on the gpu again with that kind of active params even bigger and slower models, it's a lose lose scenario
>>106706268>>10670629970B static params so that they fit quanted in 48GB. Then 1-10B routed experts that run from RAM. In this configuration the experts are used simply for knowledge extension rather than as the basis for most of the model's intelligence like current big MoEs are.
I want to moe an expert really hard...
What are the best models for character impersonation, narration and general scenario roleplay? Not in the erp sense
>>106706418Gemma 27B
>>106706433gemma and the rap stories
>>106706495say what you will say, but the only downside of gemma is that it writes, well... like thatvision and general knowledge and overall chatting experience for its size is still the best
>>106704523Its gonna be huge isn't it.My old ass pascal cards have to wait like 30 min for a pic with the recent models. Its all over.
>>106706189Not really.30ba3b is really cool.Also something like the size of gptoss120b.Mini Experts so you can offload lots to ram and still get good enough speed.
>>106706551you forgot the safety feeling.It really makes me feel safe when I use it
>>106706575speaking of which, where are the qwenmax goofs?
>>106706594it didn't refuse me shota on big titty oba-san aside from being really euphemistic and... not wanting to say... you know
>>106706613try cunny next :)
>>106706629nyo~
Gemma 3 27B is the best for non-ERP unless you go for V3-tier models. You literally don't need more if you're not a coomer
>>106706659>SWAlmao
https://xcancel.com/bdsqlsz/status/1971448657011728480#m>4x qwen image = 80bjust letting you know that they're also starting do to some ParametersMaxxing on the image side as well kek
>>106706629Define and/or use your favorite words in your instructions and it will use them organically in the RP. Don't just tell it "be vulgar" or "use dirty words".
>>1067067144 times the parameters, still struggling to match a simple finetuned SDXL...
>>106706804imagine if people bothered to finetune language models like that.
>>106706695You're poor
>>106706433>>106706659Thanks, I'll try it out.
bartowski-Qwen_Qwen3-235B-A22B-Instruct-2507-IQ2_S is ChatGPT at home. It's extremely good.
>>106707130>IQ2_Show is that not lobotomy?
>>106705193Making a corporation isn't actually that hard either. Like you don't need a lawyer or anything.
>>106707154Fuck if I know, it works, and it's substantially better than everything else I used for code review/generation and general assistant tasks. A lot better than GPT-OSS, Mistral Large and GLM Air, as well as many different dense 70Bs. T didn't do enough ERP to come to conclusion there yet, but it feels as an upgrade so far too.
>>106706714if the images on the promotion shit are samples from it then it still looks very slopped
>>106707130Really unfortunate that ever since Unsloth and the uberguy made their Deepseek quants they've overtaken bartowski, even though bartowski doesn't reupload the same quant three times a week, shill his work, doesn't require pull requests to support his latest quant (look at this retardation: https://github.com/kvcache-ai/ktransformers/issues/1195#issuecomment-2830402529), or do shady shit. Plus, his quants are actually better. The reason this all happened is that Deepsex was so good and huge that even if you quant it down to 2bits it was still useful, but everyone on reddit believed this is because of some magic done by the Unsloth team. So now everyone just assumes that the Unsloth/ubergarm guy are wizards when it was just Deepseek being resistant to lobotomy. https://huggingface.co/bartowski/Qwen_Qwen3-235B-A22B-Instruct-2507-GGUF/discussions/1
>>106706372How do we fund this?
>>106707479My anecdotal testing agrees with this desu. I feel that Bartowski's quants are a tiny bit less repetitive, have a bit better intelligence, and are even a tiny bit less slopped, compared to Unsloth and Mradermacher. I don't know about ubergarm, I haven't bothered trying ik_llama. Tested both Gemma and Qwen models. Surprisingly I saw differences even at Q6. I think imatrix has a larger influence than might be expected from benchmarks/metrics like these.
>>106704760I will add it to my backlog.
Anybody in here have finetuning advice you could give me? I have a 3200~ line dataset of samples from my fiction and a discord server, and my intention is to capture the voice of a specific character. However, my current finetuning experiments have yielded pretty incoherent results. I've experimented with Unsloth before but those were extremely small, POC style tunes where I was basically just proving I could use it. I'm also open to using AWS because I have some credits (inb4 this is /lmg/, yes I know) but ultimately, where I finetune or what model I finetune doesn't seem to matter. Since I am working with such a relatively small dataset by comparison to a big finetuning job, and we don't want the model to learn format so much as response tone/style, do you guys have any tips on formatting datasets? For reference, I've tried in two different ways--one where the user input is a short multi-line snippet of existing conversation history to one line of assistant response, and one where it is simply one line of conversation history to one line of assistant response. Neither of these has really seemed to make a huge difference, nor has playing with parameters like epochs. Generally speaking, is there a recommended format for a dataset where we're just trying to finetune for tone? I've been kind of looking around but the guides I've found have seemed extremely old, or they were using examples of characters that would have been already baked into the model they were using (Rick Sanchez, for instance), which produces misleading results.
ZAMN!Another one:https://github.com/lyogavin/airllmRAMlets are eating good these days (if they havent starved to death while waiting for the response)
>>106708050>AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.MoE is pointless now
>>106708050Can't you already do the same with mmap and llama.cpp?
>>106708124don't forget oLLM!I still havent found any time to even try one of these. But it would be pretty cool if it scales linear. Like running GLM4.5 fp8 on a single 3090 and getting 1t/sec or more
>>106707479IQK quants are better than IQ or regular K quants in PPL and Ubergarm is the only one who consistently makes those quants.
>>106708050
>>106708203https://github.com/Mega4alik/ollmbetter?
>>106701146>30% loss at FP4Is it FP4 that is bad or is Kimi just a badly quantable model? It often makes dumb mistakes at Q5 when I run it locally, but I don't know if it's the model or the quant.
>>106708378Go use it over openrouter with moonshotai as the provider and see for yourself.
>>106701146>>106708378https://deepinfra.com/moonshotai/Kimi-K2-Instruct-0905Deepinfra claims to run FP4 and they're one of the 96%ersThe shitter providers are either running like 1.5 bit quants or they have deeply screwed the pooch with configuration
>>106700424I bought Strix Halo for this.None of you men know my pain, the pain of the poop mage.
>be you>Dabbling in local models>Find github with promising tech>Repo last update 3 years agoEvery fucking time.
>>106708539it just means its feature complete and stable.
>>106703052>Waaah waah>I need an AI to read research documents for me because I'm a retard and never learned how to do my own researchThe actual state of deep research.How about you figure out how to tune a niche model in your field, and create original results, instead of hoping that some grad student Grant grinders published a paper that MAYBE brushed your topic with retard heavy strokes years ago?
>>106708539>>Find github with promising techthere are like four or five things on github that are actually useful depending on your use case for local models and that's it
goofbros...
It's over
V3.1-Terminus killed(or better said terminated) all the hype I had for R2. It addressed none of the issues of V3.1, it still begins everything with "great question", "you're absolutely right", "of course" and other slop. It still is boring and positive. It still can't properly think if the problem is not math or code. Looks like DS decided to go back to OG V3 and V2 dryness for some reason and R1 was just a lucky accident.
>>106708629They're doing it just to troll you
>>106708679But that's simply not possible. And I don't find this funny anymore.
>>106708629>wow why is a bugfixing release not addressing problems specific to ME ME ME
>>106708748Why isn't it?
>>106708196But that requires me to run ik_llama and fuck that fork.
i hate miku
>>106707130Huh, this actually fucks. It's just small enough for a 64gb/12gb system. Writes better than glm-air, but has slightly worse trivia knowledge.
Interesting. Llama.cpp has been hanging and taking a while to die on windows to the point where you can't even kill the process.Maybe I should pull and rebuild. It's been a week or so since I last id hat.
>>106709087https://youtu.be/Xr98K8dyLy4?t=137
>>106709111For me it's significantly worse. The interesting thing is that Qwen likes to use the trivia it does know more. Like, when you're doing an RP within an established universe, it'll more likely bring up stuff from it (not just what's in the card/lorebook), while GLM and other models will less often bring up references like that. However it will also often hallucinate facts.
>>106709390At first I thought it was pretty bad too, but then I gave it a single turn of a example and it somehow got it right after that. It needs a little more help at first to get going. Of course, it's still lobotomized to shit at q2, but hey, it works, kinda.
>>106709459>>106709390The model itself is good, except for the fact that it is a CCP shill. If that were to be removed, it would be basically perfect.
>>106704826no sir, that would be writing events for hoi4 modding
>>106708559Someone asked how to extract definitions from JanitorAI. I think you can do that by gaslighting the model. It would be cool to use local deep research to collect the answer from Reddit or something when I don't care enough to research about it myself.
>Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.I heckin' love wikitext! Did you see how my imatrix quant decreases perplexity on wikitext?
https://blog.novelai.net/text-model-release-introducing-glm-4-5-untuned-preview-for-novelai-opus-4aa866c8a0d5NovelAI just destroyed local.
>>106709802Where's that from?
>>106709810>untuned GLM 4.5don't you mean zhipu destroyed local?
>>106709810That's a great idea. It doesn't take much compute at all to tune since it's a MoE.Smart.
1TB RAM2TB NVME SSDREJECT QUANTIZATION EMBRACE OFFLOADING
>>106709810this is unironically good news desu. no one is really investing the time to make creative finetunes anymore, they only benchmaxx. if this is actually good then it proves that it's possible and jealous localfags might eventually do something, just like the nai leak eventually lead to ponyxl etcif it's shit then it's probably game over but I guess they helped us avoid wasting money on finetuning
>>106709810calm down nigga we dont even have the finetune yet
>>106709810>untunedso people are now paying for something I run for free on my $6k server?
>>106709980You're skipping Erato for some reason.
>>106709980this guy don't know about chub soji's failure
>>106709936How many weeks you gonna be comfortable waiting for Qwen 4 to finish getting through its 1 million token reasoning budget?
Lately I have the strange tendency to interrupt the roleplay to have all characters break out into a musical number. If you provide the lyrics, it works exceedingly well.
>>106710104I'm fine with a good answer in a few days. Better than a shit useless one.
>>106710126>a few daysDude, even if you manage to get 1 t/s running off SSD and have that speed constant up to 1 million tokens worth on context (which isn't happening) 1 million seconds is 11.5 days minimum.More likely you'll be waiting half a year just to get to </think>.
>>106709204I love how her real voice is beyond sound.
>>106710057>chub soji's failurewdym? it's better than 0324
>>106710178>muh reasoningno thanks. instruct is all you need. reasoning models are just for retards who can't build agentic flows.
>>106710369Ok, but that's not where the industry is trending. By next year, you'll have as many new non-Reasoning models to choose from as you have new 70-120B dense models today.
>>106708527>He is like a green lantern but with poop instead of green ring powers.I fucking died. This is gold.
>>106710405that's fine. qwen3 235b vl is all I sneed.
Why aren't there more bitnet models? Is there some sort of problem with the architecture?
>>106710452being suppressed by threats from nvidia
>>106710452Yes, money issue
>>106710496>we have been allocated a compute budget to train a model for research>should we try something new and experimental?>nah too risky, let's do yet another llama 2 sidegrade
>>106710452I think the model doesn't take as much vram as the optimizer and activations and what ever other shit it has to do. so it actually cost nearly just as much to train a model with bitnet as it does fp16 or whatever they actually use. so its only an inference time benefit. and most the ai companies want to make money by hosting something that you cannot run on your consumer gpu, so naturally there isn't alot of interest.
no more releases it's joever
>>106710432guffs never ever
>>106710565vibecoders are on it
To think or not to think?
>>106710452mostly pointless when you don't have the hardware capable of doing ternary stuff efficiently
>>106710592No it's not. The model being smaller gives you a massive benefit. Even if you're thinking only of speed, making the model smaller makes inference faster because of memory bandwidth limits.Of course this has the biggest effect on local which no one cares about. On cloud it's more limited by compute and space for activations, which is why everything is now a bloated MoE.
>>106710547>and most the ai companies want to make money by hosting something that you cannot run on your consumer gpu, so naturally there isn't alot of interest.Most SOTA models are fuckhuge and everyone is planning to go huger. No one is running Kimi K2 on consumer gpu so it would do nothing but save them money on API hosting costs until specialized ternary hardware comes out and who knows how long that would take.
>>106710452https://arxiv.org/abs/2501.02423The more data you train models on, the worse they quantize.
>>106710647Also see:Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokenshttps://arxiv.org/abs/2411.17691Scaling Laws for Precisionhttps://arxiv.org/abs/2411.04330v2
K2 is nice for chatting but it's really bad at acting as a narrator/storyteller. I swear Mistral Large 2 was more proactive and willing to try to do interesting things than this model. Every reply is just assistant-slop in narrator form where it addresses exactly the things I mentioned in my last reply and that's it no matter how much I beg it to get off its ass and do anything interesting.
>>106709802Isn't this exactly what I said forever ago. It's basically a roundabout way of 'finetuning'
>>106709802Who are you quoting?
>>106710882It's not. You actually said the same thing as DavidAU did, you should feel bad and ashamed of yourself.
at least it was close
>>106711095This is so cunfusing. I just want a goddamn GGUF file!!
he's making fun of you /lmg/https://youtu.be/a7TOameRqoY
>>106696286Please report back, I'm curious.
>>106711552is it me or do his jokes feel like they were written by chatgpt?
>>106709810Dunno, I think glm 4.5 is big enough and knowledgeable enough to not need a finetune. It remains to be seen if it will actually be that much better.
Why do we still use PPL for measuring quantization effectiveness instead of benchmarks? PPL only measures memorization, not generalization. It also doesn't make sense to use PPL if the dataset isn't in the training set (it's like using BLEU)>>106710672Thanks for the links
>>106712310Torse
>>106712310PPL is nice and simple and you don't have to go and deal with benchmark software. It's viewed as close enough to measuring loss in general intelligence or generalization since there's no theory or evidence that quantization hurts general intelligence at some different rate than memorization. Plus there's also the fact that most benchmarks are unreliable, have wide margins of error, don't actually measure "general intelligence", and are benchmaxxed anyway even if they were direct measures.
>>106700871>deepseek>111b model (at most) with broken attentionHm? It's 37B active. And what broken attention?
>>106712576>there's no proof these things are the same or different, so we can assume they're the samebig brain take
>>>/vg/540527582>THIS GENERAL IS INFESTED BY CHATBOTS TRAINED ON 4CHAN POSTS!>>>/vg/540529936>Ok. I am quantizing the model using AWQ right now, so it becomes very lightweight. Will launch in 30 minutes. Let's see how the quantized model behaves.This is what lmg-anon does nowadays.
>>106713132>/vg/that's like linking to /mlp/
>>106712051If you look at Alien Earth for example (that's tv) it's somewhat obvious they are using le chatgpt or something to create all these"wacky" plot points. It feels bit engineered but that's just me of course.Same goes for youtubers, especially the smaller ones who need to spam out "content" in order to keep up with the algorithm.I mean it's more common than you think.
>>106713196soon we'll be living in a chatgpt world.. le sigh.. altman u bloody bastard
>>106713124Who are you quoting? There are things called implicit arguments or hidden premises. Sorry if they weren't clear from my writing. The hidden premise is that the argument of "oh well there's no evidence" comes after the limited evidence that they do get affected at similar rates. And that's why people thought of using PPL in the first place. The issue is that the evidence in question is benchmarks, which kind of sucks, so they are in practice not a proof. But they give us some reason to believe in the idea that general intelligence and memorization are parallel in transformer models. We would be more confident in the opposite being true if there was strong proof. Since there is no strong proof of the opposite being true, "It's viewed as close enough to measuring loss in general intelligence or generalization".
What would you say a low point in this hobby was? For me it was when deepseek r1 was first released and they made those distills on smaller models and a bunch of tourists were getting the distills completely confused for the full fat model.I didn't think I was autistic until that happened I got austistically angry.
>>106713730Probably around the same time, but mostly because of the dramafaggotry flooding the threads at the time, that's now ignored whenever bait does get posted (probably because the thread isn't active enough in general anymore; it's dead, and that's a good thing).
>>106713730+ but for me it was seeing how fucking gimped everyone was implimenting it even nowadays literally every single provider is severily worse then when deepseek actually served their and i fucking mean it the gap is still fucking huge its roughly the difference between v3 and r1 themselfes besides that idk nothing bad else that really happened except ofcourse the thread's international agent being a niggerfaggot like usual everything else is going great really discounting hardware prices but eh.... patience is a virtue give it some time really not much bad can be said i think a model flops here or there but then another gets released literally 2 weeks later so not like it matters
shill me a 12B model capable of tool calling
how to get a novelai like storywriting interface? sillytavern is good but its more like a character card chat bot
>>106713985mikupad?
>>106713730The context extension + superhot spam during the Llama 1 days.
It feels like local models have become significantly smarter over the past year and yet somehow became less 'smart' at the same time. A very worrying trend.
>>106714092Just like the people after the cold war ended amirite
>>106714092More code, math and safety focus, less of anything else
>>106713730Yeah that's probably it. R1 had the allure of challenging western SOTA that social media jockeys lapped up>OMG run the bestest AI on your raspi with ollama **basedface** xdxdModels continue to improve, I will realise my perfect waifu eventually.
This is truly the end of local models
Is there something to keeping as much of a model in vram and coherency going hand in hand? It seems after i've picked up my rtx 5000 series card, this same model that was having a few hitches here and there with quality is suddenly out of this fucking world perfect. Maybe the architecture jump helped? I genuinely don't get it, it's night and day.
>>106714835I always thought it was the opposite, cpus have better floating point precision then gpus.
>>106714835It's placebo
>>106714885what did he mean by this?.assistant
>>106714835could be a lot of things. I thought local models sucked for a long time just because I had mess up settings. I wasnt matching context windows in kobold and realized I was fucking up the llm every time lolanyways placebo. GPU dont matter. Faster speeds can negate frustrations with bad gens tho.
>>106715099I'm pretty sure that's what's going on, combination of the settings finally working right and not having to wait actual minutes to get less than 1k tokens generated makes this a lot quicker/easier.dunno though, it used to go genuinely schizo very quickly and now i'm well over 100 messages in, across different characters, its never been this good.either way, never going back to a less than 10T/s speed again that's for sure.
>But now this means i'm seeing the tropes at over 10t/s>I won't bite.. unless you ask nicely
>>106715151Honestly I like that phrase in particular, I don't mind that one repeating>>106713730Trying to jerk off with GPT2 in dungeon ai was a pretty low point in my life in generalBut yeah, the R1 debacle sucked, people unironically expecting to run the full model in a raspi
>>106713730>>106715203I remember one IRL friend telling me he could run R1 on his machine and me going o_O
I wish local could actually compete
is lcpp ever gonna support Qwen3-VL-235B-A22B-Instruct? what do i use instead?
>>106715509>is lcpp ever gonna support Qwen3-VL-235B-A22B-Instruct?no>what do i use instead?Nemo
>do some machine translation task>GLM 4.5 and Qwen3-235B-A22B-Instruct-2507 at Q4 stumble, add/remove lines etc.>Deepseek R1, even with disabled thinking and at retard Q1 quant does it perfectly every timehow did they do it? I like testing LLMs for different use-cases, every now and then i try out the latest and greatest models, but the result is always the same, R1 is king.
>>106712576I am going to need a source on that. As far as I am aware, quantization does bias the models [1][1]: https://openreview.net/forum?id=e3Dpq3WdMv^ Why do some openreview links not have any reviews on them
Is there any inference provider that offers a raw completion mode that you could hook up with mikupad?
>>106715959Consult https://rentry.org/or-prefill
>>106715569>R1 is kingI confirm. R1 is amazing for translations
>>106715569I did software for docx translation in which models just don't have a way to miss a line, thy are asked about each line directly, and if they choose to translate a different thing from context, this is detected with tags. Mistral-Small 24B works perfectly.
>>106716029thanks anon
>>106715830Yes potentially especially with imatrix. However my argument is about specifically broad subject intelligence vs rote memorization of articles, ignoring individual subjects/tasks. There are some benchmarks that lean a bit more towards general tasks while some lean more towards memorized facts, and they are all unreliable for various reasons, so our understanding of whether quantization really affects the two things differently (in a way that matters towards the goal of comparing broad quant quality or "for measuring quantization effectiveness") is not clear. But generally MMLU has been used most of the time as the benchmark after PPL to test loss of intelligence from quantization, and it follows an exponential curve as does PPL. Since it's difficult to measure the absolute skill level or intelligence level of an LLM, the purpose of doing these simpler measurements is to compare relative differences, so stuff like PPL is currently an ok proxy for general subject benchmarks.Honestly though perhaps we're getting caught up in irrelevant details here. Since we are not trying to do the extremely hard thing of measuring absolute intelligence level, measuring relative difference in quality means that it doesn't really matter if we're doing it on memorization or generalization. There is going to be a loss either way, and it is almost certainly exponential. Any measure that can show that is sufficient.
This drought feels different from what we've had before. It feels more over than ever before despite everything. Almost unrecoverable.
>>106716419is this somewhat like those video game communities that start going crazy after a few months without a new update?
>>106716338>relative differenceOf course lol. Thanks anon
what do we do with this 80GB image model?
>>106716419The blocker is hardware, it's always been hardware. We need a hardware miracle right now.like a Huawei’s Atlas 300I Duo but with actually usable bandwidthOr Sandisk's high bandwith flash actually comes out. Seems like D-Matrix are working on this as well:https://www.techradar.com/pro/security/after-sandisk-d-matrix-is-proposing-an-intriguing-alternative-to-the-big-hbm-ai-puzzle-with-10x-better-performance-with-10x-better-energy-efficiency
>>106716796Even if there were some kind of miracle hardware breakthrough it wouldn't make its way to local for half a decade or more
>>106716383
>>106716839I wouldn't underestimate the chinese thoughhttps://techcrunch.com/2025/09/17/china-tells-its-tech-companies-they-cant-buy-ai-chips-from-nvidia/
>>106716419Need replacement for autoregressiveslop. I wouldn't be satisfied with running Gemini Pro on GPUs in my home rig.
I wish the hype around machine learning would die down so I'd stop having to see retards who don't know anything about it being extremely opinionated on itI just wanna do my data analysis in peace
>Moondream-3-Preview 9B MoEQ:>How many people are in the picture and what is their gender?A:>REASONING>The image shows four people in the photograph. Looking closely at their genders, we can see two women and one man. The individuals appear to be engaged in a casual interaction, with one person gesturing while holding a phone.>RESULT>4 people, 2 women and 2 menOverhyped garbage. Even SmolVL got the genders right.
i have a question, I just watched the movie Eddington about the datacenter named SolidGoldMagikarp, and after learning about glitch tokens, i see that chat gpt has patched the bugs with that came with that token.but i was wondering are there any currently glitchy tokens for chat gpt or is that like entire concept patched now?
>>106717191ChatGPT is not an open weights model so it's not a local model. You want this thread instead >>>/vg/540611817
>>106717191There probably are some still.>>106717208Model is not, but tokenizer is.
>>106717191That's a great image. I love the machinery.
>>106717208thank you!>>106717261it's a great album too!https://youtu.be/MId3KYmvsXI?si=pDbIwG6HYCayef4h
>>106716029damn I just gooned like never before to V3.1 on OR and it took all of 5 seconds to set up. I can’t believe I wasted weeks fiddling with local shit and testing dozens of retarded <100B models in the hopes that one of them would be remotely intelligent, even though deepseek was miles better and available for free all along.
>>106717383Once you stop feeling shameful about someone seeing shit you write, it's kind of hard to go back to local.
>>106717191Ignore the other shill, /aicg/ is in this board.>>106700209
>>106717437that's the lower tier thread
>>106717458The thread in /vg/ is about botmakies.
Is it possible to use Qwen-image-Edit-2509 on forge, or at least on the qwenUI?I don't want to install more GUIs.
>>106717474the thread in /g/ is about shitposting
>>106717507Occasional shitposting sounds better than worshiping botmakies.
>>106717531Occasional worshiping botmakies sounds better than shitposting.
>>106717556No, it does not. Why would I want to worship botmakies?
>>106717531>>106717556you are both cancer
>>106717569oh we're talking about you here my b, anyways for non faggots /vg/ thread is betterhappy now?
>>106717601It's the other way around. Since /vg/ revolves around personalities, that's the thread for faggots. Like you.
>>106717628>Since /vg/ revolves around personalitiesthat's /g/ hence the thread is for faggots, i.e. you
>>106717667>>>/vg/540615495What's this about, exactly?
>>106717696>>106717634>>106716210just let it go anon /aicg/ on /g/ is just the lower tier thread
>>106710285
indifferent to miku
deferent to miku
bullet to miku
i just want a new modelmistral large 3 perhaps?
>>106718270
>>106718270hahaha, lol, the mao.
>>106718270Gemma 4
>>106718323>>106718323>>106718323
>>106717153Never trust a model under 30B (dense or MoE), it'll be shit.
>>106718496>>106718496>>106718496
>>106718449>>106718449>>106718449
>>106718536Invalid Miku