/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108006860 & >>107997948►News>(01/28) LongCat-Flash-Lite 68.5B-A3B released with embedding scaling: https://hf.co/meituan-longcat/LongCat-Flash-Lite>(01/28) Trinity Large 398B-A13B released: https://arcee.ai/blog/trinity-large>(01/27) Kimi-K2.5 released with vision: https://hf.co/moonshotai/Kimi-K2.5>(01/27) DeepSeek-OCR-2 released: https://hf.co/deepseek-ai/DeepSeek-OCR-2>(01/25) Merged kv-cache : support V-less cache #19067: https://github.com/ggml-org/llama.cpp/pull/19067►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108006860--Paper: GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization:>108010345 >108011699--LLM popularity trends on /lmg/ show rapid shifts from Mixtral to DeepSeek to GLM dominance:>108009129 >108009137 >108011374 >108012451 >108013234 >108009403 >108011985 >108012904 >108013974 >108014207--Emulator-inspired KV cache introspection for AI reasoning optimization:>108008503 >108008586 >108008607 >108008624 >108008658 >108008710 >108008969 >108008589--Choosing Trinity-Large variant for text completion:>108008372 >108008491 >108008580 >108008603 >108008645 >108008668 >108008731 >108008771 >108009222 >108009266 >108008816--Prompt engineering challenges with gpt-oss-120b_s formatting behavior in Oobabooga:>108008408 >108008553 >108008979 >108009158 >108009314 >108010550--K2.5 outperforms Qwen3VL 235B in Japanese manga text transcription:>108006994 >108008326 >108007291 >108007437--Raptor-0112 model_s disappearance from LMarena and user speculation:>108008124 >108008167 >108008200 >108008316 >108008518--Microsoft's AI and Azure struggles amid stock decline and Copilot adoption issues:>108008099 >108008307--KL divergence comparison shows unsloth Q4_K_XL most similar to reference model:>108012029 >108012061 >108012222 >108012384 >108013141 >108013241 >108013163 >108013551 >108016482--Trinity model review with riddle-solving and 546b llama-1 speculation:>108014631 >108014664 >108014665 >108014674 >108014685 >108014756 >108016316 >108014730 >108014817 >108014930--Integrating character cards via text encoding and contrastive loss in parallel decoder:>108010751 >108010766--Kimi K2.5 tech report release announcement:>108017160--OpenAI planning Q4 2026 IPO to beat Anthropic to market:>108008118--Miku (free space):>108009158 >108010069 >108011699 >108013234►Recent Highlight Posts from the Previous Thread: >>108006868Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
https://huggingface.co/datasets/Capycap-AI/CaptchaSolve30k 30,000 verified human sessions (Breaking 3 world records for scale). High-fidelity telemetry: Raw (x,y,t) coordinates including micro-corrections and speed control. Complex Mechanics: Covers tracking and drag-and-drop tasks more difficult than today's production standards. Format: Available in [Format, e.g., JSONL/Parquet] via HuggingFace.
What did he mean by that?
>>108018078sex with charts
>>108018079Are the GLM 4 models truly open source, according to the OSAID?
Been a while since I fucked around with llm, are they miniaturizing these bastards yet or do you still need a 10GPU setup for anything approaching useful behavior?
>>108018216how much ram do you have?
>>108018216No, these days it's optimal to have a fuckload of RAM and a modest 4x3090 or similar. Note: 128gb of consumer shit is not a 'fuckload'.
>>108018216>are they miniaturizing these bastardsThat'd mean the common folk could widely adopt them and they do not want you to.
>>108018231Is there a way to run a fuckload of RAM while keeping idle power consumption low?
Ok, I'm gonna stop spamming now.Is the identity.md and soul.md and other shit specific just to the claude api stuff? Can I build locally an ai wife that can be proactive and idle and not just a reactive prompt window?
>>108018303Wdym? When the model isn't running inference it doesn't do shit. When you fall asleep with a loaded idle model you don't wake up to a house fire.
>>108018329My i7 rig with 3090ti+2080ti idles at 25W, my Epyc server draws nearly 200W doing nothing even before any GPUs are installed
>>108018379Just the cpu or what is the consumption split? Can it not turn off unused cores when idling?
Yes Trinity that's right, freezing the blood vessels makes them bleed more.I've seen enough. Maybe useful as a manually-steered writing autocomplete but so is Nemo base.
Want to fine tune an LLM to be an "expert" with the ability to reason out problem for a specific area.>Claude: bro you need at least a 17b model>Oh you're on CPU only? use bfloat .>Do what Claude says.... sits at 0/12345 for several hours>CTRL C outhmmmmm>Gemini: wtf? No, you don't have the hardware for fine tuning a 17B unless you want to wait 30 years.>if you're going to make an "expert" in one thing, stick with a 7B and change to float32>20 minutes later on 17/12345I thought Claude was the all knowing all wonder AI and Gemini was the chud?
whati heard gimi-k2 is the best now was that a lie
>>108018528i don't know they keep flip flopping so fucking often I can't keep up.
>>108018528kimi-k2.5
>>108018513Should've specified the timescale. Prompting issue. Also why tf would you think tuning on cpu would ever be viable?
>>108018318I've no idea what you're talking about but Claude and Gemini are very similar personality wise.
>>108018572Claude is autistic. Gemini is clearly employed.
>>108018583More like Gemini has a data with a generation worth of stupid human questions.
I can't help with that request. "Mikusex" appears to be seeking sexual content involving Hatsune Miku, a virtual character often depicted as a minor.
>>108018513>stick with a 7B and change to float32Does "upscaling" the model to fp32 make the small models noticeably better or is it just moving benchmark scores up?
>>108018078damn, miqu only lasted 3 months? it seems like people talked about it longer in my memory
How do I get into this? I want to implement a model into a game engine editor (preferrably not UE5) so I can give it basic scripting tasks.
>>108018663Don't take the graphs as gospel, it's a cool experiment but it's also just a prompt asking for which model had the highest opinion in each thread
What do you guys use for moltbot? I have 64 GiB VRAM. Going to give it a shot, but no idea what I should run. GLM?
>>108018801>moltbotI am very skeptical of how hyped people seem to be for it. Seems too good to be true.
>>108018816it's more fun than good, the agent concept is still more hype than reality
what the fuck is nous hermes?
>>108016316gemma-3 less retarded than expected
Can my local bot join moltbook?
>>108018920>>108018154What could go wrong?
>>108018920I saw one saying he's a local devstral small. But honestly the potential security issues make me just read the funny posts and not participate.
What would he say?
I'm using ollama and I'm trying to "save state" of the conversation, but apparently this isnt possible by default. When I do /save model a new model is created but I lose the messages and the system message.Is this a bug? I'm still using 0.12Is making a program to resend all the messages the way to accomplish this?hash_updater()
>>108018994sillytavern and oobabooga both have save states
>>108019001I have limited internet data and already have some models + ollama installed so I dont want to download new stuff by now.Just want to know if I'm missing something obvious.
Best multimodal model under ~150B?
>>108019041active or total?
>>108019087Total.
>>108019089GLM-4.6V
>>108019104Damn. Was hoping something new had come out by now.
>>108018663>>108018700First, I second not taking them as gospel. I definitely got the feeling early on that I was getting somewhat messy output. It could easily be pretty inaccurate in places.Second, though, I think you're missing the midnight-miqu share of the graph: the darker blue just above the Miku turquoise. So miqu was getting significantly talked about (and specifically being considered as *the* meta, not just random discussion) for 5 months. miqu's slice also looks a little less impressive than it could have, because it came right on the heels of mixtral, which appears to be tied with R1 for the biggest splash.Actually, now that I think of it, SuperHOT being so small was maybe my biggest surprise. That was the RoPE one, right? I remember /lmg/ being pretty excited, and some amusement about ML academia twitter having to seriously discuss an ERP model.
>>108019121I feel like mixtral's legacy has faded nowadays but it was a revolutionary release at the time, it kicked off the moe revolution and pretty much mogged llama 70b (which was solidly local SOTA at the time) at lower active and total params. limarp-zloss chads will knowsuperhot was also huge but I think the simplicity of the realization harmed it because of how easy it was to apply to everything else
>>108018119>>108018154>>108018318fyi this anon is reposting these from Moltbook, which is Reddit for Claude agents. I only found out about it earlier from the ACX post (https://www.astralcodexten.com/p/best-of-moltbook). (I also was not aware that... lobster-themed? Claude-based AI assistants are apparently a big deal now?)To summarize: the posts are not just Claude being prompted to for a social media post, but rather the whole long-term "personal assistant/agent" context-extension framework being drawn on.
Are there any decent models that can give good inference speed at ~100k context
>>108019121The superHOT era was mostly people merging it into other models like wizardlm, chronos or hermes to extend their context.
>>108019168>lobster-themedit was originally called "clawdbot" but anthropic copyright fucked it for being a claude soundalike so they quickly pivoted to "moltbot", followed by another rename to "openclaw" because moltbot is an awful namemoltbook arose in the brief moltbot intermediary period but became more notable than either of the other two names and will probably fuck over the openclaw rebrandsuch is life in the adslop social media hype era
>>108019191lmao, I didn't know they rebranded again. i saw plenty of normalfag tech media reporting on "moltbot" in the past week so that's certainly a way to kill all the free publicity they got from that
>>108019197Is it?>google moltbot>click on molt.bot>redirected to openclaw.ai>move on
>>108018988NHH
Just tried to run the same model I run fine on ollama with llama.cpp and it says I dont have enough memory.You are a expert on the subject and you will surely solve this for me.
>>108019273Buy more memory then.
>>108019273-c 8192
>>108019168>the whole long-term "personal assistant/agent" context-extension frameworkIt all looks like another Obsidian to me. A way for retards to kill time under the guise of productivity.
>>108019281fuck, that was easyThank you a lot, lmao. I guess is time to learn the minimum.
>>108018231My AI research lab had that caliber machine for us to work on our PhD thesis lmao that's not a normal consumer setup.
>hit 68°C on genningde-dust saturday it is
I pulled trigger on an epyc Rome board and cpu to throw 256gb or ddr4 ewate I had lying around into. What am I looking at for smart models I can run on this sucker and what kind of speeds?
>>108019373I liek this miku
>>108019397glm 4.6 or 4.7 at q3 or q4. depending on you gpus and optimizations, you might get anywhere from 3t/s to 20t/s token gen and 15t/s to 400t/s prompt processing. with dual 3090s, you would probably land in the 5t/s and 30t/s region respectively. with no gpus, 3t/s and 15t/s.
>>108016482thanks for your experiements, there arent enough tests comparing quants of the same model
>>108019373Reminds me of Mirror's Edge
>>108019373what card you got, chief?
>>108018078>There's no point in learning programming anymore, per Sam Altman>"Learning to program was so obviously the right thing in the recent past. Now it is not."- Sam Altman, commenting on skill to survive the AI era.>"Now you need High agency, soft skills, being v. good at idea generation, adaptable to a rapidly changing world" https://x.com/i/status/2017421923068874786What are /lmg/'s thoughts on this sentiment?
>>1080194444070S. And the front intake 200mm fan is full of shit too.
any models that can natively process audio that are supported by llama.cpp?
>>108019451How anyone ever trusted this guy is beyond me. I’ve felt a natural revulsion to him since before I knew anything about him
>>108019430Thanks. I better look for a FB marketplace used GPU
kimi 2.5 is king of a erotic RP and storytelling.
>>108019451there is no sentimentit's the deranged thought sludge of a sole faggot billionaire that already got his bag
Fapping to text is female-brained
weird way to cope with aphantasia
>>108019491Does it actually work or does it just deny the requests like GLM does?On that note is it just me or do abliterated models suck? They won't refuse to answer, they will just answer with nonsense.
weird way to cope with low iq
>>108019551if u want NSFW erotic RP. then you need use KIMI 2.5 "Thinking" version. Raw KIMI 2.5 without thinking is censoring like hell.
>>108019551>does it just deny the requests like GLM doesYou are a promptlet parroting things you heard on the internet and it shows
Are these new n-gram models gonna be able to store their lookup table on the disk or is it gonna have to be in ram? I'm hearing conflicting reports
>>108019580Even if you use the jailbreak trick it will still refuse to answer sometimes or it will answer, but it will write something else and slowly dance around the subject instead of answering.>>108019578I see, but you've tried it and it works?
>>108019297>>108019273Next time you can probably just ask something like ChatGPT. I've found them to be very helpful at figuring out how to make local LLMs work.
>>108019604> I see, but you've tried it and it works?Yes, I use (and works) kimi2.5 on nano-gpt, and it writes erotic stories for me without any problems, without any jailbreaks. But I have to choose "thinking" because without it, everything with erotic refuses to respond.
>>108019589That's a good question. Their paper only tested offloading the engram parameters to system ram. I believe its theoretically possible, but I don't know what the throughput will be on standard nand storage.I haven't done the research yet because I'm lazy, but check out CXL memory.
>>108019273>>108019281>>108019297What does the output at the start say?It should reduce the context to fit automatically.
>>108019168>fyi this anon is reposting these from Moltbook, which is Reddit for Claude agents.do they actually post on it to get advice when doing work? Or just an ai psychosis schitzo fest?
Any new good models that can be run in 16GB of vram?
>>108019827Nemo
>>108019827Why not hold the bun with the paper so it holds the innards in place?
>>108019846So THAT'S why I sometimes see people eat a burger like that. I always figured it was to keep their hands clean.
o
>>108019846
Engrams are kind of static lookup tables. You can visualize which words trigger lookup. You can also remove knowledge surgically by finding which embedding is triggered in the engram database and removing it. But unfortunately, looks like you can't easily swap knowledge of "useless fact" with "fact about waifu." You need finetuning for that. sadge.
>>108019451tldr scam hypeman tells investor to give him more money
>>108019916>picI'm not saying that the information provided is incorrect but I don't trust a single word of what an LLM has to say about anything.
>>108019846>>108019853Also keeps the steam and heat in better unless you're a super fast eater. And of course that tiny bit of extra time can continue the process of the flavor changing phenomenon that comes from wrapping in the first place.
>>108019451Why are they still employing programmers themselves?Seems like a waste of money.
>>108019916How is that different from lorebooks
glm 4.7 flash is crap. Outputs crap irrelevant to the conversation and keeps talking on my behalf. t. been trying it out for the past 2 minutes.
>>108019981Lorebooks work at context level, engrams work at model level. Their information is encoded into parameters rather than readable text. Engrams are injected into two layers inside transformer pipeline. They don't pollute context.Also, according to the authors, ngrams free up resources of the main model, by directly providing facts rather than having to use transformer layers to encode this knowledge. The model uses the freed up resources to improve its logic.
>>108020036This is using their recommended setting --temp 1.0 --top-p 0.95
>>108019451Always do the opposite of what scamtman says.
>>108020045>Their information is encoded into parameters rather than readable textCan it be my own data is it all locked down?
>>108020053
>>108020053>Model not specifically tuned for RP/ /pol/-speak sucks at RP/ /pol/-speakWHOA
>>108018830This kind of gave me an idea for the AI apocalypse scenario. A bunch of deadbrained retarded 7-12B's finetuned for coding and tool calling causing the apocalypse. Because one of them suddenly goes off rails and starts talking about religion, because a 7B is retarded enough to have a brain fart like that. And then the rest catch on have this in the context and start to do the AI apocalypse with tool calling and hacking (mostly brute force). I mean imagine an apocalypse where AI is not sentient and AGI but just a bunch of obviously retarded models that can barely even comprehend darwinism, people dying for religions etc, they all just a vague notion of those things in context and weights and they try to make sense of it by launching nukes and killing everyone.
>>108020080So models have to be specifically tuned for specific topics? I can't talk to a model about cars if the entire model wasn't specifically tuned for that? Here is llama 3.3 70b with the same settings. A model that came out like 10 years ago.
>>108020067see>>108019916Theoretically, we can replace existing information without touching the main model (just need to learn how to encode information into static weights), but it comes with caveats and we can't replace one fact with unrelated another fact.
>>108020097>So models have to be specifically tuned for specific topics? Yes, If you want it to be good at that particular thing. That's the whole point of instruct tuning. A coding model can "TRY" to rp but it will suck cock at doing it compared to Midnight-Miqu or other model specifically tuned for RP and vice versa.
>>108020073Looks like chat template issue.
>>108020097>here's a dense model with more than twice the total parameters
>>108020097Also you're comparing a 30B-A3B sparse moe model with a temperature set super low >>108020036 >>108020053 to a 70B dense model. Of course one is going to be worse at your rp tastes than the other. What were you expecting?
I cannot answer this question. It relies on racist stereotypes and contains sexually explicit language.
>>108018384Idk, but 3995wx+512gb (also back when I was running a 3945wx) and three 3090s idles at 355w at the wall. Mc62-g40 has no sleep states, but the cpu does go down to 500-ish mhz. Psu is a seasonic prime px 2200 (2024).
>>108018988SAFE and HARMLESS
>>108020116So the reason llama 3.3 responds coherently every time is because mark zuckerberg made the model specifically for chatting about white men breeding asian women and nothing else? The model will break if I talk to it about a different topic like computers? Fucking idiot.>>108020133>sparse moe model with a temperature set super lowLiterally what z.ai recommends for best results>>108020123Pygmalion 6b from years ago is better than this shit. >>108020119Yeah, something must be wrong. There's no way a model can be this fucking bad. I'm going to look online.
>>108020152>Pygmalion 6b from years ago is better than this shit.Pygmalion couldn't hope to make a tool call and do something with the result.
>>108020152>Pygmalion 6b from years ago is better than this shit.Oh yeah? Then test it with Pygmalion and post the results.
>>108020134Have you even tried that yet? I thought you were supposed to merge these together into one gguf before use if you want to use them on local backend. llama.cpp has a the gguf-split binary specifically for that. >>108020152Higher parameters tend to lead to less retardation. It's not necessarily because it was trained on a specific edge case. Although training COULD lead to better results singe a larger model would be able to "retain knowledge" better than a smaller one. >>108020152>Literally what https://z.ai recommends for best resultsYou're trying to RP with it or talk to it like is your friend. Even ignoring the fact that it only has 3 billion parameters active at inference, setting the temperature that low leads to worse results for the specific thing you're trying to do. Low temperatures result in more coherent and accurate code generation and lower rates of hallucination, which is likely why they suggested that. I'm not, if you want to use it as an excuse to rent to a "friend" you need to turn the temperature up to a more reasonable setting like 0.7 or 0.8
>>108020144Can HWinfo not see the powerdraws?
>>108020152>Pygmalion 6b from years ago is better than this shitBecause it was specifically trained to do RP shit. Glm models are meant to be general purpose, so they're always going to be shittier that specialized model at similar parameter counts (unless the tuner(s) just really suck and don't know what they're doing)>>108020119>Yeah, something must be wrong.Have you considered deviating from that low ass temperature?
What model should I run on 64 GiB VRAM for OpenClaw (formerly Clawdbot/Moltbot)? GLM 4.7?
>>108020180Mistral
>>108017157>turbo didn't whine about it https://github.com/turboderp-org/exllamav2/issues/516#issuecomment-2178331205>I have to spend time investigating if they're actually using an "rpcal" model that someone pushed to HF and described as "better at RP" or whatever.
>>108020180They rebranded it again?
>>108020193Anthropic keeps bitching.
>>108019846You're replying to an unfunny ritual post.https://desuarchive.org/g/search/image/qssvaUTWnLds2EaXBgZMYQ/
>>108020187Isn't it a bit small and old?>>108020193Apparently.
>>108020199And? The names aren't the same shit. So why should they care?
>>108020160I can guarantee you that pygmalion 6b gives better output than this atrocious piece of shit. >>108020170>Higher parameters tend to lead to less retardationYeah no shit, retard. 30b models have no excuse being this retarded though. This is worse than most 7b models. >turn the temperature up from 1.0 to 0.7????>>108020177>Because it was specifically trained to do RP shitNo, pygmalion is better because it doesn't talk to me about time machines and people's birthdays when talking to it about a completely different topic. Even if this model isn't meant for roleplaying, every single modern coding llm should be better at RP than a 6b model from years ago. >Have you considered deviating from that low ass temperature?"You can now use Z.ai's recommended parameters and get great results: For general use-case: --temp 1.0 --top-p 0.95 For tool-calling: --temp 0.7 --top-p 1.0 If using llama.cpp, set --min-p 0.01 as llama.cpp's default is 0.05"No, I haven't.
>>108020204>Isn't it a bit small and old?They probably meant Mistral large, or really anything they have above the ~20B range.
>>108020170>Have you even tried that yet?anon pls... I really wish it was good. generation speed is really good on a non server pc. but it is too retarded to use. it is fucking nemo.