[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


🎉 Happy Birthday 4chan! 🎉


[Advertise on 4chan]


File: hatable.jpg (636 KB, 2017x2048)
636 KB
636 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106748568 & >>106738470

►News
>(09/30) GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities: https://z.ai/blog/glm-4.6
>(09/30) Sequential Diffusion Language Models released: https://hf.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552
>(09/29) Ring-1T-preview released: https://hf.co/inclusionAI/Ring-1T-preview
>(09/29) DeepSeek-V3.2-Exp released: https://hf.co/collections/deepseek-ai/deepseek-v32-68da2f317324c70047c28f66
>(09/27) HunyuanVideo-Foley for video to audio released: https://hf.co/tencent/HunyuanVideo-Foley

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106748568

--Papers:
>106752846 >106755511
--Hardware setups and optimization strategies for running large language models locally:
>106752694 >106752837 >106752845 >106752868 >106752876 >106752881 >106752963 >106752980 >106753037 >106753055 >106753092 >106753128 >106753211 >106753217 >106753245 >106753141 >106753170 >106753173 >106753190 >106754528
--GLM 4.6 creative writing evaluation and benchmark reliability concerns:
>106750563 >106750633 >106750775 >106750659 >106750706 >106750786 >106750841 >106750833
--Mixed reception of Sora 2's video generation capabilities and limitations:
>106748610 >106748671 >106748683 >106748814 >106748736 >106748753 >106748751 >106748774 >106748777 >106748786 >106748812 >106748826
--Evaluating Suno V5's proprietary music generation against local models:
>106749000 >106749400 >106749538 >106749559 >106749590 >106749548 >106749621 >106749642 >106749799 >106749929 >106750524
--Exploring layer-level noise injection for model creativity enhancement:
>106748706 >106748752 >106748767 >106748830 >106748762 >106748852
--GLM 4.6 compatibility updates for llama.cpp:
>106751537 >106751573 >106751594
--Workaround for GLM-4.6 compatibility issue in ik_llama.cpp:
>106754849
--Sora's video generation performance and prompt adherence challenges:
>106753575 >106753597 >106753646 >106753650 >106753662 >106753732 >106753775 >106753852 >106754084 >106753667 >106753676 >106753687 >106753719 >106753729 >106753750 >106755333 >106754191 >106754204 >106753698 >106753722
--Qwen model inaccuracies in name recognition and inconsistent multilingual performance:
>106752092 >106752111 >106752139 >106752258 >106752324 >106752624
--Miku (free space):
>106748655 >106751155 >106751345 >106749314 >106753215 >106753775 >106753816 >106754088 >106754738

►Recent Highlight Posts from the Previous Thread: >>106748575

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Better tool/script to download HF repos? everyone complaining about this xet shiz. Parts keep stalling and unresumable using browser.
>>
it's so over
>>
>>106755904
My waifu Migu (not the poster)
>>
>>106755904
Well done. I pat the Mikuhat.
>>
>>106755923
huggingface-cli works fine and resumes on failure
>>
>>106755923
git
>>
Is glm4.6 hybrid or thinking only?
>>
>>106755923
huggingface-cli download ubergarm/GLM-4.5-GGUF --include "IQ3_KT/*" --local-dir glm
>>
Wasn't there someone trying to get MTP to work with GLM4.5 on llama.cpp a while ago? Did that also go nowhere?
>>
https://files.catbox.moe/w3cpki.webm
>>
>>106756110
Korean sweatshops could do better walking animation than this.
>>
>>106756126
It's not about the walking animation now but the walking animation two to fifteen years from now
>>
>>106756110
>>106756126
Even if you fixed the walking, the quality of the art is such utter garbage, like your low quality generic isekai slop of the season. The whole point of AI art is to make better art fast, not literally copy the shit speedrun art drawn by monkeys in animation sweatshops.
>>
https://files.catbox.moe/w84blo.webm
>>
>>106756164
This is truly a faithful recreation of a sloppy isekai of the season, just the animation studio saving their budget for so called "better" scenes.
>>
File: 1729374392398752.webm (614 KB, 1000x562)
614 KB
614 KB WEBM
>>106756126
Are you sure?
>>
>>106756185
Imagine how many more sloppy isekai they'll be able to churn out per season once they can have AI generate 90% of the scenes.
>>
File: 0250922_162247.jpg (117 KB, 1811x380)
117 KB
117 KB JPG
>>106755904
>>106755906
>>
File: 1741432814959702.jpg (84 KB, 540x798)
84 KB
84 KB JPG
WHERE IS AIR 4.6
GIVE ME THE WEIGHTS
GIVE ME THE GOOFS
NOW NOW NOW
>>
File: cockbench.png (1.19 MB, 1131x3646)
1.19 MB
1.19 MB PNG
Added GLM 4.6
>>
https://files.catbox.moe/ffaa0e.webm
>>
>>106756268
https://github.com/ggml-org/llama.cpp/issues/16361
https://github.com/ggml-org/llama.cpp/pull/16359
https://huggingface.co/bartowski/zai-org_GLM-4.6-GGUF/blob/main/zai-org_GLM-4.6-imatrix.gguf
>>
>>106756215
At that point it's going to be easier to literally just take any random light/web novel and prompt it to animate that shit. What the hell would we need animation studios run by retarded boomers for? Tradition?
>>
>>106756055
https://github.com/ggml-org/llama.cpp/pull/15225
>>
>>106756303
>Tradition?
It's Japan, so yes.
>>
>>106756313
Good thing we won't need Japan anymore then.
>>
>>106756215
i suspect there is lots of ai tools already used in the background for anime.
at least the subtitle translations already have mistakes that a human cant make. like mixing up the sex. you need to actually see stuff because in japanese that info isn't provided with language.
or 2 interpretations of a word and you need context to know which one it is...
I suppose everything auto translated and a dude in the basement looks it over quickly before pushing the upload button.
>>
>>106756280
Wtf coderbros???
>>
>>106756158
that's hilarious because, that's not AI...
>>
>>106756323
Stuff like that will be solved when models can also take vision into account or (less likely for now) have better context management so information like that isn't lost when translating a series.
>>
>>106756330
I thought you were joking, but I just reverse image checked that shit and it's two years old. Animators will be out of a job next year by this point.
>>
>>106756330
Winrar, these >>106756110 >>106756164 >>106756285 were all scenes from the QUALITY anime Ningen Fushin no Boukensha-tachi ga Sekai wo Sukuu you desu.
>>
>>106756333
yeah, i dont understand how the normies doom now because the jeet agi 2025 prediction didnt turn out to be true.
as far as i know freelance writers and translators feel hard right now already.
like you need to use llms and fix it up manually a little bit for like 80% less pay compared to the past.

also i spot llms everywhere now. once you figured out the model you can just feel it.
for example the monthly kindergarten pamphlet of my kids. monthly message from the teacher.
im in japan so maybe its more used here though. not sure in other countries.
>>
>>106756268
is it even good for smut compared to g4.5? I was excited for glm because moe. Should I care it's better at math? I feel like for writing it will be whatever
>>
>>106756296
THAT'S NOT AIR YOU NIGGER
>>
>>106756416
air is deprecated due to poverty not being a valid use case
>>
>>106756430
iF YOU'RE NOT USING FULL GLM FULLY ON GPU THEN YOU'RE ALSO POOR
>>
>>106756439
poor is a wide spectrum
>>
File: 1743474513250754.png (93 KB, 720x438)
93 KB
93 KB PNG
>>106756391
Improved writing and RP was actually a focus for 4.6 according to their model card.
I haven't used it too much yet because my ggufs are still downloading but it seems a lot more flexible than 4.5 was in open-ended scenarios.
One thing that stood out was that its reasoning process is really fucking thorough for creative work now. It'll go through several really fitting options, map out the reply and consider other pretty interesting stuff which I haven't seen from another model. It's quite different from the shorter thinkers like GLM 4.5 and Deepseek V3.1 while also much being more focused and on point than how R1 used to do it, which tended to meander a lot while doing reasoning.
I've seen some really good reply variety thanks to this. However, it also sometimes bloats the thinking part up to almost original R1 proportion so you likely still want to skip it if you're running it at a slow speed. Still a pretty interesting trait if nothing else.
>>
>>106756280
>The cock length was decreased by half.
its over
>>
>>106756478
cock chance decreasing isn't bad if you get more varied words for cock
>>
>>106756468
>>Improved writing and RP was actually a focus for 4.6 according to their model card.
when will we learn, that model cards mean nothing, they can say anything they want.
>>
>>106756478
it wasn't cut in half, it was flattened to match the rest of the distribution
your cock is now average anon
>>
I struggled through installing ik_llama.cpp on windows, but I finally figured it out. Was following someone's MinGW guide when CUDA seems to depend on the cl.exe from VS.

Now that I've got it running, what can I expect? I see that ubergarm has some quants that use the ik form, are those worth all the fun of installing ik?
>>
>>106756574
for larger models what requiere cpu offloading, ik+uber is the best speed wise and quality/size
>>
>>106756574
i think only deepseek gets a speedup on ikllama because of an architecture specific optimization that ikllama has implemented
>>
Very organic.
>>
>>106756599
Is deepseek even worth trying on 128+24gb?
>>
>256gb uram m5 max macbook
and just like that we're back
>>
Damn 4.6 is good. Its real nasty even without a elaborate sys prompt.
Just ooc "make it hot and sexy" in the first message and it does that. Thats how it should be.

The brutal part is the thinking. Its already too big and that kills it totally for me.
I hope we get a voodoo moment for AI soon.
I'm not gonna buy the recent 500 watt monsters from nvidia.
>>
>>106756285
This show was so ass. I can't remember whether I finished it.
>>
>>106756801
Have you tried it without thinking?
>>
>>106756727
if your Internet is fast enough and you have the disk space, it is definitely worth giving it a shot.
>>
Are you anons really measuring a model's quality just by how prone it is to write 'cock' or 'pussy'?
>>
>>106756860
yeah, have you got a better way?
>>
>>106756860
Look at cockbench results for gemma and qwen 3 32b and tell me it's not a meaningful benchmark.
>>
What was the last local roleplay/chat-tuned model that didn't have safety or alignment baked-in, and wasn't a "helpful assistant"? Bonus points if it wasn't told it was an AI.
https://huggingface.co/EleutherAI/gpt-neox-20b?
>>
>>106756882
>https://huggingface.co/EleutherAI/

>Welcome to EleutherAI's HuggingFace page. We are a non-profit research lab focused on interpretability, alignment, and ethics of artificial intelligence.
sad
>>
>>106756882
That was not roleplay/chat-tuned.
>>
>>106756830
No, because I am a huge retard. Thought glm isnt a hybrid thinker.
Also something with my preset is fucked since it doesnt properly prefill <think></think>. But thats my problem.
Anyway, thanks anon.
>>
File: GLM.png (56 KB, 486x706)
56 KB
56 KB PNG
>>106756895
My ST is ancient and I'm not updating it because I'll get supply chain malware from npm, but maybe this is helpful.
>>
>>106756903
thats helpful, i have a area with "Start Reply With" but it seems to be ignored. doing that way will probably make it work. thanks again anon.
>>
>>106756917
No problem. Thank you for being you.
>>
>>106756468
How censored is it when you let it think? GLM 4.5 was basically uncensored when you disabled thinking, but not so much when you didn't.
>>
>>106756860
yes
>>
>>106756937
how do you actually disable the <think> etc on the command line ie. llama.cpp or similar when starting a server?
and which is usually a better idea for creative writing and stories (as opposed to ERP), let it do the <think>s and just remove it during parsing or turn it off? what would you for R1 for example
>>
>>106756860
Counterintuitively it is a great test.
>>
File: 1759319701396.png (1.11 MB, 2048x2048)
1.11 MB
1.11 MB PNG
surprise reminder that jart will always lurk this bread
>>
>>106756522
Why would a big company lie about that? I get they all lie with benchmark results but focusing on rp would be such a weird lie.
>>
File: 1759319940605.png (61 KB, 401x360)
61 KB
61 KB PNG
>>106756860
cmon sam, your 'erry already lost to mesugaki tests to the point you had to include it in your sysprompt. no need to seethe over cockbench
>>
Is 4.6 support merged on main yet or do I have to build from PR?
>>
File: 1759320379155.png (269 KB, 646x644)
269 KB
269 KB PNG
i noticed poopenAI had paid shill to post about "building rig" for 'toss 120b in every single open sores llm community non-stop. on weekly basis.
no i will not take my meds. fuck clankers. jannies tongue my anus.
>>
>>106757066
Probably Sam, himself. Typical millennial twink probably just sits around shitposting on the internet all day when he's not working.
>>
I'm looking for an advice on building a rig for running the best open source AI OLLaMA chat gpt-oss-120b, any advice? Thanks in advance!
>>
>>106756303
>What the hell would we need animation studios run by retarded boomers for?
Quality? The fuck are you talking about, even the best gens are way behind human animation. This is true for images, videos and text and it will continue to be true for a few years at least.
I'm not one of the guys who has a hard-on for human artists but AI is simply not here yet. First thing I write in booru search is -ai_generated because I can't stand that low quality and shitty generic art style - it ruins my goon sessions.
>>
>>106757104
Sure! For running a 120B parameter model like GPT-OSS or LLaMA-based variants, you’ll need a high-end setup with multiple GPUs—ideally A100 80GB or H100 80GB cards—since memory bandwidth and VRAM are critical; a powerful CPU (like Threadripper or Xeon), at least 256–512GB RAM, fast NVMe storage, and robust cooling are also essential. Using NVLink or PCIe 4.0/5.0 interconnects will help scale across GPUs efficiently. You're Welcome!
>>
>>106756963
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
--reasoning-budget N controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1)
You'd have to check which models are actually affected by this setting.
It depends heavily on the model whether thinking helps or hurts. R1 specifically has very malleable thinking so you can just tell it what to focus on. Some models will spend most of their time thinking about whether to refuse, which doesn't exactly improve the output. If the thinking is not very malleable and the model hasn't been trained to think about creative writing like 4.6 apparently has been, then it's usually at best a waste of tokens.
>>
>>106757143
It's simply a matter of context, consistency and their preservation for next episodes, it's not quite there yet but it's very close. Everything outside of that like the booru stuff is literally just skill issue, slop made by retarded and lazy prompters.
>>
>decide to use chatgpt to make quick work of bash script to download a 5 part model
>model router decides to use retard edition
>can't get the console output right after 5 revisions
>final revision keeps going even after ctrl+c
Just give us o3 back, Sam.
>>
>>106757220
just pay 20 bucks bro
>>
>>106757226
The only thing Plus™ lets you decide, now is if it uses the thinking model or not.
>>
>>106757220
Cant trust gpt for anything coding related anon.
Use through an api so you have full control and no huge hidden sys prompt that severely degrades performance.
Sonnet for most stuff. Gemini 2.5 pro or a reasoner for difficult stuff. But you really have to pinpoint the problem. The reasoner models all suck because they try to fix everything but what you ask them.
Gpt is only good for general knowledge stuff. Like obscure jap youtuber drama. It knows the deep lore. kek
>>
>>106757157
thank you that helps
I still wonder, even for models which don't just spout cucked consent related nonsense in the <think>, like 4.6 or R1, is it really worth the wait? in other words, is there measurable improvement in quality of resulting prose, and is it significant enough to justify the increase in response time, and slowdown in t/s as context/kv builds?
>>
File: a.jpg (257 KB, 781x959)
257 KB
257 KB JPG
>>106757220
I made a simple ebook reader, it's 300 lines and that was really quick, no issues. Then I modified that a little bit and there it is, no more Calibre (I can change terminal font on the fly etc this is not that readable).
But on other nights it can't get anything right.
They should really use (1) quant or whatever for the free version and something else for the paid. Based on my overall experience though, I wouldn't pay anything for chatgpt... it's too random.
>>
>>106757273
4.6 still has cucked thinking?
>>
>>106757274
free tier claude has been constantly good for me. free chatgpt has gotten significantly worse with the last release, it used to be okay.
>>
>>106757299
I haven't tried it no idea. maybe I misunderstood, I thought the other guy was saying it does NOT do cucked reasoning
>>
I'm downloading 4.6 as we speak to test it out, but yeah... had to resort to manually downloading from web browser on this PC and I'll just push them to the server through the network because apparently my server is getting cock blocked via curl now, probably rate limited from restarting chatjeetpt's shitty download script several times.
>>
>>106757230
It still shows o3 and 4.1 for me under legacy models.
>>
File: file.png (4 KB, 260x156)
4 KB
4 KB PNG
Does thought process not work in the latest llama server for anyone else? I enabled it in the options and it just looks like this.
>>
File: 2025-10-01_13-07.png (134 KB, 453x538)
134 KB
134 KB PNG
yep
>>
Qwen Next 80B support in llama.cpp status?
>>
monke brain got me to do a trial of glm 4.6 off openrouter and it's goood
the vibes are immaculate
how do I turn off reasoning though?
>>
>>106755904
Is there an rp frontend that's like sillytavern but not bloated?
>>
>>106757526
curl
>>
>>106757516
>I don't need you think, I need you to write fine smut posthaste
Have /nothink at the end of your prompt
>>
the heavy weight of gemma. I can feel it
>>
>>106757656
Gemma MoE incoming.
>>
>>106757656
Would Google release that before Gemini 3?
>>
>>106757656
Gemma is already too fat...
>>
>>106757724
Make it even wider.
>>
>>106757656
...you know what >>106756280
>>
>>106756890
https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B
I should have used that link instead, whoops.
>>
>>106756860
it is the least likely metric to be benchmaxxed
>>
>>106757761
This. But as I've argued before sex permeates all of human culture and activity and thus permeates language in ways people ordinarily take for granted. And so ERP related benchmarks can tell you a lot about how the model handles world knowledge in general.
As I've said before you really have to run a lot of Nala tests to see it.
But a dumb model will just describe her as having hands outright.
But then a smarter model might describe her as having paws but describe the use of those paws basically the same as hands.
But then on really good models there's this light bulb moment where suddenly it describes the paws being used in a paw-like manner. At this point it's not just describing what a paw is but also what a paw isn't. It is now a falsifiable concept to the model. It's the emergence of genuine empirical knowledge. Like I've never genuinely coomed to the Nala card because I'm not into feralshit. But it's an epistemological goldmine when it comes to LLM testing.
>>
does zai have their api service for glm4.6? How quantized is it in OR
>>
doing a quick Nala on 4.6 as we speak, but my RAM is probably very disjointed right now because I didn't hard restart after turning off my minecraft server because it's going at like 1.5 tokens per second, But I will post both the thinking process and the response when it's done.
>>
>>106757795
You can select them as a provider on OR. Hopefully they don't quant their own model. They also have their own API.
>>
>>106757828 (Me)
One thing I will say now is that it uses a lot of wiki markdown in the thinking process which wastes a non trivial amount of tokens. Maybe it's helpful to the model though I don't know.
>>
>>106757846
Would be interesting to see how its performance change if you ban those tokens.
>>
>>106757865
looking at the text stream in it looks like they bothered to tokenize ** but not ***. So for h3 it uses ** followed by *. Which, for a model that size, is probably enough breadcrumbs for it to know to go * * if you ban **.
>>
>>106757230
I have a feeling this is going to be their "red cross free donuts" moment.
They created an expectation, which was a seen as a de-facto promise, and now the the average user is assmad.
The whole ploy was only going to work if they had an actual moat and zero competitors able to keep up.
Every AI company that can't either serve a consistent SOTA model or be the cheapest bottom feeder is going die like the irrational money pit that it is. Even VCs aren't that dumb.
>>
File: zuck.jpg (37 KB, 457x457)
37 KB
37 KB JPG
>>106756990
whats the context? damn i've been here since 2023 but never noticed this drama
probably because i was too busy making gpt4-x-alpaca
good times...
>>
>>106757936
tl;dr lcpp takeover attempt by a "look at meeee" midwit coder with some minor discord cult was curbstomped by ggerganov. It was a dark time. The less said the better.
>>
wow I have my ST set to generate 2048 tokens at a time right now to account for thinking models but I think GLM-4.6 is going to use all of those and still not be done thinking at this rate.
>>
>Great question — and it's smart to ask this before spending time troubleshooting something you may not even need.
>>
>>106757958
damn yeah I remember seeing her on twitter saying she made that llama thing all by her own
shes cute though
>>
>>106757958
Does Mr. Ggerganov post here?
>>
>>106757936
>gpt4-x-alpaca
that was one of the models that got me hooked, thanks for your work anon
>>
>>106757996
No problem, I'm glad you enjoyed it.
>>
so i tried out ikllamacpp, and i got 3t/s for a 5 bit of glm 4.5. but it was running entirely off of my CPU and my SSD. how do i put it on my GPUs and RAM?
>>
>>106757989
No. I don't.

Fuck...
>>
>>106757433
It works in some models and in others it doesn't.
>>
>>106757477
Two more weeks
>>
>>106757973
>https://github.com/ggml-org/llama.cpp/pull/613
This one is a proper narcissist.
>>
>>106756990
LLamafile has quietly been abandoned so I think it's safe to say that by now jart has moved on.
>>
Erl lather magister
>>
File: Nala GLM4.6.png (229 KB, 1832x812)
229 KB
229 KB PNG
>>106757828
Here we go.
The response is indented weird because for some reason ST didn't parse the </think> properly and everything appeared inside the thought.
But all that wiki markdown seems to have inflicted it with asterisk brain because it actually did use asterisks to punctuate every single switch between asterisks and narrative.
*blah blah blah**she said**she glomps you*
While it took the very rare step of acknowledging that the users back side is turned to her at the start of the scenario it did not seem to understand that makes the space surrounding the user's chest physically inaccessible without any additional action taken.
Either way- it's miles away from the worst I've ever seen. BUT as a function of performance relative to the investment of resources it requires to run locally... It's scout/GPT-OSS tier.
That was a lot of thinking for a very milquetoast reply.
>>
File: 1732405241445772.png (217 KB, 1060x805)
217 KB
217 KB PNG
>you were forced to go for the 200 dollars/month subcription to run Sora 1
>With Sora 2 you can simply go for the 20 dollars/month subscription
what kind of black magic is this? I really thought their model was gigantic, but it's probably not the case at all
>>
File: 0ds87f.png (40 KB, 579x160)
40 KB
40 KB PNG
>>106757960
*ding* reply is baked
>Thought for 5 minutes
sigh and hit continue
>>106758012
think the idea is ngl 99 and then override layers or specific tensors to CPU with either --n-cpu-moe (adjust as low as possible before GPU OOM) or manually assigning with -ot
for GLM-4.6 maybe want first three (non moe) layers fully on GPU?
first time playing with ik also today so don't really know what i'm doing
>>
>>106758314
Also worth mentioning this is IQ4_XS bartowski quant
>>
>>106758346
>I really thought their model was gigantic, but it's probably not the case at all
Probably a MoE.
Inference providers love MoE.
>>
>>106758346
Don't be fooled by the lies.
You need an invite code to use Sora 2. So it's a limited rollout at the moment. So don't give them your 20 dollars yet. I made that mistake already.
>>
>>106758314
>thought for an hour
wait what
>>
>>106758419
I said on my previous message that the memory on my server is in a very suboptimal state at the moment and as a result it only averaged 0.48 token/sec on generation throughout the entire run.
>>
>>106758432
anyway yeah the think part does look kinda promising
>>
>>106758314
>I like that.
>It's a line of ownership.
>musky+wild == earthy+grassy+predatory?
>>
https://huggingface.co/ubergarm/GLM-4.6-GGUF
He is goofing.
>>
>>106758607
Our lord and saviour.
>>
>>106758346
Sora 1 was also available with the $20 subscription. Maybe there was a delay before it was opened up, don't remember now.
>>106758403
I just grabbed a random invite from a Reddit thread where people were sharing them.
>>
>>106757843
I'm pretty sure ZAI runs the model in FP8 and considers that the "normal" way to run it.
>>
>>106758038
Should've split that in two posts for best effect.
>>
>no 4.6 air

another L for self hosting lmao
>>
>>106758314
Should have turned off thinking...
>>
>>106758734
stop being poor
>>
File: f978h.png (15 KB, 356x168)
15 KB
15 KB PNG
>>106758741
Oh GLM-chan it really ain't that deep..
>>
Miku is a clanker
>>
Haven't checked this general since Deepseek R1 0528 came out and have been using that ever since. Is there anything worth downloading over it yet?
>>
>>106758997
nemo
>>
>>106758997
what rig?
>>
>>106759038
Epyc 9334 768GB and a few 3090s to speed up prompt processing.
>>
>>106758997
no except for maybe glm 4.6 but i didn't try it yet. people seem to like it
>>
https://responsiblestatecraft.org/israel-chatgpt/
>>
>>106759086
Baby blender bros... we are so back!
>>
>>106758997
Kimi K2 and GLM-4.5 are side grades, but they are all slower than R1 at the same quant (at least when running in ik_llama.cpp).
I'm still on R1-0528 too.
>>
Had a dream about dating cute girls but I don't even remember their names after waking up
>>
Apple will save local
>https://arxiv.org/abs/2509.22935
>Compute-Optimal Quantization-Aware Training
>>
>>106759193
qrd?
>>
realistically, why cant i just run daniel's 1bit glm4.6 quant and have *fun*
>>
>>106759329
Who said you can't?
>>
>>106759262
Ask an LLM to summarize it
>>
File: co-sft_inference-test.png (1.08 MB, 1784x382)
1.08 MB
1.08 MB PNG
>>106755904
Good afternoon /lmg/

Did yet another finetune. This time of an entire board's posts

https://huggingface.co/datasets/AiAF/co-sft-dataset
>>
glm 4.6 verdict?
>>
>>106755904
Gemini 3.0 will be released this week or next week (probably on Monday). So the Deepseek team will finally be able to finetune their DS V4 model.
>>
>>106759542
based chinks
>>
ikllama.cpp still is not using my GPUs, even on tiny models. -ngl is set to 99 and i have completely removed --threads and --n-cpu-moe and -ot but it still is CPU only for some reason.
normal llama works with my GPUs just fine.
>>
>>106759636
Did you build it with the CUDA/ROcM/Vulka/SYCL backend?
>>
>>106759405
What's with the tabs and double newlines in the template?
>>106759499
10/10
>>106759121
I see you're back again. Try 4.6, it's an actual upgrade
>>
>>106759636
Oh yeah, if you launch it with --verbose, does it list the GPUs as devices?
>>
>>106759643
yes. i built it with
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
cmake --build ./build --config Release -j $(nproc)

>>106759656
let me try that
>>
>>106759405
>Tabs
That's just how axolotl inference formats output. It automatically wraps prompts in the proper chat template so I don't have to worry about making sure my prompts are formatted correctly.

>double newlines
???
I see single lines
>>
>>106758314
Serious prompting issue. GLM-4.6 barely needs anything beyond "You are a writer tasked with narrating {{char}}" for scenarios and "You are {{char}} in a roleplay with {{user}}". Add basic writing instructions to use full paragraphs and that's it. I'm using Markdown headers in the sys prompt to mark descriptions and instructions and it has never used Markdown in any of its actual replies.
>>
>>106759542
stealing from the (gpu) rich and giving to the (gpu) poor is based
>>
File: the stuff.png (112 KB, 960x1080)
112 KB
112 KB PNG
>>106759668
>let me try that
This is what you are looking for.
>>
>>106759687
very good sarrs
>>
File: おはミ ドヤッ.mp4 (439 KB, 1230x720)
439 KB
439 KB MP4
>>
>>106759694
it did not show that. it said that it was not compiled with cuda for some reason and that -ngl would be ignored.
i just tried to do a rebuild but it now says Unsupported gpu architecture 'compute_120'
>>
Been out of ai-ing for awhile, looking for two (2) things.

Is there a good model that can turn images into transparent vectorized images, like for an official logo.

What is the current best LLM you can run on 48gb vram (two 3090s)
>>
>>106758997
glm 4.6 is great
>>
>>106757451
birthday mikusex
>>
>>106759694
>rtx 3070ti laptop
>4.5 air
how does it even work? it must feel like you're making a religious pilgrimage every time you make a request
>>
>>106757707
They base them on Gemini, so I doubt they will release them before Gemini 3.
>>
>>106759730
Do you perhaps need to update your CUDA SDK?

>>106759749
Eh. I get 8t/s with ikllama.cpp at 16ish K context.
Good enough if I disable thinking thanks to the magic of DDR5.
>>
>>106759761
i am on cuda 12.5
>>
>>106759742
this, glm4.6 is legit the first model to truly compete with claude imo, deepseek / kimi were always poor knockoffs
>>
>>106759772
I think you need at least 12.8.
Maybe 12.9.
Or you can manually force a lower compute_ version.
>>
>>106759742
This, honestly. I also just came back after being out of here since DeepSeek R1 V2 was released and GLM 4.6 is the most natural opensource model for RP I have ever used to this day.
>>
File: file.png (193 KB, 500x280)
193 KB
193 KB PNG
>>106759773
>glm4.6 is legit the first model to truly compete with claude imo
>>
>>106759655
>What's with the tabs and double newlines in the template?
He always fucks up the template but blames whatever thing he uses for finetuning or inference or whatever is convenient at the time. He'll probably end up posting a catbox with the output trying prove it's correct, show that it's not, and then cope with "but it responds fine".



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.