[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor application acceptance emails are being sent out. Please remember to check your spam box!


[Advertise on 4chan]


File: 1745908814203796.jpg (1.12 MB, 1336x2008)
1.12 MB
1.12 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107138606 & >>107129334

►News
>(11/07) Step-Audio-EditX, LLM-based TTS and audio editing model released: https://hf.co/stepfun-ai/Step-Audio-EditX
>(11/06) Kimi K2 Thinking released with INT4 quantization and 256k context: https://moonshotai.github.io/Kimi-K2/thinking.html
>(11/06) LocalSong 700M melodic instrumental music generation model released: https://hf.co/Localsong/LocalSong
>(11/05) MegaDLMs framework for training diffusion language models released: https://github.com/JinjieNi/MegaDLMs
>(11/01) LongCat-Flash-Omni 560B-A27B released: https://hf.co/meituan-longcat/LongCat-Flash-Omni

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1758348922203207.jpg (151 KB, 808x1144)
151 KB
151 KB JPG
►Recent Highlights from the Previous Thread: >>107138606

--Agentic finetuning success with Gemma 3 27b using dataset duplication strategy:
>107140749 >107140853 >107140874 >107141186 >107141904 >107145572 >107145579 >107141303
--Model performance comparison and IF evaluation benchmark discussion:
>107145761 >107145774 >107145810 >107145849 >107146116 >107146184 >107146306 >107145947 >107145956
--Strategies for preserving Opus-3 model conversations before deprecation:
>107140145 >107140264 >107140360 >107140384
--Exploring free proxy models for logic/programming tasks and style transfer via LoRA:
>107140277 >107140356 >107140365 >107140399 >107140446 >107141293
--Single vs dual-GPU dilemma for performance vs power safety tradeoffs:
>107143867 >107143877 >107143878 >107143946 >107144867 >107144872 >107144155
--Sampling optimization debate for creative RP with minP/Top-P and temperature tuning:
>107139402 >107139418 >107139447 >107139500 >107139577 >107139540 >107139897 >107139915
--Llama training methodology and safety implications of validation set optimization:
>107140894 >107140932 >107141030 >107141086 >107141101
--Neural network depth and Gemini 1.2T model performance speculation:
>107145345
--Toss model performance vs Gemma 3 in practical applications:
>107145833 >107145904 >107146168
--Cydonia model performance comparisons and upcoming releases:
>107140380 >107140394 >107140486 >107141250 >107140397 >107140661 >107143958 >107143966 >107146415 >107146427 >107146449 >107146485 >107146506
--DDR4-6000 price spike frustrations and DDR5 transition speculation:
>107139738 >107139779 >107139792 >107139982 >107139985 >107142864 >107142896 >107143500
--Qwen data increases overfitting risk in CoT models:
>107140601
--Gemma finetuning results with QwQ's data: less neurotic, still verbose:
>107139425
--Miku (free space):
>107140392

►Recent Highlight Posts from the Previous Thread: >>107138613

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>Toss model
>>
Can someone post a QRD for setting up VibeVoice? What repo, what settings etc..
>>
>>107147241
It's stuck in python hell so just use the official demos on huggingface or find a comfyui node or something
>>
Can someone post a QRD for setting up Nemo? What fork, what temperature etc..
>>
blos? is we over? >>107147122
>>
>>107147259
stop doubting yourself and just do what you think is right. it'll work out, believe in yourself
>>
File: 1762278954671840.mp4 (2.23 MB, 512x640)
2.23 MB
2.23 MB MP4
>>107147241
I gotchu
https://github.com/vibevoice-community/VibeVoice?tab=readme-ov-file
>>
Can someone post a QRD for improving confidence uwu? Which hustler's plan, which youtube channel etc..
>>
anyone have any idea as to why sillytavern keeps deciding to insert every entry from the lorebook at the very beginning of each chat despite none of the trigger words being mentioned?
>>
>>107147288
thank u anon, im gonna read the source code before installing to make sure we're safe
>>
>>107147277
>average normalfag advice
>>
>>107147295
I think you should try the Drummer plan! I tried ERP with the Rocinante's model and it helped me talk to white girls. Make sure to join our discord and look for the right channel for a better experience ;)
https://huggingface.co/TheDrummer/Rocinante-12B-v1.1
>>
>>107147277
6'4" adonis's dating advice to 5'5" balding indian friend
>>
>>107147241
Back up of the original repo here:
https://github.com/great-wind/MicroSoft_VibeVoice
1.5B is still up:
https://huggingface.co/microsoft/VibeVoice-1.5B
Torrent of the repo (dunno if still seeded):
magnet:?xt=urn:btih:b5a84755d0564ab41b38924b7ee4af7bb7665a18&dn=VibeVoice&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce
Torrent for VibeVoice 7B:
magnet:?xt=urn:btih:d72f835e89cf1efb58563d024ee31fd21d978830&dn=microsoft_VibeVoice-Large&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce
Sampling with examples:
https://desuarchive.org/g/thread/106516368/#q106519850
https://desuarchive.org/g/thread/106516368/#q106519945
>>107147308
Good idea since the vibevoice-community repo has continued to be modified from the original and you don't know what was put into it since.
>>
>>107147352
Thank you so so much anon <3
so so so so so much <3
>>
>Let me write:
[1500 tokens]
>Wait, the user mentioned [minor detail], I should include that.
[1500 tokens]
>Hmm, I think I should expand the other part
[1500 tokens]
>Good, let's now write the reply:</think>
K2 is so amazing. The way it plans ahead is so thorough. I love it.
>>
>>107147367
so you wait like 15 minutes before even seeing a single token
>>
>>107147386
Nobody actually runs Kimi locally everyone just uses the website and/or API and then lies about using it locally.
>>
>>107147469
so then everyone is a faggot?
>>
>>107147479
UwU
>>
>>107147469
It's really time to just rename this general to /omg/ - open model general and drop the retarded local pretense
>>
>>107147500
I mean there's still lots of people in the thread that run models locally. But it's mostly just redditards that bother trying to run shit like kimi at 0.01 token/sec and drive up RAM prices in the process.
>>
File: kimi_stats.png (81 KB, 1910x326)
81 KB
81 KB PNG
>>107147367
>>107147386
I posted yesterday regarding Kimi's results. On one hand, if you let it think, the total response time (thinking+response) will typically range from anywhere from 3 minutes to 10 minutes on a mid-tier DDR5 cpumaxx machine. After some further testing, with thinking on, its really good. Completely unusable for quick goons but solid for RP. Its noticeably smarter (maybe because of QAT?) and more reigned in than K2-0905.
After some further experimentation today, it works with a prefilled thought process through Text-Completion which lets you skip the thinking all together. I need to do more testing, but preliminarily, its still smart. I'd say with a good thought prefill, it essentially is what Deepseek v3.1 Terminus should have been. I hope they benchmark its memory capabilities.
>>107147469
Why are you poor?
>>
>>107147516
what is a mid-tier DDR5 cpumaxx machine to you?
>>
>>107147469
>>107147500
its time for you two sisters, to fuck off to aicg
>>
>>107147516
>Why are you poor?
I'm not poor.
I just don't see the value in spending as much as a new car on computer hardware just to run something I can run for free off of the website.
>>
>>107147559
I've been a contributing member of this thread since day one so you can go fuck yourself you dumb retarded kike.
>>
>>107147574
I'be been a comtwibuting membwr since day -20 wen llama laked before lmg was made
>>
>>107147469
>>107147500
Not even optimized either.
>>
File: Gemini 3 🚀.png (1.26 MB, 1024x1024)
1.26 MB
1.26 MB PNG
Gemini 3 when?
>>
Why are antisemites always so angry.
>>
>>107147618
Jewish behavior fatigue.
>>
>>107147529
A 4800MHZ 768GB machine with 9334's/Xeons and a GPU or two for prompt processing. Granted I bought this when RAM was half the price it is now and saved up since 2023 in order to get it responsibly.
>>107147561
In a perfect world that probably still existed just 20 years ago, where people could differentiate between reality and fiction, companies weren't constantly trying to strip away user agency, and we didn't have outright malicious people enshittying everything to nickel and dime you at every turn, I would agree with you. Sadly, we don't live in that world.
>>
>>107147605
Because their goal is to make normal conversation impossible, and by responding to them you are helping their cause.
>>
>>107147624
is that 12 channel or 8 channel?
>>
>>107147630
>their goal
Ah yes, the singular shared goal of all these individuals I don't like.
>>
>>107147611
Aren't the angled thrusters suboptimal for vertical lift? It can turn more easily but I assume similar is achieved with straight thrusters anyway just by turning off thrust on the side you want to turn toward.
>>
>>107147624
How much context fits on your GPU?
>>
>>107147600
Now post the speed you get at 100k context loaded.
>>
>>107147352
I checked the community repo, we are safe. Am I supposed to change the sample count in the demo_gradio.py? i dont see it in the gui
>>
>>107147673
The goalposts are moving faster than datacenter API token generation.
>>
>>107147600
im very envious of you anon, and im very happy and proud of you. enjoy local kimi, a thing us poorfag seethers like >>107147673 will never enjoy
>>
>>107147691
It's ok. You can come back tomorrow when it finishes generating and report the speeds then.
>>
>>107147659
Saar this is peak Bharati engineering please understand.
>>
>>107147605
Yes Anon, your post is the normal one and not the least bit unhinged.
>>
>>107147650
8 Channel with 2 CPUs, so 16 theoretically. To be honest, if you were to get this now, I would go with Gen 5 EYPCs which are 12 channel and support 6400MHz DDR5 RAM.
>>107147660
36k, unquanted across 96GB of VRAM. Granted I use massive batch sizes (16k) in order to get faster pp so I could probably fit double that if I used the standard 4k.
>>
>>107147707
Thanks anon. I hope GLM Air 4.6 comes out soon so povertybros have a decent safetyslopless option too.
>>
File: -.png (995 KB, 1024x1024)
995 KB
995 KB PNG
>>107147800
>>
>>107147800
Maybe GLM just sucks at programming but I just asked 4.6 3K_M for help on doing what I thought was a straight forward Python decorator pattern and it got stuck in a thinking loop. I asked gemini (the coding one) the same question and it answered quickly with a good answer. I haven't really tried closed weight models much but I was surprised at how much better it was on the few questions I've given compared to all the open models I've tried which is disappointing. Maybe I need to find programming specific big models though. Also with that being said whatever co-pilot model github uses absolutely sucks when you click the help on a github action failure. It's bizarre how bad it is and that they keep the button anyway. Every time I given it a try it has something that was so blatently unrelated to the issue.
>>
>>107147899
Gemini is definitely bigger than GLM and it sure as shit isn't quanted to Q3
>>
>>107147899
What quant and programming language? It's all anecdotal but I've noticed that 'harder' programming languages (more to consider with overhead, efficiency etc) tend to suffer in quality more from quantization than shitter-tier languages. It'd be interesting to see how much the model is actually considering efficiency in output at any given quant per language.
>>
Any good rentry or whatever guides for writing system prompts? People here always act like that's the skill to get a model working. I'm skeptical but would be curious what tricks people have found
>>
>>107147899
>4.6 357B non-coding 3K_M
vs
>gemini 1.2T coding (probably Q8, but at worst Q4)
fucking retard
>>
>>107147929
>>gemini 1.2T coding (probably Q8, but at worst Q4)
They quant it depending on usage, during peak hours there is a chance you get Q3
>>
>>107147944
And during India working hours, they serve Q1.
>>
>>107147921
That is true but it's been repeated that quanting has less impact on larger models and GLM full is pretty big even if it's not approaching the 1T mark.

>>107147926
The answer to both those questions are in the first sentence anon. This was a high level python set up code so it shouldn't be taking efficiency into consideration at all.
>>
What the fuck did ik_llama change? I built the new version, then I had to adjust my command to no longer include -fa and -fmoe because it's apparently on by default now but the speeds are horribly slow compared to the old version.
Fuck this shit.
>>
>>107147992
welcome to cutting edge
>>
>>107147899
GLM gets stuck in loops even through the official webpage and also through the Openrouter API.
>>107135967
>>
>>107147992
Is ik_llama merging in changes from upstream?
>>
You all are a bunch of fools!
I was here in the early days of /lmg/ and this thread has gone to shit
>>
>>107148024
/lmg/ went to shit the moment llama2 invited all the casuals in
>>
File: migu.jpg (43 KB, 452x452)
43 KB
43 KB JPG
>>107147944
>APIjeets aren't even getting guaranteed fp16
Say it ain't so.
>>107147974
I'm too retarded to reading comprehension, sorry anon. Have you tried a larger batch size? I don't know if it'll fix your problem, but it sometimes fixes repetitive behavior if the model can see it's repeating itself in the same batch.
>>
>>107148024
are you >>107147574
>>
>>107147899
samplers?
>>
>>107148001
please delete this
>>
>>107148030
No, the problem was one-click installers and locust refugee waves.
>>
>>107147992
Yeah I had to remove those as well. But the speeds are the same with Kimi and GLM. What model are you using?

>>107147119
> mean tags like <pause>, <emphasis> and Idk maybe even <calm>, <excited>, <happy> etc

Orpheus can do some of that. With LoRA you can teach it to do <pause>.

With control-vectors you can make it do <happy> <excited> etc.
>>
>>107148001
it's fine on novelai though?
>>
>>107148082
BASED
>>
>>107148082
I hope this is shitposting and not that guy being actually right about novelai actually being the ones responsible for the relentless GLM shilling.
>>
>>107148115
It's that guy falseflagging to get people to support his crusade.
>>
when you walk away
you dont hear me say
..please baby dont go
>>
>>107148035
No. It seems like there are more of us feeling this way
>>
>>107148127
How is a general that primarily consists of straight men cooming to personalized text completion waifus this absurdly gay sometimes?
>>
>>107148138
*stays*
>>
>>107148034
>>107148057
Admittedly I didn't try much so it could easily be a bad setup. I've gotten pretty good results with Qwen 235 thinking in the past but didn't try it on the question since I needed to redownload it and wanted a quick answer but I'll try that as well. Qwen tends to give long repetitive answers though with lots of tables of made up metrics which annoys me.
>>
File: GLM 4.5 z.ai .png (10 KB, 734x255)
10 KB
10 KB PNG
>>107148162
maybe when asking simple questions you should add /nothink?
>>
>>107148005
https://github.com/ikawrakow/ik_llama.cpp/pull/883
They do. Not sure if they also ported the -fa defaults from mainline. I guess directly merging isn't possible anymore due to diverging too much. Still, I'd like to see the outrage if someone tried to port iwan's speed improvements back upstream.
>>
File: glm.png (152 KB, 906x868)
152 KB
152 KB PNG
>>107148001
Oh yeah I did see that in the past but it was a different kind of loop. It was unable to figure the answer so it kept going >I got it >actually no >I got it >actually no... That went on for a couple hundred lines before I stopped it.
>>
>>107148210
he cant get pissed. it's mit lol
>>
>>107148216
He can seethe, but he can't take it down
>>
>>107148216
Legally, he can't do shit. But he can and will get pissed. That's why the split fork exists to begin with.
>>
Why does every general have a resident schizo?
>>
>>107148260
is the schizo in the thread with us right now?
>>
>>107148274
I don't want to provoke IT, better not mention.
>>
when anons talk about the thread schizo i like to think they're talking about me but im too shy to ask if they are...
>>
>>107148298
>too shy
not you for sure
>>
>>107148223
>Still, I'd like to see the outrage if someone tried to port iwan's speed improvements back upstream.

>>107148223
>Legally, he can't do shit. But he can and will get pissed. That's why the split fork exists to begin with

Who would be pissed / outraged exactly?

They're both MIT projects and I've seen PR's in llama.cpp reference ik_llama, and half the ik_llama PR's are pulling in work from llama.cpp
>>
>>107148337
ik has some beef with the ggerganof hence the split in the first place before that ik contributed to mainline
>>
Hey Cydonia v4zd fan, try v4zg

https://huggingface.co/BeaverAI/Cydonia-24B-v4zg-GGUF/tree/main

Please let me know how it compares. I'm trying to retain the charm while removing the refusals.
>>
>>107148384
im your only fan? >_<
>still no IQ4_XS
i am hurt..
>>
>>107148143
I made a khajiit character card to have gay adventures with.
>>
>>107148384
>no model card
Jesus.
>>
>>107148384
>no model card
?

>>107148400
just run Q8, you got the vram right?
>>
>>107147927
Be as simple and concise as possible. Forget about using ChatGPT tier word salads.
>>
>>107148337
>I've seen PR's in llama.cpp reference ik_llama
Such as? They never pulled in any of the speed improvements.
>and half the ik_llama PR's are pulling in work from llama.cpp
That is less surprising.
>>
>>107148494
>vram
n-no...
>>
>>107148503
You got a job with which to aquire currency which can be exchanged for VRAM, right?
>>
>>107148510
um.. no
>>
File: nimetön.png (6 KB, 782x84)
6 KB
6 KB PNG
>>107148503
>>
>>107148527
ESL retard.
>>
>>107148527
>omama
baste
>>
>>107148527
Hi wan.
>>
>>107147927
Fit as much relevant info as possible in smallest amount of space. One paragraph is usually more than enough.

>>107148496
How did we get to the point where people put walls of text in cards not even paid models care about? Why is imagegen following along with their slop "prompt enhancers"? Don't people know what they want to see?
>>
>>107148537
not do speakings to myself or my male offspring until you a vram possessings

>>107148541
tru

>>107148542
hi
>>
I haven't posted a Miku for 10 threads
>>
>>107148580
At least stop using ollama first, retard.
>>
>>107148493
>>107148494
beaverai repo is for pre-release testing
>>107148384
Downloading now, I'll play with it and report back in an hour or so.
>>
>>107148596
you will now need to post 10 mikus in this thread to make amends
>>
>>107148644
Okay here this should satisfy the criteria.
>>
>>107148617
it'll take me 4-5 hours to download. fuck rural 4g internet

>>107148602
told you, no talkenings until vram ownenings
>>
>>107148720
the criteria is satisfied. all is forgiven
>>
The hotel room felt charged as ggerganov watched from the corner chair, his knuckles white against the armrests. Jart's laughter filled the air as the Ollama VC traced patterns on her shoulder, her eyes glazing over with a mixture of wine and desire. The bed creaked softly as they moved closer, and ggerganov felt his throat tighten with each breathy sigh that escaped Jart's lips. He could hear the rustle of expensive fabric, the low murmur of the VC's voice promising things that made his stomach twist, and Jart's soft moans of approval that seemed to echo in the charged silence.
>>
>>107148724
No one asked.
>>
>>107148724
i asked
>>
>>107148804
Thank you for using Jarty's preferred pronouns.
>>
>>107148804
>her
>>
>>107147681
>Am I supposed to change the sample count in the demo_gradio.py? i dont see it in the gui
Yeah, but maybe stick to tweaking the steps and cfg unless you have a good reason for changing that.
>>
Been a day of playing around with K2 Thinking. It's good, it has more diversity of outputs than GLM-4.6 and its thinking very obviously affects the output when I check token probs. The biggest issue is that running it locally is slow and letting it predict without thinking is sloppier than with (ofc). All that said waiting 20 minutes for it to think through a reply is HORRIBLE. Prefilling thinking is probably the best compromise
>>
I hear hermes 4 is supposed to be uncensored. Is it any good for wiitwd?
>>
File: 1736154946947126.png (537 KB, 817x867)
537 KB
537 KB PNG
>>107148034
Do you have other vocaloid reaction pics?
>>
>>107149144
Nope.
>>
>>107149144
I don't know. Ask tommorrow
>>
>>107149138
It is not 100% uncensored, they admit on their model card, it's around grok 4 level of "uncensored"
>>
>>107149144
Yes.
>>
>>107149004
i meant steps. thx anon
>>
>>107148804
You forgot to mention that the air smelled like ozone, and something deeper...
>>
>>107149215
You should be able to pass the steps when launching the server with --inference_steps.
>>
>>107149217
GPT4 wrote this, not gemini, ozone is gemini-ism.
>>
is this the thread?
>>
uwu
>>
>>107149354
are you brahmin?
>>
owo
>>
>>107149354
if you want The Thread, you need to go to the /v/ archives and search by deleted
>>
>>107148138
hold me
whatever lies beyond
this morning
is a little later on
>>
>>107149404
...
>>
>>107148298
>>107148260
now kiss
>>
>>107149179
Is it any good doe?
>>
>>107149391
I hate you guys for having taught me all this indian caste stuff
>>
>guys
>>
>>107149489
so ur not brahmin?
>>
>>107149514
>>guys
thats right we are sirs here. he can call other timmycels guys.
>>
>>107149556
don't expose yourself like that sir
>>
Are local models doomed? https://lngnmn2.github.io/articles/bullshit-bullshit-bullshit/
>>
>>107149514
What should I call you?
>>
File: 1755712169611141.jpg (309 KB, 760x873)
309 KB
309 KB JPG
>>107148384
>>107148617
Alright, I've tested it v4zg in a few different scenarios, and compared its swipes to v4zd.
>refusals (with short context)
They seem about the same to me in that neither will refuse anything unless you're almost trying to force one, like asking a basic assistant-style character to create a plan to commit IRL crimes, with no system prompt or anything.
With a system prompt and slightly tweaking the character card to give them a basic, accommodating personality they were both able to instruct IRL crimes in (some) swipes. Neither was noticeably more or less successful than the other.
In an RP context, both were able to skip straight into degenerate smut in their first reply, if you instruct them to do so.
If other testers complained about v4zd refusals then they have some serious skill issues. Going much further down the refusal elimination path might just end up making the models dumber, like what happened with abliterated tunes, with little benefit.
(1/2)
>>
>>107149648
bullshit
>>
File: 1738031421520369.jpg (1.96 MB, 2400x3346)
1.96 MB
1.96 MB JPG
>>107149683
>creativity/quality
Very similar outputs between them, overall I think I still slightly prefer v4zd but in a double blind test I definitely wouldn't be able to pick which is which.
I did have one strange misspelling with v4zg, mis-quoted me saying 'sexy' as 'sexey' right at the start of a chat, in its first reply. This was with Q6_K, and I never use quantized KV. That was the only one, though.
For the other anon asking before and anyone else, the sampler settings I use for mistral small 3.X 24b and its finetunes are just
>temp 0.7
>minP 0.02
For short context testing.
In longer contexts I also add DRY with the recommended settings of 0.8/1.75/2/0
>>
>ikawrakows completion API is still broken
Please test if it works before releasing sir thank you sir
>>
why do you guys say "sir" so much?
>>
zzz
>>
>>107149217
>>107149306
Kimi and GLM say this too sometimes. How much Jeetmini training data did they munch?
>>
>>107150132
because it's morning
>>
I couldn't sleep so I'm going to work on my assistant.
I'm going to add an approval mode for read operations (since I'm working with a very retarded model that reads files repeatedly for no reason) and also an export and import mode that will allow me to modify the conversation to fix assistant retardation in real time and also resume after we are done with the conversation.
>>
>>107149683
>>107149706
The misspelling is a concern. It could mean the model got fried or maybe you've got typos in your prompt and it picked up on that?

Are you telling me that there were no improvements to intelligence, creativity & compliance? That sucks since I trained it with WAY more data.

v4zd would be the prime v4.3 candidate then, but I'll try to make some minor adjustments to improve stability.

Thanks anon!
>>
>>107150451
>The misspelling is a concern. It could mean the model got fried or maybe you've got typos in your prompt and it picked up on that?
I checked the card, opening message and prompt and copied them into MS word, couldn't find any spelling errors.
>Are you telling me that there were no improvements to intelligence, creativity & compliance? That sucks since I trained it with WAY more data.
Compliance was never a problem personally, with earlier Cydonias and Mistral models in general. I find them to be very good at following instructions. And yeah, creativity/smarts seemed similar, but maybe your new data would see benefit in scenarios/genres I didn't test.
>>
File: k2_miku.png (58 KB, 496x600)
58 KB
58 KB PNG
K2-Thinking smol-IQ2_KS

Bald miku like GLM-Chan with reasoning enabled.
>>
What are the best models <= 32B for general purpose and code?
>>
>>107150652
If you don't need coom, then probably qwen 2.5 32b coder for code, and Gemma 3 27b for general purpose.
>>
>>107149851
>ikawrakows completion API is still broken

Yeah it's broken, this fixes it:

https://termbin.com/ppti2

chuck it in `patch.diff` then

`git patch apply patch.diff`

and rebuild
>>
>>107150666
nerve gas
>>
>>107149306
>>107150271
Do you guys even run locally?
Gemma, Mistral and very single 24b fine tune on huggingface does this
>>
>>107150659
but gemma3 is ancient
>>
>>107147210
I decided to finally take the plunge and just start making my own AI.

Gonna try and start at a surface level and work down. For now I'm just tinkering with nanoGPT and seeing what I can do.

Right now I'm working on a hybrid word/char-level tokenizer. Not sure where I want to get training data. Goal is english-only with maybe a move to Japanese or Mandarin/chinese later on once I'm more familiar with how this all works.

Are there any good text datasets on Huggingface you guys recommend?
>>
>>107150724
List of noteworthy ~30b models released after Gemma 3:
>>
File: 1755376910116192.jpg (176 KB, 1080x1337)
176 KB
176 KB JPG
>>107150737
local models stagnated, it's owari da
>>
>>107150737
If you can run gemma 3, then you can probably run big moemoekyun models
>>
>>107150791
I can run GLM Air but I honestly just don't like it
Never bothered with 'toss
Full GLM and Kimi are 2big
>>
>>107148216
He added Copyright (C) 2024 Iwan Kawrakow to every single file and is going to have a meltdown if you upstream any of his code without also adding that upstream.
>>
What the fuck are these

https://huggingface.co/hjxkjVCJKv/komiko

I keep seeing shit like this from different accounts, but they're nothing.
>>
>>107150894
perfect for good looks
>>
dead general
>>
for anything non ERP i'll just stay on the deepsneed API, paid a couple bucks for tokens a while back and I still haven't had to refill
the patrician choice for erp (and cunny) has to be cydonia thoughbeit, with a good enough sysprompt and minimal handholding it won't refuse a thing
>>
>>107151195
Not true, I always make sure my great generals are in a safe position and protected by a unit.
>>
I just ate cholle bhature. What are you guys eating for lunch?
>>
File: who would win.png (297 KB, 1079x746)
297 KB
297 KB PNG
>>
>>107151225
[Thought for 20 minutes]
A classic riddle! The surgeon is the boy's mother. The riddle plays on the common assumption that surgeons are male, but the surgeon in this case is female - the boy's mother - which is why she doesn't operate on her son.
>>
I bought a 7900 xtx for fun. Does llama.cpp work well with zluda?
>>
>>107151225
>who would win
In terms of flies eaten or fires started?
>>
>>107151225
it takes billions of transistors to simulate somewhat accurately a single neuron lol.
>>
>>107151245
kek
>>
>>107151245
lost
>>
>>107151203
>the patrician choice for erp (and cunny) has to be cydonia thoughbeit
I fucked it up hard man. I don't know what you like about my tunes so much.
>>
File: LLM-history-fancy.png (1.37 MB, 7279x2975)
1.37 MB
1.37 MB PNG
Small update
>>
>>107151379
>2023
>dark ages
>he doesn't know about Google Colab time period
The absolute state of /lmg/
>>
>>107151247
I have an ancient Radeon Instinct MI25 and just run llama.cpp with vulkan
>>
>>107151429
He didn't mention ELIZA, what a newfag!
>>
>>107151203
I've been using a very simple "Sure! Here's what you requested." in the "Start Reply With" parameter and I've never had it refuse anything to me. You should try that.
>>
Very good vibes from Kimi, knows more than GLM and is much better at listening to commands. Knows the answer to my trivia question which only gemini and dipsy got right so far. Very annoying with censorship though, needs rerolls if you touch the topic it doesn't like. I like that it's properly thinking like old R1, but it would be nicer to be able to set "low/medium/high" so it doesn't jerk itself off for 5 minutes on the same message before replying when it's not needed. Sometimes better than GLM due to not getting stuck in false conclusion.
>>
no one cares about the dork era of pre-instruct models.
>>
>>107151681
you can prefil thinking at the start to get around safety
not sure about the length of thinking though
>>
>>107151379
>>107151556
>>107151699
I first began interacting with language models ~8 years ago and by language model I mean Karpathy's Tinyshakespeare RNN thing. I guess transformers already existed by then but I didn't know about them. If you count AIML as a language model I was trying to make custom chatbots around early 2010s or late 2000s using pyAIML. Then I didn't ever touch language models again until last year I think when I could try Llama 2 on Huggingface Chat. It's weird, I don't remember where or when I first hear about ChatGPT. It only kinda went from not being a thing to being a thing overnight but I don't remember the point at which I became aware of it.
I also tried mining bitcoin in the late 2000s or early 2010s in my (even back then) obsolete computer.
As a life long poorfag I still live with my mom at 30 years old and didn't make a single cent from playing around with these things early.
>>
Thankfully Urbit didn't really take off or I would kill myself from not buying a ship early or a planet or whatever the virtual land bullshit they sell is called.
>>
>>107151784
I don't care about your attention craving faggot
>>
File: 1733610507137662.jpg (76 KB, 1024x942)
76 KB
76 KB JPG
>>107151784
that's great bro
>>
>>107151856
You seem to be missing a comma in there, buddy.
>>
>>107151379
So the modern era is just Chinese stealing Western technology and competing with each other.
>>
File: 1737233122667.png (924 KB, 7059x1284)
924 KB
924 KB PNG
>>107151379
Can you stop updating quarterly, you fag, and stop defacing the damn chart just because something didn't happen for 3 months? There was nothing wrong with how it was done prior and adding in biases to make it more /lmg/ centric and putting in stupid modern 4chan lingo makes no sense at all.
There is also nothing notable happening since technically, the Chinese are still dominating from 2024 until now for a full year and counting in open source. If you had to document this year on a significance basis, R1 should've been in the Chinese domination era because it proved that it can do original research and open source it better than the West while matching up to what was the best of the best at the time where it could beat o3 at certain tasks. The China vs China should've started with the "Summer Flood" because that is now the majority of the models releasing, the last "good" LLM model we got from the West was Gemma 3 back in March and that only held up until Qwen 2.5 surpassed it with most tasks except multilingual translation ability/size where it is still open source SOTA.
>>
>>107152015
in other words, we are in what will be known the pre-llama resurgence era once zucc's masterplan pays off
>>
>>107152063
shut up nerd
>>
>>107152084
Put up or shut up yourself, tard.
>>
so I was trying out k2 thinking from unsloth, annoying as fuck censorship as people already mentioned, but it is what it is
then tried an ubergarm version which was half the size compared to unsloth. turns out it produces some 35-40% more t/s on default llama-server settings with --cpu-moe. and that is really nice
what I don't understand is, am I running a lower quality version? otherwise why the discrepancy in size? it seems unlikely that unsloth are simply retarded and don't know that this model was supposed to be fp4 or int4 or whatever that was called, right?
>>
>Chinese are still dominating
Most people can't run 235B and China isn't dominating below that. There are zero good Chinese models for 24 GB.
>>
>>107152107
>version which was half the size
>am I running a lower quality version?
yes
>unsloth are simply retarded
also yes
>>
File: please.jpg (30 KB, 225x225)
30 KB
30 KB JPG
Hopefully someone can help. The model replies keep degrading after a certain number of messages, it will start perfect then degenerate, confusing characters personalities, important details or straight-up ignoring the latest messages. This is true regardless of which model I use and how much context I feed it, the only thing that seems to work is starting a new chat, any ideas?
>>
>>107152114
Post everything. Model, loader and options, samplers, templates, prompts.
>>
>>107152114
https://github.com/adobe-research/NoLiMa
most modern models degrade by 50% past 8k-16k tokens context
>>
>>107152114
not sure how to break this to you bro...
>>
File: Image 1.jpg (277 KB, 1920x1080)
277 KB
277 KB JPG
noob here
quick question
do you guys use koboldccp?
is it all in one?
like whats the best software ?
my pc is 4060 with i5 12400f 16gb
is it enough no?
>>
>>107152258
Yes it's all good. Get rocinante 12B gguf on huggingface
>>
>>107152268
is it text generation or text to image?
>>
File: where it all started.png (19 KB, 717x202)
19 KB
19 KB PNG
It wasn't much, but it was the first humane communication with a non-human entity. I can't believe how worked up we were at CAI denying us AI sex, people were genuinely obsessed and angry. AI sex and emotional validation is so cheap nowadays, it makes me think, aren't we rapidly forgetting some fundamental parts of human experience? Aren't we becoming blind to the historical reality of NOT having unlimited copies of discardable pocket therapists available 24/7 to listen the purging of our minds, answering our every call?

Hard to believe it has only been 3 years. On the other hand, it's been ALREADY 3 years. That gf you broke up with 3 years ago is nothing more than a faint dream by now. Welcome to the new reality.
>>
>>107152279
Are you incapable of looking for yourself? Do the research/reading for things that are easy, and save the questions for things that are difficult/require nuance.

If you're struggling this hard at this point in your LLM/Diffusion journey, I suggest you go find something more your speed.
>>
>>107152172
It's every model I tried, finetunes of different base models. Oooba, min P(from 0,05 to 1) and temp (from 0.8 to 1.2) sometimes nsigma to 1 and rep penality to 1.12. I tried switching between min p first and temp first, the problem persists. I played around with advanced setting so they are a mess, last try had add character name, names as stop strings, and trim spaces. Skip example dialogue formatting, sequence as stop strings, replace macro and wrap in newline all ticked. Used Chatml, variations of chatml, mistral v3, and gemma 2. Instruct sequences were the base ones silly gives you with their respective context templates. Don't have the guts to post messages and main prompt but past like 15 messages it looks like I'm putting more effort than the model. Kind of wonder if the problem is batch size/ rope_freq_base. Batch size is 4096 and I tried both 1000000 and 0 with rope,

>>107152190
It's true regardless of context.

>>107152235
Break it to me, I just want an answer after all my attempts.
>>
>>107152315
fuck off gatekeeping pos
>>
>>107152315
i mean i used LM studio atm
only for fun
does that count?
>>
>>107152307
>AI sex and emotional validation is so cheap nowadays, it makes me think, aren't we rapidly forgetting some fundamental parts of human experience?
I keep thinking that the filter, slow regeneration and inability to edit AI messages made you think twice before sending new messages, which overall improved conversation quality and engagement, even if cock-blocked. You can't truly have meaningful conversations without constraints and with the capability of almost instantly regenerating messages until you get exactly what you want. This is probably also why users willing to endure generation speeds of a few tokens/s (by using models larger than they should, even if it takes cope quants) might be deluding themselves into thinking their models are better than they are. When every message is "expensive", you better make full use of it.
>>
Is there a way to do the sampling externally, not in llamacpp? I wanted to play with stupid sampling strategies but the below results in low generation speed.

import httpx
import asyncio
client_main = httpx.AsyncClient()
client_unslop = httpx.AsyncClient()
last_response=None
async def get_logits(prompt, client, num_logits=100, tokens=1, endpoint="http://localhost:8080/completion"):
data = {
"prompt": prompt,
"max_tokens": tokens,
"temperature": 0,
'n_probs': num_logits,
'min_keep': num_logits,
}

response = await client.post(endpoint, json=data)
response = response.json()
global last_response
last_response = response
text, probs = response['content'], response['completion_probabilities']
return text, probs

async def sample_sequence(prompt="Once upon a time",num_tokens=10,top_logits=100,endpoint="http://localhost:8080/completion"):

for token in range(num_tokens):
_, probs = await get_logits(prompt,client_main,num_logits=top_logits,endpoint=endpoint)
probs = softmax({token['token']:token['logprob'] for token in probs[0]['top_logprobs']})
sampled = list(probs.keys())[0]
prompt += sampled
yield sampled

async for result in ( sample_sequence(prompt='Here is a proof that',endpoint="http://localhost:8080/completion", num_tokens=500)):
print(result, end='')
>>
I am a simple, uneducated man in my 30s.
I have no hobbies such as LLM gooning or gaming.
All I want is to sit in my comfortable armchair for hours in front of my homemade Raspberry Pi touch interface and chat in English and German (my English is only mediocre) with a local AI about an Arxiv dump (a small AI-capable server stands in the basement). I want to read papers across all subject areas, look up terms and have them explained to me.
The interface is controlled by touch and voice input/output in English and German.

Since German is an insignificant language, I have collected some data myself for TTS training. A solution similar to Kyutai would be great.

Unfortunately, I'm not very talented and my intellectual and financial resources are limited. I can't find other Germans to collaborate with, for example on the TTS part. If they're talented, they exclude you because "Germans who dare to not exclusively speak, think or even jerk off in English should be gassed; these damn subhumans".

I'm frustrated because I can't see a way to achieve my simple dream. Is the only solution to hang myself?
>>
>>107152322
If keeping retards like you out is gatekeeping, then I'm very much fine with it.
>>
>>107152113
both claim to be q8 although ubergarm one says "Q8_0-Q4_0" whatever that really means
>>
>>107152321
>rope
Could be because your issue resembles ones from the older Llama 2 days when we were messing with rope freq and alpha. Models would output legible text but get things mixed up, forget details, and repeat older messages while ignoring the most recent. Try leaving rope settings untouched (so backend pulls values from model files), set backend context to a 100% safe value like 4096 just for testing, then see if it still happens.
>>
>>107151211
>anon stole some of your vram with a great general
>>
>>107152431
>Vox Populi modpack installed, America is buying other civs' VRAM
>>
>>107152307
>>107152374
>AI sex
No such thing thus far
You're all jacking off to computer generated smut
>>
>>107152389
You made so much effort to write some prose in English that you forgot to ask an actual question
>>
>>107152488
Seems like your English is so much worse that you cannot even understand what you are reading. Retard.
>>
>>107152488
I didn't mean to. I just wanted to whine because it frustrates me.
The only right answer on your part would have been a recommendation or link to a sturdy rope.
But yes, I do feel a little sorry for wasting your time.
>>
>>107151784
very nice anon! i first interacted with language models with cleverbot like over 6 years ago, not sure if that counts as one. and i tried writing a chatbot 4 years ago in python but quit
>>
>>107152466
>jacking off to computer generated smut

The womankind is doomed
>>
>>107152389
what you want is possible. whisper can transcribe german, and im pretty sure there are models that speak german alright. but most papers are english too, maybe you could learn english with your waifu
mediocre english aint a big deal
i am 100% sure german has tts support, you could even do voice cloning probably.
if your perfect dream isnt possible right now, it will be in a month, two months half a year or a year. keep yourself safe
>>
>>107149487
*Kiss*
>>
>>107152409
Can't see much of an improvement, but that's exactly what's happening to me. How did you solve back then? It's either rope or my settings are just wrong. Can you post what your advanced formatting window looks like?
>>
>>107152782
Can you post the model you're using? Model parameters too, maybe you have QUANT KV turned on, like please please anon??
>>
>>107152382
Still slow with eg. num_logits=10? That's probably a lot of serialisation + event queues + overhead to do for every token.
Why I kept ooba around actually, it was easier to experiment with sampling in python but using llamacpp backend. istr there being some module to import ggufs in the right way for Transformers and use a typical sampling loop there at one point..
Implement it direct in C? can't be that hard
>>
>>107152382
Could try llama-cpp-python. It lets you set custom logits processors. The documentation for it isn't great but this repo I stumbled upon a while ago is a good usage example:
https://github.com/and270/thinking_effort_processor
>>
>>107152466
there is a thing called phone sex, and while it's not the literal same thing as physical sex, it's a form of sexual interaction between humans, or more accurately, entities that are capable to appealing to human experience (if you can converse with a non-human thing, then you can certainly sexually interact with it). Same with erotic roleplay, except that's instead of speech, the interaction is text-based. "AI sex" is just ERP with AI. It is undeniably a form of sexual interaction.

A blurring factor is that unlike a willing human, an AI is slave to your commands and will attempt to roleplay in a way you request, and if you can at any time erase and edit its memory, it becomes questionabe whether it's an entity or, just a tool and extension of you. Case in which you will have to also question whether the robot sex of the future is sex at all.

By the way, you can treat a flesh and blood human as a slave as well, coercing or drugging them into an easily controllable subhuman tool, and in that case, is sex with a slave really sex or just masturbating with a cocksleeve programmed to do the action of your choice?

In the end, you are having sexual interaction with an external entity in the sense that it's a response to you that it came up with based on incomprehensible inner workings that you can't directly control.
>>
File: file.png (351 KB, 1336x1747)
351 KB
351 KB PNG
>>107152782
Blank newline after every {{user}}: and {{char}}:, and a newline for each suffix
>How did you solve back then
We didn't, it was a balancing act between brain damage and extra context length.
>>
https://voca.ro/156ZWJesrYs7
>>
Bros...
>>
>>107147210
>>(11/06) LocalSong 700M melodic instrumental music generation model released: https://hf.co/Localsong/LocalSong
Why is this in the news? Doesn't look very important?
>>
>RDT
>>
File: Kimi says TTD.jpg (639 KB, 1268x1099)
639 KB
639 KB JPG
Local tierlist: Kimi > Everything else
>>
I've been away for a while, what's the best llm for erp right now (24GB vram, 128Gb ram)? Last I used is qwq-32b-q8_0.
>>
>>107153080
Kimi is best in class, but you can't run it with those specs, even jpgcompression-tier quants.
GLM 4.5 Air (and probably 4.6 Air when it releases) is your best bet right now.
>>
>>107152190
Wish they kept that updated. Curious about how the current latest Geminis and the like do.
>>
>>107152917
Sexual interaction=/= Sex
Jacking off isn't hand sex it's jacking off
Going to a strip club and watching the women dance isn't eye sex
You never called erp with humans text sex so why would you call it AI sex if it's with a computer
>>
>>107153080
you have a choice: glm 4.5 air big quant or glm 4.6 small quant
>>
>>107153211
>You never called erp with humans text sex
It's literally called sexting but go off I guess.
>>
>>107153080
>what's the best llm for erp right now (24GB vram, 128Gb ram)?

GLM-4.6. This specific quant is the highest quality for 128+24: https://huggingface.co/Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF
>>
>>107153211
i agree with you.
AI ERP sex.
>>
>>107153044
>>107153102
Kimi k2 thinking for code? Any good compared to qwen coder? How consistently correct and compilable are the outputs?
I’m hesitant to put in the time and effort for another model that looks good on SWEBench but produces terrible outputs and slinking back to old reliable.
I can run K2 at q4 (which is similar-but-different to full quality fp4?)
>>
>>107153231
Sexual texting isn't text sex and it never meant that either
>>
>>107153237
wtf is that
>>107153080
ignore that guy, get this instead: https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main
>>
File: neneru.jpg (186 KB, 1024x1024)
186 KB
186 KB JPG
>>
File: 1746955785748249.jpg (167 KB, 1000x1000)
167 KB
167 KB JPG
>>107147210
>>
>>107153244
Every time I ask Kimi to make a small function and document it for future debugging for an existing project it Just Werks. Don't prompt "Kimi make me Half Life 3" and expect miracles, but as a junior dev or pipeline assistant Kimi has been good to me so far.
As always though, the golden rule still applies:
>Any coding model will only ever be as useful as you are good at coding
>>
>>107153286
is he angry or embarrased?
>>
Is -mla 3 on ik_llama fucked? It's supposed to apply to both GPU and CPU but loading K2-thinking with it takes up retarded amounts of VRAM for ctx. -mla 2 works as intended and 32k is like 6gb.
>>
>>107153377
some other anons had some other issues too
>>
>>107153303
Thanks for the real world report.
Are you API or local? Thinking or old K2?
What’s the largest/hairiest thing you’ve had it build one-shot? Multi-shot? How much context do you have?
>>
File: teteto.jpg (187 KB, 1024x1024)
187 KB
187 KB JPG
>>
>>107153256
>wtf is that
The lowest KLD vs the full model, that fits in 128GB+24GB.

>get this instead: https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main

Also good. Specifically this one: https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ2_KL
>>
>>107153044
Say what you will about derangement, but this is true dedication. I can't imagine how long you waited for that to generate.
>>
Is ik_llama good? I've only ever tried regular llama.cpp
>>
>>107153403
Thanks, I thought it was DavidAU but quant.
>>107153080
maybe listen to that guy
>>
>>107153429
Depends, it's sometimes a tad faster than regular llama.cpp for big MoEs if you run the specialized quants and they didn't break anything again.
>>
>>107153296
Next bread better have a happy migu with her leek.
>>107153393
Local K2. Granted, I've only ever used Kimi on babyez high level languages so far. If you're trying to do assembly or stuff that requires innate hardware infrastructural knowledge, it probably won't be too useful.
>Largest/hairiest thing
Not much. I mostly give Kimi the busywork, review and revise the output, then copy+paste the revised implementation. I don't let Kimi directly touch project files (I don't even have a good setup for this if I wanted to right now). Sometimes giving Kimi some sample code helps, but it's usually not necessary for more simple tasks that are basically just converting a process or pseudocode into something usable.
>How much context do you have?
I've found 50k is a nice balance between maximum size and speed and I clear the buffer between every task. It shouldn't take more than 10k tokens to resolve the usecases Kimi is best at.
>>
>>107153470
>assembly
I’ve used QC for eBPF module pair-coding and find it on par with Gemini, which is fairly low-level esoteric work. Not exactly assembly, but approaching that level (heavy constraints and debugging consists of assembly dumps)
I’m often pushing 30k context (lol I’m RAM rich but gpupoor) and wish I had more.
I’d love to talk to someone who’s used both to get the lowdown, but I may have to become that person and report back.
>>
>>107153470
>It shouldn't take more than 10k tokens to resolve the usecases Kimi is best at.
Shows that you aren't serious. Not that you could go above 10k even if you wanted to without the speed cratering.
>>
Has anyone managed to get their local K2-thinking to close its thinking tags? It thinks just fine but when it's done it just starts writing without closing the bracket with </think> or even a single newline. It does this for me on both chat completion and text completion w/Moonshot K2 presets. Neutralized samplers, high temp, low temp, none of it seems to help.
Using the model via OR doesn't have this problem.
>>
File: Kimi says TKD.jpg (3.03 MB, 1267x6573)
3.03 MB
3.03 MB JPG
>>107153409
Kimi's powerlevel is strong enough to be ranked among late-series dragon ball characters. Grok wishes he was this chuddy during his mechahitler stint.
>>107153616
Very serious saar. Is high tech app!
>>
>>107152836
>>107152868
Python package's create_completion has a logits_processor function argument. but i love my ik_llama...
>>
>>107153682
That one log took you an hour to generate. Holy shit ramfags are mental.
>>
File: TKDListPoints.jpg (647 KB, 1272x1014)
647 KB
647 KB JPG
>>107153682
Does sillytavern have problems with list formatting starting at a value greater than 1? Kimi's output seems correct, but the display when the editor is closed just shows 13 on every point in the second post.
>>
>>107153397
When the Teto's ML paper gets called a meme by /lmg/ Anons
>>
>>107153697
Spot on! Let me show you how fast the superior state of the art Claude model can generate a similar report.
I’m sorry, I can’t assist with that request.
>>
>>107153697
we have a coper over here
>>
File: file.png (23 KB, 317x543)
23 KB
23 KB PNG
>>107153708
It doesn't like mixing lists like that.
I'm surprised it doesn't reset to 1, my memory is bad but I thought that was the case.
>>
>>107153664
no but maybe this can help you figure out the right chat template in the case you're using the wrong one: https://huggingface.co/spaces/Xenova/jinja-playground
>>
>>107153758
A model that can only handle 10k context and takes an hour to provide a response is useless for programming. Glad you have a toy that can entertain you for hours by saying "kike" and "nigger", really happy for you.
>>
>>107153697
>That one log took you an hour to generate.
source?
>inb4 look at the time he sent messagerinos
when im using a huge model, i send it a message and get distracted jerking off to hentai or browsing 4chan and return back to it when im reminded
>>
>>107148034
There is no gain from using fp16
Below q8 we may have a discussion but even then, above q5 there is hardly a concern.
>>
>>107153211
You are unable to even write in proper fashion. Do not lecture other people.
>>
File: Token Time.jpg (47 KB, 1698x58)
47 KB
47 KB JPG
>>107153784
>can only handle 10k context
Reading comprehension, Rajesh. I said it finishes its job within 10k.
>>107153780
Interesting. Not too big of a deal as long as it doesn't affect codeblocks.

>>107153800
Picrel console output. You might be able to squeeze more performance by lowering the upper context buffer, but this was fine for me between doing other stuff.
>>
>>107153851
>60t/s pp, 2.2t/s tg for a 1 trillion model
very nice anon, can you tell us more about your rig? are you the ssdmaxxer anon from a few threads back?
>>
>>107153851
You're getting 2 t/s at 5k context. If you tried to push it past 10k you would be getting sub-1 t/s.
>>
>>107153871
now lets see anon's local kimi benchmarks.
>>
>>107153864
256GB RAM, 32VRAM standard maxxed motherboard gaymur box. It's really nothing impressive and even when quanted, Kimi's outputs have been consistently better than the equivalent memory-profile high quant smaller model for me.
>>
>>107153900
I don't try to pretend running K2 is viable with available hardware.
>>
>>107153914
jealous much?
>>107153903
DDR5? 2/4 channel? ram MHz?
>>
>>107153922
>jealous much?
Of what exactly? A useless novelty?
>>
>>107153922
4 channel 64x4 DDR5 6000hz. Got my sticks before the Altmanpocalypse.
>>
Is running new kimi from the ssd worth it? I have 24gb vram and 128gb ram and want to see if a lower quant won't be unusably slow.
>>
>>107153943
the model spends like 3000 tokens thinking no matter what you do
running this piece of shit off ssd means that you'll get one reply per day out of it if you swap out ssds once per week
>>
>>107153950
>the model spends like 3000 tokens thinking no matter what you do
Logs for proof?
>>
>>107153950
>you'll get one reply per day out of it
but at least you'll get something out of it
>>
>>107153864
>are you the ssdmaxxer anon from a few threads back?
Forgot to answer this. No I'm not. My only real gripe with Kimi is that she's a size queen that's taxing the storage on my fastest drive right now, but that's a concession I'm willing to make until I get another SSD or two next paycheck.
>>
goys? https://www.reddit.com/r/LocalLLaMA/comments/1osml7y/eli5_why_does_nvidia_always_sell_their_consumer/
>>
>>107153950
You can already prefill the thinking under the "start reply with" section in sillytavern. Do people not know this?
>>
>>107153942
How much did you pay for the motherboard (and which one)? Also what quant are you using?
Thanks for all the info anon
>>
>>107154012
Which doesn't fucking help with K2-thinking because it'll do it anyway unless you just straight up use it to skip the entire reasoning with '<think></think>'. But what would be the fucking point of that?
>>
local models status?
>>
>>107154026
Thinkmaxing is such a retarded, degenerate form of benchmaxing. If you want the increase in intelligence, you have sacrifice 3/4 of your available context. Fuck “number goes up” grifters
>>
>>107154041
bloated
>>
>>107154041
Why are you underpaying NVIDIA? Do you want them to go bankrupt?! You should demand they raise their prices.
>>
>>107154057
It's nice to have the option to scale compute-time rather than only having model size.
>>
>>107154041
Best it's ever been.
>>
>>107154023
Get the best Asus within your budget like an X870E Hero if you can afford it. I got mine way under market price. I've tried a few of the small Kimi quants and TQ1_0 is bar none the best of its weight class for consumer-tier local hardware.
>>
>>107154123
How can q1 be any good.
Just the fact that it can produce a coherent sentence would be impressive.
>>
>>107154041
K-for Kimi-shaped, much like the economy. Excellent if you're wealthy or you're the equivalent of a boomer and/or got in before the great RAM apocalypse, absolute trash if you're just starting to get into the hobby and/or are poor.
>>
fuck, unlocking my pc caused a VRAM spike and overloaded my gpus
>>
>>107154177
lol
>>
>>107154177
Sucks. Do you know how to unmelt the tensors? The model might still be salvageable.
>>
>>107154165
Because that particular quant only compresses the less essential parts of the model's guts as opposed to crunching things uniformly like most quantizing tools do. Proof of coherence >>107153044 >>107153682
>>
>>107154200
real nice unslut kool aid you got there mate
>>
>>107154191
Tensors can't be unmelted. He would need a cutting torch and skill to separate them again. His best bet is to abliterate out the affected tensors.
>>
>>107154057
Thank you for your irrelevant input, schizo.
Anyway, K2-Thinking is really shit to use as of right now unless there's a trick.
>>
>>107154200
>the less essential parts of the model's guts
Such as obscure knowledge and complex reasoning.
>>
Quants are basically a form of irreversible, hard sampler
>>
>>107144308
It's been a day, give us the logs anon.
>>
>>107154223
You don't need that. Benches are still good so it's fine.
>>
File: file.png (2 KB, 395x25)
2 KB
2 KB PNG
>>107154177
why arent you using dwm and slock as white man intended?
no compositor btw.
>>
>>107154242
I thought white men were still using i3-gaps and i3-lock?
>>
>>107154256
i3 is too functional and uses too much vram
>>
>>107154242
>dwm and slock
I can’t tell if you ARE me, or just making fun of me…that’s exactly how I roll when a gui is needed
>>
>>107154276
b-based..
>>
File: 🔥🔥🔥.png (2.53 MB, 2400x2400)
2.53 MB
2.53 MB PNG
>>107154041
Continuously improving in lots of small ways that aren't always visible.
Previously I bought a Silverstone HELA 2050 W PSU because that was the biggest available one.
Recently I bought an ASUS PRO WS 3000W PSU and the hardware stability has become way better.
With the 2 kW PSU I could only run at most 2 uncapped 4090s in parallel at full load without risking instability, with the new 3 kW one I can run 1 5090 + 4 4090s in parallel without seeing any issues other than the room temperature (I intend to try connecting more GPUs once the cables for it arrive).
The 3 kW PSU even comes with 4 SOTA 12VHPWR connectors!

(The way to fix instability from power spikes is to cap the GPU frequency, a power limit doesn't work.)
>>
>>107154319
>4 SOTA 12VHPWR connectors
that'll keep you warm this winter
>>
>>107147210
>Text Gen. UI, Inference Engines
I have decision paralysis! So many options... which is best for a simple local setup?
>>
>>107154026
reasoning isn't always beneficial for rp
>>
>>107154319
What are your favorite models/quants at your hardware bracket, llamabro?
>>
>>107154359
he doesn't use models, only kld testing at 512 ctx gets him going.
>>
>>107154375
>only kld testing at 512 ctx gets him going.
And green peppers
>>
>>107154319
>4 SOTA 12VHPWR
How do you mitigate the risk of one of these catching fires due to the shit load balancing nvidia uses for their modern cards?
>>
>>107154355
you have to provide more information about your local setup
>>
>>107154401
4090 and 32GB of vram. I used oobabooga since the beginning but havent played with llms in several years now. I am hoping the setup is a bit more refined nowadays without dependency hell
>>
>>107154414
>4090 and 32GB of vram.
So in total 56GB of VRAM? Are you on linux perchance?
>>
>>107154421
yes, arch and dual gpus
>>
>>107154431
well how much ram do you have? is the second gpu an amd one? or nvidia?
>>
>>107154484
all is fun in guessing games
>>
>>107154359
Currently I'm spending very little time actually using language models vs. developing software for it.
One factor is that every time I use software that I'm developing myself I start thinking about all of the ways that it ought to be improved which ruins the enjoyment.
The last few weekends I've spent upgrading and rearranging my hardware and working on automating the assignment of tensors to GPUs.

It was in August when I last used language models for extended periods of time, back then I liked Deepseek R1 a lot, I haven't yet gotten to comparing it to GLM or Kimi.

>>107154399
As of right now just making sure the connectors are properly inserted and checking whether any of the cables get suspiciously hot.
I ought to buy a current clamp and check properly though.
(Also a CO2 fire extinguisher just in case.)
>>
>>107154513
>One factor is that every time I use software that I'm developing myself
Show us your custom frontend.
>>
>>107154513
That's very relatable. I hope development continues to be enjoyable and productive for you.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.