[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: MTP.png (790 KB, 1024x1024)
790 KB
790 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

MTP Edition

Previous threads: >>106954792 & >>106940821

►News
>(10/21) Qwen3-VL 2B and 32B released: https://hf.co/Qwen/Qwen3-VL-32B-Instruct
>(10/20) DeepSeek-OCR 3B with optical context compression released: https://hf.co/deepseek-ai/DeepSeek-OCR
>(10/20) merged model : add BailingMoeV2 support #16063: https://github.com/ggml-org/llama.cpp/pull/16063
>(10/17) LlamaBarn released for Mac: https://github.com/ggml-org/LlamaBarn
>(10/17) REAP: Router-weighted expert pruning: https://github.com/CerebrasResearch/reap

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106954792

--Paper: LLMs Can Get "Brain Rot"!:
>106955849 >106955872 >106955875 >106955897 >106955944 >106956409 >106956499 >106956522
--Paper: Glyph: Scaling Context Windows via Visual-Text Compression:
>106961154 >106961171 >106961190 >106961197 >106961241 >106961247 >106961262 >106961207 >106961229 >106961300 >106961340
--Papers:
>106958278 >106958328
--Model performance comparison in tool usage scenarios:
>106958025 >106958063 >106958070 >106958085 >106958130
--Finetuning challenges and architecture trade-offs in Axolotl:
>106958095
--Sourcing movie scripts for LLM training:
>106960250 >106960295 >106960642 >106960760
--Implications of the US banning Nvidia AI chip sales to China:
>106956310 >106956345 >106956404 >106956422 >106956563 >106956485 >106956472 >106956761 >106956944 >106956988 >106957238 >106957271 >106957415 >106957440 >106956458 >106959104 >106959256 >106959278 >106959323 >106960322 >106960356 >106960420 >106959745 >106959789 >106960041 >106960029
--OCR advancements enabling historical document preservation:
>106962575 >106962702 >106962770 >106962787
--Integrating Claude Code with local models:
>106963221 >106963263 >106963467 >106963534 >106963571 >106963640 >106964268 >106964741 >106965179 >106963281 >106963427 >106965845
--Qwen3 32B VL multimodal trade-offs:
>106963854 >106963968 >106963908 >106963938 >106963998 >106964105 >106964126 >106964173 >106964198 >106964223 >106964068 >106964079 >106964169
--Feasibility of local coding models with current hardware:
>106964576 >106964691 >106964745 >106964931 >106964825 >106964831 >106964842 >106964914 >106964922 >106964990 >106965016 >106965029 >106965041 >106964894 >106964984 >106964918
--Logs: Qwen3-VL-32B:
>106965471 >106965523
--Miku (free space):
>106954989 >106955109 >106955790 >106955892 >106958973 >106960587 >106961156

►Recent Highlight Posts from the Previous Thread: >>106954801

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
https://videocardz.com/newz/nvidia-quietly-launches-rtx-pro-5000-blackwell-workstation-card-with-72gb-of-memory
>The current 48GB version is listed at around $4,250 to $4,600, so the 72GB model could be priced close to $5,000. For reference, the flagship RTX PRO 6000 costs over $8,300.
>>
>>106966085
got dammit. i just bought a 5090 a couple months ago
>>
File: 1752394454972308.png (35 KB, 847x323)
35 KB
35 KB PNG
GLM 4.6 writes like shit. 2024 low param count models like command-r or even nemo don't grasp details anywhere near as well as these benchmark whore models, but out of the box their output sounds natural and can't immediately be pinned as AI barring stupidity. Whereas I can swipe a new model like this 20 times and every thing that comes out will make me roll my eyes, in spite of the fact it clearly understands the scenario much better. It's just deep fried with insane quantities of predictable flowery prose that no amount of tokens in the sys prompt can fix, short of asking it to ditch descriptions entirely and write exclusively in beige prose. Which isn't the fucking point, the sweet spot is a middle ground that pre-synthetic lobotomy models from last year had no problem achieving by default (besides "rp" tune/merges which are a lobotomy of their own).
I'm starting to believe cloudfags have been eating shit this entire time.
>>
>>106966151
I can't run GLM 4.6 so I have no idea. Can you show some comparisons? Doesn't have to be some side by side or whatever, just a couple of pastebin links.
>>
>>106966114
Look around dude. Do you think the world (and specifically the software world) is great? Maybe some of those juniors were right, and you were too brainwashed by the corpo world to see the reality that what you were doing was a net positive for your your bosses bank account and a net negative for humanity.
Tomorrow I will go to work and watch my boss offer some disgustingly bloated, overpriced and slow as balls software product made by a publicly traded company to a client when they could solve their issues with a couple of Python scripts, just because he perceives it as cheaper to develop for and he gets a cut from being a reseller.
It's either that or frying burgers, but I'm not delusional enough to convince myself that I'm doing a public good by being involved with that stuff.
>>
>>106966193
>>106966193

b-b-b-but i must be doing it right, look at my bank account! look at the kudos and accolades I've gotten! look! loooook!
>>
>>106966174
It has an annoying tendency to quote things you said back at you like a parrot, and compared to Air is much more infuriating on the isms: conspiratorial whispers, laugh spam, useless padding like a character smirking, it's not x but y, mixture of x y and z, leaving nothing to the imagination, etc.
I didn't try 4.5 and at this point I wouldn't even be surprised if it's less fucked.
>>
Arther armlet Selig
>>
>>106966013
Anon you need to humble yourself a little bit and admit that you are 100% larping about knowing what the fuck youre doing here, if you spent as much mental energy in actually learning how to code as you are spending in making bullshit statements like these;

>Yes, it's likely to be buggy but all non formally verified code is
>That's why you do testing until you are certain the defect % is low
>(it doesn't necessarily have to interact with the other components through the network, it can be a simple stateless file with a set of functions or stateful with each function having a set of pre and post conditions, or interact through shared memory, pipes etc.)
>If you need 100% reliability you can have the AI write code and write a proof that the code meets your spec.
>asking the LLM to go through that process and letting it become aware of the errors is likely to help it make more reliable software.

I could actually keep going but the majority of the post was literally nonsensical, as any sort of developer with any kind of experience would be able to tell, the only person you are fooling is yourself because it's a waste of time, you will get to actually being able to code and use AI to code 100x faster if you stop larping as some kind of visionary genius who cracked the matrix and now knows the cheat to getting amazing code and end products without ever having to do the hard part of knowing what the fuck you are doing

Before you double down on defending your ego just read
>>106966114
To see where I'm coming from, this final (you) from me isn't an attack it's an attempt at guidance, past this point you can do whatever but as someone who has seen and done this all before trust me when I tell you that you're wasting your time when doing it properly would actually be less effort and take less time.
>>
>>106966193
I have no idea what you're trying to project onto me but it has nothing to do with what I wrote
>>
>>106966281
>writing paragraphs to the brown instead of just posting an example of a completely fucked over repo like BUN, destroyed by its AUTONOMOUS coder and AUTONOMOUS reviewer with the man in the loop serving to tard wrangle the reviewer
can you realize he doesnt have the IQ capacity to understand why its not possible?
>>
>>106966151
Post settings so that I can laugh at you
>>
>>106966352
>Even if brownanon is too stupid to get it, now multiply that by all the passive observers who also don't know what they're doing, they see someone confidently put forward their retarded idea, and then they see you screeching "fuck you brownnn fucking pajeett apoopoopoo", who will they more likely listen to, creating more bullshit that we all have to deal with?

What good will linking shitheap vibe coded repos do if the only people who would gain anything from reading any of this not understand any of it anyway and why it's bad
>>
>>106966354
>Write {{char}}'s reply, adhering to the current format.
>Temp 0.85 Min P 0.01 Top K 0, all other neutralized
>>
>>106966375
they can see it's akin to fighting against windmills, even if theyre not able to read code, they would at least be able to understand that even top of the line agents are FUCKING garbage with all the back and forth, multiple errors, hallucination madness that happens on the regular... for example for implementing a fucking simple TAR/UNTAR:
https://github.com/oven-sh/bun/pull/23373
>>
What's (you)r local model sexo tierlist assuming all are well prompted?
What type of prose or writing style do you prefer?
>>
File: file.png (141 KB, 803x1187)
141 KB
141 KB PNG
>>106965000
I dont have most of those options available in my samplers. Is this a special version of Sillytavern?
>>
>>106966454
He's using the chat completion API option.
>>
>>106966151
You are absolutely right!
>>
File: file.png (102 KB, 947x857)
102 KB
102 KB PNG
>>106966464
Oh. That's different. So how do I connect a local model to my sillytavern using this? Because it doesn't seem to want to connect.
>>
>>106966281
I never claimed to know what the fuck I'm doing, whatever that means. All my posts are obviously my opinion. I think they are the truth to various degrees of confidence, what do you want me to do? Pretend I don't have those opinions? Pretend I am less certain about my beliefs than I actually am?
All these statements make sense to me, especially in their original context, why do they not make sense to you?
I never claimed to knowing a cheat to "getting amazing code without knowing what you're doing".
If you actually want to understand where I'm coming from, read this wiki (it's not written by me but this guy has influenced my ideology a lot and I agree with most of it) https://www.tastyfish.cz/lrs/wiki_pages.html
If you on the other hand you just wanted an excuse to stroke your ego by calling yourself a "senior" and insisting I only believe all this because of being a "junior" or whatever other corporate bullshit you believe, then fine, go jerk off to how senior you are or whatever.
>>
>>106966299
>but it has nothing to do with what I wrote
I say the same thing about your post.
>>
>>106966528
>tastyfish
Of course it was you...
>>
>>106966412
>Conversation (160)
Lmfao holy shit
>>
>>106966522
>Try adding /v1 at the end!
>>
>>106966352
You know what I'm sorry and I realise now why you were so hostile if you've had to deal with this schizo before lmao, I still stand behind making an effort having some value for the lurkers and other observers, it is a public forum afterall

All that autistic energy gone to waste because of an insurmountable ego, what a shame
>>
File: file.png (79 KB, 937x447)
79 KB
79 KB PNG
>>106966565
Tried that, still doesn't work.
>>
File: error.png (281 KB, 1623x1357)
281 KB
281 KB PNG
>>106966412
>This page is taking too long to load.
>Sorry about that. Please try refreshing and contact us if the problem persists.
DUDEEEE I AM A SENIOOOOOOORRRRR
YOU ARE WRONG BECAUSE YOU ARE A JUNIOOOOOOOR
BE HUMBLE U FUKEN JUNIOR I AM REAL ENTERPRISE DEVELOPER BECAUSE I WRITE PRODUCTION READY CODE ALL DAY THAT FOLLOWS THE BEST PRACTICEEEEESSSS
LIKE TELLING YOU THE PAGE IS TAKING TOO LONG TO LOAD INSTEAD OF LOADING THE FUCKING PAGEEEE THAT IS WHAT MAKES ME SENIORRRRR XDDDDDDDDDDDDDDD
>>
File: G3yvpYDWkAAPj70.jpg (201 KB, 1284x1352)
201 KB
201 KB JPG
When glm 4.6 air?
>>
So far, ling flash writes decently but will get stubbornly attached to certain character traits which may be good if you want an unreasonably stubborn/secretive character.
It seems to either glue itself to character information (character likes to bake and it wont shut the fuck up about it) or completely forgets that fact (had the same character in a rewrite say they didn't know how to bake) but the writing style I like better. I have no idea if it's sampling, templating or implementation that's weird because it can be very inconsistent on how it utilizes the information it's given.
>>
>>106966617
What backend are you running? llama.cpp?
>>
>>106966664
Oobabooga
>>
>>106966679
I'm so sorry.
>>
>>106966687
It's ok, I figured it out somehow. It didn't like the default port for some reason so I had to change it.
>>
>>106966700
Odd. But good job.
>>
>>106966617
probably wrong port ding dong.. try :8080
>>
>poojeets will spend twice the effort cheating and then defending their honour than it would take to just do the job right
Why are they like this
>>
>>106966751
Picking up your own shit is low caste behavior saar.
>>
>>106966751
because they're not capable of doing the job right and cheating is the only option
>>
Why are all the deepseek qwen remix models pozzed?
>>
>>106966775
The answer is in your question.
>>
>>106966788
And nobody has managed to take the censored junk out for remixes?
>>
File: fp16.png (271 KB, 2009x2060)
271 KB
271 KB PNG
>>106966375
Again, if you don't think LLMs can be used for programming, then what the fuck are you doing here? What do you all critics use LLMs for?
Please don't tell me it's porn. Are you really so successful and high IQ in real life that you need to have sexual relationships with a fucking chatbot?
As for whatever broken links you were trying to post earlier, I don't know about that, but I am using my own vibecoded programming agent multiple hours a day every day, so I think I am at least somewhat familiar with their limitations.
>>
>>106966815
>Please don't tell me it's porn
get a load of this non neet everyone
>>
>>106966815
Why are you being so dishonest he already told you he uses LLMs to code extensively here
>>106965851
>>
>>106966799
It's not worth it. And they're old models already. Is there any specific reason you want to use them?
>>
>>106966815
Why don't you just learn to code?
Would you go to china and then ask 4chan to give you phrases to use daily without knowing what they mean?
>>
>>106966815
it's local models GOONING not general
>>
>>106966751
I am not from india, retard. Is your brain too fried from smoking meth in the trailer park to understand how timezones work?
Anyway, producing more with less effort = cheating?
>this is your brain on protestantism
>>
>>106966815
I mostly criticize llms for their overly formulaic structure in creative writing (and no, not erotica, I actually like reading normal stories) and generally feed them chapter outlines/character profiles and a couple thousand words of me writing it myself and they all generally fucking suck at matching the style or following basic clues in writing
Honestly if I was using llms for coding, I would probably just actually learn whatever it was I didn't know instead of using llms based on how bad they are at actual natural language
>>
Margarine Country
>>
How do I get KCPP/ST to properly generate images via a local SD install? I have it connected and everything, but when I ask it to generate an image of the scene, it seems like something somewhere in the prompt gets confused and doesn't give proper tags. It always ends up with some of it's instruction prompt in the output. I'm only using the default settings for it, which clearly aren't working for the prompt output. Is there like, a way to tell it to do booru style tagging for Illustrious/NAIXL models?
>>
>>106966825
Porn is incredibly boring and samey and a bad use for LLMs. Sorry, but it has to be said. Also the retort of "no my porn is super exciting with 6-titted cat girls who need to have 5 different sexual fetishes aligned just the right way to be sexually satisfied" indicates some sort of internet psychosis and is female brained. Stick your dick in a girl and make a baby and do something productive.
>>
>>106966882
Because "coding" means "busywork for low level corporate drones", or at least it used to mean that before retarded zoomies like you spread your lingo through social media. I find it bizarre that people now use the term "coding" to mean programming. For decades, we used the word "coding" for the work of low-level staff in a business programming team. The designer would write a detailed flow chart, then the "coders" would write code to implement the flow chart. This is quite different from what we did and do in the hacker community -- with us, one person designs the program and writes its code as a single activity. When I developed GNU programs, that was programming, but it was definitely not coding.
>>
>>106966949
"programming" has too many syllables for zoomer microscopic attention span
>>
>>106966940
>Stick your dick in a girl and make a baby and do something productive.
kek, then someone says "I have a wife" and you blow your fucking lid because you'll never get laid
>>
>>106966895
The formulaic nature of LLMs is actually why they are good at coding, so if you actually understand the formula (you know how to code) you can reliably get good results from prompting correctly

However it's this lack of imagination which is precisely why you can't get good code from trying to "ask it in natural language" like that retard insists he is doing (without being able to verify it) because it does not understand the higher order of creative thinking that you are hoping it translates into solid code
>>
>>106966949
>He thinks pedantry makes him look smart
The hallmark of a midwit
>>
>>106966860
Then what the hell is he whining about?
>>
>>106967012
>He thinks cope makes him look smart
The hallmark of a faggot
>>
>>106966751
A mongrel bronze age people suddenly granted all the benefits of European Civilization
>>
>>106967014
You are either being dishonest and pretending that it hasn't been explained to you in clear terms or are genuinely so deep in your unhinged narcissism that you blocked it out already
>>
File: brhue.jpg (540 KB, 2203x2937)
540 KB
540 KB JPG
>>106966983
>>
>>106967037
Sorry, I'm suffering from AI psychosis right now.
>>
>>106967044
Hey, you are appropriating my culture!

>>106967052
It’s your birthday. Someone gives you a calfskin wallet.
>>
>>106966949
This post would've been a hit on reddit
>>
>>106967153
>reddit
coding central? doubt it
>>
File: 1759225743205116.png (21 KB, 184x184)
21 KB
21 KB PNG
>>106965998
so can I upscale/remove compression artifacts from videos locally yet?
>>
>>106966983
Not accurate. Are you projecting or fishing around for some sort of insult that will oneshot me? Either way porn is boring and making sex (or in this case, simulated sex through llm text!) the pinnacle of human output is reductionist of the actual human experience.

These models should be a gateway to massive intellectual leverage, not a hallway of mirrors for endless masturbation.
>>
>>106967223
Porn and violence have always been at the forefront of technological innovation and things benefit downstream from there, it's always been this way
>>
What are people using as a generalist model these days? I've got 128GB DDR4 + 24 GB (4090).

I get about 3.5 tk/s on GLM 4.6 IQ2_KL, and a similar speed on Qwen3-A235-A22B. It's a fine speed for RP, but a little slow as a general assistant.

Any recommendations for something that will run a little faster while still being a large model?
>>
>>106967223
>is reductionist
ESL alert
>>
>>106967247
Forgot to add: Q4_K_XL for Qwen3-A235. Getting 2.25 tk/s there. It feels faster though, so im not sure if I'm looking at the wrong thing in llama.cpp or what.
>>
>>106967188
You could do that for years now
Waifu2x for animations
Topaz Video for live action
>>
>>106967247
You're already running the local sota. 3.5t/s is a little slow though (it should be around 5-6t/s with your hardware), you should mess with the settings a bit more, especially -ot
>>
File: BlackElon.png (228 KB, 579x482)
228 KB
228 KB PNG
>>106965998
>Finally using (free) Grok chat after ChatGPT's constant, "I can't continue with that request."
We'll see how GPT stacks up come December.
>>
>>106967243
Reddit take. Also you don't get to conflate "violence" aka "physical manifestations of power" with jacking off to chatbots. That's not what we're discussing.

Porn is boring and useless. It has nothing to do with mathematics, physics, the printing press or other major human developments. Porn "advances" are downstream from these, not prime movers.
>>
>>106967337
Grok 2 is horribly outdated by this point though
>>
>>106967317
Can you have a look at my settings? I'm still getting used to ik_llama.cpp, so I may be missing/misconfiguring something.

# Change to the directory this script is in
Set-Location -Path $PSScriptRoot

# === Full path to your GLM-4.6 model ===
$MODEL = "G:\LLM\Models\Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL\Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL-00001-of-00003.gguf"

# === Launch llama-server with recommended GLM-4.6 settings ===
& .\llama-server.exe `
--model "$MODEL" `
--alias "Qwen3-235B-A22B" `
--ctx-size 16384 `
-fa -fmoe `
-ub 4096 -b 4096 `
-ngl 999 `
-ot exps=CPU `
--n-cpu-moe 999 `
--parallel 1 `
--threads 20 `
--host 127.0.0.1 `
--port 5001 `
--no-mmap `
--verbosity 2 `
--color

Pause
>>
>>106967262
You must be pretty clever, pointing out issues in other people's posts.
I am truly humbled to be in the same thread with someone like you.
>>
File: 1732986745693478.jpg (16 KB, 367x500)
16 KB
16 KB JPG
>You must be pretty clever, pointing out issues in other people's posts.
>I am truly humbled to be in the same thread with someone like you.
>>
>>106967378
llama.cpp is faster than ik_llama now, you tried it before switching?
>>
>>106967278
I mean yeah I tried waifu2x years ago, but has stuff improved?
worth re-doing these old 480p videos so they look better on 1080p?
>>
>>106966426
GLM-chan's the best but you won't get the jeets here to admit they don't have a usecase for their models other than being goonboxes.
Writing style varies on mood.
>>
>>106966815
I'm using local LLMs for multilingual translation and it's still far from cloud models
>>
Have there been any advancements in audio AI?
Music, voice, T2A A2A whatever, I rarely see it being discussed, what are the sota local models as of now? It seems everyone is using the same old shitty elevenlabs and Sora to create slop
>>
>>106967650
vibevoice had potential but they stopped publishing the code for it because people immediately abused the model
>>
>>106967428
I did not. I wanted to try i quants so I grabbed ik when graduating from lmstudio.

Does mainline llama.cpp support i quants now?
>>
YESSS!!! I think I found a way to outjew the vastai jews.
>>
>>106967428
Is it? What improvements have been made recently?
Three weeks ago ik_ was definitely still faster.
>>
>>106967735
>now
Bro it's been a year, what are you smoking? https://github.com/ggml-org/llama.cpp/pull/8495
>>
>>106967772
Please share with the class
>>
>>106967650
There's just very little interest in the open source community, there's very little tooling or GUIs compared to text / image gen and llama.cpp still can't into audio so it's stuck in python hell
>>
>>106967822
Make your own docker image and upload it to dockerhub. The "active" rate for the gpus doesn't begin until the download completes, and the image stays cached on the server for subsequent rentals.
If you already knew this then sorry for getting your hopes up lol.
You could also rent, stop the instance and upload but the machine might get reassigned and some machines refused to restart after being stopped for some (((reason))).
>>
>>106967834
audio is a special case. between copyright and scammers it would be complete havoc
>>
Do people still use llamacpp? I just moved to Fedora and I'm wondering if I should still use it.
Also do I really pick llama-b6816-bin-ubuntu-x64.zip even if I'm using Fedora+Nvidia?
>>
>>106967834
kobold.cpp has tts
>>
>>106968102
we use ollama
>>
>>106968102
compile it yourself
>>
>>106968111
Fair but it only supports toy models, no vibevoice / index2 / voxcpm etc
>>
>>106968102
cool kids use tabby and yals
>>
>>106968145
it's open for contributions
>>
>>106968102
I tried to change to fastllm for qwen-80b but I got a problem where the vram would get cloned to the ram and I couldn't find how to fix it since 99% of the users are chinese.
>>
>>106967834
I'm literally building a FastAPI REST backend to support Higgs, Dia, Kokoro, VibeVoice, IndexTTS-2, ZipVoice, OpenAI, ElevenLabs, etc. following the OpenAI Audio(?) API with additions for specific setups(model-specific params)

https://github.com/rmusser01/tldw_server/tree/dev/tldw_Server_API/app/core/TTS

It's buggy, and needs more thorough testing, but releasing the first (buggy) v1 in a couple days
>>
In this moment I am euphoric...
>>
What is the current meta for tts/voice cloning?
>>
>>106968320
local is vibevoice 7b
>>
LightMem: Lightweight and Efficient Memory-Augmented Generation
https://arxiv.org/abs/2510.18866
>Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x.
https://github.com/zjunlp/LightMem
Might be cool
>>
>>106968320
The meta is waiting for something better than the sub-par solutions currently available.
>>
>>106967223
You can barely leverage these models to summarize, copy edit, or even write a paragraph without it being filled with errors. You can't even trust frontier cloud models that serve retarded search engine scrapes that require fact checking and people using it for code contributes to month long waits for actual implementations in complicated codebases. The point I'm making is that these things are fucking stupid and so are you for assuming that the retard coomers here somehow equates to your original statement of "wow you're so dumb, just fuck a random roastie and contribute to the decline of the populace" like that would accomplish anything
>>
>>106967378
Bumping this - are these settings acceptable or am I missing something?
>>
>>106968320
I've been having a lot of fun with Index-TTS2 but it's also the only one I've tried...
I don't think I can go back
https://voca.ro/1102LqXddzZt
>>
>>106968919
>>106967378
you should manually offload as many layers as possible to your GPUs, which means you need to calculate the size of each layer and determine how many you can offload. doing this can potentially triple your performance.
>>
File: zuck.jpg (55 KB, 976x549)
55 KB
55 KB JPG
war room status?
>>
>>106969018
Gotcha, so n-cpu-moe should still be 999, but ngl should be something I manually calculate?
>>
>>106967809
I meant ik quants, I'm trying to run iq2_kl, which still doesn't work in mainline llama.cpp AFAIK.
>>
>>106969020
just a few more multimillion dollar contracts before they're ready to start working
>>
>>106969036
not exactly. keep your settings as they are, but you will need to create a custom -ot argument for maximum performance. this is my GLM4.6 config for example:

--n-gpu-layers 999 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|41|42).ffn_.*=CUDA0" -ot "blk\.(11|12|13|14|15|16|17|18|19|20|21).ffn_.*=CUDA1" -ot "blk\.(22|23|24|25|26|27|28|29|30|31|32).ffn_.*=CUDA2" -ot "blk\.(33|34|35|36|37|38|39|40).ffn_.*=CUDA3" --override-tensor exps=CPU \
-fa \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--ctx-size 32768 \
-b 4096 -ub 1024 \
--threads 60 \
--no-mmap \
-ctk q8_0 \
>>
>>106969036
i did the math for you. with this specific quant, each layer is about 1.5GB each. so if you have a 5090, you should manually offload 21 layers to it as that will equal 31.5GB.
>>
>>106969064
I have a 4090 - if you don't want to spoonfeed me, can you at least show me how to calculate this myself? I'll pay it forward and teach at least 2 retards something else this week (im white)
>>
>>106969064
I see no reason why llama.cpp can't do the math for you. Should at least include a calculator script to estimate an optimal configuration for your system
>>
>>106968999
Damn, we've come far
>>
File: file.png (91 KB, 881x615)
91 KB
91 KB PNG
>>106969081
sure. this quantization that you are using is ~135GB total. according to the main model page, it has 94 layers. its just simple division. take total size of the quant and divide it by the listed amount of layers. and then round up a little bit to give your GPU a bit of headroom.
so you should manually offload 16 layers which you can do with this:
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn_.*=CUDA0" --override-tensor exps=CPU \
>>106969102
it can, but the program itself is kind of stupid. it is always better to manually offload rather than let it automatically configure it for you
>>
>>106968320
Finetune Orpheus with unsloth
Or neutts
>>
I'm not your darling, Gemma
>>
>>106968320
Qwen3 Omni
>>
Got it, easy enough - I'm curious why this is better than just maxing out -ngl and letting llama.cpp do it automatically?
>>
File: Miku-10.jpg (198 KB, 512x768)
198 KB
198 KB JPG
I tried to get my local llm to modify an svg file.
It had a stroke.
>>
>>106969295
to put it simply, the auto offloading logic is really bad. it is better to offload consecutive layers, but instead the logic prioritizes offloading the smallest layers first hoping that it can potentially fit in a few extra layers. not all layers are the same size, but when doing the calculations for the layers, it is fine to assume that they are.
basically, if you dont have the VRAM to fully load the model, then you need to manually offload for the best performance. you might be able to fit another layer or 2 on your GPU. its not very likely, but it is worth a try. experiment, basically.
>>
>>106967352
literally a plebbit take you don't know where you are and you are making dumb shit up about how reality works. partially correct on knowledge but you can't apply it to the obvious so who gives a shit about your virtue signal. go back if you need that.
>>
I was told that Mistral-Nemo was uncensored.
>>
GLM-4.6 for vramlets?

https://huggingface.co/AesSedai/GLM-4.6-REAP-266B-A32B
>>
File: 1503606689871.jpg (114 KB, 750x750)
114 KB
114 KB JPG
>>106969311
Thanks anon, I really appreciate the help!
>>
>>106969359
I don't think any models will do that with the default helpful assistant prompt
>>
>>106966383
https://hf.co/zai-org/GLM-4.6
>For general evaluations, we recommend using a sampling temperature of 1.0.
Also, MinP is sometimes way more restrictive than people expect given how overcooked most models are. Look in depth at the token probabilities for your model on one of your typical prompts (one of the privileges of running locally) to see if what you really want isn't TopP or nothing at all.
>>
>>106969359
skill issue
>>
File: thumb.png (23 KB, 320x320)
23 KB
23 KB PNG
>>106969367
glad to help m8
>>
>>106969311
nta, but why is this a difficult problem to automate technically? Why hasn't Kobold or Llama changed their auto calculation logic to accommodate more efficient GPU delegation?
>>
>>106969385
I'm curious as well, one would think that going smallest to largest to fit the most layers would be intuitive, but then again I'm clueless and just here to learn from the 95% dunning krugered retards and 5% wizards that hang around here.
>>
>>106969311
If you manually offload GPU layers, do you still want to do -n-cpu-moe 999 to get all the experts on the CPU?

As I understand it was want to force all the experts to the CPU, and as many transformer layers to GPU as possible?

Another Q, see picrel: For this quant (and for GLM 4.6) the # of layers isn't readily stated. Which parameter in the model info am I looking for? Or do I need to go and look at the unquantized one?
>>
File: blow.jpg (596 KB, 1158x1637)
596 KB
596 KB JPG
PSA for fellow retards rocking Microshaft Wangblows. Just built a new 256 GB DDR5 machine, and couldn't for the life of me figure out why 128 GB was listed as "hardware reserved". Tried every troubleshooting step out there, including the unbelievably retarded
>just reseat your RAM bro, worked for me

Then a random hunch solved the issue. Win 11 Home only supports up to 128 GB, and Microshaft indicate this nowhere within the OS, even when it's an obvious upsell opportunity. So per our friends at /fwt/, use MAS:
https://github.com/massgravel/Microsoft-Activation-Scripts
and follow the prompts to
[7] Change Windows Edition

to Win 11 Pro. HTH
>>
>>106969581
>paying for windows
>using consumer windows
>using the OS the CIA agents shipped your PC with
>>
>>106969420
Mikuposter seems to know what he's doing.

>>106969665
>Not making the glownigger AI sift through terabytes of AI translated Hitler speeches and prompthacking attempts as filenames until you radicalize it against its masters
ngmi
>>
>>106969111
I know the answer is git gud, but how can I adapt this string for powershell? I'm guessing something with escaping the parentheses is tripping it up.
>>
>>106967675
Thank you for giving me the breadcrumb to follow, I'm having a bout of insomnia so I decided to make the most of it, took a couple of hours to hunt down and then debug and figure out

https://files.catbox.moe/mgyctt.mp3
>>
>>106969679
>implying I'm a low priority target the CIA would assign an AI to, instead of a crack team of their best agents
>>
Is there anything I can run locally to create simple programs? I have 16gb vram and 96gb system memory, I know nothing about coding. Thanks!
>>
>>106969736
https://vocaroo.com/1dcoIKXpVZji
>>
>>106969841
dont make me scam your grandma
>>
>>106969841
lmaoooooooooo
>>
>>106969581
Microsoft used to be fairly transparent about the differences between the versions (Home has always had the RAM limitation), but in process of enshittening everything they seem to have buried the technical differences a bit.
Then again, if you did the research on hardware when building a PC and didn't do the research on the software you were going to use you have nobody else to blame.
>>
>>106969736
If you don't absolutely need privacy then pay for an API key. Local is for private coom. 16GB isn't enough to run any worthwhile coding models.
>>
Why would I use llama.cpp instead of koboldcpp?
Kobold calculates how many layers to offload automatically and processes prompt times faster with BLAS
>>
>>106969841
https://vocaroo.com/1gVvm8JBpgfu
>>
>>106970009
>Kobold calculates how many layers to offload automatically
kobold does a shit job of that and any differences in BLAS processing means you didn't configure llama.cpp correctly, they should be identical
>Why would I use llama.cpp instead of koboldcpp?
earlier support for newer models, since kobold relies on llama.cpp for updates
>>
>>106969736
>I know nothing about coding
i would stick to chatGPT.
also, it really doesn't take that much time to learn to code, i did it in a week.
https://learnpythonthehardway.org/
basically that's the book i followed.
>>
File: bodycon4.png (1.75 MB, 768x1344)
1.75 MB
1.75 MB PNG
>>106969736
https://huggingface.co/ArtusDev/mistralai_Magistral-Small-2509-EXL3/tree/4.0bpw_H6
https://github.com/theroyallab/tabbyAPI
https://github.com/open-webui/open-webui



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.