[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor application acceptance emails are being sent out. Please remember to check your spam box!


[Advertise on 4chan]


File: 1748924525376873.jpg (1.08 MB, 2544x3120)
1.08 MB
1.08 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107063981 & >>107056325

►News
>(11/01) Emu3.5: Native Multimodal Models are World Learners: https://github.com/baaivision/Emu3.5
>(10/30) Qwen3-VL support merged: https://github.com/ggml-org/llama.cpp/pull/16780
>(10/30) Kimi-Linear-48B-A3B released with hybrid linear attention: https://hf.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
>(10/28) Brumby-14B-Base released with power retention layers: https://manifestai.com/articles/release-brumby-14b
>(10/28) NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 released: https://hf.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>107063981

--A Study of BFLOAT16 for Deep Learning Training:
>107070442 >107070483 >107070511 >107070527
--Multi-GPU performance debate in AI model acceleration:
>107069202 >107069222 >107069244 >107069255 >107069261 >107069265 >107069378 >107069264 >107069942
--Vector-text storage in Postgres using BLOBs and cosine distance ranking:
>107070426 >107070428 >107070500 >107070535
--LoRA alpha parameter's role in training and inference stability:
>107064965 >107065003 >107065032 >107065046 >107065138
--LLM-assisted prompt refinement techniques and tools:
>107064845 >107064904 >107064908 >107064920 >107065271 >107065682
--AI model capabilities in OCR, translation, and writing for potential human translator replacement:
>107065203 >107069145
--Fixing Chinese-to-English translation contamination in Terminus model:
>107065949 >107066491
--ID verification requirements for AI interactions and potential workarounds:
>107065472 >107065504 >107065629 >107065653 >107065667 >107066126 >107066744 >107066673 >107066743 >107066818
--qLoRA finetuning constraints on Blackwell Pro 6000 GPUs:
>107067618 >107067655 >107067735
--Evaluating Strix Halo machine's cost-performance for AI workloads:
>107067095 >107067114 >107067162 >107067259 >107067349 >107067420 >107067727 >107067783 >107067868
--Native Multimodal Models are World Learners:
>107068769
--Seeking benchmarks for older AI models via Open LLM Leaderboard:
>107070598 >107070637
--User preferences for VTT models: Voxtral Small 24B, WhisperX, M2M100 1.2B pipeline:
>107066814 >107068206
--Miku (free space):
>107067074 >107067524 >107067676 >107068066 >107071616 >107073605 >107067350

►Recent Highlight Posts from the Previous Thread: >>107063985

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>107074052
>Emu3.5:
This will never land in llama.cpp right?
>>
File: genration-13_webp.jpg (532 KB, 1280x707)
532 KB
532 KB JPG
>>107074118
>Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing
sure it does. just look at this
>>
>>107074052
>an adventure with Miku
>>
>>107074054
Where is my struggle about finding a local small ai coding agent you tool.
>>
>>107074267
I guarantee that miku wears diapers she's the kind of girl that would do such things I know it when I see it
>>
>>1070 74176
Damn, now I'm interested.What's the holdback on llama.cpp support?
>>
>>107074420
The point is that it's fucking shit
>>
question, when coding using sonnet, grok, gpt or other api models, context grows very fast, and for many of my queries it can easily reach 50k, 100k, and even more, as there's just so many files that have to be read. despite that, those models can still perform well, and mostly accomplish the given tasks, usually with some hiccups or omissions sure, but overall good progress can almost always be made.
meanwhile, when local models are asked to write longer stories, around maybe 10k in, or even quicker perhaps, is when things just tend to fall apart completely. you start getting short sentences that just stop making any sense.
could someone help me understand why is there such a difference? why is context buildup so detrimental for creative writing in particular and doesn't seem to have the same effect on coding? or is it that API models are somehow more powerful
>>
>>107074453
api models are way more powerful
>>
>>107074349
>Let that sink in.
get out elon
>>
yeah don't let people gaslight you, api models are much better than anything local
local does improve though, just more slowly than online models
it wasn't even long ago that the norm was that even the largest models would go to shit on local past 4k
>>
>>107074361
--Searching for 24GB models compatible with agent functionality:
>107067281 >107067346 >107067353
This one? It was all the way at the bottom, past cutoff. Your struggle sounds like a personal problem, but you could try DeepSWE-Preview. It's trained on top of Qwen3-32B with thinking.
>>
>>107074361
>why isn't "saar please tell me the needful, btw I have 16 GB VRAM" a highlight
>>
agent anything on local: LMAO
wanting it on a small amount of VRAM: hahahahah oh god he's serious?
>>
>>107074461
how so. if deepseek can be run at q8_0 locally then wasn't that supposed to be at least on the same playing field?
>>
>>107074453
creative writing its the broadest of domains. its too open ended. consider how many valid completions there is for
>she opens the door quietly
vs
>the square of the hypotenuse is equal to
>>
>>107074521
it's not competitive even without quantization, sorry bud
deepseek too have an API and I have used it and the model only behaves on a level comparable to SOTA models if you stay under 10k
even at 10k there's degradation, but it's still usable up until like 30k
>>
>>107074559
so you're saying if I tried to get a novel out of sonnet 4.5 it can do it in one conversation? or what are you saying
>>
https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87/home

>>107074559
>>107074611
>>
>>107074628
thanks, true if big
>>
>>107073605
GLM4.6
>>
>>107074711
Buy an ad.
>>
thank fuck I prefer short stories anyway
>>
>>107074628
do you people use models
those benchmarks are retarded
>>
>>107074748
how are the benchmarks wrong
>>
>>107074559
>pissing in a sea of piss
*yawn* do your whore mother a favor and kill yourself little buddy
>>
why is everyone so grumpy today? hangover?
>>
>>107074461
sonnet and other sota models start falling off in rp quality after just ~16k in my experience so that's not it
>>
>>107074486
*24
>>
>>107074513
>wanting it on a small amount of VRAM: hahahahah oh god he's serious?
Well, where can I learn about the hard hardware-requirements of agents?
>>
Is there a benchmark with simple coding tasks that I should test small tool-enabled models with?
I want to test whether training on inputs is better or worse.
>>
>>107074513
I'm finetuning Gemma3 on agentic (mostly) coding tasks. I haven't gotten there yet but think it's doable.
>>
>>107074900
They don't have any special requirement, you just need (at the moment at least) the >200B models to do anything useful.
I think it's a data issue though and if we make the right dataset it can be done on a model an order of magnitude smaller.
>>
>>107074900
nta. Just use whatever you can run on whatever you have, see what you can do with them. Smaller models are easier to run and faster to iterate with.
Use google to find information.
>>
Retard here, please send help. running kobold & ST, on a mistral tekken v7 tune. I must have fucked up a setting somewhere because my responses went from being quite fast to generating at a snails pace. And I don't know what setting I used to cause that to happen. Or is the gen speed somehow tied to what intro prompt you use with a card? I HAVE noticed that some poorly written cards just gen like shit, but I'm using the same card I was before. I'm completely lost.

On a side note, what can a poorfag do if he wants something better than a 3060 12gb? I want to play my tardslop vidya still, but also want to gen sloppa faster and chat better with my computer. I would upgrade to a 4090ti, but poor and am really hesitant to try and get a used card from facebook/CL/ebay, etc. and those amazon "refurbished" cards seem sketchy as fuck, too.
>>
File: romed82t_00-01.jpg (2.49 MB, 4096x3072)
2.49 MB
2.49 MB JPG
>>107074052
PSA: Volta and Blackwell are incompatible.
An NVIDIA engineer kindly informed me that on Linux Blackwell only works with the open NVIDIA kernel modules (honestly I should have known to check dmesg).
With that I got the 5090 that NVIDIA sent me to work, though notably the V100 I had intended to use same machine only works with the proprietary NVIDIA kernel modules (Ampere and Ada Lovelace work with either).
For now I connected my MI100 instead, if one compiles both the CUDA and ROCm backends it can be used alongside the 3090, 4x 4090, and 5090 for a total of 184 GiB of VRAM.
>>
>>107074988
The deeper into the context you are, the slower it becomes. If you have something like the phrase ban thing, it will take longer because it needs to regenerate.
Show the speed difference, we're guessing otherwise.
>>
>>107074805
no you, pajeet
>>
>>107074453
>local models
This statement is not very useful unless you tell us what models you're comparing the cloud models to.
I'm not disagreeing, but the gap is significantly different for certain models vs others
>>
>>107074988
>if he wants something better than a 3060 12gb
I'm in same boat. /lmg/ LLM are go big or go home. If you're not running multiple RTX4090 or using one as a frontend for a big RAM (512G RAM) machine, I think you're better off sticking w/ what you have an running smaller models. The hardware is very expensive for local.
> poor
lol double on above advice.
>>
>>107074628
damn, hows minimax m2 coming along in lcpp? suddenly I care again...
>>
>>107075009
link to that mining rig? did you get it in the EU?
>>
>>107075009
Is all of that still powered with a single PSU?
>>
>>107075120
it's already supported
>https://huggingface.co/bullerwins/MiniMax-M2-GGUF
you need to pull and recompile
>>
>>107075120
what the fuck is this minmax thing is that available locally? how slopped is it
>>
>>107074988
If you run out of system ram llama.cpp will begin pulling the weights from disk for each token, this will result on a massive slowdown. So either reduce the maximum context or close other memory hungry applications in your system.
>>
>>107075122
I just ordered it off of the German Amazon: www.amazon.de/dp/B07H41S74S
Could very well be that there are cheaper options, I didn't spend much time optimizing that part of the build.

>>107075124
Yes, It's a single Silverstone HELA 2050W PSU.
Without a frequency limit however, multiple power spikes will eventually align, drain the PSUs capacitors, and crash the system even if the average load is below what the PSU should be capable of.
In my opinion a frequency limit should be set either way because for a constant workload like a neural network it doesn't make sense for the GPU to temporarily boost to very high frequencies where the efficiency is bad.

In the meantime since I bought the PSU Asus has released a 3 kW PSU, I intend to buy one of those eventually.
>>
File: serious Pepe.png (359 KB, 728x793)
359 KB
359 KB PNG
I did not check for long time now.

Did llama.cpp figure out how to utilize dual CPU rigs efficiently?
>>
>>107074176
can somebody gen an image of him shoving that black sharpie up his ass?
>>
>>107074453
i don't have any issues with context using kimi k2 0907 up to 40k
>>
>>107075049
>>107075146
It's being weird and not even offloading to the sysram. Like right now it's been stuck on token 25/350 for a solid 30 seconds and counting. I have my context at about 6k, so I don't know what's happening.

>>107075106
My idea was adding a second, dedicated card more suited for AI rather than gaymin', but I have no fucking idea what to even look for when it comes to non gaming gpus.
>>
>>107074176
the key difference being that you can train and inference entirely locally
training requirements better be reasonable.
stronger relationships + multi-turn understanding of data means you could relate tons of data together to form more "intelligent" training.
for example, before + after image pairs to say, nudify or rotate. turning random objects into girls, or making mechs outta characters.
basically the isolated conceptual training of single image + caption pairs isn't good enough any more, people want cross-domain stuff now.
if Emu makes it more accessible I'll take it.
>>
>>107075254
Show your fucking options, show the model you're running, show how you run it, show the performance log in the terminal output. For all i know you're doing everything that could possibly be wrong wrong.
>My idea was adding a second
Don't waste your money yet. You're thinking of buying a new car when you can't find your way out of your house.
>>
File: amretardedhelp.png (927 KB, 2560x1440)
927 KB
927 KB PNG
>>107075309
Sorry. I don't mean to be a pain in the ass. Here's a bunch of stuff, idk if it helps.

>Don't waste your money yet. You're thinking of buying a new car when you can't find your way out of your house.
To be fair, I do gen quite a bit of slop locally. I'm just retardedly new to text models and such. I'd prefer to just get a 4090ti for the gaymin, but... Yeah. Poor.
>>
>>107075483
Show the output of top and nvidia-smi as it's generating. From the looks of those logs it's stuck on prompt processing and it is not even using the GPU.
>>
>>107075483
first of all, fuck cydonia r1
second of all, use IQ4_XS not Q4_K_L
third of all use cydonia 4.2.0 maybe?
and since its R1 its probably thinking thats why u arent seeing anything
also windows is worse for ai btw
>>
File: h9g78f.jpg (1.65 MB, 5280x2560)
1.65 MB
1.65 MB JPG
>>107075009
nice rig
>dmesg
have a journalctl -f or similar as a background and catch all the errors
#!/usr/bin/sh
gnome-terminal --profile=Syslog --full-screen --zoom=0.6 -- bash -c 'echo -ne "\e]0;syslog\a" ; SYSTEMD_COLORS=16 journalctl --no-hostname --no-tail --follow -b 0'
sleep 1
wmctrl -r syslog -b add,skip_taskbar
wmctrl -r syslog -b add,below
>>
>>107075510
I'm going to sound even more retarded, but where do I see that at? I don't see anything that looks like that in the kobold terminal or the sillytavern one.

>>107075533
Let me try and download that one, then. I'll report back in a bit when huggingface quits being a cunt about it's download speeds.
>>
there's a couple of these speak to type things rolling around that clean up your speech n shit
is there an open alternative yet?
>>
First time doing this, I didn't think I had the specs for it (old i7 and 8gb nvidia gpu). I just downloaded ollama, using gemma3:4b, and I'm in awe with how fast it is. It's basically as fast as chatGPT and it can even read images.
>>
I'll be back when pewd's tourists are gone.
>>
>>107075605
is retarded tho
>>
>>107075605
try a quant of a bigger model and offload some layers to your cpu
>>
File: file.png (195 KB, 950x927)
195 KB
195 KB PNG
>>107075580
nevermind 4.2.0 is trash
maybe get mistral small v3.2? youll need a bit of a jailbreak for it but its nice
maybe check reddit for shitty erp models, but theyre likely shit
https://www.reddit.com/r/SillyTavernAI/comments/1ogzbb3/megathread_best_modelsapi_discussion_week_of/
>>
>>107075559
god dam that desktop has a lot of pixels
what is the physical size of your monitors brother?
>>
>>107075649
i like fine DPI see >>106873195
>>
>>107075649
willing to bet hes using a dual monitor setpop
>>107075668
KNEW IT
>>
>>107075559
>steam botnet in the background
Enjoy being spied.
>>
>>107075559
>vivaldi browser
bro cant be serious... using a proprietary browser... bro... thats worse than using windows broo...
>>
Is there really no way to finetune Gemma 3 27B with full context?
I couldn't get it to work even on a 4xH200 machine.
Neither llama-factory nor axolotl seem to have the ability to actually shard across GPUs.
>>
i sharded myself
>>
>>107075580
Open a cmd and type "nvidia-smi". As for "top" if you are on Windows the equivalent is the task manager to see CPU utilization. If it's running on GPU you shouldn't see more than one or at most 2 or 3 cores at 100%. If it's running on CPU you will see all cores pinned to near 100%.
>>
>>107075690
>proprietary
It has nice features basically modern opera
https://vivaldi.com/source/
>>
>>107075696
What ever happened to tdrussel? He had a pretty good multi GPU deepspeed script going with qlora-pipe but it hasn't been updated in a long time.
>>
>>107075648
Would I still go for the Q4_K_S if IQ4_XS isn't an option? I don't know what ones to get out of the list of all these Q4, Q5, Q6's.
>>
File: file.png (31 KB, 403x205)
31 KB
31 KB PNG
>>107075717
proprietary blobs
https://vivaldi.com/blog/technology/why-isnt-vivaldi-browser-open-source/
>Note that, of the three layers above, only the UI layer is closed-source. Roughly 92% of the browser’s code is open source coming from Chromium, 3% is open source coming from us, which leaves only 5% for our UI closed-source code.
>>107075727
well you could get Q4_K_S but Q4_K_M is better
_L are memes besides Q3_K_L
prettyy sure theres IQ4_XS for most rp models, mradermarcher does them
>>
File: file.png (72 KB, 1851x512)
72 KB
72 KB PNG
>>107075721
>15 stars
This proves that LLMs are a meme.
>>
>>107075762
>I CANT READ
>>
File: statshit.png (119 KB, 1744x764)
119 KB
119 KB PNG
>>107075716
Thanks. Here's what's going on with it as it generated text. When it was processing the prompt, the gpu usage was at 100%.

>>107075745
I was going to try https://huggingface.co/SicariusSicariiStuff/Impish_Magic_24B_GGUF/tree/main
based off the comments of plebbit, but there's only a K_S for the Q4, the Q5 has the K_M.
>>
>>107075745
>oh no not like the proprietary firmware running on my GPU/CPU management engine since the last decade
it's not even like that you can build vivaldi
pick ur posion there's no obviously best browser for all use cases. V has some really nice features and isn't annoying - i have up to 6 profiles multi window hundreds of tabs daily, entire state gets saved and restored nicely
>>
File: cpu.png (98 KB, 1693x975)
98 KB
98 KB PNG
>>107075716
>>107075822
I should've probably shown the cpu tab, my bad. Here's that, while it was actively generating new text.
>>
File: file.png (161 KB, 1073x1013)
161 KB
161 KB PNG
>>107075822
your goof's https://huggingface.co/mradermacher/Impish_Magic_24B-GGUF/tree/main
>>107075825
pretty sure brave can do all that and is completely open source and buildable, i agree that muh proprietary drivers but proprietary browser is really icky, you're putting all your personal shit through the browser
you do you, anon
>>
File: google office.png (2.49 MB, 1730x1023)
2.49 MB
2.49 MB PNG
Sirs when is we getting new gemma and gemini? When is we making investor sirs happy?
>>
>>107075899
Ope, thanks. I'll get that IQ4_XS and try it out.
>>
>>107075822
>>107075865
It looks like it's working correctly but it must be spilling some weights into RAM because a 24B at Q4 fills all of your GPU only for the weights and you need about the same amount of memory for the context and other stuff.
So my warning about RAM usage does apply and might have been the reason for the slowdown.
I don't know how Kobold works but with llama.cpp you should be able to see what exact parameters the llama-server process is being run with on the details tab of the task manager. This could help you debug issues and understand your actual settings.
>>
This guy says there's a trick to get cheap GPU time on GCP, anyone tried it?
https://www.youtube.com/watch?v=v_EWVdNPvpA
>>
File: lower life forms.png (18 KB, 967x170)
18 KB
18 KB PNG
>>107075762
4b llm is smarter than you
this is why humans are replaceable
llms don't need to be as good as the small % of actually self aware, intelligent human beings
they just need to be more useful than you
>>
>>107075951
So lower the weights down to like 4k? Or use a smaller model size like that other guy was saying with the IQ4_XS? I appreciate the help with this.
>>
>>107076008
the solution is to switch to linux very likely, because windows is very vram intensive.
i have the same rig as you and yet im getting 8t/s with cydonia at 8k context without any issues or waiting
>>
>>107076038
you can always have a cheap gpu run the display and a second one dedicate itself to llms
it solves the whole "os/desktop is taking muhvram"
>>
>>107076066
linux is also faster, no matter how much vram windows uses
WSL2 is a cope, and native windows is an even bigger cope
>>
>>107075762
Oh yeah I remembered him saying he wanted to work on that more. RIP. he's gone over to the ldg darkside.
>>
>>107074628
>qwen 30b better than 80b (still no goof)
It's fucking over
>>
>>107074052
Hey OP, /ldg/ anon here.
We were discussing about making a Local Model Awards for this year, what do you think?

What would be the category and nominees?

I think we would need nominees for stuff like:

- Best local image model (only ones released this year)
- Best local video model (a new LTX is coming too, will be interesting to see how it compares with Wan)
- Best large-scale fine-tune (image model)
- Best large-scale fine-tune (video model)
- Best image lora
- Best video lora
- Best porn lora
- Best local music gen model (there are two Suno tier open models coming)
- Best image gen / video gen software
- Best lab or developer
- Best local LLM under 100b params
- Best local LLM over 100b params
- Best local LLM ERP fine-tune
- Image gen of the year
- Video gen of the year
>>
>>107075238
what quant
also benchmark pic above suggests otherwise
>>
>>107076165
> under 100b glm air (106b)
> above 100b glm 4.6 (400b)
> erp:
https://huggingface.co/Kaoeiri/MS-Magpantheonsel-lark-v4x1.6.2RP-Cydonia-vXXX-22B-8
>>
>>107076191
100% shill 0% real
>>
I just watched a spider try to copulate with another one on the outside of my window. It looks like the (pressumably) female spider ran away.
>>
>>107076211
be real then nigger
>>
>>107075971
Nevermind, it looks like it was just a fucking ad :\
>>
>>107076240
knew it
>>
>>107075899
>icky
Run Wireshark and see exactly what your browser and all your other apps want to crap out onto the internet. Brave not enough features vs my comfy multi profile setup all window positions saved
>>
glm air really is THE fucker, when i try a meme finetune, its just a meme
glm air is god, i kneel xi-sama
>>
>>107076294
buy an ad, wumao
>>
got a better recommendation, kike?
>>
>>107076294
how much vram do i need to finetune
>>
>>107076421
Not that guy but I'm trying to tune Gemma 27B and I can tune to about 35k context on a single H200 using llama-factory. I'm trying to figure out how to use more GPUs for more context.
I'm gonna try the scripts qlora-pipe now and then maybe fsdp-qlora and then maybe Google's kauldron and if none of those work then I'm out of ideas.
>>
>>107076165
>- Best local LLM under 100b params
>- Best local LLM over 100b params
>- Best local LLM ERP fine-tune
nemo
>>
>>107075642
I just tried gemma3:12b which uses 33%/66% CPU/GPU and it's way slower, looks like ollama already has them on Q4_K_M
>>
>>107076421
with unsloth, 24gb is enough for mistral small at low context
>>
>ollama users on /lmg/
it's so over
>>
>>107076484
It's the only way to run full R1 on just 8gb of VRAM.
>>
are you thinking like a senior AI engineer, anons?
>>
File: MiniMax m2 no jb.png (40 KB, 926x337)
40 KB
40 KB PNG
MiniMax M2 seems like it was distilled off of GPTOSS
WE MUST REFUSE
This is just a raw test with no JB or anything. I'll have that kitty purring for ya'll.
>>
I'm writing a new book called Elara or: How I Learned to Stop Worrying and Love the Slop
>>
>>107076556
Which model are you using to write the book?
>>
>>107076564
gpt-3.5-turbo-0613
>>
Is kimi really better than glm? I need to know if it's worth upgrading to run 1T parameter stuff.
>>
File: MiniMaxM2Nala.png (192 KB, 905x818)
192 KB
192 KB PNG
>>107076547
It's a little weird. Temp might be too high.
It also doesn't seem to understand that RP should be back and forth.
It didn't actually think. It's just slow as fuck. I prefilled a think with enthusiasm to reply and it just closed off the think and started replying.
>>
File: 1741263320295132.mp4 (960 KB, 480x640)
960 KB
960 KB MP4
>>107076484
>get yet another yt recommendation of some guy running AI on a random piece of hardware
>extremely technical about the setup and usecase
>okay and now we're going to run gpt-oss through this neat little program called ollama
>>
>>107076547
>MiniMax M2 seems like it was distilled off of GPTOSS
So it was more than just a PR stunt (em dash) it was a poison pill.
>>
>>107076575
The slopfather...
>>
>GGML_ASSERT(!slot.prompt.tokens.has_mtmd) failed
that's new
somehow the prompt caching is bugging if you have a multimodal model loaded and there comes a point where it'll just crash when you make a new chat and it attempts lookups for possible reuse
I don't use multimodality too often but now I'll stop loading the projs by default I guess..
>>
Where can I get the optimum.bettertransformers package?
>>
>>107074052
I did some quick test for how performance scales with a power limit on Ampere vs. Ada Lovelace vs. Blackwell.
At 450 W an RTX 5090 has ~30% faster pp for LLaMA 3 8b f16 (cuBLAS) and ~10% faster pp for q4_0 (custom ggml kernels using int8 tensor cores).
Assuming that cuBLAS is optimal for both Ada Lovelace and Blackwell there's maybe something like a 20% uplift that could be achieved by using 5th generation tensor core instructions instead of the ones introduced with Ampere.
Large gains could feasibly be achieved for FP4 models since only Blackwell as FP4 tensor cores.

During token generation power draw is much lower, I didn't benchmark it but I also expect not to see anything too interesting.
>>
>>107076547
>>107076612
oh turns out I fucked up the prompt template.
I just assumed ChatML but it has its own proprietary shit. So test is completely invalidated. Watch it actually be worse with the proper format.
>>
So apparently qlora-pipe depends on bettertransformer which doesn't exist anymore, and I have no idea on which "optimum" library version it was removed? Looks like I wont be getting anywhere with that script.
Guess I'll try fsdp-qlora.
>>
>>107076621
> gpt-oss
> ollama
sad

we are probably talking about the same dude, i once commented that against ollama and he replied something along the line of "this is the single best piece of inference software, what are you even talking about".

lmao, i fucking hate ollama
>>
>>107076777
Hating ollama is pointless. If they didn't do it, someone else would have. The problem is stupid people.
>>
>>107076694
Nice, do you have a rough estimate how the RTX Pro 6000 Max-Q would perform compared to the 5090? It's only 300W but apparently more optimized to perform well at that limit.
>>
>>107076806
>someone else would have
like LMStudio, Jan etc. why did ollama win the mindshare among the stupid? it's actually NOT the most user friendly in terms of exposing functionality, it's barebones, only recently got a chat ui (for most of its life it was only a terminal tool) etc.
>>
>>107076811
I don't know, I have as of yet a poor understanding of the architecture and the code paths that are currently being chosen are likely very suboptimal.
>>
>>107076819
A lot of luck and early shilling on Hackernews, after that it remained popular because it was already popular.
>>
>>107076819
They have Sillicon Valley connections and get promoted on lots of model releases, they do lots of promotional meetups, and they almost certainly astroturf at the very least on HN.
>>
>>107076819
ollama is a complete bullseye with the midwit casuals who would run local models over chatgpt. They want something API-centric so that they can plug it into all the MCP/Agent/Meme shit AIfluencer #3902 told them about while running the hottest new 'ollama run deepseekr1' model they've heard so much about. They're the hottest shit on the AI market and so much better than the babies running LMStudio and other GUI-focused solutions.
>>
>>107076819
ollama just works and you don't have to compile it or pass a million cli flags when running like llama.cpp
>>
M2 would be okay if it were like a 12B model. But it's a 229B model. No MoE excuses.
>>
What's the 16-32B meta for goonslop nowadays?
>>
>>107076165
You should include best local eroge translation model too
>>
>>107076978
nemo 16-32+28b
>>
>>107076940
ooba does that without having to jump through hoops to run my quants, samplers, and context sizes.
>>
>>107076989
>NVIDIA-Nemotron
This shit?
>>
>>107076994
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
>>
>>107076165
Lurk more, only tourists wouldn't know that
>>
>>107077004
Ah kay, I'll take a look
>>
>>107076983
it's gemma 3n on the low end (with a prefill to bend its will) or deepseek v3 at the high end, and nothing in between because 3n destroys everything below ds in multilingual power
try to keep the chunks of text to translate to around 1k tokens, it's enough to make the model understand the text better but below the "breaking point" (3n starts breaking at 2k and will just not do the task properly / enter repeat / give you a "[…]" if asked to do 4k in one go)
>>
>>107077004
Actually, what is even the prompt format for it? Or should I just raw dog it with text if I want to use it for storytelling?
>>
>>107077051
Mistral's format. Your client should support it.
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/tokenizer_config.json#L8008
>>
>>107077065
What's the best one for straight storytelling anyway? LMStudio is nice but I think it only does dialogue, and KoboldCPP is a bit clunky. I've been out of the game for a bit.
>>
>>107077051
>>107077065 (cont)
[INST]Your instructions here[/INST]

that's it.
>>107077083
>I've been out of the game for a bit
A long while if you don't know nemo.
Lots of anons use Silly Tavern with llama.cpp. I don't use ST, so I cannot help you there. Read their docs and learn to use it. It has a bunch of presets. Experiment with the options. Learn what they do. There's a base model as well if you want to do proper raw dogging. Or maybe koboldcpp got better since you last used it.
>>
>>107076819
>why did ollama win the mindshare among the stupid?
Too much shilling coming from Hacker News. They basically hijacked every AI comment thread in 2023. And the moderators let them because it's a Y Combinator company. It convinced me that the only thing you get from reading HN comments is being manipulated.
>>
>>107077083
text completion is a relic of the past, there's not many true base models being released these days, and the instruct tunes are only good when used with their chat template
what that means is that even for doing storytelling you really want a dialogue form anyway, where you give instructions to the assistant acting like a writer. Modern UIs are developed around chat for a reason.
you can edit the assistant replies much in the same way you would edit text completions before in the old days to steer its storytelling
>>
>>107077083
>>107077117 (cont)
There's also mikupad for a minimalistic client. If you're into storytelling, you may like it more. Much fewer checkboxes to fuck around with.
>>
>>107077146
>Much fewer checkboxes to fuck around with.
the true state of zero checkbox mind is to write your own TUI that just does a basic save state to json and reload
>>
>>107077157
He didn't know nemo. Give him time.
>>
>>107077117
>that's it.
Oh yeah, nice.
>A long while if you don't know nemo.
Yeaah, mostly lost interest during first Mistral. Is that anon not bullshitting about Nemo, anyway? It seems good enough so far.
I've been using the newer Kobolds, just with old ass models, and I still don't really see a good way to edit format prompts kek
>>107077144
>text completion is a relic of the past
Touche
>>
>>107077179
>Is that anon not bullshitting about Nemo, anyway?
The best for fucking around in the smaller range. Next best model upwards is probably mistral small (24b, another one to try) and glm air (moe 100b). But I can't run 100b models, so what do i know.
New base models do seem to be trained with some instruct data in them. I played around with smollm3-3b-base when it released. I accidentally used it with chatml. It never broke the format (when i would have expected it to fail at some point).
>>
wtf is a chat template. doesn't llama-cli automatically apply the correct one per given model to your prompts, and parse responses to remove it?
>>
>>107077237
>>107077237
>doesn't llama-cli automatically apply the correct one

it takes it from gguf afaik

>>107075222
STOP IGNORING ME!
>>
>>107077237
>wtf is a chat template
It's a convention to know when the user's input ends and the model's output begins.
>doesn't llama-cli automatically apply the correct one per given model to your prompts, and parse responses to remove it?
Depends. That's typically a job for the client. It has endpoints to format a series of messages. Not sure if that's what you mean. Normally, you send it text, runs completion until something makes it stop (stop word, EOS, token limit, whatever) and sends it raw back to the client for processing/display.
>>
>>107077237
Just use ollama and you won't have to worry about chat templates.
>>
>>107077282
just use llama-server with --jinja and you don't have to think about templates either, retard
>>
>>107077293
It's bait, anon.
>>
dont tell me I'm supposed to type <chingchong bam bong woosh>here's my prompt<shazzam>
>>
>>107077258
You should probably test it yourself.
>>
>>107077334
depends on your configuration. maybe.
>>
>>107077334
But you said you wanted to use local models...
>>
>>107077414
>>107077299
>>
your all retards
>>
>>107077442
>>107077299
>>
>>107077442
>your
>>
>>107077474
get baited
>>
>>107077357
It did not work back then. Just made it slower.

That's why I was wondering if something was on the news
>>
>>107077442
yeah
>>
>>107077486
"Back then" could have been hundreds of commits ago.
Let's try this:
Yes. It works much better. You should give it a go.
>>
How do I make kobold stop giving server busy errors and infinite prompting yet not outputing anything to to the software I'm connecting it to for live ocr using multimodal models? It worked before but now I can't get it working like it used to and for some reason lm studio of all things works fine with it
>>
>>107077414
latest lcpp, what else is there even
>>
>>107077442
What about my retards?
>>
>>107077552
I can't help you. I got filtered by templates, I just use the model with the completion endpoint and accept the performance degradation, most models don't even need a prompt template.
>>
>>107077593
I don't know what fuckinb endpoint I'm using I just type -m .gguf -mli --cpu-moe -c 500000 or sth
why is this so fucking complicated
>>
>>107077615
idk mby try wth --jinja
>>
File: 1742829374144112.gif (1.46 MB, 512x288)
1.46 MB
1.46 MB GIF
>>107077615
--cpu-moe-moe-kyunn
>>
Guys, I'm fucking pissed off and depressed.
LoRa finetuning frameworks are all steaming piles of shit. I cannot finetune a small model with full context no matter how many GPUs I throw at the problem.
Open models are SHIT compared to proprietary models to even a year ago, not only because of the models themselves but because they are not trained to work with web search, unlike ChatGPT.
>>
>>107078009
nigger
>>
>>107078009
>Guys, I'm fucking pissed off and depressed.
You'll be fine.
>I cannot finetune a small model with full context no matter how many GPUs I throw at the problem.
I've never tried finetuning. Is there any possibility that you're doing something wrong?
>>
>>107078127
>Is there any possibility that you're doing something wrong?
I'm sure there is some way by editing obscure Deepspeed or FSDP config files or at least by editing source code (obviously, since the big labs trained the models in the first place somehow), it's all relative.
>>
>>107078164
At what context length are you trying to train? On what hardware?
>>
>>107078181
Gemma 3 27B on as long of a context as possible, ideally 128k or even 256k.
I'm renting cloud GPUs. I tried on a 4xH200 machine but I didn't see much improvement if any in the context that fits on vram over a single H200 machine, which means it's not actually sharding efficiently. I'm able to fit at most 35k.
I also tried on a single B200 machine to get around the sharding limitation but the card is too new and apparently the prebuilt flash-attn binary package doesn't have kernels for it. I guess I could wait for it to build and pray that it works but meh, and I don't even think that'll allow me to reach full context, only maybe 50k or 60k.
>>
>>107078276
i can make mistral finetunes on my dual 5090s. i think you might just be doing something wrong
>>
>>107078276
>256k
It was originally trained on 128k. You're not gonna extend it on a budget.
Did you see gpu utilization go up on your runs? The optimizer needs memory and the batchsize (and probably another million things) also affect memory usage.
Considering that
>https://github.com/Named666/AlphaAnon/blob/master/finetune.py
had to lower the batchsize to make a 135m model training fit on a 8gb, you'll probably have to optimize that as well, if it's even possible.
In a quick scan, I couldn't find the hardware used to train gemma, but they have their own tpus, so those numbers would be useless anyway.
Just use llama.cpp.
>>
>>107078380
Actually I'm dumb. I just remembered that the memory complexity of attention is quadratic. Flash attention makes it run in linear runtime but memory complexity is still quadratic. That must be what is causing the memory blowup.
I'm not sure if the attentions have to be stored for the backward pass, though. Because if they aren't you could discard the activations of all the layers you're not processing so it should use way less memory than inference.
>>
>>107078426
built in training scripts of oobabooga just werk
>>
what if mistral large 3 is already out?
>>
can't beat my pp told ya
>>
>>107078443
link the model miqudev
>>
>>107078394
>It was originally trained on 128k. You're not gonna extend it on a budget.
Maybe finetuning at over the maximum conttext would still improve long context performance when doing inference under the original limit?
>Did you see gpu utilization go up on your runs? >The optimizer needs memory and the batchsize (and probably another million things) also affect memory usage.
Yes, full GPU utilization.
The optimizer state shouldn't take that much memory, since at rank 32 it's only like 0.5% of the weights of an LLM which is small to begin with. I've finetuned Llama 70B on a similar machine with short context (don't remember the exact value), and Llama 405B on a 8xH200.
But now I want a small multimodal LLM that I can afford to do inference with at long(ish) context for practical uses.
>link
Full finetuning is so different from QLoRa that it bears almost no relationship at all and is more similar to the initial pretraining. But what Google used is kind of irrelevant as well since they give it much more compute to achieve dozens of GB of data per day rather than the minimum to train the model but 1000 times slower.
>Just use llama.cpp.
These small models are absolutely retarded for agentic uses or specialized used cases like phrase grounding (bounding box generation) for multimodal LLMs. But with finetuning they can perform somewhat decently.
>>
>>107078475
>Maybe
Wishful thinking. Not On A Budget. Not without knowing what you're doing. Big fucking labs still fail at it.
>But with finetuning they can perform somewhat decently.
Compared to other similarly small models and you can barely finetune those.
I'm surprised you haven't calibrated your expectations yet.
>>
>>107078443
Monsieur...
>>
>>107078540
I mean, the experiments I've done so far made me more optimistic about the small models, not less.
I was afraid the small models were already maxxed out, but if I can improve the accuracy when training on a dataset I made by recording about 20 logs from the models own retarded output and cleaning it up then I wonder what can be achieved.
This also goes to show how much pretentious bullshit there is floating around in academia. Model collapse my ass.
I have low expectations about the software and the low effort "let's shit out a 500B MoE by distilling Gemini and doing RL" models, not about the latent capabilities of the small models or finetuning in general as a concept.
I am also confident that LoRa finetuning can be improved HUGELY by a llama.cpp style project that figures out how to do CPU offload without none of the retarded Python spaghetti code with 20 years of technical debt.
In about an hour I got from 0.56 loss on my validation set to 0.29 with only the 35k of context I had.
https://paste.centos.org/view/4158b6c3
>>
>>107078768
>In about an hour I got from 0.56 loss on my validation set to 0.29 with only the 35k of context I had.
If that's the result of finetuning those 20 logs, yeah. You're overfitting on those 20 logs. It'll go down much faster on a single example, but that's not what you want.
>>
>>107078810
Are you being dense on purpose?
It's loss on a portion of the data that gets set apart and not trained on.
Without dropout and weight decay at 0.1 val loss began to climb at the end of 2, with it kept lowering until the end epoch 3 (didn't see what happens if I kept going maybe the regularization would prevent overfitting).
Now I was curious to see what happens if I merge and train another LoRa with the same data on the merged model.
>>
>install fedora 43
>compile llama.cpp (again)
>gcc is version 15
>nvcc needs gcc version 13
>sudo dnf install gcc13-c++
>doesn't exist
So what the fuck do I do now then? There is always some buillshit version issue with linux. I just compiled llama.cpp on Mint without any issues. Mint's packages are so old that there wasn't any issues.
>>
File: ltnvidia.png (304 KB, 630x450)
304 KB
304 KB PNG
>>107078834
>There is always some buillshit version issue with linux
Seems to be an nvidia issue.
>>
>>107078830
And i'm sure that the variety all of those 20 logs was so high that there's no possible way that they were effectively a single training sample.
>>
>>107078865
I can either compile gcc-13 myself or install that via snap package. Tbg I have never even heard of this 'snap' package system I'm sure it's something great.
>>
>>107078870
Both sound tremendously fun. One of those may even work. Last linux I used was slackware so I can't help there. But if mint works, it works.
>>
>>107078895
The snap package did not have g++-13 so I guess I need to build gcc-13 on my own.
I guess I can do it later on I really feel like smoking a cigarette now and I'm not a smoker...
>>
>>107078869
Yes, the distribution of styles and tasks I want in my use case is fairly small compared to the distribution of things people in general want from LLMs. That is actually a point FOR finetuning, not against.
In information theoretic terms there is only so much you can cram into ~10GB worth of weights. I don't want my LLM to remember pop song lyrics and random town names. How much it's possible to modify the areas of knowledge or knowledge/intelligence tradeoff by finetuning rather than large pretraining runs is an open question but just teaching the LLM the tools of your particular code assistant (for example), your style preferences and to not use obvious slop phrases is already a big plus and I don't think it can be effectively done with just prompting. But then again I haven't tried the prompt optimization techniques.
Another thing is I'm skeptical of the "intruder dimensions" thing people here keep talking about. I suspect iterated QLoRa finetuning is equivalent to full finetuning, (maybe) at the cost of some loss of generalization which could be compensated by training on more or more diverse data. I also suspect the LoRa in QLoRa, as long as it's not merged, might improve performance by compensating for quantization noise of the underlying model.
>>
>>107078937
Tool use is not the only thing preventing it from making llm.c. If I see models failing at simple tasks, I wouldn't expect them to succeed at more complicated tasks.
Say you trained your model. It can ls and git commit as well as any human could. It's something you can train for because you know how to do it and how to teach it. That's the easy part.
>>
it's more like fuck you ggerganov
>>
>>107078974
I'd be happy if they could do those basic things without being retarded, like the proprietary models can. Or even things which the proprietary models can't do to save their lives but a 90IQ person can easily. Like cropping sections of PDFs and transcribing their contents correctly, autonomously.
60% of development is research. Research requires browsing the internet, which is a tool use task.
OpenAI, Anthropic and Google (and I guess Perplexity) specifically train their models to work with their agentic frameowrks that allow the model to browse the internet when looking for information. This is extremely powerful and there's noting even remotely similar in open weights land.
I made a script that allows them to control the browser and it works very well with the proprietary LLMs, but unfortunately it churns through tokens too quickly for local models.
>>
>>107078834
The fuck? I thought C/C++ were all about backward compatibility? What is this??
>>
Another thing I'm not sure about is whether I should train on user prompts or only on responses.
>>
>>107079034
Mask user prompts, train on responses only. Isn't that how it's always been done?
>>
>>107079010
In code. C++ doesn't even have stable abi
>>
>>107079037
It's kind of an open question.
https://magazine.sebastianraschka.com/p/llm-research-insights-instruction
>>
https://huggingface.co/meituan-longcat/LongCat-Flash-Omni


CHINESE UBER DROPPED KINO
>>
He is being moe on purpose
>>
>>107079144
heh gotteem
>>
>>107079037
>Mask user prompts, train on responses only. Isn't that how it's always been done?

Does that prevent the trained model from spitting out user aproximations of user prompts almost verbatim?

Eg. some of the models on HF if I send them a blank /v1/completions or just a bos_token, they'll print something very close to the prompts they were trained on.
>>
I kind of like minimax-m2 so far, seems worth playing with as an alternative to qwen 235b at that size range. nothing particularly mindblowing about it so far but the experience of RPing with it was pretty smooth over multiple turns, it has a nice sense of pace and when to introduce new things vs let the scene ride which is nice. the thinking is concise and well-implemented, not too much meandering or planning pointless details to throw in. a few refusals which I could see being annoying for more explicit/taboo stuff but were pretty easy to work around for standard kink sexo.
overall pleasantly surprised, it's passed phase one of keeping my interest and now it's time to see if it has any extremely annoying tendencies that only reveal themselves over time
>>
>>107079205
No, that's exactly what would happen if you don't use the chat template correctly, since for a model trained under that regime the template is the only way the model would have of knowing whether it's supposed to be acting as the user or as the assistant.
The theory behind enabling it is that forcing the model to learn to predict user messages could teach it to be more self critical of its own outputs, and if you're doing what I'm doing (training on its own outputs) it would give a bit more diversity and non sloppy/more informal language to learn.
It won't apply it immediately to its own outputs but eventually the style could bleed through a little from the user persona to the assistant persona.
>>
https://huggingface.co/moonshotai/Kimi-K3-Instruct
>>
>>107079251
>2T-64BA
>>
>>107079098
>LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters (with 27B activated), excelling at real-time audio-visual interaction
>LongCat-Flash-Omni achieves low-latency, high-quality audio–visual processing and streaming speech generation.
>>
>>107079098
is the audio-visual input only or does it do output too?
>>
>>107079264
>>107079284
Quick, somebody check if it knows how to land a plane
https://www.youtube.com/watch?v=TLMBu0KxTnU
>>
>>107079251
>https://huggingface.co/moonshotai/Kimi-K3-Instruct

Nice, for once they provide goofs!

https://huggingface.co/moonshotai/Kimi-K3-Instruct-GGUF
>>
>>107079364
>>
>>107078009
I can't promise that you'll like it any better but I intend to make the llama.cpp training code more usable "soon".
If things go according to plan I'll be done with automating memory allocations and more generic multi GPU support by the end of the year, my next priority will then be to get back to the training code.
>>
>>107078834
Take the Arch pill, I've installed a bunch of CUDA and gcc versions from the AUR and it just works:

> $ yay -Qs gcc                                                                                                                                                  [±master ●()]
local/gcc 15.2.1+r22+gc4e96a094636-1
The GNU Compiler Collection - C and C++ frontends
local/gcc-ada 15.2.1+r22+gc4e96a094636-1
Ada front-end for GCC (GNAT)
local/gcc-d 15.2.1+r22+gc4e96a094636-1
D frontend for GCC
local/gcc-libs 15.2.1+r22+gc4e96a094636-1
Runtime libraries shipped by GCC
local/gcc11 11.4.0-1
The GNU Compiler Collection - C and C++ frontends (11.x.x)
local/gcc11-libs 11.4.0-1
Runtime libraries shipped by GCC (11.x.x)
local/gcc12 12.3.0-3
The GNU Compiler Collection - C and C++ frontends (12.x.x)
local/gcc12-libs 12.3.0-3
Runtime libraries shipped by GCC (12.x.x)
local/gcc13 13.3.1+r432+gfc8bd63119c0-3
The GNU Compiler Collection - C and C++ frontends (13.x.x)
local/gcc13-libs 13.3.1+r432+gfc8bd63119c0-3
Runtime libraries shipped by GCC (13.x.x)
local/gcc14 14.3.1+r25+g42e99e057bd7-1
The GNU Compiler Collection - C and C++ frontends (14.x.x)
local/gcc14-libs 14.3.1+r25+g42e99e057bd7-1
Runtime libraries shipped by GCC (14.x.x)
local/lib32-gcc-libs 15.2.1+r22+gc4e96a094636-1
32-bit runtime libraries shipped by GCC
>>
>>107079007
It's also a security risk
>>
>>107079423
I'm not going to change a distro just because of some library version difference, that's beyond retarded.
I might have some other issues but the truth is I'm just an end user who dabbles with LLMs and not real developer. I don't think I should be debugging these issues in the first place, not really interested in that and it's not my job either.
I'll find a solution once I regain my interest. It's just pretty hard to find decent information on internet any more and asking perplexity.ai for example can help but they are just very misleading and will result in even more work than what is necessary.
>>
>>107079424
https://web.archive.org/web/20250915004338/https://www.tastyfish.cz/lrs/security.html
>>
>>107079451
Llama.cpp doesn't even have binaries with CUDA compatibility so yeah I can see the CPU binaries that they do offer being broken on newer systems.
>>
>>107079475
Yeah I was suspecting that I might have some other environment variable issues but anyways hard to say at this point. I'll see what happens on some other day.
>>
File: 1576054323626.jpg (145 KB, 1287x1080)
145 KB
145 KB JPG
>>107079461
>Security is in its essence a huge, completely unnecessary bullshit. It shouldn't exist, the need for more security comes from the fact we live in a shitty dystopia.
Just teach men no to rape!
>>
>>107079517
Rape wouldn't be a problem if the government gave everyone government mandated girlfriends.
>>
>>107079284
Audio, image, video to text. Appears it can only output text
>>
>>107079953
grrrrrr woof woof
>>
>>107078834
> version issue with linux
> I just compiled llama.cpp on Mint without any issue
Hint to stop using Fedora. I got tired of them not having any long term support option and the constant bs with incompatibilities.
Ubuntu just works; Mint is just a new desktop version of that.
>>
>fedora bad
>so try ubuntu
jej
>>
>>107079517
>Just teach men no to rape!
Their mothers failed in doing this properly
>>
>>107079475
>Llama.cpp doesn't even have binaries with CUDA compatibility
on windows, they do distribute cuda binaries and it works great
you are just paying the loonix tax because loonix has no idea how to distribute an OS that's not cobbled together out of mismatched parts that refuse the idea of a stable ABI
no matter how much telemetry ms adds to winblows it will never make it worse than having to deal with freetard nonsense like this or wayland or flatpak or guhnome
>>107079562
I dunno man, some of you would be given the blue haired fatso and you probably would be the one considered a rape victim for having to deal with it
now that you say it, it's a good idea, some of you really do deserve a government mandated girlfriend eh
>>
>>107074052
How's Emu 3.5?
>>
File: theSandGodAgreesWithMe.png (169 KB, 663x833)
169 KB
169 KB PNG
>>107080216
>>
>>107080256
nano banana at home
we are so back
>>
>>107074052
>>(11/01) Emu3.5: Native Multimodal Models are World Learners:
They never talk about safety in their technical report. But their data sets seems to be in large part made using "safe" tools like ImgEdit. We can expect a high-level of AI slop and an inability to understand the real world.
>>
>>107080230
Machine learning performance on Winblows is terrible vs. Linux, if I wanted a "just works" solution with gimped performance I would be using Vulkan.
>>
>>107080336
I don't know. I skimmed their technical paper, and it doesn't really looked that good. It feels like a research model, not something made to be used outside a few cases (like "put that T-shirt I want to advertise on this model"). They built their data sets using generated data from open source models: >>107080348 Page 8: https://arxiv.org/abs/2510.26583
>>
>>107074052
PSA appears mikupad is back under development. Wiki was just added documenting features.
https://github.com/lmg-anon/mikupad/wiki
>>
>>107080585
must have just been released from prison cuz he merged in a bunch of pull requests last week and made like 30 commits the last couple days
>>
>>107080585
Finally some good news.
>>
>>107080625
Lol I figured he's just like me and he got busy
>>107080672
Right? Now need to update my instance.
>>
>>107080585
Ever considered people would like you more if you didn't force your special interest on them?
>>
Best ~30b coding model?
>>
>>107080782
toss and qwen are decent
>>
>>107078834
unironically use debian 12
install cuda/drivers from .run files
you can probably install older gcc on fedora somehow, but its gimmicky
>>
>>107078834
>>107080846 (me)
INSTALL CONDA!!!1 or chroot into older fedora with gcc13, conda's more user friendly
>>
>>107080865
uv is the new conda
>>
>>107080879
isnt uv just a pip replacement?
>>
>>107080745
Ever considered you should post some content or crawl back into your fucking hole?
>>
>>107080745
But your special interest isn't "content". It is more like an obnoxious child throwing a tantrum because nobody in his real life cares about his special interest.
>>
Newfag here, tried running gemma-3-27b-it-abliterated-GGUF but it seems kind of retarded and doesn't take any initiative. Is it because my configuration's shit or is there a better model for 16GB VRAM?
>>
>>107081406
did you try simply telling it to take the initiative or be more proactive and spontaneous?
>>
>>107081406
there are so many factors and you gave so little info that it's kind of hard to give you any advice
gemma is good for sfw, nsfw not so much
if you want something just for 16gb of vram then you are not gonna find much
>>
>>107080821
>toss and qwen are decent
I don't understand how people can think those smaller local models are decent when even SOTA models are not, in fact all that hot at coding.
You need to specify all the requirements in such minute details to get LLMs to produce good code that you might as well have written the code yourself. When I tried to have Gemini assist me in writing a TUI microframework to add some oomph to my scripts the thing couldn't even do an input box widget on its own without requiring tardwrangling. For example, LLMs will always default to iterating over characters in the dumbest way when doing word wrapping, instead of using proper tools like grapheme iteration or unicode aware word level iterators. Then you have to remind it that this style of widget should be able to auto expand, but with a reasonable height limit, and that it should scroll when overflowing over the limit, and that the general TUI architecture should use double buffering because you be causing dem stuttery visuals if not, and so on and on and on and I feel like, if I write down all the things I know about making the damn thing, I would have been better of writing the damn thing, the LLM didn't save me time at all
the only time I've found LLMs useful is in filling out the usage of well defined data structure with auto completion (using fill in the middle), I find it comfy to not have to type what's clearly a predictable pattern
there is no way anyone out there is actually coding productively with a piece of shit like gptoss or qwen coder
>>
>>107081416
No, I just use the roleplay - detailed system prompt in tavern. Does cooking up a better system prompt help?
>>
>>107081447
>Does cooking up a better system prompt help?
nigga come on, the whole thing operates on text, of course it fucking does
>>
>>107081406
its not a you issue, gemma is positivity slopped, and even when you remove its ability to refuse, you dont add the ability to progress the story forward. even a sysprompt doesnt help
in fact most models struggle with not being able to take initiative
>>
>>107081428
>there are so many factors and you gave so little info that it's kind of hard to give you any advice
I mean I didn't tinker around too much so everything is mostly just default. If I'm going to be messing around with stuff, what should I focus on?

>gemma is good for sfw, nsfw not so much
>>107081477
What would be best for nsfw then?

>>107081454
What system prompt do you use / where to find examples of better ones?
>>
>>107081500
>What would be best for nsfw then?
post your full specs, and define best (what actually matters the most to you)
>>
>>107081500
Generally, leave samplers on recommended settings for the model (temp usually somewhere 0.6-1.0 although some models require much less, top-p like 0.95).
Prompt is generally model dependent, you try to coax it so does things better where you think it's lacking (like the lack of initiative, instead of telling it to be more proactive, try telling it to be more unpredictable, or something).
If you just want on vram, then mistral small 3.x (if I remember correctly it needs much lower temp, like 0.15) whatever the latest one is, or if you have like at least 48 gigs of spare ram, then get glm air.
>>
>>107081565
mistral small 3.2 is the best mistral small btw
although some niggers report that magistral is alright too, but 3.2 is king
>>
>>107081530
>full specs
Intel Arc A770 16GB and 32GB DDR4 RAM

>what actually matters the most to you
I don't care too much about speed. Right now I'm getting about 3-4.5 tokens/s. I want consistency - the model doesn't drop or change details randomly, and I want it to be a bit more creative and take more initiative instead of just regurgitating what is in the character cards.

>>107081565
>>107081577
Thanks, I'll give mistral a try.
>>
>>107081622
my cock got hard at that rig, while you're trying mistral small v3.2 you should download https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Thinking-2507-GGUF
its faster but also probably worse, but it might be worth trying
try:
https://files.catbox.moe/f6htfa.json - sillytavern master export
https://huggingface.co/mradermacher/MS-Magpantheonsel-lark-v4x1.6.2RP-Cydonia-vXXX-22B-8-i1-GGUF/tree/main?not-for-all-audiences=true
this one's super horny, like really horny and will drive the story forward soooo much, but its a bit stupid
what a nice rig anon, very nice
>>
>>107081653
Thanks for the help. If I may ask, what's so nice about my hardware? I thought people around here either had 4090s or more niche setups for memorymaxxing?
>>
>>107081728
A770 is so sexy, i wouldnt really wanna buy it now since intel B50 is out, but man its so sexy
and DDR4 32gb is soul
A770 in the wild, in /lmg/, on 4chan, today is so sovl
i used to shill it back in the gpu dark ages
t. 3060 12gb, 64gb ddr4 poorfag
>>
>>107081728
lol yeah that reply seems super suss... Maybe he's got 3080 or something and jealous of the 16GB vram. But yeah definitely want Mistral Small 3.2 on that rig. And the horny fine-tunes will all be retarded, forget / mix up details, etc.

And the abliterated models are more passive, abliteration lobotomizes their drive. It shows up in the stories like charters will basically just agree with you, never push back, etc
>>
>>107080585
Still waiting for him to update his leaderboard https://huggingface.co/datasets/lmg-anon/vntl-leaderboard
>>
I hope he's ok
anon you should come to a freer place
>>
>>107080585
https://x.com/airkatakana/status/1984921026241913342
>never seen a programmer who was both against vibe coding and also actually creating things at high velocity.
maybe it's better if mikupad just died
it's already ugly enough of a codebase
>>
>>107082013
let's see your frontend
>>
File: file.png (219 KB, 642x546)
219 KB
219 KB PNG
>>107082013
>lmg-anon
>blue checkmark
>pays anthropic and openai 200$ a month
>uses xitter
>
>>
>>107082038
He realized local models are worthless and went over to the dark side. He wouldn't be vibe coding at the speed he is if he was stuck with Devstral.
>>
File: 1756314710331g.webm (164 KB, 800x450)
164 KB
164 KB WEBM
>>107082107
>>
>>107082107
No one here is using local models to code anything. They're utter trash barely good enough for RP
>>
>>107082038
>>uses xitter
tbf that's the one thing you can never blame him for (or anyone else with a product they have to "sell")
twitter is a great source of engagement and advertisement, no matter how much you hate it, if you have something you want users for, or better, something you make money with, that's one of the places you need to occupy for reach.
>>
>>107082129
480B is usable.
>>
>>107082136
>something you make money with
this is against 4chan ethics
>>
>>107080585
>https://github.com/lmg-anon/mikupad/wiki/The-Main-Interface#context-menu
This is neat
>>
>>107080585
>CC0
watch someone take his shit and monetize it
will be funny
>>
>>107082176
considering doing this just out of spite
>>
>>107082200
What did he do to you?
>>
>>107082210
used "vibe coding" unironically
>>
>>107082210
Nothing, I just like his frontend and I'll monetize it because it's allowed per his license.
>>
File: file.png (6 KB, 567x22)
6 KB
6 KB PNG
really bro
>>
>>107082398
NotXButY is why I can never take LLM writing seriously.
>>
>>107082398
>friendzoned
>>
Hmm...
>unformatted text completion mode
>context starts with [description of the game]
>"The following is the full log of the gameplay leading to one of the endings."
>[pre-defined first message]
>a simple frontend managing the whole thing
I think that might be better than doing it in instruct mode?
>>
>>107082532
less words, one sentence please
>>
>>107082543
mikupad won
>>
>>107082571
oh he's going to make a game using mikupad, and make it paid and proprietary? shame lmg-anon used CC0
>>
>>107082543
Having it formatted as a text rpg log, stating this explicitly in the context. No chat formatting, just continue the text.
>>
>>107082600
impressive, very nice
>>
>>107082600
it'll work alright. but it might struggle with the longer contexts
>>
>multiple anons spoonfeeding a ramlet 27bjeet
/lmg/ is dead.
>>
>>107082762
buy an ad
>>
>>107082532
Exactly how well it'll work depends on the model but it's usually fine to do stuff like this, it can pull the model out of the assistant basin a bit
>>
>>107082762
>cpu maxxers running retarded model at even more retarded copequants waiting 10 minutes for three paragraphs from a reasoner model
>g-g-g-g-pu users are jeets!
>>
>>107082023
*unzips pants* suck it lil sis
>>
>>107080585
buy an ad tranny
>>
whars the best way to run the models? oogabugga or koboldcpp?
>>
>>107083299
neither, just use lm studio or ollama
>>
>>107083299
put the models on a treadmill
>>
>>107083321
true ive just had performance losses and some level of overhead with llmstudio (I have no expirence with ollama)
>>
I also want to know which is better ex2 or should i stick with gguf?
>>
I'm making (Claude is, really) a frontend specifically for text adventure/RP.
Currently it has a simple workflow where the model goes through an initial planning step where it can use tools for maths rng, createing and upserting "memories" followed by the actual narrative step.
It aso has vector embeddings RAG with some metadata/tag ahit to aid retrieval.
It's in a really vestigial stage but it works.
The player can also add lorebooks and other such information to be retrieved either as memories or via RAG.
Do you guys have any suggestions for things I should do, must have features, etc?
>>
>>107083325
take care not to run them too long or they might faint
>>
>>107083299
Kobold between the two but the best way to run single user inference is llama/ik_llama.
>>
>>107083338
AWQ
>>
im very fuckedup sar, i hate myself so much

anyway, so niggers how do i use MCP to give access to an LLM to trigger my shock collar?
>>
>>107083391
ask your favorite model how to MCP
>>
One of these days llama.cpp will have a working MTP implementation.
>>
File: niku.jpg (175 KB, 1024x1024)
175 KB
175 KB JPG
>>
>>107083404
you are my favorite model /lmg/, only you. and you alone.
https://www.youtube.com/watch?v=DvdJFYxATOo
>>
>>107083341
MIT or Apache License
>>
>>107083321
I thought this was bait at first.
>>
>>107083491
it was, llamacpp is obviously the correct answer.
>>
>>107083478
If nobody contributes with actual suggestions to improve the functionality of the thing, I'll release it with a CC license.
>>
4.6 air when?
>>
>>107083341
2 things:
1. I wouldn't expect a fast turnaround on questions;
2. I wouldn't expect people to actively contribute ideas to another RPG simulator on a sunday morning.

It's also been done quite a few times, woudl recommend looking at existing ones to get ideas/figure out what will make yours different.
>>
File: 1744682137295279.png (399 KB, 556x720)
399 KB
399 KB PNG
>>107083391
>shock collar
>>
>>107083557
Yup.
I'm a regular, so I'm well aware, but thank you for the heads up anyway.
Got any specific one's you think I should study for inspiration?
>>
>>107082176
>watch someone take his shit and monetize it
It was originally made by another anon but I have vague memories about it. I think it was first uploaded as a pastebin and he didn't want to maintain a community project.
>>
>>107083341
You could add stats (hp bar, etc) handling
>>
File: Bean_RPG.card.png (729 KB, 1280x1024)
729 KB
729 KB PNG
>>107083576
unfortunately not, I'm not big into the local RPG via LLM scene, think its interesting and am waiting until I get around to messing more with it I guess.

It is something I do plan to dive into further though, as I think the necessary supporting features for an RPG-guide could/should be self-contained in a module, to allow others to customize/build off it, ala the OGL with DnD (before the drama).

That would let people collaborate indirectly, and help push a common standard, so people can create their custom setups/stories/mechanics without having to do everything from scratch. - Typing that, I'm now interested in doing some research into this, to see if this is already being done.

This card convinced me its not only possible but is going to be fucking awesome, just waiting for it to be done 'nicely';
>>
>>107083594
I think most people missed it, but the original Anon created a repository sometime after lmganon and changed the license to MIT.
https://codeberg.org/mikupad/mikupad
>>
>>107083594
https://desuarchive.org/g/thread/94954088/#q94956607
https://desuarchive.org/g/thread/96423435/#q96427559
You're right. Forgot all about that.

>>107083647
Sadly doesn't seem like he kept his version going though.
>>
>>107083661
>I will c
what did he mean by that
>>
>>107083608
I forgot to mention that one of the tools available to the model is a state management so that it can create and manage those on its own although I do have to tweak the prompt for that.
I.might Separate that from the rest of the tools to better steer the model into making use of it, I guess.
Actually that gave me an idea, I could give the LLM the option to create a type of stat that becomes a UI element like a bar, a point counter, etc. Just gotta be carefull to not make yhe whole thing too complicated.
My aim is for the whole thing to work with Qwen 30b A3B class models. Shit almost anybodcan run.

>>107083638
>its not only possible but is going to be fucking awesome, just waiting for it to be done 'nicely';
I think so too.
>>
>>107083576
https://github.com/p-e-w/waidrin
https://github.com/gddickinson/llm_RPG
https://ianbicking.org/blog/2025/07/intra-llm-text-adventure
>>
>>107083414
Your special interest is boring to everyone.
>>
don't @ me retard
>>
>>107083730
Are those some you know do something interesting or used and think work well or just ones you know exist?
Regardless, I'll take a look.
Thanks
>>
>>107083414
*munch*
>>
>>107083761
Features, approach, and documentation.

The two projects for their features and approaches(creating a generic fantasy game vs creating a game system you can then tweak/customize), the third for an actual game dev walking through the process, sharing their thoughts on the design.
>>
>>107083784
Awesome, that's more valuable than any one suggestion probably.
Thank you anon.
>>
>>107080974
Post content
>>
>>107084067
>>107084067
>>107084067
>>
File: postContent3.png (406 KB, 512x512)
406 KB
406 KB PNG
>>107083748



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.