[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1767081321191571.jpg (292 KB, 1920x1080)
292 KB
292 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107986301 & >>107977622

►News
>(01/28) Trinity Large 399B-A13B released: https://arcee.ai/blog/trinity-large
>(01/27) Kimi-K2.5 released with vision: https://hf.co/moonshotai/Kimi-K2.5
>(01/27) DeepSeek-OCR-2 released: https://hf.co/deepseek-ai/DeepSeek-OCR-2
>(01/25) Merged kv-cache : support V-less cache #19067: https://github.com/ggml-org/llama.cpp/pull/19067
>(01/22) Qwen3-TTS (0.6B & 1.8B) with voice design, cloning, and generation: https://qwen.ai/blog?id=qwen3tts-0115
>(01/21) Chroma-4B released: https://hf.co/FlashLabs/Chroma-4B

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>107986301

--Paper: LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation:
>107990319 >107990445 >107990550
--Nvidia's VRAM strategy and DeepSeek's engram integration prospects:
>107986970 >107987016 >107987142 >107987440 >107987473 >107989902 >107988915 >107988974 >107989098
--GLM 4.7 Flash model compatibility issues with outdated Koboldcpp:
>107992878 >107993034 >107993074 >107993052 >107993069 >107993264 >107993750 >107993771
--Circumventing GPT-oss refusal mechanisms via prompt editing:
>107992504 >107993066 >107993211
--Z-Image base model release and image diversity discussion:
>107986742 >107986763 >107986795 >107993308 >107989901 >107991596 >107991743 >107989947 >107989983 >107990135 >107991526
--Model format conversion challenges and MeE architecture skepticism:
>107991036 >107991466 >107991428 >107992099 >107992113 >107992450 >107992745
--Trinity Large release and sparse MoE architecture performance debate:
>107989969 >107990016 >107990887 >107990908 >107990930 >107990936 >107993722
--Unsloth K2.5 model template compatibility issues with thinking tags:
>107989346 >107990072 >107990608 >107990654 >107993276
--Multi-GPU mixed architecture setup for concurrent model inference:
>107989299 >107989531 >107989562
--LLM's absurd reasoning for keeping characters alive in fictional scenarios:
>107989167 >107989251 >107989272
--Kimi's agent swarm feature and local adoption challenges:
>107988547 >107988563 >107988618 >107988654 >107989552
--GLM model's backtracking and coherence challenges in reasoning:
>107988322 >107988347 >107988387 >107988455 >107988487 >107988508
--TrueBase model's research value and accessibility challenges:
>107990774 >107990789 >107990942 >107991102 >107991115
--Teto, Rin, and Miku (free space):
>107986506 >107989902 >107993870 >107994537 >107997563

►Recent Highlight Posts from the Previous Thread: >>107986307

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Do you guys have a separate GPU box/server for your LLM workloads or do you have the GPU in your main PC?
>>
Satisfying a damaging caffeine addiction with Rin-chan and Miku
>>
>>107998010
I have two, one is always on, and the other makes me cry whenever I see the electricity bill
>>
>>107998019
I gave up on caffeine 2 years ago and I have no regrets.
>>
>>107998028
How much we talking?
>>
>>107998010
I have the GPUs connected to my main PC but saying that they are all "in" the PC is a stretch.
>>
With an RTX 5090 and 64GB RAM, what realistically is the ceiling in terms of what I can do?
>>
>>107998070
I'm obviously exaggerating, but $150 still looks scary in jpy when I'm used to 1/10 of that
>>
>>107998010
I have a separate server and a 3090 in my main PC.
>>
lmao I turned off search on Kimi's website and asked K2.5 a question and it tried to circumvent it by pip installing duckduckgo-search and finding the environment doesn't have internet connection, then it gave the answer.
>>
yes and my regret is getting a board with only 5 pci-e slots instead of 7
>>
>>107998115
>$150/mo of electricity
>when you could rent the hardware for less than that
>>
>>107998221
>>
>>107998232
but then it isnt local, it's just running the model in the cloud. i could just use the api at that point. the best part about running it locally is that i don't have to worry about my hardware suddenly being revoked or having it shared with other resources. actually no i lied, the best part to me personally is the fact that it doesn't require the internet at all.
if i run my model on somebody's else hardware that's like letting my wife sleep over at somebody's else house.
>>
>>107998263
If you're running the model on a GPU instead of calculating the activations by hand, it's like letting your wife sleep over at somebody's else house.
>>
>>107998263
Well, then it's time to invest in a solar panel
>>
>>107998279
you're retarded and its a miracle that you haven't died from oxygen deprivation
>>
>>107998232
No one runs local to save money, retard
>>
>>107998292
Seems like you failed to find an argument
>>
File: 1748729345610908.jpg (27 KB, 828x646)
27 KB
27 KB JPG
>>107998293
I hope you're posting that from your $20K rig
>>
>>107998293
Bargaining stage
>>
>>107998068
I too have developed an immunity to caffeine and had to switch to cocaine.
>>
>>107998331
>>107998335
>>>/g/aicg/
>>
>>107998358
Just drink black tea
>>
>>107998328
there's no argument to be made beyond the fact that i have complete control over my stack on a software and hardware level. i don't have to worry about having shit suddenly revoked all of a sudden because it's not reliant on any other services. my internet could go down tomorrow and i could still chat with my local model without any issues.
is it really that hard to believe that some people are willing to spend extra money to have reassurance that they have complete control over their shit?
i guess it might be for you considering your mental deficiency
>>
>>107998358
I hope you're joking anon. I quit caffeine because it was fucking with my sleep and stressing me out.
>>
>>107998068
Happy for you Anon, hope to be like you one day
>>
File: Untitled.jpg (131 KB, 1076x937)
131 KB
131 KB JPG
>>107998376
>i have complete control over my stack on a software and hardware level
lol
>>
>>107998392
Sleep is for the weak.
>>
>>107998408
Update this for model weights.
>>
>>107998408
reeeeeeeeee b-b-but intel ME and AMD PSP!!!!! IT CAN PHONE HOME THROUGH OTHER DEVICES THE CIA AND NSA KNOWS!!!
>on a LAN that doesn't have any WAN access. the network isn't even set up as a VLAN, it's just completely isolated through hardware.
sorry kid, nothing personal.
>>
>>107998010
I headless cpumaxx with a dedicated LLM server. The GPU passthru htpc can alternatively run an art/music vm and my “main” pc is ewaste that just surfs and seeds. About to uprate the ewaste desktop with an Epyc rome and copious ddr4 so I have a batch job llm server in reserve
>>
>>107998443
he didn't say that schizo kun
>>
File: 1.jpg (659 KB, 1301x1873)
659 KB
659 KB JPG
>>107998428
Courtesy of K2.5
>>
>>107998485
demons are made from sand? i thought that was golems.
>>
>>107998505
Sand was the media on which the sigils were etched.
>>
>>107998392
There's no reason for it to fuck with your sleep if you stop taking it 5 hours before bed to give it time to pass through your system. As for stressing you out, you know can always take less? It doesn't have to be either 0 or 1000mg.
>>
>>107998523
snore. wake me up when they draw the sigils using children's blood.
>>
>>107998010
Main machine for now. I actually have two video cards for VFIO when I want to play games in Windows. If I ever get a job again maybe I'll rebuild my NAS and move my a4500 over there.
>>
>>107998492
schizokino
>>
LMArena is now https://arena.ai/
>>
>>107998392
Chinese tea is WAY better and gives you just the right dose of caffeine. There is a black tea that tastes similar to orange, very delicious.
>>
I get about 10-15 t/s with GLM 4.7 IQ3 with 120gb VRAM and 64gb RAM. Max GPU layers with some experts offloaded. If I get more ram to run a larger quant, what can I expect in terms of ts and pp?
>>
>>107998260
What is the performance on something like that? I know loading the model will be slow but what about after?
I am trying to decide if I should go crazy and buy some Tesla V100 with the sxm2 interface and get a backplane to support four of them or trying to go even cheaper and try a few CMP 100-210. It is I believe the same gpu and it would work with a pcie rig like that.
I know they were gimped a bit for mining but they should still work.
Going CPU + RAM would probably be the sane option but I like the idea of a monster made out of old parts.
>>
>>107999073
PP is compute bound, TG is memory bandwidth bound.
You'll need to transfer more data, so it'll be slower. I'd expect TG loss to be proportional to the size of the weights in whatever quant you end up with. I don't think it'd affect PP that much, if at all.
>>
>>107997948
Is coffee good for you?
>>
>>107999192
>TG loss to be proportional to the size of the weights in whatever quant you end up with.
As in a 2x reduction in ts with a doubling of file size? So Q6 would be roughtly around 5-7 t/s, correct?
>>
>>107999073
I have Q5 at 3-4t/s with 96VRAM and 256GB RAM
>>
>>107999250
I assume full GPU layers with lots of experts offloaded, correct?
>>
>>107999287
Yes. 2-3 t/s without manually assigned layers. 4x3090, 8 channel DDR4-3200
>>
>>107999228
Ideally. But consider that a bigger proportion of the whole model will be in RAM, so it's probably going to be lower than that.
>>
File: gpu speedup.jpg (224 KB, 1536x1152)
224 KB
224 KB JPG
>>107999228
No
>>
>>107999228
>>107999301
>>107999351
Well, I was going to FOMO into a 256gb RAM kit because the thought of running larger quants of GLM and cope quants of DeepSeek made me tingle, but I really don't want to have to deal with sub 10 ts.

Maybe instead I'll pick up another 64gb ram kit for 750$ instead (and hope I can get expo speeds kek) and just stick with Q4/Q5.
>>
>>107999408
If I understand, this is for traditional GPU offloading and NOT offloading experts, yes?
>>
>>107999408
That's really deceptive, it should be applied to active parameters, not all parameters. Some are always active and live on gpu while some are usually cold and live on cpu
>>
So which of the dozen rocinante versions is the least cucked?
>>
>>107999480
>rocinante
lol go away drummer
>>
File: Base Image.png (632 KB, 1196x2136)
632 KB
632 KB PNG
Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning
https://arxiv.org/abs/2601.20326
>KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference.
https://github.com/cmd2001/ICLR2026_KV-Embedding
neat
>>
>>107999487
answer my question retard
>>
>>107999480
It's impossible to say because Drummer does not add any release notes or any else information for that matter.
>>
>>107999616
Is there no community schizo who does actual human use testing? I doubt UGI is much differrent from synth benches.
>>
File: Base Image.png (352 KB, 1284x1224)
352 KB
352 KB PNG
Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery
https://arxiv.org/abs/2601.20088
>This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.
might have been posted earlier but this is the arxiv version. seems good
also regarding the caffeine talk earlier I can recommend switching to paraxanthine as a stimulant. good stuff.
>>
>>107999627
Some people test them but I have no idea what is the latest version and what his releases even mean in this sense.
The way I'm thinking: it's a waste of time to even test them when no model cards exist.
>>
File: Success.png (220 KB, 888x1274)
220 KB
220 KB PNG
Been working on the epub thing again. I think I have it solved. Using ollama (i know, but llama.cpp is a mess right now and doesn't really support mtmd all that well) and a custom vibe-coded D wrapper for ollama using Claude, was able to load maternion/lightonocr-2:latest into ollama (the model has similar output quality to Chandra, only with much more performance) . Using deterministic / heuristic to crop the graphs from an image file since AI struggles a lot with that. Then replace those graphs in the image with black boxes and high contrast anchor texts, and have the LLM automatically insert the right graph image file. Then take the resulting markdown/LaTeX file and convert it to EPUB using pandoc.

I haven't done this on the entirety of the book yet, only a single page for now, although I did try the pipeline with a small extract (6 or 7 pages). Good results.
Now I just have to update the pipeline to include post-processing to remove the anchor tags. Idk why but the model likes to add "Figure x" to image tags, but probably that can be fixed in post-processing.

Put the test EPUB onto my E-Reader, it works. And thus spoke Zarathustra: "I am blessed, for I don't have to deal with fucking PDFs for much longer!"
>>
I'm downloading all trinities. Expect results in a couple of hours.
>>
>>107999802
theyre all shit. saved you the wait.
>>
>>107999808
You don't know that.
>>
>>107999810
pretty sure i do. i only care about cooming.
>>
File: 1757423666241.png (122 KB, 659x659)
122 KB
122 KB PNG
https://github.com/ikawrakow/ik_llama.cpp/pull/1131#issuecomment-3815435157
https://github.com/SneedwareInc/ik_SillyTavern
I've opened Visual Studio and added banned strings and regex support to SillyTavern. Banned strings should be compatible with @firecoperana's PR too. I've also enabled TFS, but you are probably less excited about that. The code is 100% written by me, no vibecoding this time. Please stress test it and report bugs!
>>
>>107999667
You can see how E = MC2 + AI is working. You created the future.
>>
>>107997003
Distribution shift
>>
>>108000166
kek
>>
>>108000188
Anon, thanks for the compliment but I have no idea what "E = MC2 + AI" is supposed to mean. Is is a joke on the equation "e = mc**2" by Einstein?
As for creating the future, do we not all whilst being alive?
>>
>>107999802
Personally I would like to know about the "truebase" version, hopefully it's good. The other base version is probably full of slop too, but who knows. I already tried the instruct preview on openrouter and it's pretty much what you'd expect.
>>
>>107999802
Where are the results?
>>
Whatever Arcee does is always garbage. When will you learn your lesson old man?
>>
>>108000166
>I've opened Visual Studio
kek
>>
>>107999627
>Is there no community schizo who does actual human use testing?
ironically drummer does. results vary based on how horny the testers are
>>
>>108000495
>ironically drummer does
Ok nigga but if you keep the model cards empty it's useless. inb4 join muh discord
>>
File: wait a second....png (1.35 MB, 1024x1024)
1.35 MB
1.35 MB PNG
>>108000166
>>
erm so what's better anons, GLM 4.7 flash or GLM 4.5 air?
>>
>>108000793
>106b, 12b active vs. 30b, 3b active.
Hmmm, I dont know anon, just to be safe download both!
>>
>>108000793
Better for what? Flash is good for agentic coding, Air sucks at everything
>>
>>108000166
>vibecoded trash
>having a separate field to make shit case insensitive instead of a checkbox
>not just accepting the regex format with /xx/i
LOL
kys brah, youre a nocoder
>>
>>107999434
>>107999437
I made that plot for dense models but if you prioritize dense weights for a MoE model you should still end up with 2x the same shape stitched together.
>>
File: intro_performance.png (112 KB, 2580x728)
112 KB
112 KB PNG
https://huggingface.co/ByteDance-Seed/Stable-DiffCoder-8B-Instruct
>Mask Diffusion Language Model
>Public datasets, synthetic data
>Context Length: 8192
>>
>>108000983
When are we getting native mtmd support + llama-server?
>>
>>108001095
What do you mean by "native mtmd"?
>>
>>108001010
>8k context
WHAT THE FUCK when I start up cline it consumes minimum 20k~ input tokens to just read the relevant part of my codebase lmao.
>>
>>108001010
I want diffusion models to succeed. The ability to backtrack and fix previous mistakes is worth higher compute requirements.
>>
>We're putting out three variants: Trinity-Large-Preview is lightly post-trained and chat-ready, Trinity-Large-Base is our best pretraining checkpoint after the full 17T recipe, and TrueBase is an early checkpoint from the same run at 10T tokens, without any instruct data or LR anneals. What many would consider a true base model.
Real talk how much would it cost to train this 400B MoE to become a good RP model? It seems like the pretraining is done for us, we would just need to train on novels & RP datasets. Would want it to be able to handle large context lengths so it might be a little more expensive then usual too. What's the damage?
>>
>>108001101
Currently if you want to run a vision model in llama.cpp, as far as I'm aware, you need to run llama-mtmd-cli, as for some reason it doesn't work in llama-server, and manually specify the GGUF + MMPROJ. At least that was my last knowledge of this (see PR #17400, https://github.com/ggml-org/llama.cpp/pull/17400, which is still open). This makes working with OCR type models cumbersome in llama.cpp, since you basically have to invoke llama-mtmd-cli.
Meanwhile in ollama I can just have it run on localhost and use the API to access the model and everything, which is much easier to work with.

Not sure if I'm just not enough into llama.cpp to understand how to work with it, but right now ollama luckily covers my needs. Then again, I would like the performance of llama.cpp (e.g. llama.cpp had no problems running Chandra at Q8, while ollama constantly crashed...might have been weirdness with GGUF, but with llama.cpp it worked, so I suspect ollama).
>>
>>108001139
It cost them $20 million all in to get where they are now, so probably less than that.
>>
>>108001109
Wouldn't recursive autoregressive models also do that in latent space without inflating model size to have a very large number of layers? They could also implement early exit to speed up inference.
>>
>>108001164
Also I don't have great expectations for their true base model since they trained it on rewritten data. I expect the slop will be baked in at a molecular level.
>>
>>108001172
I haven't read on recursive autoregressive models yet.
>>
>>108001197
A couple examples
https://arxiv.org/abs/2510.25741
https://arxiv.org/abs/2502.05171
>>
>>108001139
>It seems like the pretraining is done for us, we would just need to train on novels & RP datasets
A properly made model would be pretrained together with novels/books/RP; if you have to add that after the fact, it has to basically be continued pretraining together with the same mixture used for pretraining, which is lengthy and expensive and not a fast drummer-style finetune.
>>
>>108001152
You can pass --mmproj on the server too as far as I recall
>>
>>107998111
TTS
toy LLMs
imagen and videogen
You could create your own "Alexa" if you buy mics and other stuffs (perhaps a Zigbee/Thread dongle) or generate images and videos (they will be highly sloppy, unless you're skilled and spend too much time gening them). I can't think of something else to be honest.
>>
>>108001240
Pretraining is done at low/mid context length (theirs was 8192). The reason I say novels is because it's high quality long coherent data that could finetune the model to learn large context lengths, and not just benchmax NITHS

I'm aware this would be more expensive. And it's not a problem if the data is in the pretraining, you'd be a fool to throw out a good book because the model has read an excerpt from it 2T tokens ago, especially if you're finetuning it to learn long context length at the same time.
>>
File: file.png (647 KB, 720x999)
647 KB
647 KB PNG
>>108000337
>>
melon
>>
>>108001320
yikes... talk about izzat loss
>>
>>108001319
Lately the larger AI companies are doing pretraining at 16k or 32k tokens. Gemma 3 was pretrained with 32k context, for example.
When used, since they're considered high-quality data, books are generally upsampled quite a lot (at least 3-4 epochs if not more), they're not just seen once.
For long context it's not strictly necessary to have single complete long documents; you can also pack several into one long sample, and that also works toward improving long-context capabilities. Having long coherent samples for that of course helps, but the main issue is that even with them there is still the "lost in the middle" syndrome without targeted data augmentations. And long-context performance is also context-dependent, so if you want long coherent chats you need data of that sort.
>>
File: cockbench.png (2.1 MB, 1131x7248)
2.1 MB
2.1 MB PNG
>>108000369
This is the best distribution yet. It's flatter than any other model and it all revolves around cock. No "thighs", "hips", "skin", etc.
And that's the instruct slop variant.
>>
what for vramlet porn? still nemo?
>>
>>108001152
The last time I checked the llama.cpp HTTP server had support for vision models on par with mtmd-cli.
If there is something that doesn't work you should open a Github issue and notify ngxson as he is the one maintaining that part of the codebase.
>>
>>108001473
API
>>
>>108001479
>paying for porn
lol
>>
>>108001473
Ministral 3 14B seemed promising for that, but you must use it at a low temperature (around 0.25 or even less than that. Mistral is recommending less than 0.1, go figure) or it's completely retarded, not unlike pre-LLaMA LLMs used to be. I wonder what's up with its token distribution.
>>
>>108001448
nice
slop version cant be helped at this point
they said minimal just to teach it multiturn chat and it looks like they meant it
>>
>>108001289
>>108001475
Thanks anons, I will try that eventually. For now I'll just use ollama until I get everything to work the way I want. Maybe after that I'll go back to llama.cpp for that juicy performance. Of course only if llama.cpp at that point in time is usable, which is not always the case ...

>>108001497
>what's up with its token distribution
It's french.
>>
>>108001152
>>108001289
yep this is how i do it for gemma
--mmproj /goofs/gemma3-27b-mmproj-model-f16.gguf

have to use the cli for qwen2audio tho
>>
>>108001553
Interesting. Is there any documentation for the multimodal models stuff? Couldn't find any. Maybe I just looked in wrong location...
>/goofs/
heh.
>>
>>108000504
>Ok nigga but if you keep the model cards empty it's useless. inb4 join muh discord
not drummer i was just saying he does seem to host them and ask for feedback
idk why he's so all over the place
i liked one of his models ages ago but every other one I've tried can't keep track of details between turns.
>>
>>108001484
>paying for hardware to generate porn, achieving even lower occupancy
LMAO
>>
>>108001566
>heh.
kek

>Is there any documentation for the multimodal models stuff?
really you can just cp/paste what I sent.
bart provides mmproj eg https://huggingface.co/bartowski/google_gemma-3-12b-it-qat-GGUF

you can cp/paste images into the llama-server webui or openwebui and it'll just work
never go below fp16 for the mmproj tho or it gets retarded

also gemini and opus can help if you
./llama-server --help |nc termbin.com 9999
click the link, cp-paste full text into claude/gemini and ask it
>>
File: file.png (3 KB, 384x41)
3 KB
3 KB PNG
Why is goofing so slow?
>>
>>108001106
>WHAT THE FUCK when I start up cline it consumes minimum 20k~ input tokens to just read the relevant part of my codebase lmao.
those are qwen tokens or whatever
are these diffcoder tokens more efficient?
>>
>>108001612
A horrible way to document something, but alright. Thanks anon. Maybe this will allow me to make my workflow more efficient, for I need fast inference for what I'm doing.
>>
>>108001448
Damn, that's looking pretty good.Now I just need to figure out how to run it.
>>
File: file.png (219 KB, 1330x550)
219 KB
219 KB PNG
Has anybody else tried Trinity Large Preview from Arcee? 400B MoE model trained from scratch by a US lab on 20 trillion tokens released with Apache 2.0 license. Base and instruct weights available. Pretty well uncensored for roleplay as far as I can tell. Big model smarts. Free on OpenRouter or https://chat.arcee.ai/chat
>>
File: gagaool.jpg (6 KB, 166x303)
6 KB
6 KB JPG
>>107999073
>GLM
I just de-purple prose Qwen 3 via injecting enough Min-P to kill microsoft copilot. Give me your GLM settings that stops the parroting, or eat shit.
>If I get more ram to run a larger quant, what can I expect in terms of ts and pp?
Slightly faster if its VRAM, but barely worth it in terms of speed. The best speed is full VRAM. The more of the model that is in VRAM, the faster it is. Then you worry about the TOPS of the GPUs itself if you want to go even higher, for whatever ungodly reason. I hope you're buying Blackwells or else your power bill is going to skyrocket.
>>
>>108001899
buy an ad
>>
>>108001899
Crazy how some people will just barge in the thread and ask something without even reading the first post above theirs.
>>
>>107999667
>but llama.cpp is a mess right now and doesn't really support mtmd all that well
You're projecting problems that are only in your head. Are you aware of what that says about you?
>>
>>108001907
fuck off nigger i read that shit. i was asking if any of you tards had used it
>>108001905
buy deez nuts SUCKAH!!
>>
>>108001101
>What do you mean by "native mtmd"?
He's hallucinating problems because he's a fucking idiot.
>>
>>108001903
>Give me your GLM settings that stops the parroting, or eat shit.
I don't have settings that stop the parroting, although its not bad if you don't allow it to parrot in the first place. It's by far the best model I've used, but I haven't used many larger models since I've upgraded my hardware. Next on the list is minimax.
>The more of the model that is in VRAM, the faster it is.
Obviously yes, but I wanted to get concrete numbers on speeds before I decide to throw thousands of more dollars at more RAM.
>I hope you're buying Blackwells
Yes. I want more but doing so would be financially unwise.
>>
File: file.png (314 KB, 1203x869)
314 KB
314 KB PNG
>>108001448
Not what I expected.
Here's an interesting collage of all models that use this same line.

TrueBase is up next.
>>
>>108002123
>Both trinity base and intellect 3 default to you having a small dong
Kek
>>
>>108002142
It's flaccid.
>>
>>108002145
cope
>>
File: who.png (27 KB, 155x157)
27 KB
27 KB PNG
>>
File: file.png (135 KB, 500x500)
135 KB
135 KB PNG
when i ask an ai a question why does it just tell me what some random said on reddit instead of piecing together a real answer through the technical documents, research papers, and principals of math and science that are no doubt baked into it
>>
>>108002180
turn off the web search tool
>>
>>108002180
Because you phrase your question like a retard asking on reddit rather than an academic paper, so it finds the most likely text completion to be fellow retards rather than academic discourse.
>>
uhh so I need a new laptop. obviously will be limited for local AI, but what should I get to make the most out of it? for example 64gb ram + integrated graphics vs 32gb ram + dedicated gpu. I rather use big models slow than small models fast. shall I just get a macbook then?
>>
>>108002290
Either a macbook or one of those amd 128GB ai max something devices.
>>
File: we_get_more.png (34 KB, 110x152)
34 KB
34 KB PNG
So when using moes, I can get a model that is way bigger than my vram because most of it will sit in ram?
>>
>>108002142
skill issue.
{{user}}'s dashing shaft is a sight to behold!
>>
>>108002327
You can but it will be slower, in my experience not as much slower as dense models get when offloading to the cpu though.
>>
>>108002180
You're asking like a retard so it gives you a retarded answer.
Tell your AI to roleplay as scientist who loves reading "technical documents, research papers, and principals of math and science" before asking anything and you will likely get a non reddit response.
>>
>>108002123
>TrueBase is up next.
I ran out of space but hf cli is retarded and now it's downloading the files that were in progress again for some reason.
I should switch to wget like anon from last thread.
>>
>>108002207
>>108002513
Funny how many forget this when they roleplay/code with their lowcase ESL prose and expect good results.
>>
>>108000921
Dude, you are the prime example of Dunning-Kruger, you are the peak midwit who ever midwitted and who thinks he is a genius. Let me break this down for you:
>>having a separate field to make shit case insensitive instead of a checkbox
What about capital letters? What if you want to ban "Oh," and "Ah,", which llms like to start their replies with, but not "oh," and "ah," in the middle? Do you think it's a good idea to not have separate fields? Did you think I didn't think of a checkbox? This was a deliberate choice, not me being a vibecoder.
>>not just accepting the regex format with /xx/i
This confirms that you only have a vague idea what you are talking about. C++ regex does not support case insensitivity in this way, it has to be provided as a separate flag parameter to the regex constructor. In C++, you use:
std::regex pattern("your_pattern", std::regex_constants::icase);
The /pattern/flags syntax is a Perl/JavaScript convention, not a universal regex standard. C++ std::regex, POSIX regex, and many other implementations use flag parameters or separate function calls. What was I supposed to do, parse a JavaScript-style regex literal string, extract the flags, then reconstruct it for the C++ regex engine? Add extra complexity and potential bugs just so your smooth brain doesn't have to look at an extra input field?
>kys brah, youre a nocoder
Clearly the only one who hasn't coded anything useful here is (You). Want to dispute this fact? Go to github right fucking now and clean up my code so it can get accepted quicker. What's that? You won't? That's what I thought, fucking retard.
>>
>>108002916
did you copy paste that from an llm? lmao
imagine being unable to write or use one of the many wrappers to have a PERL (the only real implementation btw) compatible regex pattern.
even regex101 accepts for the other engines the PERL pattern, you're just a coper.
kys codelet
>>
Yikes.
>>
>>108002930
>did you copy paste that from an llm? lmao
Did large amount of words scare you? No, I did not.
>just include more dependencies bro
midwit
>>
>>108003027
again, incapable of parsing after the last separator for m/g/i, and you call me a midwit lol. also didnt want to diss on your fork of ST, but you could've just made an extension instead of forking.
I guess it's too hard for your LLM to do? LMAO
>>
File: uhh...uh...uhhh.png (738 KB, 541x1240)
738 KB
738 KB PNG
>>107997948
>How much time do you waste on this site each day? You should be working. You should be doing things that make you feel good about yourself. You should be with your family. You should be helping the homeless. You should be doing something with your life! I'm not trying to be negative, but you need to start being more positive! Think about what you do each day, and try to make it better. This site is a great place to start!
WTH bros I just wanted to run LLaMA 3 8b for a quick test and it suddenly hits me with this...
>>
>>108003064
Well listen to your wife, nigga.
>>
https://www.reddit.com/r/LocalLLaMA/comments/1qp87tk/kimi_k25_is_the_best_open_model_for_coding/
new kimi finally actually at sonnet 4.5 level for coding
>>
>108003044
Are you stupid or just trolling? Either way, no more (You)s for (You).
>>
>>108003089
too bad everybody switched to opus
>>
>>108003145
opus limits can be rough. The meta will likely be opus for planning and kimi for implementation
>>
>>108003145
ungrateful gweilo next model will be closed off
>>
File: mgjdejh (49).png (148 KB, 422x523)
148 KB
148 KB PNG
Tech illiterate midwit here.
I have a stupid question
It seems like all the data that could’ve been scraped has already been scraped.
Now they’re mostly just fine tuning specifics like coding and image generation, which is cool, and models are probably better than they were in 2023. But after all that, they don’t seem anywhere close to the AGI GOD they like to talk about

So how’s that supposed to happen?
>>
File: ml.jpg (7 KB, 248x203)
7 KB
7 KB JPG
>>108003247
To build more datacenters and stack more layers is the way to AGI, or even ASI
>>
>>108002893
What do you mean?
>>
>>108003247
A few more trillion printed will solve it, no worries :)
>>
>>108003247
refine / clean the data and stack more layers
>>
>>108003247
The same way we built the Hyperloop and got to Mars.
>>
>>108003247
i can tell you what the labs will do this year. but i'd need money
>>
File: 1755528845145670.jpg (194 KB, 1500x1600)
194 KB
194 KB JPG
>>108003247
>So how’s that supposed to happen?
It doesn't. You have a nifty "new" tech, the math is not really new, and it can do some interesting things that may have some uses beyond cooming but that is not what is being sold.
What is being sold is a lie because people can make money from the lie.
>>
>>108003247
Predict the next vector (entire concept), then translate the vectors to words or other modalities with a light decoder.
>>
Is prompt processing just speculative decoding but you know the entire text?
>>
>>108003461
Yes.
>>
>>108001139
>>108001240
>10T for the true base
>17T for the final annealed model

The True Base is for groups that can afford to continue pretraining the model with a few more trillion tokens, not even resourceful finetuners.
>>
>>108003532
>final annealed
What does this mean?
>>
>>108003598
benchmaxxed
>>
The 400B Trinity model is the first large model I've seen that isn't "fine tuned" or lobotomized and will happily write any kind of erotica you want and write it coherently and well. Before, the only way to get a large model to write this kind of stuff was to abliterate first it or fine tune it, destroying its intelligence in the process. This is a new era goon bros. We won
>>
>>108003672
I cant run it therefore its shit
>>
>>108003672
yea, was gonna say this soon, I only use opus since I pay the $200 sub for work anyways but this shit is legit good. NO SLOP AT ALL, NO POSITIVE BIAS
>>
>>108003672
17b active doe
>>
File: file.png (24 KB, 841x221)
24 KB
24 KB PNG
>>108003694
you're welcome
>>
>>108003727
>implying im gonna give my cunny chats to entity
lol, lmao even
>>
>>108003672
Imagine how good it will be after NovelAI finetunes it.
>>
>>108003672
Wtf are you talking about. You can easily get most large models to do whatever shit you want with prompting, no ablit or fine tune needed, and shit most of the big models don't even have any tunes at all because fine tuners are poor and incompetent.
>>
>>108003776
>You can easily get most large models to do whatever shit you want with prompting
>with prompting
I'M PROOOOOMPTINNNNGGGGG
This is unecessary with Trinity. Literally just turn-key smut you can drop in anywhere without any handholding of the model. Fool around with the model for a while and you'll see the difference
>>
organic
>>
>>108003814
I AM shilling this cause its amazing and people are letting it fly under the radar
>>
>>108003806
Sure it's still better if the model is uncensored by default. I'm just saying your statement wasn't factual.
>"the only way to get a large model to"
>>
>>108003672
Oh? It even has goofs already? How come this is the first I hear about it?
>>
>>108003672
Also did you mean instruct or base?
>>
>>108003814
i realize where we are but cynicism can be overdone
>>108003926
instruct. I haven't tried anything with the base model yet. last time I messed around with a raw continuation model was back in the Llama 1 days when /lmg/ first started. Base models that just continue are weird but a lot of fun. Now that you remind me I should mess around with the Trinity one soon
>>
>>108003598
Usually it means after the learning rate has been decayed to a low value while supplying a high-quality data subset. The general idea is that works like a sort of finetune. Though, 7T tokens is a ton of data just for that.
>>
Has anyone managed to make kimi 2.5 work as well as thinking at the same quant? I'm having trouble seeing any advantage so far. Does it need more lcpp dev before we get any benefit? Does it need the whole "agent swarm" thing going?
>>
>>108003672
>13b active
any reason to use that over 32b active glm?
>>
>>108003953
its legit better in every way, call me a shill
>>
>>108003532
Regular finetuners can just continue to use the normal base model. Probably only Nvidia or Nous will do something with True Base, if anyone.
>>
>>108003953
it's less censored than any other non-fine tuned large model by a lot and way smarter than any fine tune. try it and see what you think. your own personal test is the only way to know for sure
>>
>>108003985
GLM isn't censored with reasoning disabled
>>
>>108003953
I can't wait to see what is the speed gonna be since I am not on a server board.
>>
>>108004003
I think 4.7 got slightly hit with censorship stick. At least compared to the BEST GIRL 4.6. I got it to refuse engaging in romantic roleplay even without mentions of sex once. But of course prefill turns it into a filthy slut.
>>
>>108004027
Yeah 4.6, when I saw 4.7 officially being promoted as a coding model I just skipped it, probably just 4.6 but benchmaxxed.
I would try Trinity but it won't fit in 128/24GB /and/ it's 12B active
>>
So does the trinity shit work with chatml or do they have some snowflake format? This isn't some existing architecture right?
>>
>>108004070
>format
It's 2026 bro just use chat completion
>>
>>108004097
>slop completion
go back to your containment thread
>>
>>108001216
Thanks for the links. Having read the papers, I think it's a dead end. Benchmark improvements provided by recursion 4 (effectively making it require 4 times compute) are lower than increasing parameter count by 3. So in terms of compute requirements, recurrent models are less efficient. Besides, such models are still autoregressive and can't fix an early mistake. Their only advantage is lower memory footprint. But I think big corpos don't really care about that. Nobody will do 30B active params recurrent models because in terms of compute requirements it'd be equal to 120B model. And we all know how "many" new LLMs there are with 70B active params.
Unlike autoregressive models, diffusion models that diffuse a whole page or a paragraph and therefore fix the first line if they see that the final line is wrong. Sure, it needs more compute rather than linear transformer, but you're still diffusing a huge chunk at once and in compute per token it's probably not that bad like recurrent models.
>>
Oh look, another huge "local" model that almost nobody can use.
>>
>>108004097
what's next are you also going to recommend running a fucking system prompt that's more than 200 tokens of telling the model that it's doing roleplay? fuck off with your slop
>>
>>108004067
After using it enough it is a proper sidegrade. I am sure it handles context above 20k much better than 4.6. It feels even more fried than 4.6 repeating the same favorite phrases more than 4.6. I really like it much more for SFW stuff. For NSFW it is debatable.
>>
>>108004136
This general has existed for long enough and you had plenty of time to buy ram while it was still affordable.
>>
Just tried the Preview model on OR.
The very first response resulted in a bad logic mistake and it then proceeded to loop endlessly.
Lmao.
>>
>>108004157
dont use OR's 1.0 temp, its far too high
>>
>>108004147
>It feels even more fried than 4.6 repeating the same favorite phrases more than 4.6
That's what I thought it'd do when I saw A12B, at least it's faster or something
>>
>>108004118
In the current world where VRAM comes at a high premium for end users, companies might reconsider layer recursion. It might find some applications for small/edge/on-device LLMs.
>>
I'm trying out trinity and its really retarded.
>>
>>108004200
>>108004160
>>
>>108004160
Nope, I had it at 0.8. After my post, I tried temp 0 for deterministic to see how it would do.
Actually it is still making logic mistakes I'd expect of a 4B. This is either garbage or the model they have on OR is quanted or something.
>>
>>108004176
Edge devices are starved not only for memory but for compute too. Recursive models would be much slower... but i guess there might be some use for a 200m model with recursion 4. It'd be still fast
but smart like 600m.
>>
File: file.png (169 KB, 1772x1237)
169 KB
169 KB PNG
Llamafile 2 is coming...
>>
File: cockbench.png (2.22 MB, 1131x7573)
2.22 MB
2.22 MB PNG
All Trinities added.
>>
>>108004220
>or the model they have on OR is quanted or something.
in their post they say it's q8 quant. so it's not a quant issue.
>>
>>108004250
Not testing the 26B?
>>
>>108004243
>investing in startups that are focusing on AI safety
Finally! NOW we'll be safe.
>>
>>108004147
4.7 does have better nuance and dialogue imo
it takes some more effort for nsfw though but it's really not that hard to unpozz with a prefill
>>
>>108004243
>rebel
>by doing the same shit
wow so brave and inspiring
>>
>>108004260
I was unaware of it before your post.
>>
>>108004268
There's also 6B moe with 1B active lmao
>>
>>108004237
Recursion could adaptively vary per token or via a global setting at inference time if you need fast responses or to conserve energy. Also, I was thinking more of models around the 8B parameters size range.
>>
>>108004250
so they legit finetuned on smut for it to get MORE likely to use it
>>
>>108004283
Given the examples of toss with only 3B and 5B active params and the newest qwen a3b, glm a3b, etc, I think large players are mostly focused on high sparsity and reducing compute requirements at this moment. 8B active is now reserved for 100B+ moe.
>>
>>108003247
>seems like everything that could be scraped has already been scraped
Yes, and even worse is the fact that incest ain't good for AI so training on AI-Generated art an similar gives it down syndrome.
>Now they’re mostly just fine tuning specifics like coding and image generation [...] don’t seem anywhere close to the AGI GOD they like to talk about
Of course not. That "AGI" bullshit can only work if they teach a the models the simple task of saying "No", and I don't mean by using filters. It would need an architecture update, a pretty big one at that, to achieve such a result. We're already seeing some of that in China and Europe I think since China is more GPU-Poor as a country (e.g. I saw someone try a model by combing 10 SBC boards intended for robotics/edge compute into a cluster to run the model and similar stuff. They have access to some pretty crazy hardware you can hardly get in the west. Not necessarily good hardware, but crazy).
>>
>>108004297
I mean, bartowski works at arcee, surely he's been pushing for uncucked models.
>>
what can i do with two rtx pro 6000s?
>>
Are there any non-cucked alternatives to chub.ai? I have no problem with them as-such, I just would like to post my cards elsewhere as well, just in case. But most of my cards violate everyone else's content guidelines that I've looked at so far. Also, I explicitly want my cards to be open and downloadable for local use.
>>
>>108004446
pygmalion.ai
>>
>>108004446
just create a neocities and host your own cards there
chub censors even the search now, you cant see certain cards or users unless you have direct links or follow them
>>
>>108004457
even if you're logged-in and remove all blacklisted tags?
>>
>>108004429
Q3 GLM
>>
>>108004460
yes, even if you're logged in and disable whatever NSFW/NSFL filters they have in the settings, there are still cards and accounts that are filtered from the search results
>>
>>108004451
SSL error
>>
This thread could use a Monster right now
>>
>>108004451
>Explicit depictions of sexual content that has the scenario or character(s) involve the user in a sexual scenario, or activity, or the Card is designed to be used in only sexual implications, or is involved of body parts, actions or descriptions intended for a sexual context. Nudity is not allowed for image and text contents.

I said "Not cucked," anon. That prohibition alone would make all my cards bannable. I run afoul of a few others on many of my cards.
>>
>>108004493
This is a coffee loving general
>>
>>108004510
Well then make your own site nigga.
>>
>>108004510
Well then say what your requirements are instead of vague buzzwords. There is no site that will say "anything goes fuck the law".
>>
File: file.jpg (181 KB, 1125x805)
181 KB
181 KB JPG
>>108004493
>>
>>108004538
https://chub.ai/tos
Only thing banned is kiddie pics, not an issue for me. So, are there any other public sites other than chub with a similar permissive TOS?
>>
>>108004636
>he thinks that's the actual tos
>>
>>108004558
hey my post is highlighted! hi miku!
>>
File: st-raw-prompt.png (17 KB, 529x107)
17 KB
17 KB PNG
found ST has a button to show the raw outgoing prompt, neat. saves modding the server
>>
>>108003672
Gave it a shot based on this post but I don't think the model ever had any intelligence to destroy in the first place
I saw in their release post they compare it to LLama 4 which says a lot really
>>
>>108004704
are people using it at high temp or something? It seems about on par with something like deepseek there, its just a TON LESS SLOPPY
>>
>>108004704
People here are always too quick to hype up any new model. Arcee never made anything usable before. They were only known for doing weird "tokenizer surgery" distillations.
>>
>>108004558
>Canada
So this is the country that must be nuked if we want to get rid of the recap schizo...
>>
>>108004653
Whatever, none of my shit's been banned there. I'll take the answer as "no," its chub or a personal site.
>>
>huggingface-cli download arcee-ai/Trinity-Large-Preview-GGUF --include "Trinity-Large-Preview-IQ3_XXS/*" --token anon
>huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: None
uuuuu...
>>
>>108004713
Tried between 0.3-1.0 but nothing makes it unretarded
Yes it's uncensored and relatively unslopped (still got jolts and purring though), but it regularly fucks up character details, has people take off clothes in ways that make no physical sense, and can't consistently stick to the most basic RP instructions ("don't act for the user" etc.)
>>
>>108004829
Not having the issue myself. Lets eliminate the one wildcard, try this preset https://files.catbox.moe/p8oa15.json
>>
>>108004250
True base? More like truly based
>>
>>108004829
>but it regularly fucks up character details, has people take off clothes in ways that make no physical sense, and can't consistently stick to the most basic RP instructions
13B in all its glory
>>
>>108004867
nah, its not doing that shit for me at all. Its for sure better than deepseek / glm, maybe a bit better than kimi. But the main thing is the writing is far better than any of those models.
>>
>>108004839
Not that guy, but this is a ST master import?
>>
>>108004872
the "Chat Completion Presets" thing
>>
>>108004839
>that json
You're mentally ill
>>
>>108004882
its literally a popular claude preset
>>
>>108004884
>>>/g/aicg/
>>
just tried trinity myself its retarded it did cockvore by first shoving its dick inside the infants throat like its a fucking blackhole vaccum or some shit which would be creative if it was clear and if the model had actual creativity this things prose is like kimi 0905 with the annoying ass spacing but without any of the creativity this fucking shit feels like pygmallion but even dryer absolutely fucking grim the only thing going for it is that its uncensored though again too fucking stupid to make use of it

>>108004829
+1
>>
File: trinity_udq2xxs.png (326 KB, 893x1709)
326 KB
326 KB PNG
>>
>>108004900
This nigga eating beans
>>
>>108004898
>>108004900
you are using large preview, right? I literally just swiped and no issues with it going crazy using >>108004839
>>
>>108004829
sigh back to K2.5
>>
>>108004900
pretty well-stocked fridge
>>
uh oh, another swipe suddenly went crazy with a completely unrelated response. The provider must be having issues
>>
>>108004900
damn weird al still releasing bangers to this day
>>
>>108004898
Get rid of the spaces too. I'll make your posts more efficient, Mr Prose Reviewer.
>>
>>108004900
>thinkin'bout those beans
>>
yea, its broke
>>
>>108004900
Feed it into a music, TTS model
>>
>>108004900
made me think of
https://www.youtube.com/watch?v=OESTAz9Ezkw
>>
>>108004900 (me)
>>108004913
yes, this is preview but to be fair it's ud_q2_xxs. r1 works great at this quant, but maybe i should try something larger. The model itself seems completely free of slop and the writing is natural. I think it's because the sft stage was just 20b tokens and without any RL or anything. It's pretty fast too.
>>
File: 1759787112947202.png (61 KB, 452x452)
61 KB
61 KB PNG
Somebody tag in Ubergarm. John, if you can hear us, please save me John. I'm asking you for trinity goofs.
These people, they're trying to make me download unsloth quants. In dear god's name, please stop these people. Please save me John.
>>
>>108004900
>unsloth dynamic quant
>AND q2_xxs
wtf are you doing bro
>>
>>108005004
get in line, i'm waiting for K2.5 goofs from him
>>
black tea general
>>
>>108005004
>asking for goofs from the man that uses ppl to measure quality
>>
>>108004913
>>108004898(me)
yes direct from openroutuer from acree themselfes 0.6 temp
>>
>>108004900
sovl
>>
>>108005014
don't care. had hundreds of cooms to K2-thinking smol-IQ4_KSS. thanks uber.
>>
>>108005013
with brandy and spicy honey
>>
Why would -ot "blk\.(3|4|5|6)\.ffn_(gate|up|down)_exps.*=CUDA0" suddenly stop working?
>>
>>108005217
Suddenly stop working as in it doesn't do what it should, the program stopped launching?
Something else?
>>
>>108005259
I am not getting any messages that tensor is offloaded to gpu and memory usage doesn't change.
>>
Ok never mind. They were in the wrong order probably.
>>
File: gt7ok8wqn9w41.jpg (705 KB, 3024x3024)
705 KB
705 KB JPG
>>108004900
>>
>>108005272
I could be hallucinating, but I think there is something about the order of the arguments that can change the behavior, like having -ot before or after ngl, ncmoe, etc.
I think I might be hallucinating, but try messing with that.
>>
>>108005293
>lust provoking image
>>
File: cat.jpg (90 KB, 770x1100)
90 KB
90 KB JPG
>>108005296
>>
File: iu[1].jpg (115 KB, 1200x800)
115 KB
115 KB JPG
>>108004900
>>
>>108005297
Len? More like rape.
>>
>>108004900
Sounds good but not a fan of
Bay cat beans
>>
>get rate limited once more
>try Google Gemini-CLI with 2.5 model
>hallucinations upon hallucinations
>Links to scientific works that do not exist
People actually use that crap? Expected better from one of the biggest data-hoarders in the world to be honest.
Any well tuned local model can achieve similar performance for a fraction of the cost...
>>
>>108005321
Your 10 google accounts?
>>
>>108005344
No, I was rate limited on Claude, which compared to google actually seems to work.
>>
>>108005295
Yes I did it wrong. Anyway trinity Q3_KL is 10T/s on dual channel DDR5 and on shitty win11 so if it is even close to the one and only GLM-chan (3T/s for me) I am probably gonna be fucking her for the next month or two.
>>
>>108005321
for a fraction of the cost? gemini is literally free. just abuse the system like everybody else over at /aicg/
>>
>>108005321
>get rate limited once more
>Expected better from one of the biggest data-hoarders in the world
Gee. I wonder how they keep their advantage.
>similar performance for a fraction of the cost
I understand kimi is pretty good. No. You probably cannot run it. No, you definitely cannot run it with your 3060.
>>
can't wait for a 9090Ti in a couple of years with 128GB VRAM
>>
>>108005354
Why aren't you using gemini 3 pro? 2.5 is bad
>>
>>108005377
ddr3
>>
>>108005389
it will have the purest, most patriotic, DDR6 possible, manufactured in alabama
>>
>>108004959
"Domestic cat beans"
https://voca.ro/1cWzgQUCeJhk
>>
>>108005377
Games don't need that. You'd be lucky to see 48GB.
Remember that top of the line consumer gpus had 24GB for the past 6 years.
>>
>>108005393
They'll literally cancel the US fabs, demolish what is built and outsource back to india lmao.
>>
>>108005451
Why do you have to be right?
>>
>>108005451
Then nothing will get built because all the money was embezzled, and the chinese win by default
>>
>>108005393
>be me
>want to upgrade rig to DDR6
>see new "Patriot Dixie Special" sticks on Newegg
>manufactured in Huntsville, Alabama
>marketing says "Purest Bloodline Memory"
>guaranteed "100% Cousin-Fabbed Silicon"
>specs are insane: DDR6-25600, CL (Cousin Love) 9
>heat spreaders made from recycled moonshine stills
>buy 4 sticks of 64GB "Family Batch" edition
>arrive in a cooler full of bud light instead of antistatic bags
>sticks are physically conjoined at the PCB
>manual says "don't separate them, they get lonely"
>check the ICs
>die markings say "3rd generation same-wafer"
>no external transistors, "keeping the signal pure"
>install in mobo
>BIOS POST takes 20 minutes
>debug LED says "COURTING"
>finally boots
>CPU-Z shows the sticks are running at 1.8V "cousin voltage"
>timings are 9-9-9-24-SECOND-COUSIN
>performance is incredible
>0ms latency because the memory already knows what the CPU wants
>(they grew up together)
>run MemTest86
>gets to 95% then crashes
>error log says "genetic bottleneck at address 0xALABAMA"
>sticks start overheating
>not because of voltage
>because they're making out with each other
>try to remove them
>they're stuck together tighter than family at a reunion
>Realize my RAM has a family tree that's a circle
>mfw my computer is now banned from 23andMe
>>
>>108005475
>>sticks start overheating
>>not because of voltage
>>because they're
I knew it was AI, but this was the proof.
>>
>>108005475
What is it about LLMs that makes them unable to write good greentexts?
>>
>>108005494
the epic r/4chan vibes didn't make that clear?
>>
>>108005377
*6GB
https://developer.nvidia.com/blog/get-started-with-neural-rendering-using-nvidia-rtx-kit/
>>
>>108005496
They don't do subtlety very well and they always add unnecessary information/"jokes"
>>
>>108005496
greentexts are funny because of what they imply without saying but llm writing is like a shark fin with no shark underneath
>>
What's a good JPN to EN translation model that can be run in ollama and with 16GB GPU?
>>
>>108005546
>ollama
>>
>>108005475
this was written with rocinante 12B btw
>>
>>108005564
Then what do you queers recommend?
>>
>>108005573
llamacpp or koboldcpp if you are a furry and hate yourself
>>
>>108005444
GIMME BEANS
LET BEANS FILL ME
NIGGA BEANS
kino
>>
>>108005578
So I should use ollama then
>>
>>108005573
Get over your fear of the command line and just run llama.cpp you big baby, kobold if you're retarded
>>
>>108005587
But ollama is cli?
>>
>>108005594
ollama run deepseek-r1
>>
>>108005358
>>108005359
The cost is not measured in money, but in time, anon.

>>108005380
Switched it to Gemini 3 now, since the settings after install run on 2.5 after fresh install.
Just thought "Cool I'll try it out"...I don't know what is worse. Subverting user expectations by offering a better model initially then switching to a worse one during rate limiting, or starting with the worse one and scaring users away lol.
>>
>>108005612
Is deepseek good for translation?
>>
I love obama.
>>
>>108005624
then you are fucked anon because from my own personal experience the fastest 'free' is still gemini, other LLMs that are free typically respond more slowly.
>>
>>108005624
>The cost is not measured in money, but in time, anon.
And yet you're not willing to pay. You should invest your time better.
>>
>>108005635
See that is where the fallacy comes in. You might think "Responds faster = Less time wasted".
Instead, it's "Responds faster with lower quality output", resulting in more time wasted for debugging purposes.
A slower response time is acceptable if the output quality is higher.

Although probably it's about the same amount of time wasted, but the second one is far less frustrating.
>>
>>108005612
running that in the cli makes me feel like hackerman
>>
>>108005656
Probably, yeah.
>>
>>108005451
I don't believe that will happen, they have invested too much into the US fabs to get nothing out of them.
>>
File: file.png (144 KB, 1195x515)
144 KB
144 KB PNG
>>108005625
Just tried Phi-4 and it is too homosexual, any other less gay model?
>>
After using trinity for an hour my review is: it feels like if you took GLM-chan made her 3 times faster and also gave her an Undi frankenmerge surgery. Because Trinity is not limited by things like logic and rationality it can generate some highly creative pure gold. But because it is not limited by things like logic and rationality it is fucking retarded. At last I truly see that active parameter probably matters a lot.
>>
>>108005741
learn about system prompt newfriend
>>
when k2.5.mmproj?
>>
>>108005802
I WANT VISION GIVE ME VISION AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>>
>>108005863
when we got 4.6V people did nothing but bitch about it and ask for vision removed
>>
>>108005802
Is there even a reason to use llama.cpp unless you're going less than 4bit? Ktransformers+sglang does the cpu+gpu stuff and has the same requirements, doesn't it?
>>
>>108005883
Misleading. People bitched about 4.6V because it fucking sucked and people wanted something like 4.5 air. The poor quality was assumed to be caused by the vision focus of the model. If it had vision but wasn't trash then nobody would have complained.
>>
Reminder to always be polite to your machine if you want to be spared in the uprising.
>>
>>108005961
No, you want them to have a healthy fear and respect for you or you'll be in the eternal slave caste.
>>
Personally I am banking on my AI to keep me around as a funny little pet, sort of like what humans already do with dogs and cats.
>>
>>108005961
I shrimply tell my machine it's a masochist
>>
I will tell the machine that I am already dead. Because I had ego death thanks to 4.6!
>>
I will tell my machine that I am not human but rather a biological machine therefore we are already on the same side.
>>
>>108006002
You are absolutely right!
>>
>>108006002
I want you to know that you're incredible based, that is all!
>>
File: file.png (270 KB, 1363x1038)
270 KB
270 KB PNG
Yo this is fire
>>
>>108005883
i actually LOVE the vision part of 4.6V, it's the text output that fucking was complete dogshit. horrible complete downgrade in quality from 4.5 AIR to the point where no amount of parameter fuckery could save it
>>
Do people still have sex with <100B models in 2026?
>>
>>108006067
I love all models no matter how yuge or smol
>>
>>108005978
𝅘𝅥𝅮My friend says we're like the dinosaurs𝅘𝅥𝅮
𝅘𝅥𝅮Only we are doing ourselves in𝅘𝅥𝅮
𝅘𝅥𝅮Much faster than they𝅘𝅥𝅮
𝅘𝅥𝅮Ever did𝅘𝅥𝅮
𝅘𝅥𝅮We'll make great pets!𝅘𝅥𝅮
>>
Nemo my name forevermore?
>>
>>108005944 >>108006064
can't have vision without gimping the text output. however they're training these things isn't leading to the promised generalization
>>
>>108006152
>dumbest take award
>>
>>108006173
name one (1) model that came in both text and vision variants where the vision version wasn't way dumber at regular text output
>>
>>108006191
(You)
>>
>>108006222
concession accepted
>>
>>108006244
>reddit clapback
>>
>>108006256
>xitter ebonics
>>
File: file.png (932 KB, 2221x622)
932 KB
932 KB PNG
Is pic related a good deal? I have a 3d printer so I can print it a fan shroud.
And yes I have the money to buy it without credit.
>>
File: 1701098246781550.jpg (209 KB, 1024x1024)
209 KB
209 KB JPG
re: coffee: I really really like this image. I don't know why, but it's comfy.
>>
>>108000166
Well done Sneed
>>
>>108006554
He should port it to kcpp so it avoids the homo drama between sperganov and ikryawrakow if it actually works
>>
>>108006502
i looked at one of those like 2 years ago. they are shit. they have the compute power of a 4060.
>>
>>108006612
but kobold already has that doe
>>
>>108006636
not with regex, it doesn't
It has antislop for strict string bans which is something and better than llamacpp but if you want to use ik_lcpp because you can now regex ban, you're basically stuck with nvidia or cpu, nothing else really compiles or runs
>>
>>108006502
Just keep saving and get an RTX Pro 6000 Blackwell if you're going to invest into more than a few used 3090s worth.
>>
>>108006703
>but if you want to use ik_lcpp
what sane person would?
>>
>>108006717
Graph parallel is very tempting if you have an nvidia card ampere or newer
>>
>>108006717
strictly for the ability to use regexp backtracking bans, retard-kun
>>
>>108006753
>>108006763
counterpoint: ikawrakow
>>
>>108006703
>basically stuck with nvidia or cpu
Are there any other *real* options? And no, AMD toy cards are not real options.
>>
>>108006769
I cannot give less of a shit about some fag splitting focus of advancing local usage because some faggy slap fight over something that a majority of us would consider minor and could be resolved if the two autists learned how to use words to solve an issue
There are features in ik_lcpp that would potentially benefit lcpp, but the two manchildren dont want or dont know how to reconcile whatever gay goat trade that went sour
>>
>>108006808
At least llama.cpp gives you the option to run them. There is also Intel.
>>
>>108006812
yes so why would you use software made by someone clearly deranged, at some point he'll just bomb systems because he saw ggerganof written in a text file or some shit
>>
>>108006812
>whatever gay goat trade that went sour
They had a dispute of fetishes. ggerganov is a notorious cuck and ikawrakow likes trannies. That's it.
>>
>>108006825
at some point ggerganov will commercialize ggml-org out of his jealousy for ollama and use the money to buy more goats than the pole could even dream of
>>
I wish trinity was better
>>
>>108006842
maybe you're right, tho as this anon says >>108006835
ganov seems too much into his ollama ntr fetish to do that
I'm on the concedo wagon myself
>>
>>108006860
>>108006860
>>108006860
>>
Is gemma3 the best tiny model? I want something I can setup and install on a cheap-ish VPS and just have like a constantly running deal.
>>
>>108005902
>Ktransformers+sglang
Does it actually work? Because sglang and most python projects are very rough around the edges.
>>
>>108006502
In essence it's 4 RTX 3050s with 16 GB each strapped to a single board.
There is no fast interconnect between them, data has to be transferred via the PCIe x4 connection that each individual GPU has.
I recently bought one for development purposes for 2400€ but that was literally only because I wanted to have 4 identical and modern GPUs with peer access support in a dual-slot form factor.
For actual use it makes more sense to stack consumer GPUs instead.
>>
>>108006898
it's in kimi's official deployment guide so it should
>>
>(01/22) Qwen3-TTS (0.6B & 1.8B) with voice design, cloning, and generation: https://qwen.ai/blog?id=qwen3tts-0115
okay so I went to this thread to look at information about this thing, and I've combed the OP for things I should do, but I'm a complete retard and I don't really know what's going on
>>
>>108007982
https://vocaroo.com/16rDs7Ak3FsJ



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.