[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1760328570709480.jpg (440 KB, 2048x1536)
440 KB
440 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107013301 & >>107003557

►News
>(10/27) MiniMax-M2 230B-A10B released: https://hf.co/MiniMaxAI/MiniMax-M2
>(10/21) Qwen3-VL 2B and 32B released: https://hf.co/Qwen/Qwen3-VL-32B-Instruct
>(10/20) DeepSeek-OCR 3B released with optical context compression: https://hf.co/deepseek-ai/DeepSeek-OCR
>(10/20) merged model : add BailingMoeV2 support #16063: https://github.com/ggml-org/llama.cpp/pull/16063
>(10/17) LlamaBarn released for Mac: https://github.com/ggml-org/LlamaBarn
>(10/17) REAP: Router-weighted expert pruning: https://github.com/CerebrasResearch/reap

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>107013301

--Ling 1T model performance and quirks in trivia and code generation:
>107020806 >107020827 >107021066 >107021155 >107021365 >107021454 >107021546 >107021605 >107021993 >107022036 >107022040 >107022084 >107022111 >107022357
--Skepticism about MiniMax-M2 model's benchmarks and compatibility issues:
>107019653 >107019674 >107019763 >107019697 >107019757 >107019773 >107020213 >107020483 >107019782 >107019813 >107019991
--Strategies for maintaining LLM coherence in long-form storytelling:
>107018853 >107018892 >107018909 >107018933 >107018965 >107019013 >107019083
--RTX Pro 6000 Max-Q performance and efficiency vs full 600W model:
>107019227 >107019264 >107019501 >107022133 >107019522 >107019618 >107023698 >107019452
--Evaluating Blackwell Pro vs RTX 5090 tradeoffs for AI/LLM use:
>107019130 >107019200 >107019209 >107019217 >107019239 >107019602
--Model safety issues with destructive command execution:
>107013537 >107013577 >107014153 >107014937 >107015249 >107015326 >107015750 >107016016 >107016264
--Techniques for enhancing LLM creativity through randomization and prompt engineering:
>107014306 >107014328 >107014387 >107014746
--GLM 4.5 Air model instability and quantization challenges:
>107013367 >107013388 >107015210 >107016881 >107016900
--DeepSeek AI's potential military/propaganda use and US regulatory response:
>107022232 >107022241 >107022304
--Halloween audio generation and AI's impact on publishing:
>107015928 >107016475 >107017726 >107017769 >107018045
--Silicon Valley's shift to open-source LLMs for cost efficiency:
>107022381 >107022444
--LPDDR accelerator trend for cost-effective inference in AI hardware:
>107023774 >107023803
--Ryzen 9950X3D memory configuration and bandwidth considerations:
>107016171
--Miku (free space):
>107013361 >107019491 >107019944 >107019963

►Recent Highlight Posts from the Previous Thread: >>107013303

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: 1750329660617701.jpg (2.18 MB, 2242x1541)
2.18 MB
2.18 MB JPG
>>107025297
It's not a big boy datacenter gpu, its just one of their recent mid-range offerings, an AGX Thor dev kit. Specs are similar to the recently released DGX Spark, but it targets some more hardware-related use cases which I work on so I opted for one of these over the dgx.

This thing is like 75% heatsink. Feels very solid.
>>
Interesting podcast to have in the background while working
https://www.youtube.com/watch?v=01ybLOH1fnU
>>
>>107012869
>>107012883
I'm probably too retarded but I tried to do this.
>loading GLM at 6bpw with exps=cpu at practically 0 context loads just about 16GB into VRAM and 235GB into RAM
>GLM has 160 routed experts so one expert should be just around 235/160~=1.5gb in size
>7 of them are routed at a time to accompany the shared expert that should be in VRAM => ~10.5GB are used from system RAM per inference step
>"exps=cpu" loads ~16gb into VRAM ignoring kv-cache so all of that should be the dense/shared part of the model that's always used => ~16GB are used from VRAM per inference step
So the rough formula for the speed you can expect from GLM @ 6bpw should be 1/((16/bandwidth_vram)+(10.5/bandwidth_ram) in this case, shouldn't it? If I plug in my numbers the result of this is quite a bit higher than the actual speed I'm seeing.
>>
>>107025510
>llms are "AI"
when the real AI will go on a genocidal spree, faggots like you will be turned into biofuel first.
>>
>>107025394
hi betifel I love you bby come to gujarat
>>
>>107025556
The "AI" culture wars are the weirdest identity politics meme I've ever seen.
It's a semantic argument, but yes, in my book anything that processes information in a semi-autonomous way and is artificial is AI. I'm reluctant to call language or writing AI because it doesn't really do anything by itself even for a little while, but I'd be willing to call, say, Pascal's adding machine a primitive form of "AI".
Since autonomy and generality of an information processing entity is an inherent part of "intelligence" I'd say a programmable calculator is more intelligent (and thus more "artificial intelligence" than a calculator). A calculator that does all 4 operations is more AI than a calculator that only does addition. A programmable calculator is more AI than a simple calculator. A computer even more so. An LLM even more so. And so on.
Yes, ASI if it ever comes about will be more "AI" than an LLM, but that doesn't mean an LLM is not a (comparatively) primitive form of AI.
>>
wait i though this shit was difficult to get running
you can literally just do
>pacman -S ollama
>ollama pull model
>ollama run model
and there it is. am i missing something here?
>>
>>107025666
satan trips wasted
>>
>>107025679
no they are perfect
>>
>>107025666
If you do this you'll get a retard quant with shit performance+your data uploaded to ollama servers.
>>
Mikubutt
>>
>>107025679
Seems appropriate.
>>
File: ComfyUI_02547_.jpg (1.91 MB, 1280x1856)
1.91 MB
1.91 MB JPG
gyatt damn that OP image reminds me of the Dipsy images I generated a while back and turned the entire deepseek general into a konichiwa, dude! thread.

Dipsy is BWC only
>>
>>107025694
>+your data uploaded to ollama servers.
really? a search suggests its local
>>
Trying to make an LLM-as-a-judge pipeline with opencode. It seems to somewhat be working.


I have a chatlog with a code assistant in the log.jsonl file.
I want you to clean it up to make an improved version of the chatlog, with the mistakes of the assistant (with the role "gpt") seamlessly edited out to make it look like the assistant was more accurate than it actually was, based on the feedback of the user (with the role "human"). The purpose of this is to train the assistant to perform better.
Be careful though, the messages with the role "human" are not always human generated. Tool call results are also returned with the role "human" but were not written by the user, they were written by the agentic framework which executes the tool use calls.
So make sure the tool use requests stay matched with the correct agentic framework responses.
Make sure the behavior of the system is not misrepresented. Do not make simuluated tool use requests, stick to the tool calls that the agent actually made and got a response to.
Make sure to keep the system message intact at the bottom.
Make sure the conversation begins with a human message and ends with an assistant message.
Make sure the conversation alternatives between "human" and "gpt" without any two successive messages with the same role.
Basically make sure the json structure remains correct.

General guidelines

- Do NOT edit or further truncate the tool result calls.
For example, if the original was this:
{
"from": "human",
"value": "Tool executed: <tool>\n<tool_name>read_file</tool_name>\n<parameters>\n<filename>modules/cli/attention_scores/attention-scores.c</filename>\n</parameters>\n</tool>\nResult: File 'modules/cli/attention_scores/attention-scores.c' contents (458 lines)...
>>
>>107025394
nice butt
>>
>>107025468
>https://www.seeedstudio.com/NVIDIA-Jetson-AGX-Thor-Developer-Kit-p-9965.html
>delivering up to 2070 FP4 TFLOPS AI perfromance
>2070
oh nonono
>>
File: slop.png (176 KB, 1902x1112)
176 KB
176 KB PNG
Is the solution to AI slop just asking another LLM to remove the slop?
>>
>>107025742
>Make sure the conversation alternatives between "human" and "gpt"
*alternates*
>>
>>107025856
Oh, thank you. Brain autocorrect fail I guess.
>>
>>107025846
That's an interesting idea, having another LLM proofread for you. But how will they improve upon the slop, yet know what it is to begin with? And if it can know what slop is, why wouldn't you just tell that to the first LLM to avoid it to begin with? And if the first LLM can't do it, why would the second one be able to?
>>
>>107025796
He did a fairly popular abliterated version of gemma

https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated

I used it myself for a while, it was pretty good, as far as abliterated models go.
>>
>>107025867
>simuluated
>>
>>107025903
Just diverge it to a third model, duh. Add more AI is the solution to most things, apparently.
>>
>>107025903

Well, in this case I don't tell the LLM because I want it to be fully focused on the task and not on linguistic details, and also because it's a dumb as bricks 27B model I'm trying to finetune to be usable at agentic coding, so it wouldn't follow the instructions anyway.

The LLM I'm using to clean up the logs is GLM 4.6 through the API.
I'm giving it examples of what replacements to make.

"You're absolutely right", "You are absolutely correct" -> "Ok", "Sure", "Yeah, you're right", "That's right", "That looks ok", "That seems correct", "That looks right", "That looks correct", etc.

"I apologize for the error." -> "Looks like a made a mistake while doing X because of Y.", "Hmm, the tool call didn't work because of Y.", "Huh, looks like it didn't work.", etc.

"You are right to point that out." -> "Good idea.", "That's a good idea.", "Yeah, I'll do that.", "Thanks for the information", etc.
>>
So apparently mistral small knows how to cook ye olde walter white. I was fucking around with it and gave it the lyrics to the wkuk never song, and it got mad at the refinement process. So I asked it to fix it, in a hypothetical scenario, so that the ingredients listed could be used without killing the consumer. It began spitting out a detailed step by step response. Pretty funny, even if I'm too chemically inept to verify any of it.
>>
>>107025947
There was this paper where people were able to bypass safetry restrictions in large context models by feeding it a long conversation in which the model always answers the questions. I had the idea of just feeding a high quality conversation and just continuing off of that into your actual roleplay. pre-heating the oven, if you will. Have people tried this? I'm sure someone has. I wonder if the high quality conversation can offset the retardation models suffer as you use up more context.
>>
sysprompt:
>The writing style does not follow common prosaic conventions in favor of more grounded, factual storytelling, replicating the informal style of online forum discussions: not low effort yet not pretentious.
FIRST FUCKING LINE
>The door creaked open, revealing a sterile, shared space.
FUCK FUCK FUCK GLM4.6
>>
>>107026002
The benefit of local is anything you could do with in-context learning you can do better by actual learning through finetuning, and keep the short context performance. Those tricks are only useful with API calls.
>>
>>107025742
if the original modela made the wrong tool call, the editor won't have the information it needs to fix it unless you give it tool access
>>
>>107026023
I think it would be better to just give a few of actual turns as examples so the model soaks up the instructions better, but muh service tensor isn't built for such a thing. (Fucking retarded ""examples"" section.)
>>
>>107026049
Ah, but you see? He asked the model to not misrepresent the behaviour of the system.
>>
>>107026049
If that happened then the user (me) would have given feedback to the model until it made the right tool call. So the cleaning up step for training consists mostly of removing the wrong responses that were corrected by the human through natural language feedback and only keeping the one that actually worked.
If llama-factory supported per-message masking I would just keep the original log and mask out the turns where the assistant underperformed, but alas, that is not an option without modifying the llama-factory source code which I don't want to do.
>>
>>107026087
it's the fact that it has no understanding of shit that bothers me. because of that the whole thing feels fundamentally flawed
>>
>>107025468
>5 gbps Ethernet
Doesn't the AGX Orin Dev Kit have 10 gbps? What hardware specs for that thing?
>>
>>107025551
As I said, it's a lower bound on the runtime so an upper bound on the speed.
The first problem is that the code is not 100% efficient.
The second problem is that you can only achieve that speed on a completely empty context, the runtime increases linearly with context depth.
>>
>>107026244
>>107025468
Never mind, I didn't notice the big 4x25gbps port on there. How much was it?
>>
>>107025468
Regardless of specs and performance that's a cool looking object.
>>
>>107026366
cool looking paperweight
>>
File: IMG_8482.png (3.36 MB, 1024x1536)
3.36 MB
3.36 MB PNG
>>107025717
Witnessed
>>
>>107026666
im trying, satan, but it's so expensive
>>
>>107026666
check'd quad satan dipsy
>>
>>107026666
God damn super satan over here.
Might as well ask, how are the latest DS releases?
I only ever tried the original V3 and original R1.
>>
>>107025556
LLMs are more real than most cunts nowadays. frfr
>>
srs question for all the coomers/ERP people. Do you actually prefer to coom to ERP rather than just watch porn? I don't get it.
I've coomed to porn my fair share for more than 15 years now and I still would watch it over cooming to a ERP session.
Unless I'm missing something from ERP and I haven't experienced it properly
>>
>>107026802
it's less about stimuli and more about interactivity and reactivity
a video once made is always the same, but I can always come up with a novel scenario and act it out in a rp setting
>>
>>107026802
I use RP for scenarios not commonly found in porn. It's not a substitute for regular porn, but a supplement. If your only interests are things commonly found in porn, then your enjoyment with RP will me limited.
>>
>>107026802
What >>107026853 said.
The combination of the scenario you want and the interactivity, and even storytelling, is like nothing else.
I really should work on setting up an image gen model to add some visualizations to the whole thing.
>>
>>107026747
V3.2 was optimized for agent and coding. It requires some additional wrangling for role play. But it has an anthropic compatible endpoint and can be interfaced with Claude code. At 1/10 the price lol. Oh, and think and nothink are the same base model now.
V3-0324 was better for role play. But we all move along.
>>
>>107026802
the interactivity and being able to perfectly tailor it to what I want are big draws for me. porn is great for pure high-grade zero-effort neuron activation, but if you're a high-concept, cerebral, trope-subverting, bone-chilling A24 kino coomer like myself ERP is a much more interesting form
>>
>>107026882
A continuation of this RP after we installed hidden cameras in the bedroom and called the girl's parents asking for her to come over for another tutoring session. You're not going to find this on pornhub or anything like that.
>>
>>107026023
holy skill issue
>>
>>107026023
just write your sysprompt like a shitty esl forum post and itll work
>>
File: 1754417936209402.png (7 KB, 1024x1536)
7 KB
7 KB PNG
Hmm.
So uh, I was trying to use GLM Air (q5) to make a slightly special-case download/scraping script. It kind of failed. Then I tried GPT OSS 120B just to see what would happen, and it just werked.
Damn. We gave it a lot of shit but it's not so bad sometimes? I mean it'd still be nice if it was less censored and slopped of course.
>>
>>107026802
it's nice to have full control over the erp scenarios you want and do stuff that usually isn't shown elsewhere
>>
>>107027108
>full control
I wish.
>We must refuse.
>>
>>107026802
My favorite coom artist takes like 4 months to make one 6 minute video
With LLMs I can take those same characters and scenarios and generate infinite variants on the fly
>>
>>107026945
>User
>asterisk formatting
>first person narration
>random chinese
>doing actions for user
enough, my fucking sides xD
>>
>>107025904
ah thanks anon
>>
>>107026882
>>107027136
ignore the rude anon, keep posting logs
>>
>>107026802
No, you're not missing anything. The people who get off to text are the same retards who before LLMs were in Discord rooms all day grooming underage boys and ERPing with each other.
>>
>>107027112
Part of the control is being able to try and bypass that or just use another model.
>>
>>107026023
Have you tried increasing the temp to 1.2? It really shines then and feels better for rp compared to the recommended temp.
>>
File: IMG_5946.jpg (48 KB, 500x500)
48 KB
48 KB JPG
>>107025394
This image is allowed on a blue board where I’ve gotten global vacations for posting male nipples.
>>
>>107027181
That's why I only post using the proxy website.
>>
https://huggingface.co/inclusionAI/Ming-flash-omni-Preview
CHAMELEON BROS WE ARE FUCKING BACK
TEXT IN AUDIO IN IMAGE IN VIDEO IN
TEXT OUT AUDIO OUT IMAGE OUT
>>
>>107027181
no one likes faggots
>>
File: 1758920312862923.png (1.08 MB, 1024x1024)
1.08 MB
1.08 MB PNG
>>107027227
>>
>>107027227
Holy fuck finally.
>>
Gemma sirs....
>>
>>107027227
105B I can actually run it in FP8, but I have no idea if llm-compressor supports this model or how to create the recipe
>>
>>107027227
Please let it be good and please let the goof gods gift us precious goofs
>>
>>107027318
cant u download the FP16 model then run it with bitsandbytes int8=true or whatever the thing is nowadays
>>
File: migu.png (35 KB, 814x999)
35 KB
35 KB PNG
DeepSeek-chan?
>>
File: iu[1].jpg (64 KB, 474x266)
64 KB
64 KB JPG
>>107027341
>>
>>107027328
it's BF16 (like fuck do I know what even is the difference between BF16 and Fp16 anyways lol)

I'll try then, but afaik int8 is slow and not very good quality?
>>
>>107027227
>actual omni
>it's based on ling
monke paw
>>
>>107027227
there's no chance this is going to be good, right?
>>
damn minimax-m2 is a piece of shit
>>
>>107027172
I said it when Alpaca happened, and again when Llama 2 came out, and I'll keep saying it: having to jailbreak your own model, on weights you run on your own hardware is fucking retarded.
>>
>>107027480
I agree, but it's something you CAN do.
Or just change models.
>>
File: NIG.png (39 KB, 659x300)
39 KB
39 KB PNG
>>107027341
oh my... kimi-chan...
>>
Kimi k2 vs glm 4.6 for creative writing? Anything better?
>>
>>107027227
Well, that's never coming to llama.cpp.
Maybe the text gen part of it, if that.
>>
new sloppa jus droppa
https://huggingface.co/Ilya626/Cydonia_Vistral
>>
>>107027438
how much shit?
>>
File: file.png (6 KB, 375x72)
6 KB
6 KB PNG
>>107027542
die motherfucker die bye bye
>>
>>107027514
depends
>>
>>107027509
>>107027593
New benchmark discovered.
>>
>>107027843
Meant for >>107027341 and >>107027509
>>
Mobile users of ST, how do you manage long gen times? For example, I'm using a 2-bit GLM quant, so parsing takes 30-60 seconds, and then the reply comes in at about 4tk/s. It's useable, but if I navigate away from my phone's browser with the ST frontend open, the generation seems to hang.

I'm running ST on the same machine as GLM, so i just have the local IP and port for ST in my phone's address bar.

Is there a different way to set it up or to have it retrieve the response asynchronously?
>>
>>107028092
Also to add - even if my phone screen turns off (I have it set to either 30 or 60s of inactivity to save battery), the streaming response hangs. And no, I don't want to do something hacky like using an app that keeps my phone awake. That's gay.
>>
>>107028104
just disable the power saving option in your web browser so the tabs aren't suspended when the screen turns off
>>
Somebody is going to notice that using a deep fried jpg as context is the same as keeping a fixed size embedding as input. And then that person who rediscovered the encoder-decoder transformer architecture is going to present it as a great breakthrough to have "infinite context".
>>
>>107027227
>HIGH QUALITY IMAGE GENERATION
>look inside
>sloppy sloppa
Why do they always do this?
>>
>>107028092
I have the same issue with paid api running through a server. I can go to a new tab but if I go to another app or phone sleeps, I lose the gen. It's a ST and mobile thing, not just local.
>>
>>107028568
You can't automate creativity
The majority of untalented dipshits will use even the most powerful of tools to flood spaces with sloppa
Talented artists have been using AI since SD 1.X to make some pretty cool shit
>>
>>107028092
I wonder why they did it like that. Long polling is superior in every way and allows multiple clients simultaneously
>why do so many people have their own custom frontends
>>
File: file.png (430 KB, 2254x1452)
430 KB
430 KB PNG
I'm trying to convince my new SWE job to sponsor my skunksworks project of running their entire codebase through a local language model. Maybe I can get them to buy me this :)
>>
>>107028863
It's not going to be yours, dumbass.
>>
>>107028863
For what purpose?
>>
>>107028887
I work remotely. By the time I leave, they'll have either forgotten about it or it'll be a paper weight. I get a $2000 office stipend annually so I can always use it towards this too.
>>
>>107028898
Cooming (on company time)
>>
>>107025394
I installed EasyDiffusion. My gens suxs. How do I into animu? Do I need more models? More LoRAs? Better prompt-fu?
>>
>>107028863
>1,000 AI TOPS
kek. AI is a unit now.
>https://www.youtube.com/watch?v=9ntPxdWAWq8
>How much capacity would HP's cloud uses have access to?
>1000
>>
>>107028914
>>>/g/ldg/
>>
>>107028914
You need /ldg/
>>
>>107028914
>How do I into animu?
https://novelai.net/register
>>
File: 1761389401659502.png (3.22 MB, 1264x2216)
3.22 MB
3.22 MB PNG
>>107028929
Shoo shoo
>>
>>107028900
You are very pampered.
>>
>>107028963
>>107028928
>>
>>107029021
Feels good to be working at a fortune 100 company for a change :)
>>
>>107029069
Ahh yeah, you are that weird guy.
>>
>>107028863
My last company always supported shit like that because it was cheaper than raising wages. Somehow, people were more excited about random things like that than getting $100 added to their monthly paycheck
>>
>>107028236
Hmm, brave for Android doesn't have anything like that. Any other ideas? I've got another server in the house I could use as well - just looking for a way I can go and do something else while the LLM is making a response.
>>
File: 1745236808520206.jpg (963 KB, 1336x2008)
963 KB
963 KB JPG
>>
kind of crazy to think about how ai is a solved science and with the next gen of nvidia chips and about a year of datacenter and power infra expanding we will be able to just use the current algorithms to create agi
>>
>>107029260
Why post this again?
>>
MiniMax-M2 gguf status?
>>
>>107029530
Worse than Nemo, don't bother.
>>
>>107029585
But it says "agentic" on the model card. So how's that possible?
>>
File: 1734548536727531.jpg (1.06 MB, 1336x2008)
1.06 MB
1.06 MB JPG
>>
File: 1745908814203796.jpg (1.12 MB, 1336x2008)
1.12 MB
1.12 MB JPG
>>
File: 1758146830843767.jpg (864 KB, 1336x2008)
864 KB
864 KB JPG
>>
>>107025394
flatta
>>
Lazy question, but any "recommended models" for very basic coding, specifically lua, js/html? Screwed around a bit with GPT5 to test something and now I'm fixed and I'd like to continue.
Got a 4090, so I'd rather use that over CPU inferencing, if possible quality wise.
>>
>>107029161
>>107029649
>>107029655
>>107029660
Cyberpunk Miku is best Miku
>>
File: firefox_7Rxgqhs366.png (614 KB, 758x1124)
614 KB
614 KB PNG
Presented without comment.
>>
>>107029762
A 4090 alone isn't nearly enough to run worthwhile coding models. If you don't absolutely need privacy, then pay for an API key, you'll save yourself a shitload of time and effort over trying to tard wrangle a small model.
If you do need privacy, then probably Qwen 2.5 coder 32b, or the newer 30b variant.
>>
>>107027136
Show the class what the right way to RP is, anon.
>>
What are LLM's role currently in robotics? All these new robots like Optimus and the chink ones are coming out and I had an idea how to get into robotics myself and wanted to learn more. I think Elon Musk and big tech's obsession with scifi is blinding everyone from what's right in front of us and always has been.
>>
>>107029800
The Qwen 3 Coder 30B is held back a lot by only having 3B active. It's good for simple autocomplete, but not much more. If he just needs chat 235B would be worth a try if he has enough RAM.
>>
>>107029800
Again, this is for VERY basic stuff, so I don't need anything super complex. Think script kiddie if you will, beginner level stuff. I just need that to be functional though obviously. If 24GB isn't enough for that fair enough. At least now I know what to expect.
>>107029818
>235B
That bad, huh? Had a hunch, but better safe than sorry and ask before I do anything.

Thanks, to the both of you!
>>
>>107029825
Don't listen to them, get 30B at 4 bit quant (IQ4_XS) and check it for yourself.
>>
File: noble meeku.png (2.16 MB, 768x1344)
2.16 MB
2.16 MB PNG
>>107029771
For me, it's noble Miku
>>
>>107029844
That's the idea, don't worry. Just not right now cuz bed is calling.
I've tried random LLMs and Image models for goon garbage, might as well give this a shot for the hell of it.
>>
>>107029844
He can run 30B Q8 easily with partial offload, my 3090 runs it at well over 30t/s.
>>
>>107029854
And pp of 50 tokens per sec, right? No thank you. Not a byte on CPU for me.
>>
>>107029863
No? It's been a little while since I used it, but pp was definitely at least 500t/s.
>>
>>107029869
Your pp seems quite slow.
>>
>>107029863
You should really try both and compare. Running a retarded model fast is counterproductive.
>>
>>107029877
I did and generally at 4 and larger it was largely the same.
>>
>>107029825
try the qwen3-32b dense model at q4, it should just barely fit. it's slower than the 30b-a3b model but it's noticably smarter
>>
File: 1736056913898746.jpg (949 KB, 1336x2008)
949 KB
949 KB JPG
>>107029849
royal miku
>>
my pp so hard rn
>>
my pp is so neutral rn
>>
regardless of warning your pp doesnt scare me at all
>>
>>107029882
It's definitely smarter. But not having a Coder varient makes it not ideal. Don't know why they didn't do one.
>>
>>107029882
>30b model smarter than 3b model
no shit
>>
>>107029916
This looks like eldritch miku wearing the skin of noble miku
>>
>>107029933
just tell it to write code lol
>but the filename doesn't have "coder" in it
>>
>>107029996
Do you know why some of the files have "Coder" in it? It's not because those are the only ones allowed to write code.
>>
>>107025468
That’s a decent overall package for the price, 60% the FP4 of a 5090 for only 1k more, with 4x the memory to boot. Does being shitty lpddr5x hurt it, though? Obviously it would be ass for gaming but does a low wattage GPU and slow ass memory matter for AI?
>>
File: 1759528463502848.jpg (915 KB, 1336x2008)
915 KB
915 KB JPG
>>107029951
>>
>>107029951
>wearing the skin of
shame theres no tag for this. maybe you could get close with a sort of non surgical face mask. like shes wearing someone elses face. youd probably have to inpaint it
>>
holy shit k2 (not local version) absolutely fucking mogs on gpt5 and gemini 2.5pro for making experimental electronic prompts for suno
https://suno.com/s/tr4JGFPxO7Mo8wHm
"local" won
I also fed it the prompt for this 4o song (including the lyrics one not displayed)
https://suno.com/s/jUk50aWDBdHb76wU

and then asked it to mate the two song concepts together into something new
https://suno.com/s/lKIpynPKOZ3goIFz
interesting emergent capabilities
>>
File: 1761495895836594.jpg (89 KB, 700x700)
89 KB
89 KB JPG
Can someone teach me
How do i use local deepseek v3.1 on windows 11
Sorry if i have 0 knowledge on LLM
>>
>>107030105
surprised there's not even a loose skin tag
>>
https://huggingface.co/inclusionAI/Ling-1T/discussions/15
>>
>>107030355
gm sir I am microsoft tech support technician engineer.. kindly tell me how much ram and vram you got sir
>>
>>107030377
Ram 16gb
Gpu rtx 3070
Very old yeah
I will upgrade later in 2026
>>
>>107030387
thank you for the info sir.. . unfortunately your computer is two week for deepseek 3.1. . unles you got datacenter ssds. I can suggest you alternative models if would like to
>>
>>107030362
>Llama 3.3 instruct gave "You'll Cowards Don't Even Smoke Crack", which AFAICT is the correct answer.
Correct answer to what question???
>>
>>107030416
Which song is rapper viper best known for?
>>
>>107030355
I am summoned.
https://rentry.co/DipsyWAIT
>>
>>107030084
I like this Miku
>>
What's going on in LDG?
>>
>>107030459
sir I have bad news for yuo. .. you need moare ram if you wish to follow /lmg/ meta, kindly max out ram with 192gb fastest mt/s and get better cpu. .
the best model you can currently run is mistral nemo, best modal for very poor. do you wish to get instructions on how too install nemo sir?
>>
>>107030084
>why, miku, what big... tentacles you have...
>>
>>107030506
petra discovered the sharty fae prompt and set his bot on them. give it a week and the same thing will happen here.
>>
>>107029792
nice
>>
>>107029792
That's a lot of em dashes and markdown—just like Ling 1T
>>
>>107030506
OOF, that's one hell of a meltie!
>>
File: sd1.2.png (589 KB, 962x647)
589 KB
589 KB PNG
>>107027227
most generic image ever BUT it's benchmaxxed on text!!!
>>
>>107031089
they all got mind broken when dalle3 released and never recovered
>>
>>107031089
>Prompt: Sunlit
>Image is at night
And they cherrypicked this image? Oh no no no...
>>
>>107025394
So the fattest MoE model available right now is Kimi K2 with 1T parameters.
Surely if I now put together a machine with 1.5 TiB RAM there won't be some cunt releasing a 4T MoE model just 2 weeks later.
>>
How can I estimate how much VRAM does it take each layer in a quantized model? it's basically just size of the weights / number of hidden layers ?
>>
what is the proper way to use the -ot parameter in llama.cpp? i think now it works differently, do i need to use the --cpu-moe flag too? before or after the -ot flags?
>>
do you think that if you combined predictions from different models the slop would go away, given that you handle tokenizer mismatch? For the same text different models have vastly different distributions, many valid tokens are considered almost impossible. Maybe something like setting each logit to the minimum value across all model's outputs would work?
>>
>>107031459
real men use -ot, fags use -ncmoe
>>
File: Zaiglyph.jpg (34 KB, 640x326)
34 KB
34 KB JPG
uhh ohh local bros
ze future is vision
guess we'll have to buy 4 BBCwell GPUs at last...
>>
>>107031459
-cmoe and -ncmoe are just shortcuts for -ot.
>>
>>107031552
>text
>images of text
>books with realistic page physics
>simulated school environment
lecunny was right all along
>>
>>107031552
I fear for the future of lllmao.cpp, vision is barely supported there :(
>>
>>107031596
If LLMs fully take the vision route, it's going to be too slow to offload, so we'll all switch to exllama/vllm anyway.
>>
>>107031596
Making vision necessary is going to motivate development.
>>
>>107031552
That seems extremely retarded, there's no way images of text work better than text.
>>
>>107031622
>he didnt read the deepseek-ocr paper
cringe
>>
>>107031639
>changes the font
>>
>>107031620
lol
>>
>>107031639
oh damn, a paper said so?! revolutionary new Pareto paradigm unlocked!
>>
File: BEEPDING.png (2.05 MB, 768x1344)
2.05 MB
2.05 MB PNG
>>107031622
Read Deepseek OCR paper again. I also don't see why it's not intuitive, input bloat is huge in text. Learning from letter shapes seems more intuitive than figuring out different tokenizations of the same word. Imagine how many neurons were spent learning that word, WORD, Word, worD, and W-O-R-D are the same word. Misspelled words also work better with visuals
>>
>>107031680
>Learning from letter shapes
yeah let's maybe hopefully try a whole new way of training just to ace strawberry, at the cost of everything else, sounds lovely
>>
>>107031620
Qwen-Next is well over a month old and still isn't supported, and that's a regular text model.
>>
File: G3z4SdjXgAAJic4.jpg (97 KB, 1200x480)
97 KB
97 KB JPG
>>107031659
retard
>>107031697
At the cost of what? LLMs don’t understand raw text, the bloat from figuring out tokenization on the LLM’s side is huge, whereas vision is a pretty straightforward and well optimized. I wouldn’t be surprised if it takes fewer resources in the end
>>
>>107031731
Yeah. New attention mechanism taken on by a vibe coder. Whoddathunkit.
>>
>the deepseek ocr paper
that anon shilling colpali/colqwen and open source RAG frameworks like morphik half a year ago might have been onto something...(yes that anon might have been me)
>>
>>107031731
Just use exllamav3, what are you, a vramlet?
>>
>>107031797
>just rammax for the huge moes bro

>few weeks later

>are you vramlet lamo?
>>
>>107031797
If I wasn't then I wouldn't care about an 80b moe
>>
If i make enough predictions, i'll eventually be right about something, even if i have to reinterpret my original prediction. Or what actually happened. Or both.
>>
>>107031810
You're absolutely right!
>>
>>107031820
You're absolutely right!
>>
File: got you.png (2.01 MB, 768x1344)
2.01 MB
2.01 MB PNG
>>107031804
>>
>>107031748
The problem is that every single token is rich in semantics, gets encoded by the LLM as a high-dimensional vector. How does one train an OCR model without involving semantically-rich text tokens, e.g. making the model understand what the written word "dog" actually means, regardless of style/font/etc? I'm genuinely curious, because I've been looking into it and I don't think it's easily solvable.
>>
>>107031882
Fuck off and read the paper already.
>>
>>107031613
cpucopemaxxers were always fucked, but exllama/vllm don't support anything but the most recent nvidia cards
>>
>>107031882
Current models don't understand what a dog is either, they just know which patterns of words should be used when asked to describe one.
>>
>>107031896
I'm talking about a hypothetical model that doesn't use text tokenization at all. OCR models are trained with image-text pairs.
>>
>>107031697
>let's maybe hopefully try a whole new way of training
That's how you make progress: try new things, see if they work. You can’t just scale the same shit with more slop and then call war rooms when it doesn't work anymore
>>
Are there any AI as good as Gemini at editing images that can make naked anime characters?
>>
>>107027133
being into vore is a curse, especially if you aren't a furry at the same time, there's so little content, most of it is pretty bad, and there's so many subcategories that there's only a very small amount i actually like, they're very difficult to look for, and i often have to just ignore/alter aspects i don't like to enjoy the parts i do. llm's aren't great at it but i've made some stuff i like with one
>>
File: 00274-563513342.png (844 KB, 896x1152)
844 KB
844 KB PNG
>>107031919
I'm convinced that diffusion models do. They can learn pretty wild concepts like transformation into inanimate objects or equirectangular projection
>>
>>107031927
In the llama team's defence, at least their incremental improvements kept improving. They didn't have the skills to make progress and the war rooms forcing them to do "deepseek, but better, and bigger, and with omni too, but super super safe" is when they imploded.
>>
>>107031987
What improvements? MHA to GQA and GELU to SwiGLU? Huge fucking difference
>>
>>107031459
you are in luck lol: https://www.reddit.com/r/LocalLLaMA/comments/1oi7k25/hf_space_to_help_create_the_ot_flags_in_llamacpp/
>>
>>107032037
actually yeah, it was, compared to the 100th paper that never gets implemented, muh BLT titan ocr with 9billion retnet contexts
>>
>>107031978
They are trained on text-image pairs, using a text encoder. What they won't do is learning semantics, abstract concepts and long-range word relationships purely from images containing text, but I'm looking forward to be proven wrong (Imagine if a language model could be trained just on raw scanned books and nothing else, without labels / human supervision).
>>
>>107032070
Trying something revolutionary from their own papers would have been a lesser waste of resources than Llama4
>>
>>107032100
>What they won't do is learning semantics, abstract concepts and long-range word relationships
that's bloat anyway just use rag for that
>>
>>107032110
>but llama4
papercels so predicatble
>>
>picrel

Fucking finally, people would stop complaining about ST settings. Chat completions doesn't use any of the text completions stuff
>>
>>107032156
2023 called it wants it's retardation back
>>
>>107032165
Wow you're only 2 years old and you're posting on 4chan?
>>
>>107032171
my guy really doesn't understand how instruction formatting works, how utterly embarrassing
>>
>>107032156
Where did that idea come from?
Chat completion is the least error prone way (in theory) to use the models, since you don't get the chance to fuck around with the chat template at all. although you can still do things like insert messages with the system role where they shouldn't.
>>
>>107032204
>Where did that idea come from?
early bad jinji templates
>>
>>107032156
I'm not sure, but at least is a sure way to get all the chat template stuff correct.
>>
>>107032156
>this is the guy telling you local models are shit bt w
>>
>>107032221
>a sure way
A sure-r way at least. The GGUF can still have a fucked template.
See, unsloth.
>>
>>107031552
>>107031622
>>107031748
See >>107028332
It's retardation by people who don't understand ML
>>
>>107032156
NO SIR CHAT COMPLETE PERFECT FOR GOOD LOOKS
>>
>>107026802
Real porn is boring and nasty. I grew up with shitting dicknipples and only AI will reliably provide me an infinite "free" supply of first person multiple character hyper macro herm taur goredeath hyperscat anal vore shitjerks.

>verify you are human
Haha yeah... yeah...
>>
File: G4UBM8BasAIvwQm.png (359 KB, 1290x1406)
359 KB
359 KB PNG
Alright guys hear me out what if instead of shape-rotating models vs wordcel models we did both?
>>
How do you guys deal with some models/chars in ST having inconsistent use of quotes/asterisks?
Do you just put something in the character card to force the use of markdown * and " for dialogue and narration?
>>
>>107032282
>still using asterisks in almost '26
jesus christ
>>
>>107032282
Generally you just include that markdown in the opening message and the model should adhere to it.
>>
>>107025468
This one gets my attention for a few reasons:
A) Proper nvidia support
B) Very low power usage relatively speaking
C) Good memory size (even if the bandwidth isn't great.)

I think it'd make for a decent off-grid smart-home LLM server. That's my goal for it anyway if I ever get around to buying one.
>>
File: dipsyByzantine3.png (3.44 MB, 1024x1536)
3.44 MB
3.44 MB PNG
>>107031680
I really like this gen.
Agree, but mostly b/c we don't know exactly how human memory is stored, but we do know it's not strings of text. Machine memory for these LLMs should eventually take a shape that wouldn't make any sense to anyone but a machine. More of a black box, like that that makes up the LLM's matrix.
We'll see if anyone is able to scale it.
tmw forever.
>>
>>107031680
>learn 10 ways of spelling of a word using groups of letters
>vs
>learn 10000 ways of writing a word using pixels
>>
File: 17370157794947.jpg (408 KB, 593x781)
408 KB
408 KB JPG
>>107031591
>lecunny
rope yourself.
>>
>>107032286
what is wrong with that? even ST in the tts settings suggest that use.
Do you just use plain text for narration?
>>
>>107032375
>Do you just use plain text for narration?
obviously?
the novel vs markdown style formatting has been had years ago
>>
>>107029812
They don't. LLM is for text conversations, auto complete, "Chain of thought" and that kind of thing. It's good for human interaction understanding, IE: You tell the robot "Shake hands" and give the LLM a list of commands it can issue to the robot and it'll figure out if any of those commands can do that task, but that's it, and pretty much optional. The rest of puppeteering the robot is handled by any number of other AI that aren't LLM.
>>
>>107032286
Most voiced Japanese visual novels can largely do away with narration, or make it a minimal part of the story. I would like LLMs to be capable of roleplaying in that style.
>>
>https://arxiv.org/abs/2510.13928
>LLMs Can Get "Brain Rot"!
When will labs learn that if they just feed slop they are hurting the models? They are probably leaving tons of useful content just because they find it dangerous.
Just 1M tokens of slop out of 15T causes severe brain damage
>>
>>107027227
So, has anybody managed to toy with this thing yet?
>>
>>107032474
we know, discussed already, next paper please! also this >When Bad Data Leads to Good Models https://arxiv.org/abs/2505.04741
>>
>>107032495
It's 100B with full gguf support likely never coming. cpumaxxers and vramlets can't touch it and that rules out nearly everyone except maybe the guy stacking RTX 6000s.
>>
>>107032474
Junk paper with very flawed methodology.
>>
>>107032556
stfu, paper never lied to anything
>>
>4090
>16gb ddr4
I'm pretty much limited to dense models that fit on 24gb vram right

Let's say in the new year I upgrade to an am5 CPU/mobo and 256gb ddr5, what should I run to maxx out my system, i.e some sort of MoE that has a dense portion big enough to use 24gb vram and a bunch of experts that can fill up all those dedicated wams?

Following that, say I hypothetically replace the 4090 for a 6000 pro, is there anything that can utilise both 96gb vram and 256 ram?
>>
>>107032566
>more vram than ram
holy brainrot
>>
>>107032384
Models like to interject emphasis on *words*, but I bypass this by using asterisks for sound effects and onomatopoeia.
*plaplaplaplap* really does it for me in terms of immersion.
>>
>>107032572
its a ship of Theseus, when I got the GPU a new gen of CPUs was right around the corner so I wait chadded, now in the new year I go full big dick on 9k series Ryzen and ddr5

Any input on my questions?
>>
>>107032519
I'm aware, but I also know some dudes in here rent cloud hardware to play with new models, which I'm considering doing this time round too.

>>107032566
>is there anything that can utilise both 96gb vram and 256 ram?
The answer is always yes.
Granted, the dense part of most MoE models are pretty small relatively to the full thing, but putting more experts in VRAM will always yield some speed gains.
With those specs, I guess you'd be running the big qwen Moe and GLM 4.6 mostly.
>>
>>107032577
I antislop ban asterisks altogether.
>>
>>107032606
at 96+256 he could even run deepseek at like some magic ik llama iq3 quant.
>>
>>107032638
True. That would be worth trying, actually.
>>
>>107032606
Any models that do better at coding, and any that do better for general knowledge? Anything truly uncucked?

From my experience the cuckold models are actually pretty good for coding but shit for anything else and vice versa, but I haven't tried any of these new fancy MoEs
>>
>>107032566
>Let's say in the new year I upgrade to an am5 CPU/mobo and 256gb ddr5, what should I run to maxx out my system,
You'd be better off, but consumer stuff being stuck with a couple of channels and mostly limited to 128GB makes it an expensive dead end. Either go full retard with a threadripper/epyc/xeon build with 512gb+ and all ram slots populated or keep waiting for a dedicated, highmem inference card to come out.
>>
>>107032565
Too much of one thing always hurt models.
Continuing pretraining instruct models on non-instruct data, especially adversarial (short-form and "toxic"), will damage most post-training work done by the labs that trained the models, including long-context performance, "safety", RLHF.
Alpaca is an ancient shitty SFT dataset for attempting restoring any sort of performance or the so-called safety lost by continual pretraining.

The paper is just propaganda by would-be AI ethicists with an axe to grind against Musk-owned X/Twitter.
>>
>>107032729
>a dedicated, highmem inference card to come out.
Surely, if we wait long enough China will mass produce one.
>>
>>107032745
>Surely, if we wait long enough China will mass produce one.
It might not be China, but its too big a market opportunity for everyone to ignore. I'm saying that as a cpumaxxer.
>>
Shameless Jensen is trying to spam the DGX Spark on /pol/
>>>/pol/520010813
>>
>>107032729
Would I be better off just getting a 9900x3D with 128gb ram and saving the money to vramaxx on a Blackwell pro card then? Going the threadripper route and spending that much money on ddr5 feels like I would be getting less bang for my buck at that point considering my main use case is AI
>>
>>107032813
> but its too big a market opportunity for everyone to ignore
Is it?
I understand why Nvidia and AMD still make gaming GPUs. Gaming is the biggest entertainment medium currently, and it's a way to diversify even if making the shovels for the AI companies is making a ton of money, but I don't see how the same applies to home AI usage.
Anybody getting into the business of making AI ASICs has no incentive to make stuff for us instead of for Amazon, Meta, etc, I think.
>>
So where do I try Ming? I want to see what happens when chameleon gets scaled
>>
>>107032813
>>107032896
It will be Apple. They already use AI inference workload as a marketing point for their M5 chip. They'll put out some $10k M5 Ultra device and it will prove the market exists.
>>
>>107032896
yeah I don't think it is. at this point home use is a speculative market that companies are feeling out but there is absolutely no incentive for any of them to go all-in on it at this point in time
>>
>>107032931
They won't make an ASIC, but their architecture is the one consumer oriented hardware that works for larger models, so that makes sense actually.
The compute of their SoC's will continue to grow.
>>
>>107032866
>I have voice to text based LLM that transforms my commands into text for deepseek to create a comment on /pol/. For example I'll say "btfo this dumb schizo" and deepseek abliterated version will write a good comment to post here.
>
>I run additional userscript that takes LLM outputs and puts into text boxes. I just have to solve captcha myself.
all that effort to shitpost on the isreali bot board and the retard doesn't even know how to install a captcha solver
>>
>>107032980
Sirs please be kind Jensen bought the very best shills sir.
>>
ok I'm going all in into ERP. I can self host glm-4.6 at q5-6 and i would say to complete the experience maybe image gen and/or tts would be cool. I have the tts part solved, i created a whisper compatible voice server for vibevoice cloning the voice of the character, and the latency is quite good on a 5090 (i generate each paragraph individually to speed it up, and only the quoted parts).
But for the image gen, i have no idea. Any pointers?
>>
>>107033001
Get a suit and a haircut. Zoomers nowadays are open-minded they're down for anything in bed, BDSM, cosplay, anal, cucking. No need to fap to generated content.
>>
>>107032962
Yeah Apple is the only company I can think of that
1) makes hardware that can do inference
2) has a consumer electronics focus
3) is not in the data center business
Everyone that has data center biz *loses* when people have AI at home. There is no reason for nvidia or amd to sell you a box to have AI at home when they can get fat data center sales margins. There is no incentive for anyone with data center sales to bother with consumer facing stuff until the consumer stuff starts eating into their profit margins. Then you'll see nvidia suddenly put out a AITX 1080ti or whatever.

Also good luck to people in the US who think they will ever be allowed to purchase a Chinese AI device. Your only hope is apple.
>>
>>107033057
>Also good luck to people in the US who think they will ever be allowed to purchase a Chinese AI device. Your only hope is apple.
The entire Chinese economic philosophy is to overproduce and then produce some more and then flood foreign markets to drive their competitors out of business. They have zero reason to block sales to the US. Probably the US will eventually impose tariffs or block imports, but that usually takes them a few years. Should be enough time to load up.
>>
>>107033057
>There is no incentive for anyone with data center sales to bother with consumer facing stuff
And crucially, there's very little reason for a new entrant in the market to aim at consumers instead of big business/data center, which reinforces point 2.
So yeah, who'd thunk that AI would be the thing that could get me to buy an Apple product.
Here's hoping they deliver.
Or that somebody figures out a way to use an external GPU with the M series Macs. That would be a decent second option.
>>
>>107033001
ok, i connected ST to comfyUI, and changed some variables to use a custom prompt, but it seems like comfy just gets feed each individual response? Any way to feed comfy a proper prompt that makes more sense for the scene occurring?
>>
Is anyone keeping track of prompt processing and t/s on all these unified memory architecture machines?
>>
>>107033079
I sure hope that's the case. Although its a tricky situation since the entire US economy is now all-in on the idea of 4 billion people paying scama $200 per month. The existence of an AI in a box device, even if legally banned from import in the US, would probably cause the US economy to implode.

It basically like if in the 70s some guy was trying to make a gigantic mainframe that everyone in the world would pay to use. Then the "personal computer" comes along. I think that "Personal AI" that you can own is as big of a deal as the PC. We're not quite there yet, but we're only a few years away from the equivalent of the 90s PC revolution.

It was monumentally stupid to go all in on the megascale data center buildouts. People are so greedy they didn't even try a different paradigm because the existence of it threatens their revenue model.
>>
>>107032871
The problem with vrammaxxing is that to do sota models you'd need to spend $50,000+ for even an "acceptable" level quant and context.
VRAM is faster than RAM, but the price/performance doesn't scale anywhere near linearly. You can easily be an order of magnitude more money for VRAM vs ram, whereas a proper cpumaxxing build is maybe 1/2 the speed for 1/10th the price.
So the question to ask if you're looking to run 1T-class SOTA models is: speed costs money, how fast do you want to go?
>>
>>107033281
>a proper cpumaxxing build is maybe 1/2 the speed
mother of all cope, try 1/20
>>
Seems like
>the character can't speak English
Is pretty hard for many models to handle properly.
>>
>>107033211
seems like the best is to put somewhere in the prompt this:

OOC: generate a prompt to describe the scene for a text to image model

And the doing /sd whateverprompt the model gave you.
Unless there is a better way
>>
>>107032866
> /pol/
You should not invoke that place.
>>107032980
> retard
> that board
So really, no surprizes.
>>
>>107033340
Are you writing this as is here? Or do you actually specify which language they should speak? That's a pretty big difference.
>>
>>107033563
>> /pol/
>You should not invoke that place.
>I don't recommend kobold
we know blacked-dev we know
>>
>>107033689
The issue was making the model understand what "barely knows English" meant. Much better after I specified the region of origin.
>>
>>107033751
>everyone that disagrees with me is the same person
>>
>>107033890
>everyone that calls me out for samefagging is the same person
>>
File: xwing.jpg (200 KB, 1480x736)
200 KB
200 KB JPG
>>107034065
Everyone is anon and therefore the same person because they are the other and there is only me and I only know myself and my mind not the other anon which are all the same anon
>>
We need a coherency detector LLM to filter out the nonsensical spam at /sdg/
>>
>>107034095
>Everyone is anon
>llama.cpp CUDA dev !!yhbFjk57TDr *exists*
>>
>>107034114
CUDA dev is the only valid name/trip fag. Drummer and scablicker or whatever are the problem
>>
>>107033281
My thought here was, upgrading to a Blackwell 6k would be a straight and easy upgrade route that I probably won't replace for years, on top of that I could sell my 4090 and break even after owning it since launch.

If I go and get a threadripper and a gorillian ddr5 wams, then ddr6 comes out some point next year or early 2027 the latest then it's a whole new CPU/mobo/ram buy, and selling all of those old parts will be a real ballache as well as taking a loss on all of those parts.

As opposed to just spending a reasonable amount on a new CPU/mobo/ram combo for under a grand, upgrading in a year or two won't sting as much and triple digit value parts I'll happily donate to friends and family
>>
>>107032871
Forget Threadripper, for some reason /g/ never notices ktransformers runs best on Xeon Scalable.

The choice is between a 1TB Xeon Scalable with AMX and an x090 and a 128 GB Blackwell Pro. If you want to run 1T LLMs get the Xeon Scalable, if you want to rip through image gen or smaller models get the Blackwell Pro. A refurb Xeon Scalable will be almost half the price.
>>
>>107034239
Does AMX help at all if you are doing PP on a GPU?
>>
>>107034263
i do not recommend putting your pp on the gpu, you might get burns or electric shocks
>>
>>107034272
I might have misunderstood what it means to "fuck the AI" then.
That explains why I never see anybody complaining about cum on their GPU fans.
>>
>>107034263
>Does AMX help at all if you are doing PP on a GPU?
Yes, inference still needs some processing power. threadripper needs lots of cores, whereas the cheapest scalable with AMX can do.
>>
>>107026802
It's completely different, if you're doing a good ERP then it's just insane, like you're living there
>>
can we get the spambot running here? dead general is no fun
>>
>>107034526
the designated shitting street is down that way /aicg/
>>
>>107034526
better dead than slopped
>>
>>107034526
Sorry, I'm at work
>>
>>107034554
we're both doebeit
>>
>>107034515
teach me the ways anon
>>
>>107034587
Does your boss know you're here?
>>
>>107034667
I haven't been fired yet so probably not
>>
Is there a <50B without spiral sense errors yet, or is this always going to be a >100B luxury even into 2026?
>Spiral sense errors?
Knowing and remembering where things and people are, and positioning, correctly.
>>
>>107034750
Qwen 30B thinking is decent at that if you prefill the thinking with explicit instructions to consider that kind of thing.
>>
Does one expert size in MoE models correlate with generation speed?
I tried running glm-air with RAM offloading some time ago and got like 5 t/s, wonder if the new qwen3-next-80b will be better when it gets llamacpp support
>>
>>107034750
It's tragic that you essentially need 700b to get the Cumulative Space values right. Hope we see a model that can do it sub-200b soon
>>
>>107034792
Fuck it, I'll try it. Give me your prompt for that if you can spare it.
>>
>>107034833
Qwen is the driest mofo around for RP if that's your use case, just letting you know before you waste your time.
>>
>>107034852
That was my use-case, damn.
>>
>>107034833
Fuck, it's been a couple months since I played with that.
What I did was essentially write a system prompt with instructions to think about the space, then the position of each entity (people, objects, etc) in that space, plus some other stuff.
Then I took the thinking process it generated naturally, tweaked the shit out of it, incorporated the instructions of the system prompt in first person (I'll consider, x, y, etc) and used it as a prefill.
I might have some old silly chat with that still in the history, I'll see if I can find it.
Also, anon is right, it's dry as a bone for ERP.
>>
>>107034869
nta but imo it's not that bad, the 2507 versions are... passable. not a lot of RP chops but they're competent enough now. ymmv.
>>
>>107034892
One thing I didn't try but thought of was trying to steer it's prose by forcing it to think in a specific "voice" or style.
>>
>>107027136
It says "user" because I don't want to post my name on 4chan. I literally changed it just for the screenshot. I don't see what's wrong with using asterisks. I want to talk in first person because I want to feel like I'm there, it's me doing it. And Ling Ling doesn't speak English, that's why I had my Chinese servant tell her that we're going to play in my room rather than me tell the girl herself. And I don't see where the AI is doing actions on my behalf.
>>
File: 1754596167579622.png (113 KB, 614x189)
113 KB
113 KB PNG
>>107033211
Go to the image generation tab in extensions and tick picrel, then you can change the prompts in image prompt templates. Interactive mode sends the free mode prompt to your LLM when you type 'generate a image of x' or something similar with "{0}" being your raw prompt.
>>
>>107033563
/pol/ is alright, flags and IDs help identify samefaggotry and proxyhopper niggers. We cant do it here.
>>
>>107031748
>the bloat from figuring out tokenization on the LLM’s side is huge
I have no idea what this means, you can code a BPE tokenizer in a weekend. Is this just people parroting Karpathy and his anti-tokenizer autism?
>>
>>107035178
The few days we had on 8ch with post ids was the best this general has ever been.
>>
>>107035288
>Karpathy
that's *AI God Daddy* to you, peasant
>>
>>107035178
They probably make money from stealth advertising on /g/, no way those shitty banner ads make enough to run this site. IDs would ruin it unless you post from several IPs since it would be a pattern that made it obvious? Also a reason why IP count was removed.
>>
>>107035288
I don't know how you could remove the tokenizer or something equivalent to that.
Let's take a typical tokenizer of a mid-size model, having shape [262208, 5376]. Every single token embedding contains 5376 dimensions, or 10.5 kB of purely semantic information in FP16 per token.
A long word in standard 96 dpi size might be 70 x 12 pixels x 3 channels = 2.46 kB, and it would contain no semantic information on its own, just the pixels. Encoded into a smaller space it would be even less information than this.
>>
>>107035353
It's a good bet that glowies have access to a special system that lets them use whatever ID they want.
>>
>>107035415
Glowies you spot by their words, everything else can look legit.
>>
>>107035412
You are looking at raw input values for images, but at semantic stuff for tokens, If you want to compare raw inputs, look at 262208, not at 262208 * 5376.
>>
>>107035412
Read the paper and believe.
>>
>>107035440
>>107035412
And that would be 1 int value for teach token, or 4-8 bytes per word. Encoding words as images isn't even in the same realm in information compression of raw input compared to tokenization.
>>
>>107034750
What are you talking about "remember" where things are?

LLMs don't remember things, they generate tokens based on what is in their context, and which context matters is determined by attention.

We need tweaks to attention, or better training data for the purpose.
>>
Anons, >>107035532 has autism, so please go easy on him.
>>
>>107034750
llama 405b at q8 handles character positioning just fine. probably because it's a dense model

reminder https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth
>>
>>107035570
meanwhile the "best" moesissy model...
>>
>>107035570
>japan
>>
>>107035581
>Maverick is the 405b equivalent, in case you forgot. I imagine that the single expert routing isn't helping it develop a unified picture.
>>
>>107035542
you're the retard.
>>
>>107035628
>Maverick
>interleaved attention
>moe
an abortion of a model
>>
>>107035628
better, it vaguely has japan look somewhat normal if you squint
>>
>>107035664
and it's missing half of antarctica which should be much easier to generate than japan. what's your point?
>>
>>107035570
this is mostly a test of memorization not spatial reasoning and has basically no correlation to keeping track of character state in a roleplay
>>
>>107035682
only japan matter
>>
File: teto_00009_.mp4 (1.25 MB, 1920x1184)
1.25 MB
1.25 MB MP4
>>107035690
If it was just memorization, you would expect the model with more total parameters to do better. It doesn't because it requires integrating the knowledge it has which depends on attention having more active parameters.
>>
>>107035841
>>107035841
>>107035841
>>
>>107035841
Where's her ass?
>>
>>107034239
Where are you seeing this? The cheapest refurbished server ram I could find came up to 8k dollerydoos for 1tb of ddr5 and that's the ram alone, before any processors boards psus etc

Seems like a lot to spend just to run bigger token models, prosumer GPU feels like a better spend when its faster, can run fairly big dense models along with MoEs quickly, img/vid gen workflows etc

I feel like it's enough for any use case, alongside the odd api use for things that really truly need a fuckhuge model, my current coding workflow uses a local model first and then have one or two api models look over it, the goal being to go through less iterations and need less tweaking at the first step with a bigger locally run model, relying on api's less. Then I can use SD and WAN for creative projects, and slap my cock to smarter coombots.

Thanks for making me look into it anyway I've made up my mind, I'll go for a cheaper am5/ddr5 upgrade first and look for a pro line gpu a little later
>>
>>107035934
She laughed it off.
>>
What's the best general purpose SLM (under 4B in my case) that can rival something like ChatGPT or Grok? Obviously it wouldn't be as powerful, but something that can do the low-level assistant tasks most people use those two for.
>>
>>107034892
the problem is that even a small model like Gemma 3n E4B has more world knowledge than ANY qwen, including the API only Qwen Max. I'm not being hyperbolic.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.