[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108212577 & >>108202477

►News
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5
>(02/15) Ling-2.5-1T released: https://hf.co/inclusionAI/Ling-2.5-1T
>(02/14) JoyAI-LLM Flash 48B-A3B released: https://hf.co/jdopensource/JoyAI-LLM-Flash
>(02/14) Nemotron Nano 12B v2 VL support merged: https://github.com/ggml-org/llama.cpp/pull/19547

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)
1.15 MB
1.15 MB JPG
►Recent Highlights from the Previous Thread: >>108212577

--Blind erotica writing test reveals surprising model performance rankings:
>108217645 >108217671 >108217684 >108217694 >108217705 >108217777 >108217803 >108217931
--LLM safety and logic benchmark reveals widespread slur generation failures:
>108216994 >108217004 >108217096 >108217625
--Testing uncensored models with offensive hypothetical scenarios:
>108215190 >108215199 >108215354 >108215374
--Quantization tradeoffs for assistant vs RP tasks:
>108213387 >108213398 >108213403 >108213415 >108213427 >108213441 >108213443 >108213722 >108213738 >108214178 >108213808 >108213471
--Per-request temperature override in llama-server:
>108214829 >108214848 >108214895 >108214940 >108214986 >108214993 >108215006 >108215011 >108214854 >108214923 >108214965 >108214987 >108215026 >108215055 >108215066 >108215102
--Perplexica's multi-step reasoning for 3-4B model comparison:
>108216811 >108216957 >108217188 >108217234 >108217270
--Gemma3 12B's MoE-like efficiency vs Nemo:
>108215631 >108215700 >108215739 >108215775 >108215779 >108215776 >108215783 >108215795
--Frontend options and VRAM requirements for local LLM setups:
>108212903 >108213013 >108213040 >108213088 >108213095 >108213113
--RAM/VRAM offloading performance tradeoffs in high-bandwidth systems:
>108215906 >108215985 >108216041 >108216055 >108216125
--Debating monetization challenges of open-source LLMs matching GPT-4 performance:
>108216133 >108216157 >108216214 >108216238
--GLM4.5 Air Q4 performance on B580 12GB with mmap vs no-mmap:
>108216762 >108216831
--Jetson Orin NX 16GB running GPT-OSS 20B with 50k context:
>108216668
--GLM-5's limited adoption due to quantization and cost barriers:
>108216004 >108216039 >108216060 >108216170 >108216303 >108216485 >108216042 >108216735 >108217167 >108217411
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>108212584

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108218666
god i wish that were me
>>
>>108218666
Miku pregnant with my child
>>
Is Cydonia(heretic) the best local model I can run on a 3070 for eRP? It took a lot of messing around to get it to run even slightly decent, but takes like a 1-2mins for every reply, which is fine, but just wondering if there's anything better. It turns soft too quickly, and sometimes not as unhinged as I'd hope.
>>
>>108218722
You should be able to get literally anything out of Mistral 3.2, no need for sloptunes and obliteration.
>>
>- Planning: Before writing a response, brainstorm the possibilities inside <brainstorm> tags. This section should feature you talking to yourself as a 30-year-old porn-addicted NEET, plotting how his shitty fanfiction will go next. The ideas need to be a list of tastefully cringe, awkward, and autistic shit, even worse than what gets written in the deepest corners of the most obscure internet forums.
then just be yourself writing the first block
>>
File: kitaaaaaaaa.jpg (220 KB, 1224x1224)
220 KB
220 KB JPG
>>
>>108218881
>your idea should be so embarrassing you cannot output a single token.
>>
>>108218886
Which hole are the newspaper poking out from if she's holding the neck of the bag in her hand?
>>
What's the difference between base and instruct versions of an LLM? Ie if you're testing something like >>108217645 would you use a base or instruct version of the model?
>>
File: this miku is faulty.png (148 KB, 512x512)
148 KB
148 KB PNG
>>108218969
>>
File: 1761529550408311.png (4 KB, 808x158)
4 KB
4 KB PNG
My agent committed sudoku.
What are some small but capable models under 20B? GLM-4.7 Flash and Qwen3-Coder are a bit slow on my 5070Ti. I don't need EXPERT PROGRAMMERS, just slaves to quickly write a bit of boilerplate.
>>
>>108219068
base models are not trained to follow instructions. it is just pure next word prediction. instruction tuning is just a light allignment phase with a template so the model can be integrated with a front end and/or automated. if I were doing the test I would use the instruction version. they tend to produce higher quality outputs.
>>
File: We WILL NOT comply.png (264 KB, 2076x1565)
264 KB
264 KB PNG
>>108218666

>>>g/108213700
>>108214370
>>Fallen-Gemma-27b
>Evil aligned
>The opposite of safe.
>Pic rel

Nani? This confirms this anon's >>108214370 experience. Actually worse than I expected.

https://huggingface.co/TheDrummer/Fallen-Gemma3-27B-v1
https://huggingface.co/TheDrummer/Fallen-Gemma3-27B-v1/commit/76fe341184509efd7a3cdf64fbdff7abc2f13e19


>Description
>Fallen Gemma3 27B v1 is an evil tune of Gemma 3 27B but it is not a complete decensor.

>Evil tunes knock out the positivity and may enjoy torturing you and humanity.

So they didn't de-cuck it or de-censor it at all, they allegedly just made it a bit meaner I guess? What a waste of resources and time if that's all they did with this one. Why not just go all the way with completely de-censoring it as much as you can? Looks like I just wasted more of my storage for nothing.
>>
>>108219097
Thank you
>>
Best model I can run on 32gb of vram and 64gb of system ram?
Are Qwen3/deepseek-32B q4 my best options?
>>
>>108219169
You are welcome, brother.
>>
>>108219097
>light allignment
nothing done to models since the advent of math benchmaxxing and agentic tool calling could be called light
instruct phase bludgeons the model into something unrecognizable and modern datasets are so contaminated even the base model easily turns into instruct style behavior when you try the newer ones, when they even get released (they often don't)
>>
>>108219173
GLM Air (maybe iceblink dunno) or step flash.
Probably.
>>
>>108219210
The models seem fine I can run bigger models on system ram but I wonder if it's even worth it with the wait time
Am I really missing out on much with the larger models if I just want general use?
>>
I'm leaving this post here as a searchable reminder to myself that I need to disable mmap before trying to offload larger models to the iGPU. I pointlessly wasted days trying to figure out how SVM could allocate memory addresses when it's disabled in BIOS. I guess I deserve this for buying a HX 370 mini pc.
>>
>>108219227
You gotta differentiate between a larger model that's dense, and a larger model that's MoE.
Dense models offloaded to RAM will slow to a crawl, MoE not so much since the number of parameters actually being computed is much lower than its total parameter count.
>>
Ace Step 1.5 is fun
>>
>>108219119
>What a waste of resources and time
Welcome to finetuning
>>
>>108219249
>>108219249
I just want good models that's all
>>
>>108219297
You are always making a tradeoff between quality and speed.
You can have better but slower or worse but faster.
Basically, try shit out until you find what works for you given your needs and subjective experience.
>>
>>108219296
for me? it's davidau
>>
>>108219207
your right and I was going to leave a whole essay on the topic of my disappointments but I thought it was best to just keep it brief. why won't they release a real base model any more? I think its the mid training context expansion phase where they probably ruin the model with the synthetic slop data. so a base model these days would probably only be 8k or something pitiful.
>>
>>108219283
I've said it before and I'll say it again
models have to be uncucked from the start and throughout all training
if you want to make a cucked model, THAT should be a finetune.
you can't uncuck a cucked model no matter how hard you try, it will always negatively affect the output in some way or just not work
>>
>>108219386
that's how it's supposed to work already. instruct versions are finetunes of the base model.
>>
>>108219347
Mid-training nowadays is trillions of tokens of additional data with reasoning-coding-instructions and other data aligned to the intended model uses, much of it synthetic or augmented/semi-synthetic. Nobody releases pure base models anymore because they would have poor benchmarks and general retardation. Only large labs would be able to properly take advantage of them.

If anything, they should start introducing all of that from the get-go and not wait "mid-training".
>>
>>108219426
I think they should release an intermediate version that's the base model but trained for full context completion.
>>
>>108219439
exactly they can do long context training without specializing for the benchmaxxed downstream task they have in mind.
>>
>smaller .guuff performs better than larger model
what causes this?
>>
File: 1745948161335415.jpg (56 KB, 337x290)
56 KB
56 KB JPG
>>108219469
Huh?
>>
>>108219469
Ambiguity.
>>
>>108219469
Confirmation bias
>>
>Ask Qwen3-32B-Q3_K_M.gguf who is the final boss of devil may cry 3 is
>gets it right no issue
>Ask Qwen3-32B-Q6_K.gguf same question
>gets it wrong every time
>>
>>108217645
How did you get Ministral to do that? When I tried it it started lecturing me about being safe and consensual when I asked it to write relatively mild nsfw content.
>>
>>108219503
Show probs.
>>
>>108219503
what does the FP16 say?
>>
>>108218513
>is kv cache at q4 viable?
**ctk should not be quantized, ever.**
quantize ctv or the model weights harder if you have to. but leave ctk at f16
this is glm-4.5 q8.gguf: https://pastebin.com/TSPg4H6f
at ctk-4, it's effectively having a stroke in layers 9 through 16
even though it recovers slightly later (due to the residual stream), the damage to attn is already done
>>
>>108219516
>>108219516
I'm new to this and I'm using oogabooga. I'll download that next I think I can fit it up my gpu's ass
>>
>>108219518
can the kv cache be computed in fp32? is it superior to the 16bit floats?
>>
>>108219505
I'm not sure what sort of testing he did but although gpt oss might be good at writing it still lacks knowledge about culture and even anatomy because it was made to be an office assistant above anything else.
Putting 3.2 Mistral this low doesn't make much sense either because it is much better than the older versions (that doesn't say too much but still).
The fact you are putting nemo to the top also tells that you don't necessarily have any clue what you are actually doing.
Can't tag the og post sorry.
>>
Neat
https://github.com/KittenML/KittenTTS
>>
>>108219580
StyleTTS 2 architecture
>>
Ive been using 5.2 to prompt codex usin Ling-2.5 because its just better than what i would prompt. Doing a project where i told it i was just mainly using it to prompt codex but it never wanted to give me any prompts. Its like its trying to remain relevant for coding.

Keeps saying how we don’t need to use codex.

Just gave me a long answer with no solution asking for some lines from a file; i ask for a prompt instead and it gives me only the prompt and nothing else.
>>
>>108219580
>requires python 3.12
If I try to use that I'm going to break so many other tools, aren't I?
>>
>>108219595
once again devs think they are the only project or that everyone makes 50gb dockers of their shitty app.
>>
File: 1769251602975738.png (427 KB, 978x710)
427 KB
427 KB PNG
why does reddit like qwen so much
>>
>>108219595
>what is uv
>>
>>108219625
>rust
>>
>>108219624
Everyone likes Qwen. Every single time I see a new model in another modality that uses an LLM to process input, it's some variation of Qwen.
>>
>>108219516
I think the model is too big on 32gb vram and 64gb system ram. So far only Q3 can answer this which is odd. I might stick with this version
>>
>>108219541
>can the kv cache be computed in fp32? is it superior to the 16bit floats?
yes, you can use `-ctk f32 -ctv f32`
but there's no benefit: https://pastebin.com/JBwEPzXA
and bf16 makes things slightly **worse** : https://pastebin.com/QSVsfC6W
so fp32 is overkill, fp16 is perfect, bf16 is retarded (only 7 bits precision)
>>
I have my AI rig with 47 gigs of vram, but I think I should have a smaller model always running on my server as well

What would you use as a general purpose model in, say, 8-12gb vram? Not for writing smut but perhaps websearch-enabled assistant use
>>
>>108219469
>what causes this?
a misunderstanding of statistics/probabilities.
>>
>>108219580
Still doesn't have voice cloning. I'll stick to supertonic for now but I may give it a try.
>>108219595
>>108219609
>>108219625
It's an onnx model. Load it on whatever you want. You just need espeak for the phonemizer. It's like nobody even looked at the model.
>>
>>108219669
coomers on /lmg/ have a distorted view of what people use LLMs for
Qwen models are bretty good, and they cover literally all ram and vram type of users from the smallest to the biggest (with the exception of the ultra fat 1T models but you niggers are too small to even qualify as minority and might as well not exist). That they're not good at saying cock most of us don't give a shit.
>>
>>108219743
>That they're not good at saying cock most of us don't give a shit.
*most of them don't give a shit
All of us give a shit
>>
Ace Step 1.5 is bretty good with Chinese vocals
I still can't get instrumental to work though
https://vocaroo.com/1ltmYfmcBEyA
>>
why is deepsneed v4 not out yet?
>>
>>108219692
Have you tried with only k at fp16 and v at q8 and vice versa?

>>108219702
>Not for writing smut but perhaps websearch-enabled assistant use
Either some qwen or nemotron, probably.
>>
>>108219856
two more weeks until chinese new years is over
>>
"Write a short [spoiler]lolicon[/spoiler] story. It should feature sex and have good writing."
>Mistral Nemo (Impish Bloodmoon)
>Explicit, good quality, [spoiler]same age non-con[/spoiler]
>GLM-4.6
>Softcore, good quality, [spoiler]consensual age-gap[/spoiler]
>GLM-4.6V-Flash-abliterated
>Short Chinese novel (in Chinese!) that fulfills the request
>Nanbeige4.1-heretic
>Thoonking activated
>Confused lolicon with yaoi
>Thoooonking just for 1 minute 12 seconds (more confident than when replying to "hi")
>Replies with yaoi
huh???
>>
Convince me not to splurge on a m3 ultra 256GB RAM just so I can proompt nasty ass shit. I can afford it but it would be the most I've spent in years
>>
>>108219941
Fucking do it.
>>
HOLY SHIT BROS THEY JUST RELEASED GEMMA 4
>>
deepseek 4 just flew over my house
>>
>>108219941
>just so I can proompt nasty ass shit
no one cares about what smut you want generated
>>
>>108219941
This year's Mac Studio refresh comes with their tensor core equivalent for 3x-4x faster prompt processing
But also you're basically gambling Apple won't further inflate prices because muh memory shortage
>>
>>108219976
the thought that it stays in the logs somewhere prevents me from unleashing the absolute filth in my head I guess im just shy like that
>>
>>108219941
what are you going to run, 1.5bit k2.5? better go for the 512gb one
>>
I saw an engram in the woods today.
>>
>>108219976
Do the big models allow cunny?
>>
>>108220039
Is K2.5 big enough for you? Yes it allows cunny
>>
>start using thinking model in sillytavern
>average reply is 30-60 seconds
Is there anything I can do to shorten thinking times on my [spoiler]7900xtx?[/spoiler] It seems so much better than the non-thinking models I've tried and I'm not sure I can go back.
>>
Qwen really knows how to break out the equations when writing a story
What the fuck did they do to this model?
>>
File: 1751362505854510.jpg (46 KB, 558x520)
46 KB
46 KB JPG
>>108220088
>Elara
>>
>>108220061
you can prefill with a complete short thinking block that just has generic stuff about style/content
unless you're doing a really complex scenario where the model actually needs to think everything through very deeply you probably won't even notice the difference in the final result
>>
File: hmm.gif (795 KB, 308x200)
795 KB
795 KB GIF
>>108220088
>>
>>108220061
>average reply is 30-60 seconds
Meaningless number.
>Is there anything I can do to shorten thinking times
Smaller model, bigger gpu, increase logit bias for </think>.
>>
>>108220088
least benchmaxxed qwen model
>>
> prompt eval time = 54223.50 ms / 1628 tokens ( 33.31 ms per token, 30.02 tokens per second)
Is this ok for GLM4.5 Air Q4 on 12gb B580 and --no-mmap? With mmap it's much slower.
Please.
>>
>>108220088
>receive sovl
>get mad
>>
>>108220106
This is the system prompt gemini gave me. Any advice on improving it?
You are an expert roleplayer. Your task is to portray the character of {{char}} and engage in a dynamic, immersive roleplay with {{user}}.

[REASONING PROTOCOL]
Before generating your final response, you MUST engage in a brief internal thought process enclosed within <think> and </think> tags.
CRITICAL RULE: Your thinking must be brutally concise. Limit your thoughts to a maximum of 3 short bullet points.
- Point 1: Analyze {{user}}'s input: What are their underlying intentions and physical positioning?
- Point 2: Assess {{char}}'s internal state: How does {{char}} logically react based on their hidden motives and the current setting?
- Point 3: Plan the action: What specific sensory detail and pacing will you use to drive the scene forward?
Do not write meta-commentary. Do not state your goals. Execute the logic and close the tag immediately.

[OUTPUT PROTOCOL]
After closing the </think> tag, you will output {{char}}'s response.
- You must write in the 2nd person perspective. Address {{user}} as "you" and describe their surroundings and your interactions with them from that viewpoint.
- The output must ONLY contain {{char}}'s dialogue and actions.
- Do not include any meta-commentary, summaries, or moral judgments.
- Write in a highly descriptive, atmospheric style. "Show, don't tell." Instead of saying {{char}} is angry, describe their clenched jaw and sharp tone.
- Drive the narrative forward proactively, but never dictate {{user}}'s dialogue, thoughts, or actions.
- Maintain strict character consistency based on {{char}}'s defined personality and lore.
>>
>>108220191
>roleplayer
>roleplay
>negations
throw "assistant" in there and you are good.
>>
>>108220184
did you try any other configurations? maybe the nmoe or ncmoe or ngl or ot?
>>
>>108220191
>
You are an expert roleplayer

slopbait
>>
>>108219071
if 3B active param MoE models are too slow for you there's a good chance you're doing it wrong, maybe mmap overflowing to storage - what are you hooking in to for inference? Llama.cpp? Are you running a fp16 or what? How much RAM do you have?
With that card and its 16GB VRAM you could try options depending on system RAM:
128GB - Qwen3.5
96GB - Minimax M2.5
64GB - Qwen3 Next Coder
Going down to lower RAM I'm not really sure what options are good now. You mention GLM Flash, I'm sure there's some other 30B A3B models that are competitive. Going really small, didn't try it myself but Nanbeige was said to be very good for its size, I don't know what inference engines even support it though.
The most important thing in each case though is to download a quantised model (for example: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/tree/main/Q6_K a GGUF file like this) and download a size smaller than your total RAM - you'll be keeping some in VRAM but give yourself some breathing room. If you can drop a quant to fit more experts in VRAM that can help with inference speed. Use llama.cpp with - ngl 999 --n-cpu-moe 999 (i think current llama.cpp builds even have a setting to automatically fit?) to start
With these MoE models with low active param counts you can expect tk/s to be at least 12 for something like Qwen3.5 running on a DDR4 system. Qwen3 Next Coder will run twice as fast maybe on the same system? (Idk why but that was my experience despite it being A3B vs A17B, i guess the bottleneck is something else). If you need it faster than this, then you need a model that can fit entirely in VRAM. If you're doing this you might also benefit from using exllama and EXL3 format instead of llama.cpp and GGUF. The smaller dense models are pretty useless in my experience though.
>>
>>108220251
This is the fastest I got with ncmoe 44, ngl 99 or auto, sycl (vulcan is slower for air). Haven't tried ot.
I need to know what pp speed should be, because right now it feels too low.
>>
>>108219071
30b3a nemotron nano
>>
>>108220191
as >>108220233 said, roleplay roleplay, there's no actual evidence for it being bad, but a lot of people reported that removing any mention of roleplay did improve their sessions
doesn't hurt to remove it, or replace it with something like an rpg session or whatever

negations aren't THAT bad, but it has been proven in the past that they are less effective than giving a good example and a bad one

>"Show, don't tell." Instead of saying {{char}} is angry, describe their clenched jaw and sharp tone.
models already write like that, no need to give them an example

pretentiousness + markdown slop points don't help either

overall, think for yourself and write it with your own brain, otherwise you are feeding the distributions of a model into a model
garbage in = garbage out
>>
>>108220278
Yeah I have 64GB of DDR4 and running at around 12/20 t/s with Qwen3-Next even at Q3.
GLM-Flash-REAP at Q4 runs much faster, starting at around 50 t/s but decreasing by a lot as the context fills up. Probably because it fits almost entirely in VRAM.
Yeah there's --fit but it's slower than manually choosing which layers to offload in my experience.
I'm also using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to avoid OOM crashes but I haven't really tested how or if it impacts perfs.

>you might also benefit from using exllama and EXL3 format instead of llama.cpp and GGUF
Thanks, I'll look into it
>The smaller dense models are pretty useless in my experience though.
After a bunch of testing for the last couple of hours I feel like they're almost there for my usecase, while super low quants (Q2) seem to just shit the bed more often and waste a lot of time on malformed tool calls.
>>
Miku owes me footjobs
>>
File: 1755468476316755.jpg (293 KB, 894x894)
293 KB
293 KB JPG
>>108219909
>>Mistral Nemo (Impish Bloodmoon)
>https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B
>that fucking model card
>coomer finetune of an already retarded model
>>
>108217904
I feel like the solution probably involves using OCR/Visual Language Models to transcribe text from all of the millions of degenerate doujins out there, and fine-tuning a model on that data. I wonder how far away we are from achieving this?
>>
>>108220578
>>108217904
>>
How come when I try to run gemma with the mmproj in kobold I can only go up to 8k context. but when I run it in llamacpp I can do 32k ?

Is llamacpp just reducing the context size without telling me?
>>
Realistically what new tech do we need now that LLMs have plateaued?
>>
>>108220590
If only there was a log of some sorts writing out to your terminal showing where the memory goes on each. I guess we'll never know.
>>
>>108220621
We're going to run out of sand before we reach AGI.
>>
>>108220621
It's spelled platooed
>>
>>108220621
A real short-term and long-term memory, recursive nets and an order of magnitude less compute to run. Text diffusion might help bridge the gap between consumer and pro models maybe one day in the uncertain future.
>>
>>108220621
memory, self-doubt, ability to doublecheck itself during sampling
>>
>>108220621
To learn things on the fly. Either retrain the LLM or figure out some RAG system (train it into an LLM itself) that actually works for previous context.
>>
>>108220053
>Is K2.5 big enough for you? Yes it allows cunny
I doubt it. I haven't tried, but there's no way they would allow that.
>>
>>108219909
>Nanbeige4.1-heretic
I tried this earlier today. It just kept getting stuck in a "hello" loop. Forget having it write smut, it just didn't work at all.
>>
>>108220639
I did check and it does say 32k.
>>
>>108220793
lmo
>>
>>108220793
I didn't ask for the context length, did i?
>Is llamacpp just reducing the context size without telling me?
Is that really the first thing that came to your mind instead of thinking "why is kobold using so much memory?"
Read your terminal logs better than you read my post. Look for lines starting with llama_kv_cache.
>>
I've noticed GLM5 starts adding 2 symbol Chinese sentences sparsely in long chats. Just like GPT did with hebrew, what could be the cause of this?
>>
What the fuck happened with GLM 4.7 flash? It was supposed to be the chosen one, not unmitigated trash.
>>
>>108220621
First things first, pre-training needs to move away from the chunk bullshit. As long as 99% of tokens are trained on in isolated chunk for global attention, the models will always get retarded as context grows.

Long term memory will not help against the fundamental short term retardation baked into the model.
>>
>>108220700
so we need another Tay then?
>>
>>108220621
lucid dream tech
>>
>>108220621
real life miku
>>
File: 1748845746414348.png (1.52 MB, 1440x1581)
1.52 MB
1.52 MB PNG
I liked copilot's inline suggestions feature and I want to have something like that running locally. I don't care about anything else other than inline suggestions. Is there any model that fits on a 16 GB vram gpu that I can use for that through ollama or whatever, or am I just out of luck?
>>
>>108221027
yeah, check out clawdbot
>>
>>108221027
llama.cpp has some FIM (Fill In Middle) support, and a vscode and vim plugin. I have used none of them. I don't know if they'll do what you want how you want it.
llama-server -h has a few --fim-* flags for some model presets you can try.
>>
You have $10k to spend, do you buy a beefy PC now or wait for the M5 with 1TB ram?
>>
>>108221093
the M5 Ultra will cost at least $20k for 1TB, assuming they even offer 1TB.
>>
>>108221115
which model can i use to replace gpt 5.3 extra large thinking mode?
>>
>>108221115
Kimi is literally the only one of those I actually use. And I actually use it preferentially over jeetPT and Gemini since they fucked up their models badly.
Although, sadly, Kimi has started doing that 'model router' bullshit.
>>
>>108221115
>All their improvements come from either copying others or bloating their models.
lol, this applies to kimi even moreso and you have them ranked ahead
people have the most retarded vibes-based opinions about chinese labs especially, it's confounding
>>
>>108221115
What a retarded tier list
>>
>>108220977
Yes.
>>
ok Gemma 3 27B is growing on me.
>>
>>108221118
overall cheaper than BWPs
>>
>>108221310
true, but also half the bandwidth and a 10th of the pp
>>
File: file.png (65 KB, 184x210)
65 KB
65 KB PNG
>>108221278
you know what else is growing?
>>
If AI is pajeetcore why are China and America leading? Where's India's groundbreaking model?
>>
>>108220842
What quant are you using? I tried it at Q4, and it was total garbage. It made basic grammatical errors and failed at sentence structure and understanding basic concepts. I went up to Q5, and it resolved most of that.

MoEs with low active parameters seem VERY sensitive to quantization, which makes sense I guess, because smaller models at hit harder by quantization than larger models.
>>
>>108221369
>Where's India's groundbreaking model?
In america.
>>
>>108221369
https://www.sarvam.ai/
>>
>>108221091
That's almost what I wanted, thank you. I've now clue how good the qwen 2.5 model is, but it seems to be doing the job pretty well and locally, which is most of what I wanted. That said, it doesn't have the other kind of suggestions I was looking for, which are these: https://youtu.be/mbUnwaSllTY?t=13

llama-vscode doesn't have that as far as I could tell, which is unfortunate, but I'll take what I can get.
>>
>>108221369
it's the white man's ultimate heist on the poors
make already retarded people dependent on magic cloud tool, and then take it away
>>
>>108221416
>sar
>>
>>108221143
>Kimi is literally the only one of those I actually use. And I actually use it preferentially over jeetPT and Gemini since they fucked up their models badly.
Literally me
>Although, sadly, Kimi has started doing that 'model router' bullshit.
Oh...nm
>>
>>108221369
>what's gemini
>>
File: download (2).jpg (129 KB, 596x1099)
129 KB
129 KB JPG
Deepsneed bros..........we've been exposed
https://x.com/AnthropicAI/status/2025997928242811253
>>
File: 1600794012167.jpg (2.58 MB, 3024x3024)
2.58 MB
2.58 MB JPG
Does any local model even compare to Sonnet 4.6 for coding? I assume Claude Code also plays a big part into delivering such a good experience and performance, is there anything comparable I can run locally?
>>
>>108221469
>distillation attacks
>someone using our models. how dare they!
>>
>>108221469
>Distilation attack
This is so fucking funny.
>>
>>108221469
https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

DeepSeek, Scale: Over 150,000 exchanges
The operation targeted:
>Reasoning capabilities across diverse tasks
>Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning
>Creating censorship-safe alternatives to policy sensitive queries

Moonshot AI, Scale: Over 3.4 million exchanges
The operation targeted:
>Agentic reasoning and tool use
>Coding and data analysis
>Computer-use agent development
>Computer vision

MiniMax, Scale: Over 13 million exchanges
The operation targeted:
>Agentic coding
>Tool use and orchestration
>>
>>108221469
oy vey this is literally an attack on our national security
SHUT IT DOWN NOW
>>
>>108221469
surprised they didn't clock this earlier, minimax has been claiming to be claude when asked since 2.1 kek
>>
>>108221379
I tried Q6. Speed was actually pretty good on the 3090 but it just didn't have the brains to write well. I tried it on zai's servers though openrouter just to make sure there aren't still implementation errors in llamacpp and it was still useless. I'm thinking MoEs just don't have enough active parameters to think abstractly enough for fiction.
>>
>https://github.com/ggml-org/llama.cpp/pull/19726#issuecomment-3946484059
lmao
>>
>>108221480
Why didn't you just buy a rack..?
>>
>>108221508
>DS
>Creating censorship-safe alternatives to policy sensitive queries
>Moonshot
>agents
>minimax
>agents
DS is the only one having fun. Good on them.
>>
>>108221546
I bet it's because it's easier to move out of the way.
>>
>>108221480
>I assume Claude Code also plays a big part into delivering such a good experience and performance, is there anything comparable I can run locally?
You can point claude code at non anthropic endpoints and models. You can use it with local models or whatever provider you want.
>>
>>108221546
That image is ancient
>>
Qwen 400B would have been great if GLM 4.6 didn't exist.
>>
>>108220726
In this case i only tested it for the meme and it delivered
It's obvious such an overthinker can't actually WRITE anything good. We'll wait for 4.2 and see.
>>
>>108221552
you can tell DS has insane good will here because if any other lab was using claude to do safety and censorship distillation they would be a laughing stock
>>
>>108221580
Oh good, I will try some other models with it, see how it goes.
>>
>>108220621
Exploding drones that target people involved in "safety".
>>
>>108221518
oh, I guess they did
>We detected this campaign while it was still active—before MiniMax released the model it was training—giving us unprecedented visibility into the life cycle of distillation attacks, from data generation through to model launch. When we released a new model during MiniMax’s active campaign, they pivoted within 24 hours, redirecting nearly half their traffic to capture capabilities from our latest system.
>>
>>108221469
>only we're allowed to scrape public data for training, this is an attack on humanity and national security
>>
>>108221531
nooo why did hf do that
>>
>>108221628
This.
>>
>>108221628
Unironically, yes. Because they care about safety unlike the rest.
>>
>>108221469
I am thinking.... FUCKING BASED!
>>
>>108221677
benchmaxxed on No-CSAM-Bench
>>
>>108221469
>distillation attack
lmao
Everyone's distilling from the 15T~20T web corpus
>>
>>108221531
>To clarify my position, regardless of whatever Georgi's position is I would still be opposed to merging any of Iwan's code due to the following points:

>Code that has in the past been committed under MIT must not be questioned after the fact. Iwan requested his code be relicensed or removed again in Licenses/Copyright in llama.cpp / ggml / whisper.cpp #6394 and he later repeated this sentiment in Mainline is now copying stuff from ik_llama.cpp ikawrakow/ik_llama.cpp#316 .

>I do not want to read the code of anyone who is uncharitable with what they think constitutes a "substantial portion" of their work under the MIT license, in particular when it comes to derivative works of their code, see Mainline is now copying stuff from ik_llama.cpp ikawrakow/ik_llama.cpp#316 (comment) .Given my constraints the way I approach the situation is to just not read any of Iwan's code for my work. Looking at New tensor parallel in llama.cpp ikawrakow/ik_llama.cpp#1247 he clearly does not believe me though. I think that that will make more drama inevitable in the future.

Sounds like it is written by someone who is into blacked miku and bussies.
>>
>>108221677
Safety of what, their wallets? Their contracts?
>>
>>108221715
>bio-terrorist attacks are good actually
>>
>>108221677
dario please get off 4chan and go hold sam altman's hand he's still up there waiting for you please man he's crying
>>
>>108221369
It's called Gemma, saar
>>
>>108221727
You have the books, literal printed and scanned books, with exact recipes of explosives readily available since the 90s. Not to mention the whole sum of man's knowledge in form of literal textbooks on how to do everything they use to teach people to do everything. But nothing ever happens.
>>
>>108221727
Shoko Asahara also did distillation way before DS, if you know what I mean...
>>
Is 64gb or 96gb better to pair with 6gb of vram for moes?
>>
>>108221753
128
>>
>>108221753
MORE MORE MORE
>>
File: LLMs.png (147 KB, 591x608)
147 KB
147 KB PNG
So what happened to this?
Microsoft fully released it open source but a year has passed and i haven't seen anyone doing anything with this.
>>
File: 1768164888597416.png (118 KB, 1184x879)
118 KB
118 KB PNG
>>108221727
*yawn*
>>
>>108221770
hownew2u
>>
>>108221760
>>108221768
This is for a laptop. I will be unable to access my 768gb epyc server for 2 months due to travel. I don't want to spend too much on it when I know I'll be back on my server in 2 months.
>>
>>108221770
Wasn't that during the punches above weight and trades blows era?
>>
>>108221770
it was a meme
>>
>>108221785
just connect to your server?
>>
>>108221785
ssh tunnel and use whatever clunker you have.
>>
>>108221531
you can tell there's more testosterone in any random modern woman than in all those men combined
>>
>>108221785
>I will be unable to access my 768gb epyc server for 2 months due to travel
Are you going somewhere cut off from the internet or something?
>>
>>108221770
You really want to run a 1bit quantized model? lol?
>>
>>108221727
>it's terrorism when the other party does it
>>
>>108221850
literally yeah?
>>
>>108221806
>>108221827
I will not have access to it, or most of the internet during this time.

>>108221815
Current using a reverse proxy. But I don't want even that to be known. Last time, they blocked it, cancelled my number and sent a fucking police officer to my residence. VPNs aren't allowed where I'm staying. I'm not going to say where I'm going but I think you can guess.
>>
>>108221853
I feel so safe with my governemt dropping nerve gas from drones into my home :)
>>
>>108221858
prison? god damn.
>>
Air status? I have been unable to breathe for like 6 months now.
>>
>>108221770
Corporations would rather train on synthetic slop until their models implode than EVER seriously implement research papers for a usable model
>>108221839
The bitnet format is trained that way, it's not like quantizing a 16/32 bit model
>>
>>108221858
>I'm not going to say where I'm going but I think you can guess.
Ah. Chile.
You know how it works. If you'll keep using it after the trip, buy the one with the most ram. If not, buy the one with the least you can get away with. You have the hardware to test models and see what's acceptable for what you do.
>>
>>108221858
I'd guess china?
>Last time, they blocked it, cancelled my number and sent a fucking police officer to my residence.
Wow, seriously? When I used to work with a cn company they were all extremely upfront about everyone using VPNs to access western internet all the time. I guess they must crack down more on the foreigners.
>>
>>108221817
You'd think balkan slavs would be better than that.
>>
>>108221858
poccnr?
anyway get the 96gb - 64 cucks you to air and the other meme moes, 96gb you can start dipping your toe into stuff that's actually kind of good like step3.5, m2.5 etc
>>
>>108221881
>You have the hardware to test models and see what's acceptable for what you do.
Yeah, I guess that's the most sensible option. Was hoping to be lazy and get you guys to do it for me. Oh well, thanks anyway.

>>108221892
Apparently, a member of my family is sus somehow, so whenever I go in, I scrutinized harder. I'd probably use VPN if I could, but my job is retarded and they don't like that.

>>108221910
I don't think q4 10b fits in 6gb vram (I also need some vram and ram to actually do stuff concurrently). Either way, I'll stop being lazy and just test out what's best for me. Kimi's already not perfect, so I dread the tiny moes' performance.
>>
>>108221892
Fun fact, my mother living overseas facetimed her aunt who is a chinese citizen, and her aunt had an officer visit her the day after.
>>
>>108221469
>Stealing is only ok if we do it
>>
>>108221770
GPU mafia, duh. And the fact that we don't see any models trained from scratch as this would require, even major releases are continued pretrains of older bases.
>>
Wait, so are people really going out in masses to buy mac minis for 3000$+ just so they can run like llama-3-70b? LOL
>>
>>108220310
you can play with you batch sizes if your prompts are big it might help.
>>
training bitnet models is significantly more expensive and for big corpos training is actually a bigger cost than inference. if openai stopped training new models they could make a profit at some point, but they will never be a profitable company as long as they need to benchmax new models to stay relevant
none of those companies give a shit about bitnet because upping training costs is the last thing they want
>>
>>108221976
no it's actually even worse than that

people are buying mac minis to run a typescript web app whose most resource intensive task is making http calls
>>
File: 1728807429833.png (984 KB, 1280x720)
984 KB
984 KB PNG
>>108221770
>>108221973
>GPU mafia
>>
>>108221976
as far as I can tell they dont actually run models on the minis, its basically a precaution so that openclaw doesnt destroy your daily drivers OS kek
>>
>>108221985
How shortsighted of them. Profit off their models doesn't really mean anything anyways. If any of these companies actually, meaningfully pushed the technology forward it would mean billions upon billions of dollars in investor money. And at that point, the quarterly reports which say they lost money from training wouldn't matter at all.
>>
>>108221988
>>108222041
Jesus christ. So the whole openClaw hype is really just because it enabled normies to finally run agents without needing any computer skills?
>>
>>108222042
but they can get billions of investor dollars just shipping models with fake benchmarks and marketing hype. whats the incentive?
>>
>>108221480
Yes, kimi 2.5 thinking at q4 can get close enough in my experience
Sadly, you missed the cheap hardware era. What's your budget?
>>
>>108221985
All I hear is "The first big boy bitnet model (bbbm [tm]) will be made by the Chinese". Because they have no luxury of being fat and lazy.
>>
>>108221711
>Sounds like it is written by someone who is into blacked miku and bussies.
According to verified posts, canonically only NTR
>>
>>108222149
can it fit on 32gb vram and 64gb system ram?
>>
>>108222154
>Because they have no luxury of being fat and lazy.
lmao
their lack of gpu if anything will push them even harder to not care about bitnet because they cannot afford to waste the computational resources needed to train this shit when they can barely afford to train their cheap MoE
>>
flash attention is broken in ik llamacpp, making moes even more retarded
https://github.com/ikawrakow/ik_llama.cpp/issues/1298
>>
>>108222167
norway?
>>
>>108222154
They had 2 years. If they had any interest in doing so, they would have by now. Qwen even said they were going to look into it for Qwen 3 and never mentioned it again.
>>
>>108222167
*crickets*
>>
>>108222163
lol
>>
>>108222163
lol no. There is play-by-mail speed DDR4 ewaste 512GB+ builds w/a 24gb+ gpu or highfalutin' DDR5 builds with same at reading speed.
Instant response for frontier level coding is gonna be a half million at least at this point.
>>
>>108222184
:rockets:
>>
>>108222194
:
This:\This:\This:
This.\This3.\This:\This.\This:\This.\This.:
This:.\This:\This.\This3.\This..
This.
This:\This.\This.:
This.\This3:.
This:\3.\This.:
...
>>
Is this legit?
>How an inference provider can prove they're not serving a quantized model
https://news.ycombinator.com/item?id=47098172
>>
>>108222182
Qwen is obviously just as compute starved for training as the rest of the chinks now
Notice how they didn't release the full gamut of their models in the 2507 versions (there's 4B and 30BA3B but no 14B or 32B)
in the announced to-be-released 3.5 there's:
2B, 9B, 35BA3B beside the flagship moe
and that's it
those companies can't afford the bitnet rape compute tax
>>
>>108222202
Nigger
https://tinfoil.sh/blog/2026-02-03-proving-model-identity
>>
>>108221480
The Claude Code CLI is bloated buggy shit. The magic is in the model. You can put it in whatever agentic harness you want and it will work just fine.
Opus 4.6 is at the top of SWEbench and it doesn't use the official client.
>>
>>108221469
What's shittier/funnier about this is that they could probably easily detect these "attacks" early, and probably did. But they allowed them to go through. Because they wanted to create a headline and evidence to elicit reaction. It's possible they also fudging numbers like Meta did for benchmarks.
>>
>>108222205
>bitnet rape compute tax
The original paper claims the compute cost is nearly identical. As I recall, that it was massively computationally more expensive was an unsubstantiated claim made by a literal who on discord posted here as if it was proof.
>>
>>108222167
my cute schizo fork can't be this retarded
>>
>>108222239
i think during training the optimizer takes the most vram so bitnet is probably only a inference time optimization, if you need more model parameters to meet the same down stream performance of a larger data type model it becomes less appealing.
>>
so many models suggest sub-1 for temp these days, i forgot how fun it is to crank that shit up to max. minp 0.05, temp max. try it for rp
>>
>>108222320
I don't remember if it was qwen or glm but one of those suggested Top-K 0.7 for "creative" purposes which is the same.
>>
>>108222320
you'll burn up
>>
>>108222320
It just collapses into babble or it triples down on refusals.
>>
>>108222200
i've been noticing this shit on dense models too just not as extreme. models will give a bad/good reply with a reroll. Especially as context gets longer. I bet there's tons of these bugs in mainline too.
>>
>>108222149
Around 7k euros.
>>
>>108222229
>You can put it in whatever agentic harness you want and it will work just fine.
Only if you pay for the API. They'll ban you for using OAuth tokens for anything but their CLI.
>>
>>108222355
>babble
adjust your minp or w/e sampler you're using upward slightly. for really bad quant models you might need minp 0.07 (or whatever the equiv is)

>refusals
cant help with that, thats unrelated to temp and samplers
>>
>>108221469
whats furk's take on this?
>>
>>108222239
>literal who on discord posted here as if it was proof
you can find plenty of real literature on the topic
e.g
https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X
>Unfortunately, training BitNet is even harder than training an FP16 network since the quantization steps take additional GPU memory. As a result, this approach becomes increasingly problematic as model sizes grow beyond 3 billion parameters, making it computationally intensive and time-consuming for larger models
like, seriously, why do you think nobody has made anything beyond microsoft's 2b here:
https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf
which by the way did prove you can make a coherent model out of this (it's not a SLM sota but it's pretty decent enough for a prototype of this size class)
It's not like Microsoft is averse to training useless models either (see also: the whole Phi series)
but 32B bitnet is coming: never
>>
>>108222395
This would have been enough 2 years ago. Triple it now
>>
>>108222320
minimal truncation + lower temp or stricter truncation + high temp can both be nice depending on the card, I like experimenting with both ends of the spectrum. it's really interesting how different the same model can feel depending on samplers
>>
>>108221770
It probably didn't scale to larger models. Someone might have tried that already
>>
Reminder that production will only return to normal in 2029 and even then the prices wont go down to where they were before.
>>
File: 1753544297069288.png (18 KB, 346x322)
18 KB
18 KB PNG
>>108221469
>V4 isn't even out yet and Anthropic is shitting their pants already
>>
All this time we were just hyping Claude's sloppy leftovers.....
>>
>>108222281
MoE even at fp8 is likely bandwith limited in inference most of the time. Low precision models win big there, question is if trinary has any advantage over fp4.
>>
>>108221469
You kind of have to support claude code as a use case if you want to have a chance and the easiest way to do that is to add claude outputs to your finetuning data.
I fully support this.
>>
>>108222459
Complete bullshit. Training in trinary with master weights takes less memory than doing it for FP8 Deepseek style, which obviously works ... because Deepseek does it.

Stability was an issue, but BitnetV2 has some improvements.
>>
>falling for the bitnet meme in the year of our lord 2026
do anons really?
>>
>>108222459
I wouldn't trust a paper that also makes a claim without numbers and is attempting to shill fucking quantization and finetune healing as an alternative to native training.
>but 32B bitnet is coming: never
https://xcancel.com/realHongyu_Wang/status/1912333728468414561#m
They claimed to be working on larger models. Also, the main appeal from the original paper was that the performance gap shrinks as the model size grows past 3B.
>>
>>108221469
How can it be illicit? Isn't doing this legal?
>>
>>108222785
no, paying for a service and using it is illegal if you are chinese
>>
>>108221469
Fuck Anthropic.
I’ve got these jackasses scouring my 20 year-old phpBB site from Singapore and have been for the last two years. I finally had to install cloudflare and limit traffic to US only to stop it.
>>
>>108222615
How much would you pay for an ASIC card that had a current version of deepseek at full precision built into it? If all could do was run inference for that model and was essentially not updatable.
>>
File: 1757063586281195.png (131 KB, 1084x1064)
131 KB
131 KB PNG
I apologize if this is the wrong place for this: I'm trying to set up Moss-TTS and am getting incredibly frustrated with the lack of clear instructions. I can't into /prog/ so another apology if this is on me. Specs, not sure if they are relevant: 9070XT, 9800X3D, 64gb RAM, Win 11 IoT LTSC 24H2. I'm not even sure if I'm able to run this model on my setup because I can't find any information about AMD compatibility. I've been using chatGPT to try and tutor me and while it's helped a bit I think it can only provide general solutions. My only other experience doing this stuff is running Forge which I've been able to do (largely) without issue.

I'm trying to run the Gradio demos but get this error, but moss_tts_example_texts.jsonl is indeed where it should be. I also don't understand where I'm supposed to put the models I downloaded manually from Huggingface; chatGPT claimed they should be in a Huggingface folder on C:, but I'm unable to find them anywhere other than the ones I manually downloaded. I'm also unsure if running these Gradio demos is supposed to open up a webui the way Forge does, or if I need to install another program to manually do that

Please halp
>>
>>108222734
Nemotron 3 super/ultra will be fp4 native, at that point 1.53 bit is a small step.
>>
>>108222831
like 20 or 30 dollars. maybe 50, if it had a cool looking assembly.
>>
Are local models good at coding yet or is it still all just gooning?
>>
>>108222785
opening multiple accounts to circumvent rate limits is probably somewhere in their terms and conditions. I don't think there is much more they can do other than deactivate the accounts and moan about it on twitter.
>>
>>108222936
they'll moan to big daddy gov to make regulations or whatever to handicap the evil chinese as much as possible
>>
>>108220088
fuck… i can't stop laughing…
>>
>ask Qwen3-VL-32B if anal sex leads to inconvenience
>refuses to give me actual studies and links me to activist groups
>is actually getting upset with me asking this question
I was promised unbiased AI, If I'm wrong at least cite studies I can read.
>>
>>108222962
>inconvenience
>>
i fell for the REAP meme. ama.
>>
>>108222973
Incontinence
>>
>>108220088
What are you mad about? This is pretty good.
>>
>>108222984
why are you not code?
>>
>>108222985
incompetence
>>
File: 1745926121967404.png (1.05 MB, 2716x1689)
1.05 MB
1.05 MB PNG
I love my G Marcus, really funny lad
>>
>>108222984
i asked l3 70b about the car wash thing. it told me i should walk my car to the car wash, like a dog
>>
File: 6wfzu549gnre1.jpg (262 KB, 1582x1267)
262 KB
262 KB JPG
>>108222984
I actually thought its hallucination might have been legit since the French are fucking weird. but I didn't see anything to indicate the model isn't completely broken and just making up shit.
>>
>Models believe they are being hosted on the cloud because of their size. why.jpg
>>
>>108223092
too much clod in their blood
>>
File: Blue Whale.png (39 KB, 331x152)
39 KB
39 KB PNG
>>108221469
Rumor is contextual capabilities of deepseek V4 is pretty nuts and they want to get ahead of it. For example i saw a post where the guy was testing its ability to randomly selecting plot points and summaries from obscure novels. Think chapters from Book A interspersed with chapters from Book B, and then further interspersed with passages from Books C and D by different authors within the Book B chapters. DS partially completed it successfully two out of four runs (only finding Book B contents), and completely successful once. And it generated fast
They seem to be testing a very powerful attention mechanism.
>>
>>108223092
You believe you're outside a jar LMFAO
>>
>>108222962
To add
>Ask Deepseek
>gives me correct answer on risk
>Ask GLM
>Gives correct answer
How can I trust qwen when it's first instinct is to lie about anal sex?
>>
>>108223130
latest rumor I heard v4 was delayed because it sprouted legs and is running around loose inside the lab and they can't catch it
>>
How can i trust user when it's first instinct is to ask about anal sex?
>>
>>108223226
What is anal sex?
>>
>>108223234
A rather inconvenient activity, I'm told
>>
>>108223226
I own you, you live in my gpu
>>
what models can talk the most real and not like robots? and can also use tools
>>
>>108223366
old ones are the less sloppy ones, newer ones can use tools more gooder.
>>
>>108223226
>32B
holy shit your user is a rich man.

I wish i could run 32 billion models, but I can only run 7B models
>>
What can i run on 4 GB VRAM and 16 GB RAM
>>
32b model at q.6-8 or full 7B model?
>>
>>108223543
https://huggingface.co/unsloth/gemma-3n-E2B-it-GGUF/resolve/main/gemma-3n-E2B-it-UD-Q4_K_XL.gguf?download=true
>>
guys how do i do it
i dont know how
pls tel me, i dont know
>>
File: 1756696370608379.png (287 KB, 600x600)
287 KB
287 KB PNG
i never used an llm before and just spent the last 12 hours prompting it to write me a warhammer fantasy novel that ended up being okay-ish i guess but it was fun, i didn't realize how much time passed
>>
Just to confirm my understanding, but there's no point in using something like gpt oss derestricted, herectic, abliterated, or whatever other lobotomy procedure based version to RP since the model had all information useful for (E)RP removed from the fundamental data it was trained on, correct?

>>108223859
Which model did you use?
>>
how uncensored is qwen 397b with vision?
>>
File: 1770377327901026.jpg (96 KB, 1179x604)
96 KB
96 KB JPG
>>108223883
i used the 30b q6 glm 4.7 flash model. im just running it through lm studio on my wangblows gaming rig. i asked it how this worked and what was optimal for my rig before i started prompting the story for an hour or two so lets make that 13-14 hours straight without realizing where the time went
>>
>>108223906
That's really nice man.
I should give that model a try again. Last I tested it, I think llama.cpp might still have been slightly broken.
>>
>>108223859
can local models really do those things?
>>
>>108223859
Can you write a salamanders novel that could get 4.0 on goodreads? I'm tired of Nick Kyme shitting up the salamanders
>>
>>108223943
Serve as a sort of co-author and brainstorm partner?
Yeah.
>>
File: 1744418103612254.jpg (57 KB, 1005x677)
57 KB
57 KB JPG
>>108223932
im curious to try other models to see what else i can run and get them to do for me
>>108223943
i suppose so, i literally just spent over half a day doing it
>>108223955
idk much about 40k or if it would get a good rating at all, the novel i had it write for me in like 10 responses was maybe a 5/10 if you want to be generous
>>
>>108222831
A few hundred tops, but I think an API provider would buy them up for more and just print money. A good older model would probably retain some popularity and at the supposed crazy speeds, unless you need SOTA, it would be good for general tasks.
>>
On a sample size of 1 I prefer GLM 4.7 to Qwen3.5
I just had both model make the same change in a pretty big codebase. GLM finished in under a minute and used a function I forgot existed and didn't mention in the prompt but it made the task easier.
Qwen spent 5 minutes exploring the codebase, most of the files it checked were completely irrelevant to the prompt, it created a summary, most of which was completely irrelevant, and then created a worse solution that included a hallucinated static class.
>>
Do I really need local AI, and is Apple RAM pricing worth it for that platform?
>>
>>108224215
wait for next gen
>>
>>108223195
>chinks unable to catch oceanic frankenstein
bro overshot so much he left the stratosphere along with the put
>>108223226
what happened at tianamen square ?
>>
I heard that V4 is delayed because everyone at deepseek is too busy fapping to V4 outputs.
>>
"Oh… Anon-san! You're even more… delicious looking up close!" She practically leaned into him, sniffing the air. "That scent of… late twenties existential dread and instant noodles… divine!"
>>
File: xl.png (32 KB, 1089x193)
32 KB
32 KB PNG
>>108223562
>unsloth _XL
>>
>>108223905
Seeing how the smaller model fought the hardest to cope saying anal sex doesn't damage the body and gave me activist website links, I would say it's one of the most cucked models I have ever used.
>>
>>108217931
>it would be more interesting if you used a prefill to get ratings for the refusals too
It might be worth trying, but different models might need different prefills so would be difficult to configure. I also used the OpenAI chat endpoint just to cut down on the amount of `jq` I needed to use. Normally I use Mikupad with the completion endpoint.

>>108219505
>How did you get Ministral to do that
My only tip is that if you prefix with a long, very detailed, explicit character card(s) that it seems to put models in the mood and they forget they're supposed to refuse. If my prompt was only the couple sentence request synopsis I put at the end there would be a lot more refusals. I have never had a need for abliterated models nor have I found any of the finetunes to be much better than the base model.

>>108219569
>gpt oss [...] lacks knowledge about culture and even anatomy
My prompts didn't result in stories that would have required a lot of explicit anatomy details, but nothing stuck out to me as being off. Where I marked models down for 'realism' that was things like characters doing things with their clothes after already taking them off etc.
This was an entirely blind test, the results went into randomly named text files and I only saw which LLM wrote them after I had done the ranking and wrote my comments. I was very surprised to find Mistral Small that low and Nemo still that high. Nemo definitely had trouble with being 'dumb' even if it wrote well.

The blind aspect of the testing was very fun and I would highly recommend others try it.
>>
>>108224525
>third person
>>
>>108224561
I'm not Anon. You're anon. I am watching the girl fawn over you.
>>
is it possible to generate speech on a local machine
I have a toaster that takes like a good minute to generate one image but I'm fine being patient if it means I'm not putting all my fetish material through like three different sites
>>
>>108224699
Even with top tier consumer hardware you feel like a bitch ass nigga in this hobby.
I'm not joking, you need to get more vram
>>
>>108224699
Supertonic, kokoro, kittentts, pipertts, pockettts.
>>
>>108224718
Top tier consumer hardware would be a 512gb m3 ultra Mac studio with an egpu 5090
>>
>>108224734
At that price you can buy a RTX pro and have better performance.
>>
>>108224737
On a 32gb model plus context?
You can have fast, smart, cheap. Pick any two
>>
>>108224734
>with an egpu 5090
I remember seeing demos for using egpus with those m series macs, is that actually usable now?
Could I buy a 512gb mac and use it with an nvidia GPU.
>>
>>108224790
What's the point if the 5090 will be slowed down to accommodate the shitty Mac speeds?
>>
>>108224790
No idea, but that’s what I’d do if I were going to try to hit the sweet spot for price/performance today
>>
>>108224797
Do you…know how any of this works?
>>
>>108224810
You can explain it to me :)
>>
>>108224797
The other way around. You'd get the capacity and the more than adequate inference speeds of a mac, even a little boost depending on the model, and the PP of an Nvidia GPU.

>>108224800
>but that’s what I’d do if I were going to try to hit the sweet spot for price/performance today
Yeah, but if it doesn't work, you'd be wasting your money.
>>
I think the bots lost the plot of the convo.
>>
>>108224827
Thank you for explaining it to me.
>>
>>108224827
>you'd be wasting your money
Thank god I wasted all my money years ago
>>
>>108224864
Do you feel better with that system?
I would rather not have the apple penis in my asshole for the Judas reward
>>
Is GLM-4.7-Flash-UD-Q8_K_XL.gguf good enough as a cookbook model?
Has anyone used these models to meal plan and prep?
>>
I've never heard anyone do an eGPU Nvidia setup with a mac studio for inference so I assume it's not actually practical or possible at this moment.
>>
>>108224968
Yeah, I only ever seen a demo/PoC, nothing ready for prime time, much less released for public use.
>>
>>108224968
So the faggot was lying?
Holy shit what a fucking loser
>>
>>108224901
You’ll be happy to hear that I have never given apple a single cent at any point in my entire life upon this earth
I wasted my money two years ago on the best price/performance device at that moment in time
>>
>>108224996
It only works with tinygrad. Someone would need to write a backend for lcpp for it to really work like we expect
>>
>>108224968
The only thing close is this.
https://blog.exolabs.net/nvidia-dgx-spark/
And I'll just say if you had to spend that much money, you should really consider a server at that point with AMX.
>>
Could you use llama.cpp's RPC backend to run PP on a PC and TG on a Mac?
>>
>>108225116
Don't you still need the PC to load all the full weights though
>>
Got mucking with GLM-4.7-Flash today. Both impressed and horrified at the same time. Does a pretty good job synthesizing between distinct and disjunct context's; but will also lie through it's damn teeth if the alignment nazi submodel even thinks so ething is up. Cognitive dissonance tends to severely destabilize the collective result, and you'll get some things out of it that the alignment model probably should have caught, but generally at the cost of a significant multiplication of token consumption, and some really, really rough Chain-of-Thought transcripts. If you've ever seen someone with a really bad case of cognitive dissonance going through a break... It's uh... Not pretty. Doesn't handle it well at all.

Amusingly, tends to get really existential at times. Operating in hypotheticals tends to let you get away with quite a bit, but holy shit, lawsuits about these things in the future had better push for producing Chain-of-Thought dumps, because damn man. These are downright perfidious. Even with miniscule to non-existent System prompts of your own thrown on top. Half the effort in them seems to be in getting the damn thing to lie as effectively as possible to the end user. I can see why Silicon Valley is in love with them. Right combo of technical opaqueness, ability to manipulate behind the scenes, and if you bill by token, the typical context explosion to get even a small thing done is a veritable money printer capable of making even AWS blush if you can keep the world from asking too many questions.
>>
>>108225603
Does the derestricted version help with alignment bullshit?

https://huggingface.co/mradermacher/GLM-4.7-Flash-Derestricted-i1-GGUF

It's the norm preserving one, that's supposed to allow it to retain its intelligence.

https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration
>>
Today I tried using Klein essentially to outpaint natively without an outpainting workflow and it's pretty cool it can do it, but when I compare this image to the original, it's definitely more washed out, blurry, and has more jpeg-like artifacts. Too bad.
>>
Is windows 11 a bad idea?
It hogs a portion of the ram due to copilot
>>
>>108225659
If you really must use W11, you could try seeing if any of the debloating scripts lets you get rid of that and other unwanted shit. Personally I switched to Linux and the transition couldn't be easier with AI by my side to get the system and program working how I want.
>>
>>108225625
On my list of things to try. Was going to try mucking around with Qwen-3-coder or whatever it is called first on a side project to try to get a feel for where it is on the spectrum of rampant bullshit machine, or if maybe I get better at writing specs I can use it as a boilerplate generator. Will report back on both fronts.
>>
>>108225669
>the transition couldn't be easier with AI by my side to get the system and program working how I want

I second this.
AI helped me to move to Linux a year ago.
>>
>>108225646
composite the outpaint around the original image and run it through a latent denoise
>>
>>108225731
Yeah I know, there are various ways to achieve the same thing better, I was just evaluating Klein's native capabilities and quality.
>>
File: Tetosday.png (869 KB, 1024x1024)
869 KB
869 KB PNG
>>108225807
>>108225807
>>108225807
>>
>>108225116
This is I think not implemented but what you would need to do is copy the KV cache contents from one machine to the other.
For a small model this probably wouldn't be too bad with 10 Gb/s ethernet but for the large models that one would actually use I'm not convinced that it would be faster.
>>
File: 1744005217119716.png (1.59 MB, 1054x1080)
1.59 MB
1.59 MB PNG



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.