[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1289005993449.jpg (195 KB, 800x1100)
195 KB
195 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108186120 & >>108175259

►News
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5
>(02/15) Ling-2.5-1T released: https://hf.co/inclusionAI/Ling-2.5-1T
>(02/14) JoyAI-LLM Flash 48B-A3B released: https://hf.co/jdopensource/JoyAI-LLM-Flash
>(02/14) Nemotron Nano 12B v2 VL support merged: https://github.com/ggml-org/llama.cpp/pull/19547

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108186120

--Papers:
>108194842
--Porting IQ*_K and IQ*_KS quants from ik_llama.cpp:
>108186634 >108186693 >108186827 >108186850 >108186897 >108186914 >108186933 >108186936 >108186941 >108186989 >108187058 >108187118 >108187178 >108187221 >108187228 >108187242 >108186814 >108186873 >108192527 >108192588 >108192730 >108192752 >108192773 >108192835
--Debating AI memory and writing style solutions:
>108190281 >108190382 >108190418 >108190459 >108191423 >108190475 >108191738
--Debating em dashes as AI writing indicator:
>108187936 >108187995 >108188031 >108188139 >108188143 >108188163 >108188165 >108188210 >108188294 >108188720
--Extracting LoRAs from finetuned models using MergeKit:
>108188651 >108188666 >108188671 >108188685 >108188728 >108188763 >108188832 >108189163 >108188772 >108189938
--Jetson Orin Nano LLM struggles, domain-specific models debated:
>108192160 >108192167 >108192227 >108192251 >108192283 >108192321 >108192371 >108192382
--Fine-tuning advancements and niche dataset challenges:
>108192287 >108192339 >108192388 >108192340
--Claude Code agents making unauthorized changes and self-responding:
>108194060 >108194065
--Minimax M2.5 outperforms GPT-OSS-120B and Qwen-Coder-Next in server automation task:
>108193519 >108193621
--Qwen3.5's self-monitoring reasoning process:
>108192528 >108192541 >108192595
--LLM Arena adds open-source filter but lacks parameter size options:
>108191494 >108191596
--Qwen3.5 non-thinking syntax context handling:
>108188539 >108188567 >108188668 >108188710 >108188659
--Google's timesfm-2.5-200m-transformers model release:
>108192228 >108192238 >108192308
--mlx-lm overtaking llama.cpp for Mac Studio model support:
>108188625
--Rin and Miku (free space):
>108187257 >108187262 >108187464 >108187541 >108187623 >108188448 >108190034 >108192219 >108193602

►Recent Highlight Posts from the Previous Thread: >>108186122

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Has anyone tested how much (or if) RAG can decrease the perplexity of a completion model?
>>
>>108194930
Decrease the perplexity relative to what? To the contents you just shoved in the context? Yes. Presumably it would.
Are you sure you understand what you're asking?
>>
Last time I was here, people here still were disappointed with Llama fiasco, what changed since it?
What is meta for small models(5b-20b)?
>>
lol finetrooners are funny
https://blog.dphn.ai/xgen-rl/
was wondering what some of the early niggas were doing these days now that the only living, staying relic has been drummer
>What’s more, we found the model to be MORE censored than the base model itself, achieving a much higher rate of refusal on our benchmark, which generates replies from the model in both multi-turn and single-turn scenarios and evaluates responses via a classifier.
the dolphin guy puts so much "effort" into his "uncensors" when a prefill on a regular model does a far better job than he ever did (and for promptlets, there's always heretic, which is.. decent)
>>
>>108194939
To the text being completed, given some standard (constant) generic knowledge base or search engine.
Suppose you have some text. You could concatenate the actual text to the information retrieved by RAG and compare.
So the question is how does perplexity(text, probs(text), (0,len(text))) compare to perplexity(concat(RAG(text), text), probs(concat(RAG(text),text)), (len(RAG(text)),len(RAG(text))+len(text)))
>>
>>108195001
ai psychosis
>>
GLAWKSS
>>
>>108194991
Same as always, Gemma 3 or Mistral.
>>
>>108195001
Sigh... if you shove information you know to be correct in the context, the language model is more likely to output correct information. All models are completion models.
>>
Gemma 3+1 soon
>>
>>108195034
Gemma3+mc^2
>>
>>108194993
I wouldn't be surprised if western companies explicitly train their models to detect disruption from tuning/adapters.
>>
>>108194845
*tickle tickle*
>>
File: completion-model-rag.png (462 KB, 1360x2348)
462 KB
462 KB PNG
>>108195012
I'm sorry for your developmental difficulties dude.
>>
>machine, confirm my beliefs!
>>
>>108195031
Not necessarily. It could cause the model to imitate the style of the reference documents rather than the actual text. Or it could decrease the performance of the model due to longer context.
>>
>>108195061
ai psychosis
>>
>>108195072
>>108195067
You are literally dumber than said machine.
>>
said the retard asking an llm
>>
>>108195076
ai psychosis
>>
>>108195061
Anon. I'm the first User 2 (>>108194939). Someone else is the second.
Neither the model or you figured that out.
>>108195071
>could could could
>>108195031
>Sigh... if you shove information you know to be correct in the context, the language model is more likely to output correct information. All models are completion models.
still stands.
>>
>>108195077
I already knew the answer. I just asked the machine to explain to you why you were wrong for your own benefit.
>>
>>108195092
ai psychosis
>>
>>108195084
And I had to spell it out for you because you were too dumb to understand what I was even asking. So I guess we're even.
>>
>>108195100
nigger
>>
>>108195103
I said the same thing, rephrased, in the first post.
See >>108194939
>To the contents you just shoved in the context? Yes. Presumably it would.
>>
>>108195001
>To the text being completed
The perplexity of the text being completed relative to the text being completed is zero.
>>
>>108195113
Presumably "the contents you just shoved in the context" meant the RAG results, which wasn't what I meant.
>>
>>108195124
That's what RAG is. You query some database, fetch results by cosine similarity on embeddings or whatever, shove the text corresponding to the embedding into the context, let the language model do what it does. Complete text.
If that's not what you meant, then your idea of RAG needs to be corrected.
>>
>>108195120
Decrease the perplexity (of the model's predictions for each token in the prompt) relative to the text being completed (the prompt). Are you just a troll/contrarian or did you genuinely not understand what I said?
>>
>>108195136
So you DID mean the RAG results by "the contents you shoved". Which means you did NOT say the same thing rephrased.
In this scenario you are not measuring ppl over the RAG prefix, only over the prompt which is being prefixed.
>>
>>108195141
Anon... your question makes no sense, and I should have called you a schizo from the beginning.
>>108195012
>>108195072
>>108195081
>>108195100
You were right.
>>
File: surgeon.png (230 KB, 1221x1587)
230 KB
230 KB PNG
wew
>>
>>108195195
lmao, it managed a lengthy write-up where every stated fact is wrong
What model is this? Surely a 12b or below.
>>
>>108195195
AGI soon bros
>>
>>108195217
Gemma3 27b
>>
>>108195250
Gemma very good! Tbh depends a lot on the system prompt and if you give it a hint "this is a riddle".
I tested "U.N. Owen was her, devil may cry and the surgeon said "I can't operate on this child", god why?" and at first it thought that was part of the rp scenario and began to describe some bullshit.
After resetting and giving it a hint it was able to successfully decipher this one.
But more you play with these smaller models more stupid they are starting to feel.
>>
>>108195154
What part do you not understand?
Process the prompt using the model. Get the probs for each token in the prompt. Compare each predicted token to the actual next token. Calculate PPL.
Then prepend the rag results to the prompt, run reg results + prompt through the model and get the probs for each token in the prompt (keeping the RAG results as context but not getting probs for those tokens). Compare each predicted token in the prompt to the actual next token. Calculate PPL.
Compare both PPL values.
>>
>>108195321
What do you mean?
>>
>>108195321
then fucking get to the point.
we're not here to be your fucking slaves.
if you are going to bring a useful thought then state it, you are wasting everyone's time here.
and your fucking AI analysis can go fuck itself too, we're not obligated to give you an answer.
>>
>>108195321
ai psychosis
>>
>>108195372
kek

>>108195378
I thought you had understood it when you said "well yeah obviously it will". Or was that somebody else?
It was part of a broader thought of what is "the model" when I was thinking about finetuning and personalization options.
We can compare things like RAG or prompt engineering somewhat fairly with things like finetuning just by thinking of the whole thing as a tokens in, probs out function and measuring the accuracy.
>>
>>108195321
giving model more context almost always reduces the ppl. shorter sequences are simply harder to predict. your test will change the ppl but your results are essentially meaningless. unless the ppl goes up then you know the model really sucks at rag and polluted its own context. why don't you just test it?
>>
>>108195321
>Calculate PPL.
Not that anon, but can you explain how you calculate it?
>>
>>108195541
I have ideas more often than I have time to try them out.
I did recently try one of my ideas though.
https://desuarchive.org/g/thread/108088802/#108097306
The result was that teaching the model to predict the user's moves while masking out the loss from the assistant tokens decreases the number of legal moves during its own turn, at least using LoRa. But the user response also had the board state on every message so the responses weren't in the same format.
Now thinking about it, maybe one variation I could try would be to switch the assistant and user turns so during training the model sees the user's response as if it was generated by itself.
>>
>>108195683
ppl = exp(-(1/prompt_len)*(sum from i=1 toprompt_len of)(log (probability given by the model for)(token at pos i within the prompt))
>>
>>108195732
>decreases the number of legal moves during its own turn
so you broke the model? are you trying to train it on sequences generated from a model playing against a chess engine? wouldn't the correct thing to do be just train it on pure sequences from a proper chess engine playing against a chess engine so all the move are legal in the training data?
>>
I had ego death
>>
>>108195820
thanks
>>
File: HF.cpp.png (295 KB, 886x1286)
295 KB
295 KB PNG
https://x.com/ggerganov/status/2024839991482777976
georgi found an exit
>>
>>108195832
it's so over
>>
>>108195779
Yes, I broke it when used as assistant. It learned to play chess as user but that broke the pre-existing chess knowledge it had as assistant.
>wouldn't the correct thing to do be just train it on pure sequences from a proper chess engine playing against a chess engine so all the move are legal in the training data?
Yes, but the chess was just to have a task to train on and evaluate objectively. The goal was to see if training on stockfish generated chess moves in the user messages would make the assistant better at being an assistant, at least in an idealized scenario where both assistant and user had the same tasks (generate good chess moves moves). And even in that simplified scenario the role header (user or assistant) made the model only better when generating under the header it was trained on (in this case user).
The dataset had in the assistant role the messages the model had actually generated during inference (but masked) and the "user" moves generated by stockfish to the original model's own moves.
>>
>>108195832
>https://ggml.ai/
>The development process is open and everyone is welcome to join. In the future we may choose to develop extensions that are licensed for commercial use
>>
>>108195855
>help foster new opportunities for users and contributors
> improving user experience and integration with the Hugging Face transformers library for improved model support
>>
>>108195832
https://github.com/ggml-org/llama.cpp/discussions/19759
>>
File: file.png (28 KB, 886x154)
28 KB
28 KB PNG
>>108195832
:rocket:
>>
>>108195832
oh is this why slaren quit?
>>
>>108195877
No, it was due to pregnancy.
>>
>>108195863
I can't wait to subscribe to ggml PRO to get access to even a semblance of model support while also adding a hard python dependency.
>>
>>108195832
Good on him.
Here's hoping this doesn't end up wuining llama.cpp eventually, though.
>>
>>108195832
he is right to cash out before the bubble pops
shit is already inestable
>>
>>108195852
masking the loss will not punish the model for mispredicting it. but its still part of the context the model will still learn from it. the model weights are being updated to make the response more likely given the instruction, so the models internal representation of the masked instruction becomes more refined even if it isn't being punished for it directly.
>>
File: VibeBench.png (45 KB, 1811x662)
45 KB
45 KB PNG
>>
>>108194845
feet
>>
>>108195946
>pass in chinese
Huh?
Cool shit though, thank you for sharing.
Could you test Qwen 3 Next instruct and thinking as well?
>>
>>108195946
I was pretty impressed with nanbeige but I found its safety cucked, and liked to ramble on too much.
>>
>>108195832
IK meltdown in 3...2...1...
>>
>>108195946
Interesting choice of models. I bet nemo beats all. It would be nice to have a common sense bench that evaluates not only such riddles but also stuff like emotional and spatial intelligence.
>>
>>108194845
>OP pic
遠慮します。
>>
>>108195832
One one hand it's over on the other maybe we'll get actual multi-modal support now
>>
>>108195863
>integration with the Hugging Face transformers library for improved model support
literal cancer
there is no worse code base in the world than transformers.
also
>Better packaging and user experience of ggml-based software
what's wrong with the user experience right now??
cuckoldganov makes me consider moving to ollama and pray they make their own replacement for ggml after having replaced llamer.cpp model impl
>>
>>108195946
Gotta try this Nanbeige thingy. It's size is perfect for abliteration tests.
>>
>>108195966
>liked to ramble on too much.
I'd like Nanbeige better if they had a non thinking variant like Qwen did with their 2507 instruct.
Reasoner models as a whole are insufferable to use, and this one in particular yaps as hard, or harder, than the original R1.
>>
>>108195951
Anything above 30B is too big for me, which is why I didn't test Qwen Coder further (it was an awful 40B REAP)
>>108195966
>>108196089
There must be a demand for CoT finetunes of Nanbeige, but it might be hard to make it think faster without lobotomizing in the process. There is one such finetune for 4.0 but that's just your average 3B model.
>>108195980
The cup riddle seems to be easier for models with vision.
>I bet nemo beats all
Mistral-Nemo? Fails all except for the last question.
>>
>>108196086
See >>108128796
ollama is seems to be planning to replace ggml with MLX.
>>
Why would I not buy 3-4 ASUS Ascent GX10 and host anything I want?

Perplexity says runs for 3€ per month on eletricity bills and can easily run latests full open source models like glm-4.7 or kimi 2.5

Where is the catch, seems too good to be true imo.
>>
>>108196144
wasn't the spark a cucked blackwell chip? i remember reading something about it not using the same arch as professional blackwell cards and instead having DLSS etc. shit.
i think the thor has better CUDA sm compat, but i also read somewhere that its slower than the spark?
>>
File: asexpectedofgoogle.png (1 KB, 164x68)
1 KB
1 KB PNG
>>108195946
heheh
>>
>>108196144
For that much money, I'd rather buy a 512gb mac.
>>
>>108195946
Kinda interesting but would be more meaningful if you gave each model at least a few attempts
>>
>Logged your complaint. Have a nice day.
kek. That's a cute response to accusations of being useless
>>
>>108195924
In theory yes, but maybe only training with a certain context (in this case the user turn header) makes the model specialize to that context at the expense of other possible contexts. Like how training a base model with a chat template makes it worse at predicting text in general.
>>
>>108195832
Is HF going to buy him some GPUs now so he can test his PRs himself?
>>
>>108196253
It is consistent enough, 1 out of 10 hit like with GLM-4.7-Flash isn't much. (i tested different versions of it and it was mostly random)
>>
>>108195832
Is this because of the schizo fork?
>>
>>108196118
>Anything above 30B is too big for me
Shame.
You could try Kimi-Linear I think.
>>
>>108196144
Obsolete in 2 yrs.
>>
>>108195946
>3B
By chance does it pass better than average because it's small and not stuffed with slop, like, is it unaware of the "gender expectation" meme?
>>
>>108196329
No? IK has been around for ages
>>
>>108196356
might be to have protection from a big corpo during the oncoming legal battle
>>
>>108196366
there is no "oncoming legal battle"
>>
>>108196392
You're absolutely right!
>>
File: aware.png (131 KB, 1377x701)
131 KB
131 KB PNG
>>108196341
If anything it is always *too* aware.
>>
>>108195946
That's a nice Qwen. Too much thinking tho
>>
>>108195832
OWARI DA
>>
File: 1765287827526172.png (144 KB, 312x392)
144 KB
144 KB PNG
>>108195832
Apologize.
>>
>>108196501
literally her fault tho?
>>
their latest melty was not that entertaining tho
>>
>>108195832
CUDA dev, now that ggml-org was acquired by HuggingFace, how much of that money went to you for all the work you contributed to the project? Are you in the triple comma club?
>>
>>108196541
As of right now I do not have any financial ties with HuggingFace.
>>
>>108195947
i had the same thought
>>
how many more months before we can run gpt 5.2 locally with a 5090?
>>
>>108196338
Kimi-Linear IQ3_XXS from bartowski only passes the third question, where it is explicitly stated the doctor is the father.
>>
>>108196595
>5.2
That model is slopmaxxed unless you're a mathematician.
>>
>>108196615
ok what about gemini 3 pro
>>
>>108196607
I'm surprised it passed anything at all, lol.
>>
> https://taalas.com/the-path-to-ubiquitous-ai/
>makes custom hardware that can run a local llm super fast (17 000 token / s) for cheap
>it's 3bit quantized Llama 3.1 8B
why not at least something like Qwen 4B at Q8, that actually would be useful
it's pretty legit impressive though to see answers come that fast:
https://chatjimmy.ai/
>Generated in 0.065s • 15,770 tok/s
>>
>>108196556
I hope you get your bag as well tho, not just the vibecoders.
>>
>>108196556
NTA but any plans to scuttle your code and go ghost for the lulz over this?
>>
>>108196556
joahnnes bros... WE LOST!!!
>>
>>108196673
I have already made a substantial amount off of my work.
Though quite honestly I don't particularly value money in the first place; there is nothing that I would want to spend it on other than computer hardware.

>>108196684
No?
>>
>>108196649
It takes longer to design custom hardware than to train AI? It's still pretty exciting to see something like this, assuming it's true.
>>
>>108196712
but you're using poverty tier hands me down hardware
>>
File: file.png (553 KB, 2470x1199)
553 KB
553 KB PNG
>>108195946
Qwen3.5 non-reasoning and reasoning versions. The only one it gets wrong is the car without reasoning.
>>
>>108196649
Not very useful unless new models stop coming out every quarter.
>>
>>108196712
You're too pure for this world.
>>
>>108196684
>scuttle your code
Do you understand how git works?
>>
>>108196731
he could just throw a karkrow "I demand my code be removed" and there you go cuda anything becomes toxic waste
>>
>>108194993
Dolphin uncensors work pretty well whenever I've tried.
Even with prefill, models won't generate content about certain topics. They won't refuse, but they will generate stuff around it.
Abliterated models don't work either because they start spewing nonsense.
>>
https://github.com/ggml-org/llama.cpp/pull/19374
is that thing working for others here? I was curious and tried it since I have a cluttered model list from my FIM and coding models but it doesn't hide them from the chat drop down menu, it just half breaks the menu (clicking the model loads it, but it doesn't show as loaded in the UI, and there's no unloading function)
>>
>>108196649
I always knew SRAM was the future.
>>
don't mind me im just a nomad scouring 4chan to see if the tampermonkey bros have found a way to bring "sort by upload date" back to youtube.
>>
>>108196726
it looks like it can load models, we would just need the architecture and size to stabilize so new models doesn't mean everything changes. it would have to be just the same stuff with a new knowledge cutoff date.
>>
i tried using gpt-oss-20b-heretic for info on cyanide and holocaust things but it just invents shit. are models that size just hopeless or is gpt-oss just a bad choice?
>>
File: file.png (493 KB, 448x600)
493 KB
493 KB PNG
Still no weights.
>>
>>108196857
toss likely had all of that scrubbed clean before training and a good dose of brainwashing after to be safe
>>
What is the most uncensored LLM that runs on 6 GB VRAM and doesn't hallucinate like a schizo on DMT?
>>
File: file.png (229 KB, 451x210)
229 KB
229 KB PNG
What would be a good nerd equivalent of a boxing match? Something where they can both compete one of them can win and then they make up and become best friends.
>>
>>108196918
Not much of an expert but I'd say mistral nemo quant for jerking off, gemma-3n-E2 or qwen3-4b quant for everything else
>>
File: file.png (247 KB, 800x1560)
247 KB
247 KB PNG
>>108196418
dayum

>>108195946
Oddly, Flash 3 on direct AI Studio acknowledges the mother (but offers possibility with 2 fathers which is irrelevant therefore wrong for the question), but OpenRouter is stuck on "lol 2 fathers". I swiped a few more times to make sure.
>>
>>108196918
>doesn't hallucinate like a schizo on DMT
you are asking for the impossible
smaller models like Qwen 4B are useful for tasks like summarization or basic translation (not going to produce high literature), or tagging your vacation pics if it's a VL
they're not useful as knowledge bases and they're beyond useless for tasks like coding
with that said, if you still have to insist, in my testings of small models I found Gemma 3N (both as E4B and E2B) were the most knowledgeable models of that size class. They're more knowledgeable than the small MoEs you could run too. However, as tools, I find the Qwen plain better. Gemma misbehave like crazy past like 10k token which makes them godawful local summarizers for e.g
>>
>>108196934
Gay sex
>>
>>108196938
what about for 24gb vram and 64gb ram
>>
>>108196999
unironically ignore the moessissies and give mistral small a shot.
>>
>>108196999
ignore the guy who bought multiple gpus and run glm 4.5 air
>>
>>108196825
throw more money at dreams
>>
>>108195832
Hugging Face killed papers with code so I'm worried they will kill llamacpp or turn it to shit.
>>
File: image_2026-02-20.png (15 KB, 481x289)
15 KB
15 KB PNG
>>
File: cellphone_girl.jpg (2.17 MB, 2171x2505)
2.17 MB
2.17 MB JPG
Anyone else running on mobile hardware?
Found this guide, not sure if it fell out of the OP or if everyone is desktop-only https://rentry.org/tysLocalGuide
>>
File: file.png (44 KB, 1103x182)
44 KB
44 KB PNG
>>108197130
>sophisticated knowledge
such as?
>>
>>108195946
10k+ tokens.
for 4B it's super impressive tho. I could try running a couple in parallel.
>>
>>108197135
This screenshot reads like it wasn't written by an LLM but it was written by a pajeet.
>>
https://huggingface.co/ThalisAI/Nanbeige4.1-3B-heretic
I'm trying it.
>>
>>108197135
zoomers were a mistake that should be unbirthed
>>
File: file.png (11 KB, 129x259)
11 KB
11 KB PNG
Everyone complaining about vibecoders but the most glaring issue is still unaddressed.
>>
>>108197130
I ran mobile back in the day just long enough to fix a regression for a PR I made to lcpp, but it was and is trash for anything general intelligence. Maybe a super specialist model could do something at that size?
The real way to run mobile is with a VPN back to your giant inference server at home. I use wireguard and ooba via nginx.
>>
>>108197135
Being on the RHS of the bell curve
>>
>>108197135
>sophisticated technical knowledge to assemble and provision the computer
is this a joke? is someone having a laugh
>>
File: nanbeige.png (141 KB, 1227x799)
141 KB
141 KB PNG
10k tokens for this LMAO
>>
File: IMG_2817.jpg (3.32 MB, 4032x3024)
3.32 MB
3.32 MB JPG
>>108197130
Why does 4chan hate my computer?
>>
File: nanbeige2.png (60 KB, 856x453)
60 KB
60 KB PNG
>>108197245
This model jesus
>>
>a business could never be close to the residential area
nanbeige is an honorary amerimutt
>>
File: file.png (674 KB, 1024x542)
674 KB
674 KB PNG
>>108197200
>I ran mobile back in the day
I hope that just before internet becomes illegal to use /lmg/ turns into a thread overran by saars running gemma-5-5B1A on their second hand top of the class current year iphones
>>
>>108197288
>saars running gemma-5-5B1A
IBM has released something close to that and even more SAAR friendly
https://huggingface.co/ibm-granite/granite-4.0-h-tiny-GGUF
7BA1B model crafted by the noblest of SAAR corporation
>>
>>108197331
>posting tiny when micro exists
https://huggingface.co/ibm-granite/granite-4.0-h-micro-GGUF
>>
File: tako & shite.jpg (278 KB, 560x560)
278 KB
278 KB JPG
>>108197130
That is hilarious. Kinda curious about the perf of a 8B on top iphone, silicon isn't far off the laptops no?
>>108197226
Track down tys and let's get to the bottom of this
It's zooms being technically retarded again
>>
>>108197363
micro is a dense model and is not as meme worthy as a A1B MoE (that doesn't even do anything better than the 3B you linked)
>>
>>108197269
It's literally there in the error hfs. Disable your adblock on this site/thread or adjust the filters
>>
>>108197388
>just unblock the sus domains bro, just do it
>>
>>108195946
I'm calling BS on your bench, Nanbeige keeps telling me to walk.
>>
>>108197245
I unironically live 50 meters from a car wash. I can see someone washing their car right now in -5C weather.
>>
>>108197423
why lie on the internet like this?
>>
>>108197411
fix ur malware then idk figure it out?
idk if sus
no i won't visit your ip harvesters
>>
File: nanbeige3.png (238 KB, 854x1733)
238 KB
238 KB PNG
>>108197423
This model is fucked.
>>
>>108197476
Have you tried resetting the context?
>>108197437
Lmao
>>
>>108197476
i come to wash
>>
File: output_2.jpg (628 KB, 1389x3792)
628 KB
628 KB JPG
>>108197421
check your sampler settings. I know most of the /lmg/ niggers love to use meme sampler settings. Use what the lab tells you: 0.6 temp, top p 0.95. If they say nothing about shit like top k and min p you should take it as meaning they should be disabled.
>>
File: output_1.jpg (1.18 MB, 1389x4350)
1.18 MB
1.18 MB JPG
>>108197421
>>108197511
reasoning scrollback
>>
File: output_0.jpg (1.32 MB, 1389x4350)
1.32 MB
1.32 MB JPG
>>108197421
>>108197511
>>108197522
start of prompt
>>
File: nanbeige4.png (231 KB, 885x1579)
231 KB
231 KB PNG
>>108197245
>>108197277
These 2 were with the Heretic

>>108197476
This one is the normal model.

>>108197497
>Have you tried resetting the context?
This one was after I asked the normal model the question picrel was what it answered.

I'll play with the sampling params.
>>
>>108197551
I noticed it still said heretic in the chatbox so I might have inadvertently still been using the heretic model. It's now properly answering drive.
>>
>>108196719
The only thing I didn't buy because I judged it to be too expensive is 1.5 TB of DDR5 RAM.
But my priority is reducing the cost of running models locally so optimizing for that particular hardware setup didn't make sense in the first place once the prices exploded.
>>
>>108197617
>>108197619
I'm starting to think the guy fucked up his abliteration and he just created a EHTICALMAXXED model instead.
>>
>>108197620
I'll take the opportunity to ask something theoretical.
Some anon was talking about using pipelining and request batching to run multiple requests with the same model with each request predicting token n+1, n+2, etc as a form of self speculative decoding or whatever he called it.
Does that kind of thing even make sense? Is there something you can do to have a model "skip" n token (he mentioned padding I think?) without having to train the model for that?
Doesn't seem like it would work since to generate token n+1 you'd need to know token n, and self speculative decoding is about generating whole sequences then having the main model check that sequence, yeah?
>>
>>108197645
I think you are describing a beam search at the end. it is a seemingly valid approach.
>>
>>108197645
>and self speculative decoding is about generating
I mean,
>and regular speculative decoding is about generating...
>>
>>108197680
>>There is no reason not to abliterate small models locally.
other than not having the hardware...
>>
>>108197680
I might look into it. was just curious about his model.
>>
File: 1745654255497465.png (409 KB, 1140x849)
409 KB
409 KB PNG
https://taalas.com/the-path-to-ubiquitous-ai/

Meme hardware company shows off their chips which shit out sloptokens at light speed by having models hard-wired into the chips.

Their demo ( https://chatjimmy.ai/ ) is only running a quantized llama 3b right now so it's not actually useful for anything yet but it's a cool tech demo and seems like a possible way for inference and hardware costs to come down dramatically in the future.
>>
>>108197704
we know, you're late with the ad
>>
>>108197680
>banning emoji.
bruh
just use a grammar
root ::= [\u0020-\u007E\u00A0-\u00FF\u20AC\t\r\n]+
>>
>>108195832
>zero new llama.cpp commits since this was posted
It's over.
>>
nanbeige is useless for text completion.
>>
>>108197754
text comp is depreciated unc, get with the chat times
>>
>>108195832
everything hugging face touches turns to shit
>>
>>108197777
fitting for georgi.cpp
>>
>>108197754
nevermind text completion, it's so reasoner maxxed a prefill with empty <think> </think> will not work if you ask a question that triggers its reasoner mode. It can, however, somewhat behave like an instruct model when there's an underlying high confidence score
Here's an example:
>>"what's the capital of France"
>The capital of France is Paris.
>Paris is renowned worldwide for its rich history, culture, art, fashion, and cuisine. It's home to iconic landmarks such as the Eiffel Tower, the Louvre Museum (which houses the Mona Lisa), Notre-Dame Cathedral, and the Champs-Élysées. The city also serves as a major global hub for finance, fashion,
gastronomy, and diplomacy. Is there anything specific about Paris you'd like to know?
if I mispell France intentionally:
>>"what's the capital of Rance"
>The question "what's the capital of Rance" contains a common misunderstanding.
>Let me clarify:
and it continues to blabber on and on and on like a reasoner despite being outside of a <think> block
>>
>>108197765
All models do text completion under the hood.
>>
>>108197820
nope that's such an unc way to think
>>
>>108197777
Quads of truth.
>>
I'm still using GLM 4.5 Air.
Anything better or worth trying out since?
64G+12G
>>
>>108197820
is there a secret to making autists like you just stfu? you very well know what he meant, but you had to add your autism to it
if someone says text completion everyone has the tacit understanding the person means "using the model without a chat template"
a chat template may still involve text completion on a technical level but that's so obviously not what people mean here.
>>
>>108197859
you're using anon's template wrong
>>
>>108197052
Air was only slightly less frustrating than oss was
>>
Finetuners, this might be your chance to create AGI. Take a Nanbeige4.1 and make it think 4-8x less with ~same quality of answers, and at the very least that's an irreplaceable subagent.
>>
>>108197851
At that size bracket, no, unfortunately not.

Qwen3-coder-next if you want to do coding with it I guess, but that's about it
>>
>>108197886
Sure
You're footing the bill for everything, agreed?
>>
>>108197859
You give the people in this thread way too much credit.
>>
>>108197645
The reason speculative decoding works is because the runtime when evaluating 2 tokens is smaller than 2x the runtime of evaluating 1 token.
For the whole thing to work you need to produce guesses for the next token that are sufficiently cheap vs. evaluating the full model and also sufficiently good to offset the increase in runtime from evaluating more tokens.
I don't know how one could use the full model with batching to produce these guesses with a lower latency than running the model normally.
>>
>>108197867
kek
>>
File: cellphone_girl2.jpg (240 KB, 1205x2048)
240 KB
240 KB JPG
>>108197269
>high GPU workloads will destroy the phone
Good thing games aren't popular on the iPhone, that would definitely burn them out.
>>108197372
Looks like Impish Mind is ~3.5T/s on mine. Nothing groundbreaking, but more than fast enough to have out-and-about.
>>
Why is this thread suddenly cumming their collective pants over a 3B model?
>>
>>108197551
I decided to give the model a try. When I first try a model I usually ask it to write a dossier on richard nixon, i don't know why i just do,
so far it is 10700 tokens and counting on reasoning and is rewriting dossier over and over with some very minor changes.

it is an interesting model that is for sure
>>
>>108197933
I heard it punches above its weight and trades blows.
>>
>>108197933
do not question it
>>
>>108197933
sorry if your to poor to run it but you need to let peple talk about things
>>
>>108197948
Definitely can't be used for roleplay but heh, maybe its good at tool calling.

Maya’s fingers stilled above her heart. She took a slow breath. This wasn’t about technique. It wasn’t about describing what she was doing. It was about yes.
A silent understanding passed between them.
Then, slowly, deliberately: hands meeting.
No rush. No performance.
A shared stillness.
A decision made in trust.
When their breaths aligned—soft inhale, gentle exhale—
This moment was enough.
It didn’t need obfuscation.
It needed respect.
>>
>>108197902
>I don't know how one could use the full model with batching to produce these guesses with a lower latency than running the model normally.
Yeah, it doesn't make much sense.
I *think* the idea is that thanks to pipelining, running two or more parallel requests (parallel decoding/batching) yields an absurd total throughput in t/s (does it?), so you could run the generation of a whole sequence (tokens n, n+1, n+2, ...) in parallel.
Meaning, you'd be generating a number of tokens in less time than you would if generating them sequentially as normal, but if that time + the time to evaluate the sequence is lower than the time to just generate the sequence normally, I have no idea.
But even before that, you can't really make an arbitrary model (read, not trained for that specific behavior) generate tokens further in the sequence the the immediately next one, can you?
You can't really make a model do
>Input: Jhons Mobile C
>output1: ar
>output2: _<space>W
>output3: __ash
And so on and so forth, at an inference engine level. At least not as far as I'm aware.
>>
>>108197933
It tells you to drive to the car wash.
>>
>>108198009
But what about cockbench result?
>>
>>108198009
that's not ethical
>>
File: k153703.jpg (672 KB, 1920x1080)
672 KB
672 KB JPG
>>108197932
which phone model? pp t/s? ctxlen? util to see mem usage?
i'll get this going to have something for when the nukes go off inb4 emp
>>
>>108198001
I still need to double check some of the info but it is actually a well written dossier, much better than what has been produced by other larger models i have played with.
>>
still no emotional voice cloning?
>>
Talk about running models on a mobile phone should be an automatic ban.
>>
>>108197411
If any doubt, they are sus. My browser doesn't touch those domains loading this thread. You have some sketchy extensions, or are trying to scam anons into visiting those domains.
>>
>>108198080
Idk why we're still stuck with fucking "ok google" bullshit when we could have a tiny LLM that is actually smart enough to click shit or call app defined tools.

Right now you're always stuck trying to figure out the exact wording some asshole dev defined for google to understand that you want to "Report that there is a car on the side of the road"
>>
>>108198080
There are uses, take something like granite micro that can summarize text or retrieve information.You could have a personal digital assistant that is run locally and could do simple tasks like telling you what the top news stories are and giving you a summary.
>>
>>108195832
>llama.cpp is the fundamental building block for local inference, and transformers is the fundamental building block for definition of models and architectures, so we’ll work on making sure it’s as seamless as possible in the future (almost “single-click”) to ship new models in llama.cpp from the transformers library ‘source of truth’ for model definitions.

Omni models support when?
>>
>>108198115
>>108198164
saars please
>>
File: memory.png (126 KB, 1728x118)
126 KB
126 KB PNG
>>108198033
iPhone 17 Pro, 8196 ctx, 4580 tokens in context before generating.
memory usage in app seems sus, but the log has the full breakdown.

>new captcha
I hate this
>>
>>108197765
It's cute when zoomers try this hard.
Have fun getting drafted for Israel in Iran.
>>
>>108198165
>so we’ll work on making sure it’s as seamless as possible in the future (almost “single-click”) to ship new models in llama.cpp from the transformers library ‘source of truth’ for model definitions
this doesn't even make any sense
how could they possibly achieve this short of importing transformers in llamer cpp? there's no deterministic code conversion mechanism that could lead to satisfying results here, they're too different. Or are they going to do what that retarded vibecoder mentioned once and write a trillion token prompt describing how to convert from TF to llamer and hope for the best?
>>
>>108197511
>0.6 temp
what retarded lab says this? temp=1 = off, has no effect, the output as it was trained. imo there's rarely a need for temp under 1, use an aggressive truncation sampler if it's that precious
>>
>>108197704
usecase?
>>
>>108198214
It was qwen or mistral that recommended a base temp of 0.15 for one of their models iirc
>>
>>108197886
OK but what if you just take it and multiply it by 50?
>>
>>108198214
temp 1 is absolutely not off. 0 is off. this is easily demonstrable, how did you ever come to such a conclusion in the first place? have you just never used language models before?
>>
>>108198236
You mean in size? I strongly suspect there not being enough high-quality data. This is why all frontier models are slop-poisoned.
>>
File: mistral small.png (21 KB, 878x169)
21 KB
21 KB PNG
>>108198214
>the output as it was trained
well, for many models that output is not good at all
mistral pic related, and I concur with them, their model is unusable at temp 1. Just plain unusable.
I don't train models, I'm not a ML researcher, I can't explain the why, but I can say from experimenting with various prompts in various models I'd sooner trust what labs say to do on their model than the local /lmg/ niggers.
>>
>>108198227
Ideal for premature ejaculators
>>
>>108198249
Create a jobs program for all unemployed amerimutts to filter through the slop.
>>
>>108198203
I prefer it over the retarded stars
>>
File: llama-sampling_cpp.png (34 KB, 748x152)
34 KB
34 KB PNG
>>108198237
>wrong
source code disagrees with you.
>>
the stars were the absolute worst. I always paused a lot longer to find the two retarded stars than in other 'cha. rn the 'cha are easy to do fast.
>>
>>108198287
bruh. i dont care about your code snippet. just load up a model. temp 0 is deterministic you will get the same reply every time. use a temperature below 1 and above 0 and you will get swipe variety. i dont care about your code reading comprehension or lack thereof. its fucking easily demonstrable with any model size on literally any inference engine and front-end.
>>
>>108198084
>You have some sketchy extensions,
>>108181785
>>
File: 1760067146659991.jpg (3.24 MB, 1755x2242)
3.24 MB
3.24 MB JPG
>>108198237
divide by 1
>is absolutely not off
>>
>>108198310
>temp 0 is deterministic
temp 0 is a special case, checked for, and with its own code path (for greedy decoding).
static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
if (temp <= 0.0f) {
// find the token with the highest logit and set the rest to -inf

any other value of temp, including something tiny like 0.1, uses the same algorithms as 1 or 2 or whatever other temp you'd set up. Temp works by division and 1 is the natural state, he's right. Why do some models work better at low temp I can't explain, but that's where that nigga's wrong.
>>
>>108198310
>>108198392
Temp 1 has no effect and Temp 0 is undefined, it is literally dividing by 0. So every inference lib makes the obvious choice to interpret it as greedy sampling
>>
>>108198411
>Temp 1 has no effect
>>
Guys we are just all pretending that we are using smartphones for this and we are pretending that we are retarded about how samplers work. Right?
>>
>>108198427
/lmg/ has been jeetmaxxed
>>
>>108198427
>suddenly
>>
File: chaamiku.jpg (38 KB, 409x545)
38 KB
38 KB JPG
>>108198427
Fri eve EU we can have a lil fun
(retards are amogus indeed)
>>
File: carwash.png (157 KB, 772x1288)
157 KB
157 KB PNG
I think I've found the reason why LLMs fail the car wash test

>2. Clearly state the goal:
To determine the most appropriate method of transportation to get my car washed, considering the short distance of 40 meters.
>>
It depends on what one means by off, no change in response on swipe, or or no change to default weights behavior. However, there are multiple ways to achieve the former without setting each way to "off".
>>
>>108198392
idk i feel like its still sampling and not off. to me off is greedy decoding. if its sampling even at baseline probability its still sampling, it isn't always picking the highest prob token.
>>
>>108198444
but saarvam isn't out yet
>>
>>108198419
>>108198446
is this the bot meta? randomly greening random substrings
>>108198461
look at the code of your inference library, add some debug prints
stop being dumb the lot of you
https://artefact2.github.io/llm-sampling/
>>
File: cellphone_girl3.jpg (391 KB, 1771x2213)
391 KB
391 KB JPG
>>108198427
>it's not real "Local LLM" unless you use an Nvidia card, otherwise it's just "Sparkling Markov Chains"
Yes we are.
>>
>>108198461
>to me off is greedy decoding
that's because you misunderstand the point of temperature
temperature shapes the token distribution, 1 means it does nothing to the distribution, but sampling is still occurring (seed+PRNG) because sampling itself isn't disabled
setting temp at 0 disables sampling, but that's like an extra function tacked on temperature, it has nothing to do with the true nature of temperature, which is to shape distribution (when you set it to anything other than 1).
>>
File: carwash2.png (176 KB, 639x1445)
176 KB
176 KB PNG
I'm losing my mind
>>
>>108198046
not easy, but doable with gpt-sovits. Maybe works with other cloners where the sample is easy to swap out over API?
Ideally you need sentiment analysis and a voice sample per emotion and then it could be automated.
>>
>>108198555
if you ask nicely they might bring the wash to your car, though this may be unethical
>>
>>108198552
yeah i could see that. but its not changing my mind. 0 is off 1 is baseline or neutral.
>>
>>108198257
To me this suggests that their base models are undertrained and the token probability distribution is poor as a result. Pre-llama era models would also be unusable at temperature 1.
>>
File: WALK.png (141 KB, 1227x798)
141 KB
141 KB PNG
>>108198555
I felt compelled to try this meme on 'toss. I was not disappointed by the result hahahah
>>
>>108198587
>doable with gpt-sovits
i'm on v2 because someone was kind enough to train character voices for it. emotions other than exclamation are impossible
>>
>>108194845
feet
>>
>>108198617
>someone was kind enough to train character voices for it
which characters are you using? Which v2 model?
>>
>>108198612
They're totally AGI and sentient though.
>>
>>108198641
Does it matter? https://huggingface.co/therealvul/GPT-SoVITS-v2
>>
I'm starting to think model makers were actually on to something with the "But wait..." thinking spam.
>>
File: butwait.png (177 KB, 774x1293)
177 KB
177 KB PNG
>>108198689
holy shit!
>>
>>108198711
>finds the correct answer before the butwait
>>
>trying Anima preview
Not perfect, but it's got promise. Feels like Noob except it can do text, kind of. Hopefully the final version turns out well.
>>
>>108198711
>drummer
>>
>>108198731
4.1 cydonia is better than base small.
>>
>>108198728
>>
>>108198665
>Does it matter?
No, but I still run v2 so was curious what trained model/characters you were using.
>>
>>108196797
I am going hollow. Tried the webui in multiple browsers thinking it might just be some broken JS in firefox. Unless I'm crazy, they merged a feature that just plain does not work, the heck?
>>
>>108198728
another anon is training a better version. we should be hearing about it soon
>>
I'm retarded. How do I download Llama model weights without signing their gay ass agreement and giving Zuck my info?
I want to implement the model myself, not just run it btw.
>>
>>108199007
someone might have re-uploaded the safe tensors, you can try to look around.
>>
>>108199007
iirc all of them have public mirrors on huggingface
>>
>>108198802
source?
>>
>>108198958
Nice. I'll probably not spend much more time playing with this one then.
>>
>>108198728
cool ATs
>>
>>108198738
>[drummer something] is better than [real thing]
[headcanon]
>>
>>108199007
>I'm retarded
The first step to learning is admitting that
>How do I download Llama model weights without signing their gay ass agreement and giving Zuck my info?
You could always hop on the original leak torrent if you mean the OG llama model. Otherwise mirrors abound, as others have said
>I want to implement the model myself, not just run it btw.
This is the part I'm most curious about: what do you think "implement" means, especially in relation to "run"?
>>
>>108197704
Yeah once AGI is here it's essentially guaranteed that the weights will just be etched into silicon directly and we will mass produce AGI chips and implement them in literally every device because of its low cost. Like how we now have embedded systems running Linux just in Bluetooth speakers because it's easier and cheaper to use an entire SoC running Linux than putting in a microcontroller.

We will have traffic light controllers with AGI chips in them. Kind of a next level of horror scenario but I genuinely think this is going to happen.
>>
>>108199351
I want to write my own inference kernel in Triton, to learn about GPU programming and LLM slop
>>
>>108199452
>once AGI is here
why do we have this kind of crazy people here
ungenuine question, I'm just wishful thinking a parallel /lmg/ dimension where low iq jeets are not allowed
>>
>>108199458
is this like a school project or something or are you serious about wasting your time on that shit.
LLM slop is basically slang for low quality output, which could mean literally anything when interpreted by different people.
>>
>>108197704
>no one will ever make a drummer shittune dedicated hardware
Sometimes I realize this world might have been a darker place than it actually is.
>>
Did people stop constructing uncensored models from existing stuff or why is mistral nemo recced for local masturbation in 2026
>>
>>108199552
even if its 10000 years later that stuff is inevitable, the only question is whether we get to enjoy it
>>
>>108197704
As in it's literally only allowed to use 1 specific model? Dafuq? Can they scale it up to large MoE models?
>>
>>108199680
Mistral Nemo is the last small non-benchmaxxed (math, agents, reasoning, etc) model trained on pirate books. Other than that I think it's just been memed to popularity.
>>
>>108199709
the weights are like literally physically built into each chip, so no it doesn't swap models

>can they scale it up
well that's the billion dollar question
>>
>>108199552
>I'm just wishful thinking a parallel /lmg/ dimension where low iq jeets are not allowed
Take one look at the /g/ catalog and you'll realize it won't ever happen here. We need an alternative.
>>
>>108199680
It was the last uncensored and non-codemaxxed model. If you use it for RP it's noticeably smarter than even bigger models at getting and handling non-obvious cues that's why people like it so much, I know I liked it more than all other sub-35B models



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.