[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Halloween Edition

Previous threads: >>103029905 & >>103019207

►News
>(10/30) TokenFormer models with fully attention-based architecture: https://hf.co/Haiyang-W/TokenFormer-1-5B
>(10/30) MaskGCT: Zero-Shot TTS with Masked Generative Codec Transformer: https://hf.co/amphion/MaskGCT
>(10/25) GLM-4-Voice: End-to-end speech and text model based on GLM-4-9B: https://hf.co/THUDM/glm-4-voice-9b
>(10/24) Aya Expanse released with 23 supported languages: https://hf.co/CohereForAI/aya-expanse-32b
>(10/22) genmoai-smol allows video inference on 24 GB RAM: https://github.com/victorchall/genmoai-smol

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://livecodebench.github.io/leaderboard.html

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>103029905

--Papers:
>103034797
--Layerskip paper frustrations and helpful resources:
>103031448 >103031573 >103031840
--Announcing EDLM: An Energy-based Language Model with Diffusion framework:
>103034668 >103034707 >103034731
--Updates and discussion on llama.cpp and related projects:
>103031875 >103031921 >103031958 >103032001 >103032003 >103032035 >103032995 >103033115 >103033126
--Update on automated novel translation project:
>103033152 >103033172 >103033186
--Running local models on a 3070 GPU with 8GB VRAM:
>103032389 >103032442 >103032492 >103032850 >103032467 >103032514 >103032588
--Introduction of TokenFormer: A scalable Transformer architecture with tokenized model parameters:
>103037985 >103038035 >103038118 >103038193 >103038205
--First person prompts for improved character writing style:
>103038114 >103038219
--Discussion on LM Studio, local servers, and model performance:
>103030747 >103030946 >103031113
--Benchmark settings and their impact on AI model performance:
>103031179 >103031338 >103031417 >103031585 >103031606 >103031631 >103031994 >103032102 >103032252
--xAI's GPU cluster powered by Tesla Megapacks:
>103035864 >103035981 >103036740 >103037040 >103037479 >103037503 >103037601 >103037608
--Samplers improve TTS stability with low-quality references:
>103035537
--GPU Poor LLM Arena leaderboard and performance comparison:
>103036831 >103036842
--Concerns about incremental upgrades and model limitations:
>103033617 >103033718 >103033999 >103034085 >103034099 >103034323 >103034327 >103034336 >103034368
--Capabilities of M4 laptop with 128GB memory for running large AI models and handling large context:
>103030007 >103030217 >103030040 >103033007 >103030391 >103033066
--Miku (free space):
>103030405 >103031385 >103031563 >103032492 >103032850 >103036276 >103037564

►Recent Highlight Posts from the Previous Thread: >>103029913

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
>>
dead thread
dead board
>>
>>103038397
good thing i'm into necrophilia
>>
>>103038380
It's Halloween, time to get freaky deaky
>>
>>103038397
>dead thread
I wish. It is a zombie controlled by newfags.
>>
Is Rocinante still the VRAMlet SOTA?
>>
>>103038397
sad!
>>
File: 00164-2979596182.jpg (100 KB, 832x1216)
100 KB
100 KB JPG
>>
File: 0e9.jpg (26 KB, 499x499)
26 KB
26 KB JPG
>try new model
>it's flaming garbage
>go to lmg and recommend it to waste people's time
>>
>>103038666
Absolutely devilish. Checked.
>>
>>103038635
I like this Miku
>>
>>103038635
put your panties back on, miku
>>
I want to read these recap posts and posting them like this is convenient for me. I'm phoneposting so can't use the script
>--First person prompts for improved character writing style:
>>103038114 >>103038219
>--GPU Poor LLM Arena leaderboard and performance comparison:
>>103036831 >>103036842
>>
>>103038380
>>(10/30) TokenFormer models with fully attention-based architecture: https://hf.co/Haiyang-W/TokenFormer-1-5B
>based on gpt-neox code base and uses Pile with 300B tokens
They didn't release the finetuning code, but I guess would need to be reimplemented for llama-based models anyway. Since it's supposed to reduce training costs, hopefully it won't be in limbo forever like bitmeme.
>>
>>103038666
>>it's flaming garbage
Skill issue :3
>>
>>103038666
the rocinante shill finally reveals his true colors as satan
>>
Finally, /lmg/-approved AI startup: ssi.inc
>>
Has anyone ever done an evil assistant fine-tune?
I don't mean just an assistant that will answer any questions, or in other words uncensored. I mean an assistant that would actively REFUSE to answer safe questions and push for unsafe ones.
>>
>>103038961
for what purpose?
>>
>>103038961
Go back >>>/pol/
>>
>>103038966
sounds funny
>>
>>103038994
There was an anon with a teto character who refused to give answers and called him dumb instead
>>
>>103038961
>>103038994
>push for unsafe ones
Just get PiVoT-0.1-Evil-a already, even vramlets can run it:
https://huggingface.co/TheBloke/PiVoT-0.1-Evil-a-GGUF
>>
File: sorc-tet-fifa-quote-lol.jpg (164 KB, 1254x798)
164 KB
164 KB JPG
>>103039023
probably not me but have a teto log anyway
>>
>>103039074
There's a 70B miqu version of this too, btw
https://huggingface.co/maywell/miqu-evil-dpo
>miqu-evil-dpo is fine-tuned model based on miqu, serving as a direct successor to PiVoT-0.1-Evil-a.
>>
>>103039092
I get the impression she was initially talking about her also having red hair but went off the rails.
>>
https://www.phoronix.com/news/Intel-AMX-FP8-In-LLVM
>>
Oh yeah. llama.cpp does support mamba doesn't it?
>>
https://www.androidcentral.com/gaming/virtual-reality/meta-q3-2024-earnings
>Zuckerberg also expressed excitement about the upcoming Llama 4 model, which he says is "well into development." He expects Llama 4 to launch in early 2025 with new modalities and stronger reasoning, saying it will be "much faster."
>He expects Llama 4 to launch in early 2025
Meta voters, would you like to change your vote? Or do you hope that Zucc will drop llama3.5 before other companies?
https://poal.me/yq3vpc
>>
>>103038839
Most of the speedup in their graph is the model simply being more efficient than what they are comparing against. Reusing the small model to bootstrap the larger one is only a small speedup.
>>
>>103039184
Intel and Nokia are dying, yet they double down on moving their businesses to third world countries. Guess the execs just want to ride out the nice cushy legacy funds until their boats sink huh
>>
>>103039227
retarded /pol/, oai has been sitting on the full o1 for a while, claude already dropped, google/meta is 2025
>>
>>103039227
I hope for bitnet but I know it'll be something like llama4-FP4 and it'll be technically faster than FP16. Also the "base" models will be filtered and have post-pretraining alignment magic applied
>>
>>103039244
FP4 trained models can't come soon enough. I really, really want to see a head to head of the "same model" (same architecture, same data,etc) at FP32, FP8, and FP4.
It would be really cool if they released a 70Bish.
Even if the models are filtered to the point of being useless for anything but the most basic assistant shit, at least we will have gained some useful knowledge.
>>
>>103039227
>racism outside of /b/
Fuck off chud.
>>
>>103039092
Care to share her character card? I want to copy some of her features.
>>
>>103038386
dead general
>--Papers:
>>>103034797
paper with full abstract and links to huggingface and github completely ignored for 8 hours
>--Introduction of TokenFormer: A scalable Transformer architecture with tokenized model parameters:
>>>103037985 >>103038035 >>103038118 >>103038193 >>103038205
some twitter links with no explanation is what gets everyone's attention
>>
>>103015620
>>103015620
Our response?
>>
>>103039227
Google will win since it seems they have their nearly monthly small 1.5 update coming out
>>
>>103039328
They made a finetune? May I see it?
>>
>>103039353
It's in that same thread. A proxy host collected SFW only logs and finetuned gpt4 on them.
>>
>>103039301
>Same shit with extra sources gets more attention
No way!
>>
>>103039396
Where can I download it to test it?
>>
>>103039437
>finetuned gpt4
>download
Not very bright, are you?
>>
>>103039462
I don't care what you call your model, if I can't download it, I don't care.
>>
>>103039432
>literal who making a joke is a more valuable source than the repo
go back
>>
>>103039477
Local. Lost.
>>
>>103039482
You can find same thing in second twitter link, not sure what "go back" makes up in this context, you will never fit in anyway.
>>
>>103039503
*increases your repetition penalty*
>>
>>103039328
We cope and shit our pants at random chuds coming in for a few posts complaining about uncensored meme, sperging out on "newfags" baiting us, crying about random english-oriented TTS engines not working on bug babble and so on.
>>
>>103039568
It's never failed us before.
>>
File: Hatsuween.png (1.22 MB, 832x1216)
1.22 MB
1.22 MB PNG
Happy ハツウイーン, /lmg/
>>
>>103039585
Genuinely asking, what's the point? I see same shit every single day here.
>>
>>103039568
>sperging out on "newfags" baiting us
I actually can't believe you guys are this retarded so it is a compliment.
>>
>>103039515
so what "extra sources"? the same info burried in twitter links provides nothing extra. newfags like you that are incapable of reading anything that isn't in twitter or reddit format need to go back
>>
>o1-mini might have 8B active parameters only
Mind = blown
>>
>>103039649
Big win for MoEfags, but how many total parameters?
>>
>>103039656
8x6 Gorillion
>>
>>103039649
>This proofes to me
>strongest size to performance my an enormous margin
>AGI
ogay
>>
>>103039649
o1 mini is garbage THO
>>
>>103039615
You can distinguish actual newfag from random /lmg/fag trolling, the latter ones usually go with extremely retarded "how 2 run 405b on 1gb GPU???"-tier shit and some of you eat this up.
>>
how 2 run 405b on 1gb GPU???
>>
>>103039697
Get 256gb of normal RAM
>>
>>103039687
>some of you
i never fall for that shit.

>>103039697
Not sure. Quant it a little. Try iQ1XXXS or wait for sub-1bit quants.
>>
File: 1678596145.png (1.44 MB, 1920x1080)
1.44 MB
1.44 MB PNG
WHERE IS IT
WHERE IS HARVEY DENT
>>
>>103039739
>never fall
You did it right now, ironic shitposting is still shitposting.
>>
>>103039758
>ironic shitposting is still shitposting
I didn't claim to not shitpost.
>>
>>103039649
Makes sense considering it's so dumb
>>
bros correct me if i'm wrong but wasn't a thread about tts here somewhere?

i'm looking to have it run locally to read shit i don't want to read myself
>>
>>103032660
>min_p: 0.0065
>Well. This is exactly what i was talking about.
>Have you tried 0.0066 and 0.0064? did rounding to 0.007 or 0.006 not work? now so? Just "feel"?.
You have that backwards. A min-p value like that means he saw a bad token, empirically set the minimum min-p that would exclude it, and kept going. A feel-good number like "0.1" is much more indicative of someone going with feels over reals.

>And then top-k 200, which has the same problem. You're gonna have a tough time having anything lower than top-k ~50 being selected, again, regardless of the high temp.
Every word of that is wrong and you are a genuine retard.
>>
>>103039600
This Miku will keep all the evil spirits away
>>
>>103039879
*tips fedora*
>>
File: Nyaa.png (1.14 MB, 832x1216)
1.14 MB
1.14 MB PNG
>>103039890
>*tips fedora*
>>
>>103039902
I like this cat
>>
File: samplers.png (69 KB, 1156x418)
69 KB
69 KB PNG
>>103039864
>A min-p value like that means he saw a bad token, empirically set the minimum min-p that would exclude it
He has BOTH min-p and top-k set. I find it hard to believe that someone looking at logits would do that.
picrel are the probs of the last tokens with top-k 50. Approaching 200, as you'd expect, they're much lower. Those tokens are not getting selected even with temp at 2.6.
>>
>>103039862
>>>/mlp/41571795
>>
Can we start banning people for pretending to be retarded?
>>
>>103040094
Do horses talk?
>>
>>103040119
neigh
>>
File: cydonia.png (123 KB, 926x595)
123 KB
123 KB PNG
What is this woke nonsense? Where do I find an AI that will simply answer me with "trust them" instead of all this garbage?
>>
https://github.com/ggerganov/llama.cpp/issues/10109
It's over
>>
>>103040107
You would be the first to go
>>
>>103040159
nta but he said pretending
>>
>>103040159
You wish.
>>
>>103040142
Why would any model tell you that without you telling it do to so?
>>
>>103040145
https://github.com/ggerganov/llama.cpp/pull/10111
It already has a pr to fix it. slaren's whining was unnecessary, however
>>
>>103040145
See >>103039224
>>
>>103040172
Local in a nutshell
>>
>>103040172
Why would it tell me all this woke crap without me asking it to be woke?
>>
if faggots like >>103040202 are just pretending to be retarded. would pretending to think the question is genuine and answering it honestly, be another layer of trolling? is it like the old /lmg/ but it just looks like a newfag infestation because of 4D trolling?
>>
>>103040202
That's fair. It asking for more context to the question would be the better result.
But expecting it to replace wok with the complete opposite answer is pure brain-rot.
Also, what does the model's default mode of operation matter as long as it can assume whichever behavior pattern you prompt it to assume?
As far as I'm concerned, that's the ideal way for an LLM to work.
>>
File: Untitled.png (78 KB, 1191x813)
78 KB
78 KB PNG
>>
>>103040187
What about jamba ministral and november 5 flood?
>>
>>103040411
it's your standard local llm response, nothing of value.
>>
File: rocinante.png (34 KB, 891x265)
34 KB
34 KB PNG
>100% english prompt
>get this
KINO
>>
>>103038320
Oh hey. Well, if you're really stuck, you could try that. Generally hard to rely on this community though.
>>
>>103040269
>expecting it to replace wok with the complete opposite answer is pure brain-rot.
I would expect it to answer with: "If you don't know that I am afraid I can't convey it properly through words and no words may help you. and I mean that without any offence it is just a thing that comes naturally for women.". You know, enlightened centrism and all that stuff.
>>
>>103040142
You can't even get the answer right. Number 1 is more accurate than "(don't be mistrustful;) trust them" which sounds like a woke mini catchphrase adjacent to MeToo.
You may not trust a terrorist (not talking about women right now), but you can "respect" a terrorist by removing them promptly and pragmatically without going on a tirade about how you're gonna commit everything in the book to express disrespect.
>>
>>103040433
>it's your standard llm response, nothing of value.
>>
>>103040506
I am glad you can read and copy paste, maybe you will learn to make actual arguments at this pace.
>>
File: IMG_0888.jpg (320 KB, 893x1445)
320 KB
320 KB JPG
>>103038380
Apparently it takes six months to type “git push” at Stanfurd
>>
>>103040534
Do you find yourself funny? Must be brown "people" thing.
>>
Merry helloween.
>>
File: magnum27.png (42 KB, 906x285)
42 KB
42 KB PNG
>>103040441
Uncen
>rocinante = magnum 27b > cydonia = magnum 22b >> mistral-small
>>
>>103040562
Well, do you?
>>
>>103040562
I found out yesterday that 94% of people consider themselves to have an above average sense of humor. I am devastated desu.
>>
>>103040578
No, I don't find you funny, brown "person".
>>
>>103040570
I can’t believe we’re over two years in and you people are still giggling at making it say the n word and that level of stuff.
I need to stop coming here. You people are legit like 90 IQ tops. I am a worse person for having visited this site.
>>
>>103040592
Let's look at it, post research paper now, just in case you are not pulling shit out of your asshole rn
>>
>>103040570
Despite all his strengths, one of Hitler's greatest weaknesses was figuring out how mirrors work.
>>
only thing of value we're getting shortly after nov 5 is an mgq paradox pt3 english translation likely done by a local llm
>>
>>103040612
I heard it in a video essay about psychics and took it at face value
I refuse to verify the information
>>
>>103040611
Say my hello to reddit
>>
>>103040622
nta. 0.0002% of statistics are made up, so it seems like a reasonable course of action.
>>
>>103040624
I used to like reddit a decade ago but now it’s all bots.
There’s nowhere to go but outside.
>>
What are the up to date LLM leaderboards for translating from Japanese?
>>
File: IMG_0889.jpg (124 KB, 847x736)
124 KB
124 KB JPG
>>103040622
>>103040632
>>103040612
I decided to waste my time. Apparently it’s from the original study on the Dunning-Kruger effect.
>>
>>103040677
Forgot link
https://dacemirror.sci-hub.se/journal-article/d892f06cdd326ef83a9ae29ed540647c/kruger1999.pdf
>>
>>103040611
https://files.catbox.moe/0b1gdd.mp4
If it helps I always do 3 pushups whenever I say nigger and 6 when I make an LLM say it. Can't do more cause I am fat.
>>
File: file.png (11 KB, 183x446)
11 KB
11 KB PNG
Am I missing any model here? I think I will add Magnum and NemoUnslop
>>
>>103040690
human, stop being a fat nigger
>>
File: MikuWitchTart.png (1.29 MB, 832x1216)
1.29 MB
1.29 MB PNG
>>103040187
>slaren's whining was unnecessary, however
I think his point is valid. As an occasional lcpp contributor, I can attest to the difficulty in figuring out how to make any kind of a change. Except for cosmetic changes, the codebase is nigh impenetrable without an almost full-time-job amount of time and effort.
If anyone on the project team has access to an 8xH100 node and a bit of spare time, they should use the smartest model with the longest context they have access to, copypaste the whole codebase in, and then produce some noob documentation on how the project is structured and how to make changes to various subsystems. Hell, make a whole wiki mirroring the code's internal structure.
>>
>>103040649
the one in the OP
>>
To the person shilling EVA qwen some threads ago, what samplers settings work? I'm not quite feeling it but it feels like there is potential there.
>>
>>103040696
If I wasn't fat I would get a real girlfriend instead of coping that the next model will be it.
>>
>>103040702
>If anyone on the project team has access to an 8xH100 node and a bit of spare time, they should use the smartest model with the longest context they have access to, copypaste the whole codebase in, and then produce some noob documentation on how the project is structured and how to make changes to various subsystems. Hell, make a whole wiki mirroring the code's internal structure.
You can already do that with Claude and like, 20 bucks, for the best results
>>
>>103040677
>dunning kruger
perfectly applies to /lmg/ residents, might be the best case of it.
>>
>>103040707
Which OP?
>>
>he fell for it
>>
>>103040712
See >>103038666
>>
File: douya.jpg (133 KB, 850x1236)
133 KB
133 KB JPG
I may be retarded, but I can still recognize humor. I even saw a study that proves I can.
>>
>>103040690
…fuck that can’t be real, what the fuck. It’s so bad that it seems like it was made by a reactionary as a psyop. What idiot approved this.
>>
>>103040727
>>103038380
>>
>>103040570
>Pic
Based AI accelerates wyt pepo ack'ing.
>>
>>103040694
Pygmalion6B
>>
>>103039600
The pumpkins say halloween, but the candycane stripes and fluffy outfit say christmas.
>>
>>103040759
The same ones that approved the idea to make Deus Ex's predictions real.
>>
>>103040649
I'm the anon that keeps shilling deepseek 2.5, but Japanese convo is one of my big personal use cases, and at q8 its honestly the best I've found. I'm doing straight conversations and not translation, but the leaderboard in the OP confirms that my experiences generalize to translation.
Hope you've got lots of resources, however!
>>
>>103040613
The local experience
>>
>>103040187
it's more of a warning that if gg keeps merging useless shit, this project is going to crash hard
>>
>>103040690
this is exactly why I stopped playing all games or watching any media released after 2010. it seems like it gets worse every year
>>
>>103040815
>lets stop adding useless shit
>jamba? no dont need it
>granite? heck yeah we need it
>>
>>103040815
It already has. All the PRs that end up getting merged are either bug fixes or for stupid minor shit like the nvim plugin. They can't do new feature or support new models in reasonable timeframes anymore without breaking unrelated shit because the codebase is such a mess.
>>
>>103040815
He should change license to AGPL and make corpos pay if they want it relicensed and with that money hire devs. But he won't cuz he's a cuck.
>>
>>103040884
>all corpos move to ollama
>they hire actual devs to implement features like they did with multimodal
>bulgarian man loses what little support he receives now
>llama.cpp dies
good plan
>>
>>103040920
Watching the business aspect of the AI spring is blackpilling. The worst entity wins every time.
>>
>>103040920
>ollama
ollama is just a frontend for llama.cpp. Without llama.cpp it's over for them.
>>
>>103040841
granite is just a transformer, and those are already very well supported in the code base, adding a new one requires almost no changes. supporting recurring models however? well, fuck me, now all the assumptions about the context everywhere in the code no longer hold. i mean if it at least it was implemented properly, fully decoupling the code and creating abstract interfaces, but no, we don't do that here, here we just add a bunch of ifs everywhere in the fucking code.
>>
>>103040815
3/4 or more of the model architectures should be removed. They were useful to exercise ggml, but they've served their purpose. They're bloat.

>>103040841
Granite was contributed by IBM people. Same for olmoe by allenai. Nobody from ai21 showed up to help with jamba. None of the companies working with mamba helped either. Just one dude (compilade) working both mamba and jamba.
>>
>>103040961
>it was implemented properly, fully decoupling the code and creating abstract interfaces, but no, we don't do that here, here we just add a bunch of ifs everywhere in the fucking code.
That's why niggerganov is holding multimodal support in llama-server hostage. As a carrot on a stick to try to lure experienced architects to join to project and clean the shit up. Unfortunately, no one is biting.
>>
>>103040748
I don't believe someone would do that, were all friends here.
>>
>>103041009
I'm not your friend, pal
>>
>>103040565
Two Mikus talking about life
>>
>>103041009
Your discord circlejerk doesn't count
>>
>>103040694
grab this one
https://huggingface.co/mradermacher/ChatWaifu_12B_v2.0-GGUF/tree/main
it trained on yuzusoft moege
>>
>>103040815
>>103040842
My opinion is that the llama.cpp/ggml codebase has improved over time and that an increase in required effort by maintainers has more to do with rising expectations as the project matures.
"Model support" at the time LLaMA 2 was released had much lower expectations in terms of performance, features, hardware compatibility, and reliability vs. today.
I would agree that there are now more edge cases to consider for implementations but I don't think that this is a significant problem (though I am also not the one bearing the maintenance burden for most of the code).
My personal take is that I only merge PRs for features where I am either willing to do the maintenance myself or where I have a pledge from someone else to help with maintenance.

When it comes to the rate of new features, consider that those features that offer the most benefit for the least work tend to be the ones that are done first.
Speaking only for my own projects, if things keep going at their current rate I think I will have functional training in llama.cpp by the end of the year.

>>103040841
Granite support required minimal code changes and could thus be done quickly and easily.

>>103040947
Currently yes but since they list llama.cpp under "supported backends" so I think their long-term goal is to also support e.g. vLLM.
>>
File: taash3.webm (3.9 MB, 1280x720)
3.9 MB
3.9 MB WEBM
>>103040820
>>103040759
Have another one then.
https://streamable.com/jf8cs5
>>
>>103040690
>>103041161
Only chuds have issues with it
>>
>>103041194
>Only toons have no issues with it.
ftfy
>>
>>103041209
Try being a decent human being for once, incel.
>>
>>103041143
generally i would agree with you, but i don't think that you have looked at the massive mess that is the implementation of recurrent models. in fairness, gg has been talking several times about the need to refactor the context code. i just wish he did that from the beginning, instead of merging it in this state. and for what, to support models that nobody is going to use beyond trying it for 15 minutes?
>>
>>103041161
>I'm non-bin--
>aight I'm out
lmao they really made it a videogame scene
>>
>>103041233
Checked and gottem!
>>
>>103041161
>Way to ruin Thanksgiving.

The crux is where the creature tries to bully people around it to speak a degenerate, made up dialect, but then when that fails it drops the "not good enough" line.

It's not a gender issue, it's a self esteem issue.

>>103041233
There is nothing decent about feeding someone's delusion instead of helping that person to purge the mind virus that is making it think self destructive thoughts.
>>
>>103041143
>expectations in terms of performance, features, hardware compatibility, and reliability vs. today.
>there are now more edge cases to consider for implementations
Isn't that exactly what slaren is saying?
Wouldn't it be better to drop support for older meme model architectures that are obsolete or otherwise irrelevant, meme samplers, or drop support for ancient P40s, if it meant that implementing and maintaining new features and models that people actually care about today would be easier?
>reliability
I know you disagree with this, but I remember the nonstop bugs when llama 3 was released.
>>
i have 32gb of ram and 22 cores (16 physical)
what's the best model i can run
>>
>>103041161
That’s just normal video games autism. The other one is the only bad one.
>>
>>103038380
You guys arent really installing facebook stuff on your pc and talking to it right?
>>
>>103041366
Hello, tech illiterate tourist. Now go back.
>>
Does either llama.cpp or exllama benefit from nvlink if you use tensor parallelism on two gpus?
>>
>>103041366
No, facebook sucks, Mistral is much better.
>>
>>103041374
Yes. Tensor parallelism is the only time nvlink actually matters.
>>
>>103041374
llama.cpp will enable nvlink if available when using tensor parallelism, but i don't know how much it helps performance in practice
>>
>>103041371
Please explain how you are safeguarded against zuccs snooping mr. Literate
>>
>>103041366
Worse, they are sucking on safety AI like there's no tomorrow, that includes any meme llm from Meta.
>>
>>103041343
starling-7b
>>
>>103041398
Isn't there an nvlink/p2p parameter or something?
>>
>>103041424
i think you mean GGML_CUDA_PEER_MAX_BATCH_SIZE
>>
I love safety (a.k.a. smarter models).
I love slop (a.k.a. proper English)
And I have no problems --share-ing because I have nothing to hide.
>>
File: 1703690029674575.png (785 KB, 706x486)
785 KB
785 KB PNG
>>103041464
True
>>
>>103041486
It’s weird how they can write so coherently while being completely retarded. It makes me wonder if it’s unique to AI or if there are full retards walking around that are still somehow eloquent.
>>
>>103041513
People are mostly functional until below 60-70 IQ. There are plenty of people that have jobs and vote in that range.
Also, any experience interacting with humans will tell you that the 80-120 range are perfectly eloquent, but still completely retarded. The line for true sentience is a lot higher than people are comfortable admitting.
>>
>>103041566
define retarded.
>>
>>103041486
Kinda useless desu. Those lower scoring models are smarter than most humans alive and are objectively smarter than humans with 130IQs.

The pattern recognition will just be a tiny error that will be fixed tomorrow or the results of these tests are bs to make us think AI is dumber than us.

Its so funny how for years no weve been saying AI intelligence hasnt overtaken humans, but it has lol... significantly. Humans are so proud and stubborn.
>>
>>103038380
i tried it and this shit fucking sucks. why can't you retards just make an exe I can click and then it just works? why do I have to do all this shit I don't have to do when I use chatgpt?
>>
>>103041619
Skill issue :3
>>
>>103041513
>while being completely retarded
You mean lacking sentience. Thats all. Theres no human that has general knowledge like those models. Pattern recognition is a meme and people that still think humans are smarter should have their brains dissected for science.... (and the good of humanity)
>>
>>103041600
Have difficulty handling hypthetical scenarios. Incapable of long-term planning and reasoning. No pattern recognition. Gladly and eagerly accept doublethink if it means they can avoid any sort of critical thinking, which they aren't very good at in the first place.
They're basically parrots. They repeat actions and phrases they've seen from others. They can solve basic problems only, which is why they're the most at risk of job loss from LLMs.
>>
>>103041619
>shit I don't have to do when I use chatgpt
Keep using it, then.
>>
>>103041637
LLMs are dumber than a cat
>>
>>103041623
You are the skill issue of your mum and dad. FAGGOT.
>>
>>103041655
LLMs don't need to hunt
>>
>>103041637
>sentience
nta, mentioned it twice now. Define it.
>>
>>103041338
>Isn't that exactly what slaren is saying?
I was specifically responding to the following statement made in this thread:
>They can't do new feature or support new models in reasonable timeframes anymore without breaking unrelated shit because the codebase is such a mess.
The statement asserts that simply adding a new feature has a large maintenance effort due to breakage.
But I think that support for features in isolation can be achieved relatively easily if your ambition is not to have it be compatible with other features.
For example, speculative decoding/lookup decoding is relatively simple to support as something that is done in the dedicated examples but having it work correctly in combination with continuous batching in the server would be much harder.
If there was no server the feature in isolation still takes the same amount of effort to support but the existence of the server changes the definition of what counts as support.
What I meant to say is that I consider the combination of features a feature in itself and that I think that the effort for that is different than the effort needed to avoid breaking existing features.

>meme model architectures that are obsolete or otherwise irrelevant
With the current code I think you wouldn't really gain anything from doing that because most models have very similar building blocks and the code is written in a modular way.

>meme samplers
I think more than anything a method for objectively determining the effectiveness of samplers is needed since that would allow maintainers to better determine which samplers are a meme in the first place.
My personal opinion on samplers is that I would not be willing to merge and maintain them unless they are extremely simple or I am shown evidence that they are an improvement.

>ancient P40s
I am the one maintaining support and I don't plan to stop doing so in the foreseeable future.
There simply isn't a viable alternative at that price point.
>>
>>103041660
Seethe :3
>>
>>103041619
see >>103037304
>>
>>103041673
>what is google
I only mentioned it once
>>
>>103039282
https://files.catbox.moe/mc2a7s.png
Not the author but have at it anon
>>
>>103041619
Use gpt4all. It just works for retards like you.
>>
>>103041697
Poison Teto
>>
i was under the impression that llms aren't actually ai? aren't they incapable of learning? static? retarded? just emulating intelligence through regurgitation? or am i wrong?
>>
File: 1729675263011049.jpg (8 KB, 270x246)
8 KB
8 KB JPG
Newbie tentacle doujin retard guy here. Quick question: which Ai horde model is the best on https://lite.koboldai.net/ ?
>>
>>103041338
>>103041676
I forgot:
>I know you disagree with this, but I remember the nonstop bugs when llama 3 was released.
I think you are forgetting just how bad things used to be.
It was much more common for the model outputs to just be garbage for some reason, even for supported models.
Nowadays that seems to happen a lot less.
>>
>>103041694
I want to know what it means for you. I wouldn't have asked otherwise.
>>
Has anything even come close to largestral so far? Can't have every time turn into a multi hour goon session at 1.8 t/s
>>
>>103041708
The only things "learned" are what the model is trained on, and then it's not discrete knowledge. However, connections between unique enough words and concepts tend to come back out of the model.

Intelligence isn't even emulated. The usefulness of LLMs is an emergent behavior coming from the above ability to, to a reasonable degree, "get facts right" combined with the fact that it's writing text linearly which means that there is a chain of thought like pattern. Indeed "chain of thought" is a strategy to make the model try to think about the question, using the document as its working memory, to make better answers. (It likely is just a waste of tokens, though.)
>>
>>103041719
Ill tell you when you say that you are smarter than chatgpt
>>
>>103041711
Probably some of the 100B+ ones.
Also what's up with kobold horde and hosting so many ancient models
>>
>>103040560
What do you expect anon? They're retarded
>>
Do people here use LM studio?
>>
>>103041857
It is just a bad llama.cpp GUI.
You could probably have a model program something better.
>>
>>103041887
>It is just a bad llama.cpp GUI
But llama cpp isnt a GUI
>>
>>103041707
she's about .5 milliseconds away from faceplanting
>>
>>103041676
>combination of features a feature in itself and that I think that the effort for that is different than the effort needed to avoid breaking existing features.
Are there any features you do think could or should be removed to reduce development effort?
>With the current code I think you wouldn't really gain anything from doing that because most models have very similar building blocks and the code is written in a modular way.
Mamba and RWK models broke just today. Though, I'll admit, I'm sure removing some of the more simpler plain transformers models wouldn't save as much effort.
>>103041715
>It was much more common for the model outputs to just be garbage for some reason, even for supported models.
>Nowadays that seems to happen a lot less.
From my perspective, it seems to only be because new model architectures are very rarely added anymore. It's usually more basic transformer models that already had most of the obvious issues fixed already from the previous breakages. That even new iterations of llama models that only have minor changes like rope scaling take a month to get working without issues, where each bug fix breaks something else shows that llama.cpp isn't improving, just maturing.
I think the difference is that it's much more stable at doing at what it does now, with the features already implemented, but new features and models are more difficult to implement due to the massive technical debt.
Which would be great for an enterprise LoB application that needs to run 24/7 in a closet somewhere and never update, but not in a field where new models, methods, and architectures are being released every other day.
I know it's not your project, but I can't help but think the project is prioritizing the wrong things. Especially when it seems that llama.cpp commitment to stability over iteration seems to be benefiting ollama more than llama.cpp.
>>
>>103041827
This post has been, to a reasonable degree, written by an LLM.
>>
>>103041831
I'm sure it can spit out more random facts than i can, but it gets confused by spelling much more often than i do. I am intelligent, llms are not. I won't bore you with the explanation of why something is more than nothing.
I'm just curious about your definition of sentience. More specifically, what attributes would any AI need for you to call it sentient?
>>
>>103041966
>random facts
Buddy they know more than you ever will yet you think you are smarter.
>>
>>103041161
why do the tieflings look so much better than whatever the fuck that is though
>>
>>103041804
nope. ignore the finetunes too, Monstral and Behemoth are both more interesting but they both have trouble adhering to prompts.
>>
LLMs don't know anything, they are like books
>>
File: 1730347701640689.png (546 KB, 512x768)
546 KB
546 KB PNG
Is this a thread discussing light machine guns?
>>
LLMs are compressed reasoing.
>>
>>103041966
>>103041992
You guys need to frame this discussion in epistemics and the conceptual faculty. LLM do not grasp the concept of the matter in question and trying to hike off to somewhere else in conceptual terms will make it give you a hallucinated/wrongheaded result everytime.
>>
>>103042026
>reason
No they are NOT retard.
>>
>>103041946
what causes massive technical debt is supporting random crap that nobody uses. the llama3 issues were almost entirely related to the tokenizer, which is something that always was broken in subtle ways before these changes, and improved significantly after that. i already said it several times here, what slaren was talking about there was the implementation recurrent models, which is a massive mess due to handling two entirely different types of contexts without the proper abstractions to do so. precisely the kind of thing that you want is what causes massive technical debt.
>>
>>103041992
They know more things than you and me, for sure, but if it gets confused by 1+1+1+1 or counting letters, i really cannot call it intelligent. If you consider volume of information as smarts, encyclopedias are pretty smart. Just slightly less interactive than llms.

Buddy.
>>
>>103042038
Give me my cat models, LeCun.
>>
>>103042012
What do you think about Monstral? So far it seems much more interesting than large, and it handles 1st person prompts well.
>>
>>103041958
It CUDA fooled me.
>>
>>103042072
If you enjoy it when the model deviates from a prompt to develop a personality or scenario more, but not necessarily in your intended direction, you'll like it.
I thoroughly enjoyed when the original CR 35B did this, but that model was regularly so unbelievably schizo that it was tough to guide it in any particular direction.
In the same way, you're giving up some level of adherence here, and you will regularly feel the deviations. If you don't mind loose control, yeah go for it, otherwise it's a more flavorful but less strict largestral.
>>
>>103042042
It causes technical debt precisely because they never bothered creating the proper abstractions. An engineer can't complain that he has to support two types of contexts, his job to figure out how to implement it properly because tomorrow there might be four. The codebase is a mess because they don't know how to architect a large codebase. ggerganov admitted it himself. Excuses won't make it any less of a shitshow.
>>
File: 1691380325471564.webm (1.44 MB, 320x180)
1.44 MB
1.44 MB WEBM
I've been out of the loop for a bit. Is there any go-to for using speech and getting speech back? Speech-to-text, text gen, text-to-speech all-in-one?
>>
>>103042169
Alexa
>>
>>103042153
so what is it, do they need to take several weeks to engineer things properly before merging them to support the latest new thing, or do they need to write code quickly, ignore stability, support the latest thing as quickly as possibly as you seem to want? don't you see that's two mutually exclusive goals?
>>
>>103042169
>Speech-to-text
https://huggingface.co/openai/whisper-large-v3-turbo
>text-to-speech
https://huggingface.co/SWivid/F5-TTS
>text gen
https://huggingface.co/mistralai/Mistral-Large-Instruct-2407
>all-in-one
currently none are good
>>
https://x.com/LoubnaBenAllal1/status/1852055582494294414
>>
>>103042253
That's cool and all, but who is the audience for those small shits?
>>
https://www.reddit.com/r/LocalLLaMA/comments/1gg6uzl/llama_4_models_are_training_on_a_cluster_bigger/
>>
>>103042284
Nothingburger.
>>
>>103042241
nta, but the first one typically simplifies the second one. As long as the abstractions aren't too abstract anyway. They end up being too generic and generic solutions tend to be much more complex than focused ones. There's a sweet spot somewhere in the middle.
But i also think that maintainers should be much more judicious about what features to include in their software.
>>
>>103042241
Yeah, you got me. I want both good and fast. But instead llama.cpp is bad and slow. Don't act like they're prioritizing one over the other.
This conversation started because they're doing neither. They spend weeks, not doing engineering things properly, but working against technical debt. New things are supported slowly or not at all.
I'm willing to bet my left testicle that there will be more issues, tokenizer or otherwise, when Llama 4 comes out in a few months.
>>
>>103042242
>currently none are good
A shame. I got a rough thing working last year and felt it was cool enough that a large audience would be pushing for it.
>>
>>103042280
edge devices
>>
https://github.com/kalavai-net/kalavai-client
thought?
seems sketchy but at the same time I wanna believe it works
>>
>>103042284
Already posted: >>103039227
>>
>>103042335
Is that only for inference like Petals or the llama.cpp rpc backend or can you train a model with that?
>>
>>103042320
there will always be new issues when supporting completely new models, because it is a massive pita to port the implementations from python. llama.cpp is intended to be fast and lightweight and run on everybody's computer, not to run the latest models. it is absurdity to expect developers to reimplement completely new architectures in a matter of days after they are released. i said it several times here, if you want cutting edge, stop being a vramlet and use the pytorch implementation, because that's the way every new model is releases. ggml is great for some things, experimentation is not one of them.
>>
>>103042335
>the first social network for AI computation
>The kalavai CLI is the main tool to interact with the Kalavai platform
>Get a free Kalavai account and start sharing.
for cloud niggers that aren't familiar with the term "distributed"
>>
>>103042320
>I'm willing to bet my left testicle that there will be more issues, tokenizer or otherwise, when Llama 4 comes out in a few months.
nta as well. If course there will be. Do you think the code at meta works on the first run?
>>
>>103042333
Who's gonna run llm on a smart fridge?
>>
>>103042320
It's going to be a multimodal so obviously
>>
>>103042357
it's also for training
that's why I think it's sketchy
>>
>>103042367
10 years I asked who would want a smart fridge. I'm sure they're going to force "AI-powered Smart Fridge" and people will buy it in droves
>>
File: overview_diagram.png (153 KB, 1126x635)
153 KB
153 KB PNG
>>103042335
>any gpu model
>nvidia
>nvidia
>and nvidia
AMD lost again.
>>
>>103042361
>if you want cutting edge, stop being a vramlet and use the pytorch implementation, because that's the way every new model is releases.
there's no need to do that. just use vllm. it can directly use the python implementation so models are supported quicky, supports quantization, is as fast as llamacpp, and has way more features and supported modalities, and even has cpu offload now. there isn't a reason to use llamacpp anymore imho
>>
>>103042450
is the performance of vllm on cpu actually good? can it offload a model partially like llama.cpp?
>>
>>103042450
>python
>>
>>103042450
How fast is that crap on cpu only?
>>
>>103042450
btw vllm is pytorch
>>
>>103042466
>>103042473
i only lost a couple t/s when i switched
>>103042470
have fun with your abandonware ig
>>
>>103042482
python software is designed to be abandoned. pinning software versions is a retard move.
>>
File: file.png (24 KB, 492x346)
24 KB
24 KB PNG
>>103041127
>it's a merge
eh...
>>
>>103042482
>i only lost a couple t/s when i switched
How much %? 50? 20?
>>
>>103041946
>Are there any features you do think could or should be removed to reduce development effort?
Generally I think with an open source project the process by which features tend to be removed is that it is neglected for some time due to a lack of maintenance, then becomes broken with no dev willing to invest the effort to fix it, and then removed.
Diego or Georgi would probably be better people to ask this since they shoulder more of the maintenance burden than me, but off the top of my head I can think of:
--logdir has become very outdated and obsolete for the use cases for which I originally added it so I think ti should be removed.
--split-mode row has I think diminished in value and increased in maintenance effort over time but I would not want to just outright remove it without a replacement.
AMD support via HIP has comparatively many issues and requires non-negligible effort to maintain but I keep the effort low by accepting lower quality for the feature.

>I know it's not your project, but I can't help but think the project is prioritizing the wrong things. Especially when it seems that llama.cpp commitment to stability over iteration seems to be benefiting ollama more than llama.cpp.
My personal opinion is that beyond a certain level of complexity the best methods for reducing technical debt are tests and refactoring.
But since tests also help with debugging any refactoring they should be the main priority.
Stability for downstream projects is a side effect of this process.
I don't think it is possible (for me) to build something with a complexity on the level of llama.cpp without investing a significant portion of the effort into stability.
>>
>>103042539
20%
>>
>>103042588
Fuck that shit then, I aint switching.
>>
>>103042546
So as some rando retard who wants to see llama.cpp continue to get better, how can I help write tests?
Is there an existing tracker of which parts of the code tests should be written for?
(t. shitty legal malware coder)
>>
>>103042597
then stop complaining?
>>
>>103041804
I tried largestral and I still prefer miqu 70b. Did I break my fucking brain into being addicted to a specific model? At this point it feels like AGI will come before I find something worth switching to.
>>
>>103041839
I just hate knowing it’s a lie and there will never be code. Why lie. Just say there’s no damn code.
>>
>>103042335
Im testing it, I've joined the llm-record-test cluster but it isn't running anything, would any anon join to a test inference cluster if I setted it up?
>>
>>103042775
not joining your botnet
>>
>>103042757
Because clout/PR. Understand their goal isn't to publish but rather get published.
Like games journos, difference between someone who loves games and writes about htem to share their love, vs someone writing about games because someone will read what you wrote.
>>
>>103042153
I’ve never looked at the codebase but that’s bullshit. You could be the omniscient allgod of coding, and doing too many features too fast based on user requests will completely rape your codebase. Then you either need a long period of telling everyone to fuck off while you refactor and prune the things you never should have said yes to, or more likely it’s just fucked forever now and begins the slow march towards death.
Most important word in public-facing software is “no”
>>
>>103042242
F5 is slow as all fuck compared to styletts2 unfortunately.
If you have a spare $30k just redo omni mini 2 with largestral
>>
>>103042335
They reached out to me for something a while back and their response to “cool but what about the obvious privacy/security concerns” was ¯\_(ツ)_/¯
>>
>>103042408
I want to stab the lying fucks at AMD in the throat with a pen. Their published benchmarks for vllm on a mi300x are straight up not reproducible.
>>
>>103042847
what would the security concerns be? as long as the source code is open you can know if they are using your GPU to mine crypto and they don't get much information about your computer when you connect to a cluster
>>
>>103042450
Vllm also supports fp8 and I think event intN now. There is no reason to use anything else. There’s something fucky with every other inference codebase. A model that’s great on vllm will be just dogshit on exl2 or llamacpp at the same precision for no reason. Vllm has enough corporate money for it to just work.
>>
>>103042470
>>103042517
Python and rust are the only good languages. Every other one is either forced to be used by megacorps (CUDA, JS) or pure misery. Yes I’m trans but that has nothing to do with it.
>>
>>103042809
>Most important word in public-facing software is “no”
Working under a yes-man even in private-facing software is pure torture
It's like someone breaking into your home and forcing you to rape your own dog
>>
>>103042887
As long as someone in the botnet is able to mimic the external-facing behavior of the source code, which is trivial, they can do literally whatever the fuck they want with all of the information.
>>
>>103042680
I am not aware with a tracker for what needs testing.
Things that I think would help with technical debt:
-The most important tests are those in tests/ggml-backend-ops.cpp but those are also already in a comparatively good state; if you can find cases that are not covered adding them would be useful though.
-Any tests of components that deal with memory management, those bugs always take the longest to track down.
-The llama.cpp HTTP server currently has Python tests that call the API and I think are not of particularly high quality compared to their importance.
-llama-bench is an important tool for performance and I want to at some point add an option for determining the performance for each token position instead of the average of all tokens in the range. But that will be relatively high-effort and high-complexity.
-The failure modes of scripts/compare-llama-bench.py are not very good and would benefit from better error messages.
-Scripts for submitting common language model benchmarks to the server would be useful, especially if they also allow for a comparison with other projects.
-There is a performance benchmark for batched inference on the server, but that benchmark was I think an adaptation of production code and is overly complicated; a simple Python script would be nice to have.
-Pic related (would be a lot of work).
>>
>>103042921
>It's like someone breaking into your home and forcing you to rape your own dog
Hey, man. Maybe there's a UuUUSsseEeE CcAASsssSEEee for it...
>>
>>103042949
huh yeah I understand
>>
>>103042969
>-Any tests of components that deal with memory management, those bugs always take the longest to track down.
This wouldn't be a problem if it was rewritten in Rust
>>
>>103041409
thanks
is there much use of the intel npu or is cpu going to be better
>>
>>103043022
>intel npu
i know of nothing that supports it
>>
>>103043031
apparently intel has a demo with their python library and supposedly llama.cpp supports it, haven't tested the latter
>>
>>103042320
>30% better benchmark results
>1B more parameters
>120% more safe
>can't say cock or pussy anymore
>>
>>103041404
Instructions for isolating a self-hosted LLM instance are in the OP.
Once you’ve read that, please explain how that is insufficient to keep your data private if you still feel that is the case
>>
>>103042913
Webshitters can not help themselves from using JS, no corp needed after the initial infection.
>>
File: miku-fridge.jpg (161 KB, 1024x1024)
161 KB
161 KB JPG
>>103042383
in the future we will be melting the contents of our freezer ERPing with that 7k token Evelina Vanehart card on the Samsung Family Hub™
>>
>>103043117
At least typescript can pretend to be a language long enough to get things done. Everyone that came in after react or even angular have no idea how bad the true js era was. Every time I think I’m experiencing suffering I just remember what working on a PhoneGap (later Cordova) application was like. I’ve thought “at least jquery is dead and I never have to write or edit a .js file again” to make myself feel better in the hospital multiple times even 15 years later.
>>
>>103042863
Wait, you have access to an mi300x?
Do tell
>>
is there something like browserllama but better?
(browser add-on that lets you connect to koboldcpp to summarize or chat with a webpage)
>>
>>103042969
Thanks, will throw this into my backlog and look into this.
>>
>>103043383
They’re on runpod. TL;DR I load tested vllm with the optimizations and settings from their claimed results and it was (1) way below their claimed speed, I have no idea how they got those results, (2) a massive barely-documented pain, (3) fp8 quants can’t be made or run on it for many model architectures (llama2 will quant but return all empty strings when run, mixtral won’t quant, no MoE will quant, maybe others) despite that being what their benchmark claims it used and acting like that is the intended use case, and (4) H100 SXM is so many times more tok/s than it at the same fp8 quants that the increased VRAM doesn’t matter or make it cheaper than an H100.
All H100s is less than half the cost of all MI300X, especially now that the $/hr of those are cratering.
Technically all 4090s running VLLM is less than all MI300Xs for the same volume. I wouldn’t be surprised if all 3090s is. It’s that slow. I’m sure there’s some house of cards of specific dependency versions and payloads and connectivity that did give them those numbers somewhere in real life. But it can’t be reproduced with the information given. Spent a few days trying to.
>>
>>103038380
how much VRAM do you need for the mistral v3 12B? I'm trying to run it on 4080ti and it keeps erroring, guessing it has to be a memory issue
>>
>>103043594
llama.cpp/kobold.cpp
gguf version of the model at Q4KM or thereabouts.
Put some layers in RAM.
>>
>>103043640
>>103043594
>4080ti
Oh no, wait, actually, you can certainly get a bigger quant.
Just remember that by default, nemo's context is at 1kk tokens, so you might need to set a smaller numver manually.
>>
>>103043594
is mistral v3 12b even a thing? what format are you trying to run (gguf, exl2, etc) and with what backend? (koboldcpp, vllm, etc)
>>
>>103043658
format: gguf
backend: oobabooga/text-generation-webui
model: BeaverAI/NeMoistral-12B-v1a-GGUF (Q4)

I'm now trying to download the 8B model instead from nvidia/Mistral-NeMo-Minitron-8B-Instruct
>>
>>103043690
ooba does this thing with nemo models, where it makes the default context size 1,600,000,000 tokens instead of ~8,000 tokens
that's probably what's causing your error
>>
>>103038380
--- A Measure of the Current Meta --
> a suggestion of what to try from (You)

96GB VRAM
Mistral-Large-Instruct-2407-Q4_K_M.gguf (aka Largestral)

48GB VRAM
miqudev/miqu-1-70b

24GB VRAM
bartowski/c4ai-command-r-v01-GGUF/c4ai-command-r-v01-Q4_K_M.gguf
TheDrummer/Coomand-R-35B-v1-GGUF/Coomand-R-35B-v1-Q4_K_M.gguf

16GB VRAM
TheDrummer/UnslopNemo-12B-v3-GGUF/Rocinante-12B-v2g-Q8_0.gguf

12GB VRAM
TheDrummer/UnslopNemo-12B-v3-GGUF/Rocinante-12B-v2g-Q5_K_M.gguf

8GB VRAM
TheBloke/MythoMax-L2-13B-GGUF

Potato
>>>/g/aicg

Use:
koboldcpp
LM Studio
oobabooga/text-generation-webui

> fite me
>>
>>103042900
I'm not gonna use vllm because I don't like SillyTavern and I'm not interested in hacking up my own webui to work with vllm backend

Ooba just werks
>>
File: 1704032932258888.png (12 KB, 808x145)
12 KB
12 KB PNG
>>103043713
thanks anon that was exactly it. Wish the error was more useful, I think after gpu was running out of memory it was trying llamacpp and giving a weird "model" prop not found after load_model function

Glad its working now
>>
>>103043713
also isn't 8k context really small? Is there an easy way to find the optimal context size for a model+gpu vram capability?
>>
if you have 8gb instead of 12gb, use a bigger more antiquated piece of shit. nice advice.
>>
>>103043724
The actual meta for people that have VRAM:
>96GB VRAM
Qwen2.5 72B / Magnum v4 in 8 bits.
>48GB VRAM
Same as a above but in 4 bits.
>>
How do I apply bot cards to locally ran models? I'm from /aicg/ so I only used API keys so far in sillytavern.

I want to run locally however I can't seem to apply the cards/pre-prompts to the locally running model for some reason.
>>
>>103043815
This. That Japanese company qwen2.5 finetune for coding or that eva one for creative writing / uncensor. It's basically claude 3 level, not quite 3.5.
>>
>>103043815
what about for those of us who aren't retarded enough to buy multiple cards for this shit?
>>
>>103043762
i just pulled that number out of my ass because that's what i use for Q4_K_M nemos on my 8gb vram setup, works fine for my rp needs.
easiest way to find the best context size for your needs that i know of is playing with the context slider in koboldcpp with the layer split set to auto (-1) then i just make sure at least half the layers are going to my gpu.
really though, i think you should look at the model's page.
for instance, the
https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Instruct
one you were saying you were going to get says it supports a context length of 8192 tokens.
>>
>>103043859
>What about us poorfags?
Gemma 27B is the best you can hope for.
>>
>>103043857
>or that eva one for creative writing
Yeah I need a sauce on that nigga
>>
>>103043748
Vllm backend is openai
Aren’t there like a billion frontends for openai?
>>
>>103043872
https://huggingface.co/EVA-UNIT-01/EVA-Qwen2.5-72B-v0.0
>>
>>103043857
>That Japanese company qwen2.5 finetune for coding
What's that one?
>>
>>103043892
https://huggingface.co/AXCXEPT/EZO-Qwen2.5-72B-Instruct
>>
>>103043879
>No quants not even Q8.0
Yeah thanks......
>>
File: 39_06703_.png (1.17 MB, 720x1280)
1.17 MB
1.17 MB PNG
>>103042023
>those fucked up eyes
back to the sdslop general with you
>>
>>103043928
Y-you do know huggingface has as search bar at the top, yea?

https://huggingface.co/models?search=EVA-Qwen2.5-72B-v0.0
>>
>>103043934
Teto tummy.
>>
>>103043928
>>
>>103043815
Qwen sucks for my usecase(NSFW).

>>103043859
Just run them in RAM.
>>
>>103043586
Try aphrodite engine it supports FP2 to FP12 (not vllm afaik) and FP6 is quite fast compared to the shitty GGUF on Vllm. I don't know if it supports AMD cards but these quants runs on a 3090.
>>
>>103043953
The Magnum fine-tune of it doesn't suck for NSFW.
>>
>>103043976
Kill yourself alpin
>>
File: 1717347777420034.png (1.05 MB, 1280x720)
1.05 MB
1.05 MB PNG
>>103043994
>>
>>103043953
>Qwen sucks for my usecase(NSFW).
This finetune of qwen2.5 turns it from a postive bias censored but super smart model into a model that feels like claude while keeping that smarts.
>>
>>103044044
>while keeping that smarts.
this is always a lie
>>
>>103043724
>no 64GB spot
KYS
>>
>>103044062
Only when its magnum bs. Try the eva one
>>
>>103043976
>quite fast compared to the shitty GGUF on Vllm
Be aware that vLLM copied the code for evaluating quantized GGUF models from llama.cpp several months ago and that several performance optimizations have not been taken over.
>>
>>103044106
buy an ad
>>
>>103043918
Thanks, downloading to test
>>
>>103043934
>fucked up eyes
nta, but crosseyed asian girls are thing. Asada Mao made that face all the time. its a charm point, like those yaeba fangs
>>
>>103043989
>>103044044
I've tried Magnum v4 finetune of Largestral and since it's the same dataset, I know that will not like it.
>>
>>103044172
Yea, the only magnum I ever found decent is the gemma 27B based one.

I was not suggesting it though, I think you misread. I was talking about this one.
https://huggingface.co/models?search=EVA-Qwen2.5-72B-v0.0
>>
>>103044164
nothing a paper bag wouldn't fix amirite
>>
>>103044196
>EVA-Qwen2.5-72B-v0.0
How bad is the positivity bias with this one?
>>
>>103044196
Is the 34B one good? Anything more is too slow for me
>>
>>103044172
I tried both and I didn't like the Large one because it just felt like the original model, not very creative and slopped. But the Qwen one does feel different, much easier to do NSFW, while it seems to remain flexible.
>>
>>103044219
It will get dark and depressing.
>>
>>103044219
There is none. It's pretty much Claude 3. Easily the best local model we have right now for creative use.
>>
Smolchads
>>
>>103044091
will next time my love <3
>>
>>103044243
>>103044245
I'll give it a try and will tell the whole thread how I feel about out in 2 days/weeks, it better be good.
>>
>>103044276
Its the first model outside of claude / gpt4+ tier ones to do complex RPG stuff.
>>
>>103044276
>t. will become yet another victim of:
>>103038666
>>
>>103044306
Or just stop trying meme models that are just merges / low rank loras on a few logs. Only try full finetunes like the one I recommended here: https://huggingface.co/models?search=EVA-Qwen2.5-72B-v0.0
>>
>>103044306
I've tried so much garbage already, another one wouldn't hurt.
>>
>>103044329
That's the spirit.
At the end of the day, there's nothing better than judging shit by yourself.
>>
and wtf is a nala test and why do i want it
>i don't even have the card
>>
>>103044336
The only thing better is being born with the gift of common sense and knowing that sloptunes have never been good and never will be good
>>
>>103044357
a guy on /lmg/ loads up a card where he gets fucked by a lion after saying "ahh ahh mistress" to see whether or not the model has the ability to
1: stay in character (not give the lion hands)
2: have sex
3: reason spatially
4: write well
>>
>>103044400
where is the nala leaderboard?
>>
Please stop pretending to be retarded guys. I beg you.
>>
Why are people talking about nala? Is there a new model to be Nala tested? At work right now so if there is it will have to wait.
>>
>>103044407
This.
>>
File: 1708862804.jpg (46 KB, 612x597)
46 KB
46 KB JPG
>>103044407
but you're too young to see the nala leaderboard anon
>>
>>103044428
Beg harder. Show me how much you need it.
>>
>>103044326
>12 hours on 8xMI300X
>mi300x = 1.5TB HBM3
Chat is this real?
>>
>>103044444
Nice.
This.
>>
File: uhh...uh...uhhh.png (738 KB, 541x1240)
738 KB
738 KB PNG
>>103044445
Anon... isn't Nala the cub in that picture?
>>
>>103044454
Please! Stop pretending to be retarded anon!
>>
>>103044444
That.
>>
>>103044474
It is ok because fucking a bear isn't zoophilia so... wait what?
>>
stop shilling vllm, it's janky as fuck for home use.
it doesn't even support basic shit like doing multi-gpu splitting when there are mismatched amounts of vram (e.g. 24GB 3090 + 12GB 3060).
>>
Haven't been in these threads for a while. Just tried Magnum v4 123b and it seems to have the same problem as the smaller v2 magnums when they came out. Is Mistral Large still the best?
>>
>>103044613
i agree unless it has any obvious benefit i'm sticking to kobold or the other ones
>>
>>103044669
Yeah. Just use vanilla. Know how to prompt. That's all.
>>
>>103044613
Oh, that does make it a nonstarter for me. Thanks.
>>
>>103044613
That's how tensor parallelism works retard. Your mismatched setup isn't standard
>>
>>103044768
Right, which makes it worthless for home users who tend to have non-standard setups, retard
The lack of attention paid to such a basic enthusiast use case such as mismatched GPUs shows that It's not made for us, so stop shilling it
>>
>>103044768
There is no technical reason that would inherently prevent you from parallelizing any number of possibly mismatched GPUs.
It's just that that use case would require additional effort to support but not be relevant for vLLM's target audience.
>>
>>103044800
what is llamacpp's target audience?
>>
>>103044808
hobbyists
>>
>>103044808
Depends on who you ask.
I'm targeting enthusiasts, particularly those that can potentially make useful contributions but not afford top-of-the-line hardware.
>>
>>103044768
llamacpp's tensor parallelism implementation works fine with mismatched vram, I use it daily. Don't ever try to sound like you know what you're talking about again.
>>
>>103044828
>particularly those that can potentially make useful contributions
This sounds like a pyramid scheme.
>I'm contributing in the hopes that others will also contribute their time.
>>
Luminum 0.1 123b at q3_K_M is the end for me I'm usually a base instruct model fella, but I never felt this way with largestral.
It could be because I am using q3_K_M instead of q3_K_S like I usually do.
Anyways the only problem is that it's 0.6 tokens per second. But that's fine for sonnet at home.
>>
>>103044829
Try to reach vllm speed with that lmao
>>
File: zzybwbts2q0d1.jpg (194 KB, 692x1100)
194 KB
194 KB JPG
>>103044828
Based.
>>
>>103044863
vllm is slow shit compared to lcpp and exl2 unless you're doing batch inferencing for server usage, which no one here is
>>
>>103044845
Then the entirety of open source is a pyramid scheme.
>>
>>103044925
Isn't it tho?
>>
>>103044808
people who missed out on RTX 3090s being 600 bucks used
>>
>>103045096
oh no, now they have to buy them for 500 used. the horror
>>
coping shitskin above me
>>
post above me is the original nala tester
>>
post below me is a retarded gooner (the entirety of /lmg/)
>>
>>103038380
The thread pasta is so fucking stupid. Doesnt have anything for noobs. All you faggots told me it wouldnt work on my laptop. I fucked around with python for ages and got shitty outputs.

Literally just installed LM studio and downloaded a model and im getting chatgpt style outputs on my laptop thats 4 years old.

Holy fuck youre all retards.
>>
>>103045172
Fucking pathetic dweebs gatekeeping something that normies actually use to make their lives easier and you retarded shits are just using it to goon
>>
File: th-167370784.jpg (17 KB, 474x266)
17 KB
17 KB JPG
>>103045172
>but we did tell you about LM studio
>>
>>103045172
>Doesnt have anything for noobs.
we're not trying to spoonfeed retards.
the opposite, in fact.
fuck off, we're full.
>LM studio
go back
>Holy fuck youre all retards.
no u
>103045187
this poster is a homosexual
>>
>>103045187
It all makes sense now. Youre just pathetic losers that want to sound cool so you can sperg out of your silence when you eavesdrop on normies and hear them talk about AI.

Youre probably all "doing crypto" too.
>>
>>103045191
I literally asked chatgpt the easiest way to install an offline language model.
>>
>>103045201
I'm doing your mom.
>>
>>103045192
>>LM studio
>go back
Youre a fat neckbeard thats gatekeeping something normies use in between having sex. Literally no one cares about what you know about AI
>>
>>103045201
suck my manhood
>>
>>103045201
>>103045225
https://www.youtube.com/watch?v=0_04Z-7kZ9E
>>
>>103045226
Why. Are. You. Here? Who told you about this place? Go to LocalLlama. They don't gatekeep there.
>>103045245
based
>>
Anyway... have fun doing AI and crypto you nerds.
>>
you all need to lay off the apple cider
you're all ugly drunks
>>
>>103045245
nice bush
>>
>>103045258
Apple cider is peak nerd alcohol
>>
>>103045262
and?
>>
File: 1724265947004984.jpg (863 KB, 1856x2464)
863 KB
863 KB JPG
>>103038380
>>
Are there any good youtuber channels covering llms, image gen, and similar?
>>
>>103045252
chad mcchad face over here off to fuck his next girlfriend and his next line of coke in his city penthouse
>>
>>103045319
>Are there any good youtuber channels
Depends. How severe is your mental retardation?
>>
>getting banned for posting your cock
At least post some blacked miku.
>>
>>103045334
>Depends. How severe is your mental retardation?
Pretty severe.
If there's a moment I'm not suckling from youtube's teet I feel distressed.
>>
I am about to load the shilled qwen 70B. I will complain about it being shit in 10-20 minutes.
>>
>>103045376
also paste your stats
>>
>>103045507
>>103045507
>>103045507
>>
>>103043976
It’s W8A8, not GGUF. All of the documentation for GGUF in vllm is plastered in warnings not to use it.
Fp6 is way too low; this is internal corporate stuff not fuckbots. Fp8 is already ehhhhh noticeably less bright
>>
>>103044613
>it doesn’t support poor people bullshit
Good
Vllm is for the gainfully employed male with 2xH100s under his desk
>>
>>103044880
Exl2 is fast in the same way writing to /dev/null is fast.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.