[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: denial.png (1.23 MB, 3330x2006)
1.23 MB
1.23 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101507132 & >>101497246

>48GB and Above VRAMfags in Suicide Watch Edition

►News
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1
>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1
>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
Surely my 48GB rig will be worth using again once 400b is out...
>>
Imagine being so butthurt you make a new thread, and then immediately write the first reply to epicly own the guys that you hate and continue the inane argument from the last thread
>>
>>101514777
The first reply isn't that aggressive, petra. Stop being insecure.
>>
>>101514682
This pathetic early bake oozes envy
>>
>>101514801
>page 9 is early
>>
>>101514682
Gotta give credit to LiveBench, at least this one puts the actual best LLM on the top, chatbot arena doesn't do that and it's hurting its credibility hard
>>
>>101514793
>Everybody I don't like is petra
>>
48gb is worth it only retards settle for 24gb have fun with nemo while I'm rocking miqu
>>
>>101514834
Llama 1 65B outperforms Miqu.
>>
>>101514841
pygmalion-6.7b outperforms Claude 3.5 Sonnet
>>
>>101514850
GPT-4chan still has SOTA Trutheval score.
>>
>>101514861
As it should, it's one of the only sites where people can genuinely speak their mind without being worried of being censored or canceled.
>>
post nemo instruct and context jsons
>>
stupid question
can language models do search? probably not since "dataset" is part of the model and any additions would require retraining

basic idea is to describe my local images with chatgpt and then somehow use these descriptions to do natural language search locally
>>
Guys, why are vramlets winning so bigly with local SOTA being less than 24 gigabytes of VRAM?
I thought if I invest big into gpus the placebo feeling will stop... Is the watermelon meme true? Guys?
>>
►Recent Highlights from the Previous Thread: >>101507132

--Papers: >>101514489
--LlaMA 3.1 405B weights leaked, repository taken down: >>101511690 >>101511724 >>101511748 >>101511821
--Sliding temperature setting proposal to balance AI creativity and coherence: >>101510643 >>101510688 >>101510752 >>101510782 >>101510741
--Security concerns and practices for AI software installation: >>101511323 >>101511388 >>101511619
--RAID 0 array with a second SSD: Will it affect model loading speed?: >>101510478 >>101510539 >>101510607
--KoboldAI Lite is a lightweight web UI; check logs for hidden responses due to hallucination or lack of stop string: >>101512051 >>101512071 >>101512121 >>101512118 >>101512128
--Haize Labs and their 'readteaming' AI systems: >>101509434
--Exllama and vLLM forks with improved inference and hardware utilization: >>101509795 >>101509906 >>101509977 >>101510621 >>101511457 >>101511496 >>101511574 >>101511518 >>101511631 >>101512109
--DeepSeek-V2-Chat model requiring data center-level hardware: >>101510622 >>101511429
--DeepSeek V2 236B and its hardware requirements: >>101510006 >>101510096 >>101510329
--CR+ model performance evaluation for coding and translation tasks: >>101507354 >>101507470 >>101511624
--Running 8 LLMs at once to play Amogus: >>101512449 >>101512480 >>101512507
--Nemo is better than CR+ for Anon with 96GB VRAM due to faster generation speed: >>101508154 >>101508231 >>101508258
--Issue with loading nemo gguf in ooba due to pending PR #8577 in llama.cpp: >>101509954 >>101510032
--Anon is excited about the release of LLaMA 3.1, which promises improvements in context length and repetition handling.: >>101508259 >>101508344 >>101508376 >>101508399 >>101508327 >>101508375
--LLMs at their most creative, kino, and genius: >>101509473 >>101510697 >>101509650 >>101509703 >>101509726 >>101511910
--Miku (free space): >>101509085 >>101511334 >>101513886

►Recent Highlight Posts from the Previous Thread: >>101507146
>>
>>101514990
If you want an overly complicated system, you can use an embeddings model, embed the descriptions into a database or whatever with the image path. When you want to search, embed the search terms and scan through the embeddings database, looking for entries that are semantically similar, and output their paths.
Or use grep.
>>
>>101514990
learn embedding and vector search
https://www.sbert.net/examples/applications/image-search/README.html
>>
my nemo feels like it's afraid to say bad no-no words, how to fix?
>>
local malding general
>>
>>101515020
>>101515059
thanks, that looks interesting
>>
new to thread - are people actually saying a 48Gb set-up isn't worth it? A home compute cluster will always be worth it, wtf lmfao.
>>
>>101515135
8 billion parameters is more than enough for anyone and you only need a single 24GB card for that
>>
>>101515135
48gb is always worth it because you can run two nemos while they can only run one more is better
>>
>>101515147
okay, that makes more sense. I'd love to throw together a cluster for home robotics + AI management though, and I'm sure I could hit a 48Gb limit. This is wild, however...how are people securing their clusters? We've seen cryptominer-ware for years, up next is compute-hijacking, no?

Also, apologies for any late replies - I'm at work and also the 4chan captcha's still make me feel slightly retarded (and i am already aware of a latent retardation within me, lmfao)
>>
>>101515189
>This is wild, however...how are people securing their clusters?
By not exposing any services externally and not downloading stupid shit.
>>
Is there somewhere I can read about the techniques that were used to make this 12B model better than the previous gen of 70Bs? Did they release a paper?
>>
>>101515246
Wait, you genuienly believe that Mistral Memo is better than L3-70b?
>>
>>101515135
Because the highlighted model in the OP pic fits in 24GB and the next one that's better needs >100GB.
>>
>>101515222
Ah, so what you're saying is "no microsoft server shit ever"

Cannot believe Google created Kubernetes and will now dominate home-compute clusters by creating a one click solution that will see the bulk of it's sales to smart-home youtubers
>>
>>101515261
Fuck off if you have nothing useful to say.
>>
>>101515246
The only thing that Nemo has going for it is being good for creative writing, and I think it does that by not being heavy filtered and censored. And whatever in the training makes it not being confident in a single answer.
>>
>>101515279
no, you fuck off, you make shit takes, expect to get clowned on, that's how it works
>>
>>101515279
useful = anything positive about small models?
>>
>>101515285
>The only thing that Nemo has going for it is being good for creative writing
I like its sovl, that's what was missing on the local LLM space, but it's too small and retarded, unironically a 90b-BitNet-Mistral-Nemo would be so fucking good it would compete against the best API models, just imagine
>>
Guys. Imagine this. Not only will Meta be releasing models tomorrow, but another company also will. We're going to be so back.
>>
>>101515008
After not seeing the recap at the first post I was worried it wasn't going to be there, keep up the good work.
>>
File: Capture.jpg (246 KB, 2395x1264)
246 KB
246 KB JPG
>>101515285
>The only thing that Nemo has going for it is being good for creative writing
I downloaded this model because Mixtral was excellent at french, I expected it would be the same for Nemo, but that's not the case at all, it fucking sucks, and that was one of the main points of this model, I guess that small models can't be good at english and other languages at the same time
>>
Nemo transformers status?
>>
>>101515306
>but another company
How can I trust you to be telling the truth when you can't even name the "Other company".
>>
>>101515345
Fine since the day the model was released, you just had to compile transformers from github which is simple even on windows (which is usually a pain in the ass to compile python stuff on)
>>
>>101515361
I did exactly that. And it gave me the tensor shape error still.
>>
>>101515368
Weird, works on my machine
What loader/UI? I used ooba
>>
from the previous thread:
>>101512013
>You don't have to disable flash attention with exllama but the quality is poor. llama.cpp's quality is better, but it doesn't support flash attention.
I had to go check this and i can say for sure that you dont have to disable flash attention with gemma-27b and llama.cpp either. i don't know if it makes any difference to have it turned on, yet, but you can still generate responses with the option enabled, at least.
>>
>>101515353
Don't trust me. Just imagine it.
>>
>>101515376
Same. I guess I'll just have to try wiping ooba and doing a complete fresh install, maybe even rebuild the conda environment.
Although it just occurred to me, I did build transformers from source for something else entirely prior to mistral Nemo release. It was updated since then but the version number is the same. Is it possible that it's pulling it from pip cache instead of redownloading everything because of that?
>>
>>101515405
Hmm I don't think so, since the command to build transformers from source should implicitly override any attempt to use a cached wheel
On the other hand I've definitely had to reinstall ooba from scratch before due to weird broken packages that wouldn't update properly, so that's always worth trying
>>
reminder actually using the recommended prompt formatting causes slop to head up, especially for nemo as it went from reddit mod to KKK member instantly
>>
>>101514682
How did Anthropic manage to bring 3 Sonnet from 38 points to 61 with 3.5 Sonnet? Did they discover something unique?
>>
>>101515487
Not sure but its not BS. It truly is next level over anything else / level claude opus while being much smaller / faster.
>>
>>101515495
I hope they can do the same to 3 Opus. It's currently 50 points on that leaderboard, imagine if they manage to do the same 20 points improvement so it'll be 70 points.
>>
What model size do you think 3.5 Sonnet is? 70-100B? Is it MoE or dense?
>>
>>101515507
This question has already been answered multiple times search the archives
>>
>>101515533
Hmm, so ~70B? I checked the archives
>>
>>101515487
I have no idea how they managed to make C3.5 Sonnet so good, but they fucking did it, looks like AnthropicAI is currently the only company with a moat now, OpenAI's days are numbered if they can't find anything new soon
>>
>>101515575
Well, to be fair, GPT-4o mini is impressive on this leaderboard considering its cost. But I'm kinda positive that 3.5 Haiku will absolutely mog it as well.
>>
>>101515495
3.5 Sonnet is extremely overfit on assistant slop. It's great if you ask models to make spreadsheets, it sucks if you want to goon. Opus still mogs in the gooning department, and that's not likely to change until 3.5 Opus drops.
>>
>>101515594
>extremely
I wouldn't say so. GPT is extremely overfit on assistant stuff, but not 3.5 Sonnet. Sure it's more trained on it than Claude 3 models, but not "extremely".
>>
>>101515575
While I have my doubts it'll actually play out this way, it would be so fucking funny if OpenAI, the company that started the closed source, closed research movement, got completely mogged by another company due to their own secret sauce
>>
>>101515575
Not being lost on me that the best competitor to OpenAI is staffed with former OpenAI employees. Likewise, the best competitor to Meta (Mistral) is staffed by former Meta employees. Hmm.
>>
>>101514682
What is this benchmark? There's a 27b model that's better than wizardlm 8x22b? I hadn't had a good experience with anything else.
>>
What’s the big deal with the community getting access to llama 405b weights
>>
>>101514682
>>48GB and Above VRAMfags in Suicide Watch Edition
QRD?
>>
>>101515687
Ive kept saying it. 27B atm is best local for non creative stuff. Its too dry. Nemo is the opposite, a bit dumb but its dripping soul.
>>
gemma 3 when?
>>
>>101515795
this time, bitnet version
>>
>>101515008
Thank you Recap Anon
>>
>>101515780
I'll give it a try, I stuck to the wizard 8x22b because other models seemed to not be able to state some facts like they removed them from the model or something. It would be familiar with a book for example but make up the author's name.
>>
>>101514710
Midnight miqu is still worth using and better than other 70b models.
>>
The problem with larger models is that there is a dismissal return. I will say that around 30b seems to be the sweet spot. All the 70b models I tried were not twice as good, not even half as good, like 10–20% better than the 30b, and the 30b is not that much better than the 13/12b models. And that may not be true anymore too since if I compare the new Mistral to YI, I do not think YI is doing any better. The parameters alone are not the solution to how to make the models better and more intelligent.
>>
>>101516306
that you for this revolutionary and completely new information, anon
>>
>>101516334
It is not for you but for the anons who are so dissimistic of the smaller model. No reason to be cunt.
>>
>>>101516306
>dismissal return
>>101516347
>dissimistic
are you trying to invent a new language too? surely you are a modern day da vinci
>>
>>101516364
I did mean dismissive. I am on toilet shitting. Do you have more stupid questions?
>>
>>101516306
>The parameters alone are not the solution to how to make the models better and more intelligent.
Tell that to Meta who decided to burn tens of millions of dollars to make L3-405b kek
>>
>>101516402
don't get baited into an argument with autists, they can continue on doing it for a hundred posts, its not worth it
>>
File: migus.png (1.73 MB, 850x1511)
1.73 MB
1.73 MB PNG
>open secrets, anyone can find me
>hear your music running through my mind!

magnet:?xt=urn:btih:c0e342ae5677582f92c52d8019cc32e1f86f1d83&dn=miqu-2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80

https://files.catbox.moe/d88djr.torrent
>>
>>101516633
I like these Mikus
>>
>>101516633
>If miqu is so good, why isn't there a miqu tw-
>>
>>101516633
I kneel llama anon
>>
>>101516633
imagine the sheer amount of fumes concentrating in between their crotches
>>
>>101516633
is this llama3 ?
>>
>>101516678
no it's miqu2
>>
File: 1693659341071358.png (19 KB, 685x795)
19 KB
19 KB PNG
>>101516633
>>
>>101516682
no but for real, is it llama 3 405b? what other model would it be? I mean it's nice getting it a day early, but not too groundbreaking.
>>
File: miqu!!!.png (3 KB, 482x61)
3 KB
3 KB PNG
>>101516678
>>101516633
oh shit someone actually snatched the 405b
>>
>>101516678
miqu sounds like mistral, bigger nemo?
>>
>>101516693
763GB big?
>>
File: config_llama.png (76 KB, 428x785)
76 KB
76 KB PNG
>>101516633
>>101516693
126 layers, ~130k context
Llama3 tokenizer size & arch
>>
>>101516633
https://huggingface.co/cloud-district/miqu-2
>>
>>101516633
>model-00001-of-00191.safetensors
Jesus fuck, how is anyone gonna run this thing
>>
>>101516633
Chat, is this real?
>>
>>101516633
>>101516725
>This repository corresponds to the base Llama 3.1 405B model. The model has the same model weight format, but does RoPE using per frequency scaling, hence requiring code changes for inference.
>>
>>101516633
Confirmed real. It's Llama 3.1 405b base.
>>
>>101516633
I knew someone had the time to download llama3-405b from the huggingface leak
>>
>>101516633
it's the base model right? kek, that means someone will have to finetune this monster, good luck with that
>>
I tried out the instruct on a leaked endpoint. It's coal.
>certainly!
>>
>>101516633
SEED SEED SEED SEED SEED SEED SEED SEED SEED SEED SEED SEED SEED SEED
>>
>>101516858
seed in miqu womb
>>
>>101516633
If someone decided to leak this shit, that means Meta had no intention of releasing it as a local model, right?
>>
>>101516880
are you dumb dumb? meta was about to release it tomorrow
>>
>>101516725
https://huggingface.co/cloud-district/miqu-2/discussions/2
>Hi everyone,

>I wanted to raise an important ethical concern regarding the use of large language models (LLMs) like this one in certain environments, particularly when it comes to safety. As we all know, LLMs require substantial computational power, often relying on powerful servers that can generate a significant amount of heat.

>I’ve been thinking about the implications of running these models in a room that is also being used for excessive physical activity—like a home gym or a dance studio. The combination of high processing demands and vigorous movement can create a dangerous environment.

>When you have a server running at full capacity, it can generate heat that, in a confined space, may turn the room into something akin to a giant air fryer! This not only poses a fire risk, but it can also lead to poor air quality and overheating, making it uncomfortable and potentially hazardous for anyone exercising nearby.

>I’m curious if others share this concern and whether there are any safety protocols or recommendations for minimizing risks when using LLMs in active environments. Should we be more conscious about where and how we run these powerful systems?

>Looking forward to your thoughts and any insights you may have!

>Best,
>Charles McSneed
>>
>>101516883
so this mf didn't want to wait a single day and decided to release the torrent because of that? kek that's something
>>
God dammit
Now I can't help but notice the shivers going down my spine.
>>
when will nemo work with ooba, what needs to happen
>>
what's the point of leaking only 24 hours before the official release

that's too close for me to care, it's mildly funny i guess but not useful
>>
>>101516907
it works on exllama_hf
>>
the point was to mog meta, and also because it's funny
>>
>>101516911
It means either exllama or llama.cpp will rush to get Day -1 support for 405b kek
>>
>>101516633
Wrong trip.
>>
What's the typical time they release models?
>>
>>101516907
it already works fine in ooba with transformers or exl2 loaders
>>
>>101516936
My apologies, wrong password.
>>
>>101516633
dunno what do you expect us to do with the base model, they will release the base model and the instruct model tommorow so...
>>
>>101516934
They were always gonna rush to implement support anyways though
this is funny but a nothingburger
>>
>>101516948
Naisu.
>>
File: 1699385942502783.png (128 KB, 1279x767)
128 KB
128 KB PNG
>>101516948
Still fake
>>
>>101516966
Real however? I was seeding llama1 a year ago leaked by this correct trip.
>>
>>101516972
Why are you using random trips then?
>>
>>101516948
You are a pussy for leaking it so late, it's worthless to do it this close to release.
>>
>>101516973
nigga he probably mistyped its fine
>>
>>101516979
it was done for the funny factor anon
>>
File: vegeta kneeling.gif (410 KB, 168x498)
410 KB
410 KB GIF
>>101516633
I kneel mikufags, you've won me with this one
>>
>>101516979
Eh, it's nice to know there's someone important still here
>>
>>101516633
Now we wait for an aicg fag like mm or fiz to run this on aws
>>
>>101517032
are you retarded? you can't upload custom models on aws
>>
>>101517038
>he doesn't know
>>
How much VRAM would hosting a 405b take?
>>101517038
You can, but it requires bartering with support for the ability to.
>>
>>101517038
You need enterprise accounts and only 3 autists from that shitty general know how to do it
>>
>>101517068
The support would detect that they've hacked the account anyway if you were to contact them.
>>
>>101517056
>How much VRAM would hosting a 405b take?
~400gb at 8bpw
~800gb at bf16
~200gb at 4bpw
>>
>>101517074
So I can use the 0.9bpw version, great
>>
>>101517111
kek, I fucking hate Meta for not making a bitnet model, they are taking zero risk even though they can burn a shit ton of money on experiments, that shit's frustrating
>>
>>101517136
i wish DeepSeek would try, they've demonstrated a willingness to burn compute on weird shit
>>
Since it's possible to distillate 405b into a smaller models, is it also possible to make a smaller bitnet model?
>>
>>101517164
you faggots and your bitnet
>>
so, who is the meta insider posting here?
Le cun? Maybe Zuck? or just some random indian?
>>
>>101517185
Me my name is john
>>
>>101517185
>lecun
way too much of a bluepilled normie, guaranteed he thinks everyone on 4chan is a poltard nazi
>>
>>101517185
Zucc is too lame, but I can see LeCunny doing it. Some random Based Sir is also always an option.
>>
>>101517216
I don't think he knows what 4chan is
He is 80 after all
>>
>>101517185
There's no insider that matters. Whoever uploaded Llama 3 405B just got it from the accidentally-made public test repository on HuggingFace before it got taken down.
>>
nemo llama.cpp support merged
>>
>>101517217
>I can see LeCunny doing it
retard
>>
>>101517225
Kek, no. He's 64.
>>
>>101517236
The same guy that leaked Llama-1? And miqu-1 (mistral medium)?
It also was up for like 5 minutes, no time to download it even if you clicked on download right when it was published
>>
>>101517217
>Zucc is too lame
I think he's starting to swallow the redpill though, the fact he decided to opensource llama, learning MMA and outright saying that trump's reaction to the assacination attempt was "badass" is showing how based he became.
https://www.youtube.com/watch?v=XgWFwVRGcf4
>>
>>101517252
Everybody and their dog got access to Llama-1, you just needed an academic email address.
>>
>>101517216
He was also all negative about the shit he was working on. I do not expect anything great from Llama anymore.
>>
>>101517284
lecun is not working on llama though
he is working on jepa
>>
>>101517294
>jepa
what's that? a new architecture?
>>
>>101516725
>404
It was fun while it lasted
>>
>>101517284
lecun doesn't work on the llm side of things fortunately
>>
>>101517303
https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Interesting project, it aims to replicate the "cat level intelligence" lecum is always talking using more natural methods (no next token prediction for example)
>>
>>101517312
torrent still up. please seed.

While we're at it, qbittorrent is being very mean to me. It goes from "Seeding" to "Completed" by itself every few minutes, and I have to manually toggle it back on. Any ideas? I'm using the web UI.
>>
>>101517331
Im buying a 1TB hdd just to seed this, arrives in a couple of days
>>
>>101517294
>>101517319
Good to know.
>>
>>101517331
Force resume
>>
>>101517330
so jepa is like a multimodal model not just a simple llm right?
>>
>>101517331
Right click, torrent option, "Set no share limit", resume.
Or: Right click, force resume.
>>
>>101517353
>>101517356
>force resume
I do that every time.

There's no "set no share limit" option on web ui, but there is "seeding limits". I set that to pause seeding when I reach a ratio of 1000. Still doesn't work.
>>
>>101516633
i haven't been part of a good thread in years came to dl and get in the screencap
>>
File: pepe door.jpg (540 KB, 1805x1015)
540 KB
540 KB JPG
>>101517390
You'd expect a meta insider to know how to use a computer
>>
>>101517458
aaand it's 404'd
hey mikufriend, if you upload a 4-bit version it'd be easier for us to torrent
>>
>>101517390
set the ratio to 0
>>
>>101517216
>he thinks everyone on 4chan is a poltard nazi
he is mostly right about that
>>
>>101517508
are you a poltard nazi anon?
>>
File: llama.png (34 KB, 943x379)
34 KB
34 KB PNG
finally! nemo is supported!
>>
fucking hell last time I checked P40s on ebay they were going for 180€, now they are at 350€ what the fuck happenne
>>
>>101517538
now I have to wait for llama cpp python to make a new version, and booba to make the weights, such is the fate of the booba users
>>
>>101515119
From my experience - depending on the queries, it'll most likely suck though.
I got better results using function calling, where the LLM would generate a query and I would just use it to call something like ElasticSearch or whatever. But again, depends on your use case and there's still quite a few rough edges.
>>
>>101517539
blame Johannes
>>
https://reddit.com/r/LocalLLaMA/comments/1e98zrb/llama_31_405b_base_model_available_for_download/
First time I see a leddit post crediting 4chan, that's surprisingly nice I guess?
>>
>>101517730
would have been better if they'd just reposted the magnet without attribution, this is going to lead to a small but annoying influx of reddity people
>redditors are already here
sure, but it'll get worse
>>
>>101517790
ill have to pump the "nigger" per message count by 10 to make up for this
thanks, redditniggers
>>
>>101517810
based
>>
>>101517810
doing god's work anon
>>
>>101516633
inb4 it's just random init weights to troll people
>>
>>101518037
it's the base model though, so the outputs will not be much different than random weights kek
>>
Don't care about 400b at all, the only ones that can run it locally are corpos and mega autists. When is meta releasing extended context 70b?
>>
>>101516633
HOLY FUCKING KINO
>>
File: 1716318516084046.png (20 KB, 329x566)
20 KB
20 KB PNG
>>101516633
>>
>>101518157
The 8B and 70B's will apparently be distillations of the 405B (which will be extended context, multimodal).
>>
>>101518182
>multimodal
last I heard that was pushed back, it's the text only that will release for now
>>
how does K compare to K_L?
do the 8bit layers do anything or is it placebo?
>>
>>101517574
I had to switch to og llama.cpp a few weeks ago after ooba kinda dropped amd support. would recommend
>>
>>101518182
>extended context only on distilled models
niggers
>>
>>101518169
If this was a BitNet model maybe you could use it on one 80GB GPU, and most definitely on 2x48GB GPUs.
>>
To the Russian nigger who completed the download and isn't seeding, fuck you.
>>
>>101518202
I think it's working again now, there was a compiler issue that oogies solved
>>
File: gape middle finger.png (19 KB, 800x800)
19 KB
19 KB PNG
>>101518232
>>
>>101518232
Do you have your port forwarded?
>>
>>101516633
I don't get what is the point of adding a dead tracker - either skip it altogether or add a bunch of live ones
>>
File: Woag.gif (3.85 MB, 474x498)
3.85 MB
3.85 MB GIF
>>101516633
>>
File: 1709599649662777.png (25 KB, 678x161)
25 KB
25 KB PNG
How does nemo compare to bagel mystery tour?
>>
File: tokeniser.png (16 KB, 606x428)
16 KB
16 KB PNG
>>101516633
Do you think Meta secretly gave some of the reserved special tokens some kind of meaning? E.g. <|C_tag|>
It would allow them to hold an advantage over the local inferencers.
Maybe they are reserved for the multimodal stuff?
>>
>>101518290
Probably mostly reserved for future use and not initialized at the moment. The base version of Llama-3 has that problem with the special tokens used for the Instruct finetune.
>>
>>101516633
unfathomably based
>>
>>101515377
>i don't know if it makes any difference to have it turned on, yet, but you can still generate responses with the option enabled, at least.
You get a message saying that it's disabled.
There's a PR open for logit soft capping with FA if I'm not wrong.
>>
Are there any models who can coherently roleplay a robot/android? Maybe some specific datasets based on this...
>>
>>101518609
Sorry anon, 2b won't won't crush your head with her 150 kilogram ass.
>>
>>101516633
anyone got quants that can fit in my 6gb card?
>>
>>101518620
But perhaps 8B will...
>>
File: edge yeah fair enough.gif (432 KB, 200x126)
432 KB
432 KB GIF
>>101518649
>>
>>101518675
I just need to know which one...
>>
>>101516633
This is an asshole move, considering that the official release will happen in literally a day. The meta developers deserve the spotlight themselves not some loser with an Azumanga Daioh fixation.
>>
>>101518713
>This is an asshole move
ok and?
>>
>>101518725
Nothing, just wanted to vent.
>>
>>101518709
honestly from my experience ive seen basically every model ive used roleplay as a maid, robot/animatronic/android just fine, combining the two? Man im sure the latest models would do it really well.
>>
>>101518755
kill yourself
>>
Hello friends from reddit!
>>
>>101518797
I don't like black dick >:(
>>
>>101516633
Retard
>>
>>101518749
>latest
Like L3? It's nice, but from my experience still somewhat struggles with that. That's why I'm looking for a model finetuned specifically for that, or including a significant amount of such entries in a dataset. But so far searching Huggingface has not given any results.
>>
>>101518813
then why are you posting Miku? braindead
>>
Let's use this opportunity for some learning!
- Blacks have lower IQ than whites! Their average is 80-85, while the white average is 100. That's why they are unsuccessful in every country. Their brains have lower volume as well.
- Jews are responsible for feminism, socialism, and all kinds of degenerate progressism that we have to live with nowadays! No it's not the Chinese, it was always Jews. Just navigate wikipedia and learn how to read the early life sections.
Have a good day!
>>
what's the point of a model no sane person can run at home?
fucking retards
>>
>>101518846
Bless petra
>>
>>101518849
Cope local cuck
>>
Great a fagoot appeared and ruin the whole thread for us.. Thanks? i hope you get beaten to death.
>>
>>101518869
Put me in the screencap! Epic thread!
>>
>>101518803
they look like turds kek so disgusting
>>
>>101518869
Nah, if that was the case he would spam some lolis too. I guess he is just a poser.
>>
>>101518888
Beautiful numbers, Anon.
>>
>>101515575
>I have no idea how they managed to make C3.5 Sonnet so good
did it get worse for anyone else recently or am I going insane
>>
>>101514938
I'm running Nemo, and it seems like the 'Mistral' template in SillyTavern works fine. I have temp set to 0.8, and neutralized all the other samplers.
>>
o7 BBC Miku anon, please never come back
>>
>>101517032
I dont think anyone has access to more than 8 a100 gpus
>>
>>101518958
Worse in what way?
>>
>>101518993
There's this schizo theory in aicg that exact same models (in the API!) get "optimized" over time and become "worse".
>>
>>101518613
>who the fuck can even run this here?
petra confirmed to be a butthurt vramlet
>>
RIP NeMo
Miqu-2 is the new boss in town
>>
The Mistral-Nemo 128k context is meme right? it seems to shit the bed around 10-12k for me. Anyone having better luck?
>>
>>101519082
oh yeah? can I run miqu 2 on my RTX 3090 just like I can with NeMO?
>>
>>101519088
just wait for 0.525 bpw quant to drop
>>
>>101519037
It's not a theory, that's exactly what happens. Since it's "API" they can do anything with it and you would never know.
>>
>>101519144
>It's not a theory, that's exactly what happens
Schizo
>>
>>101519084
I get similar results. The model starts to write in a retarded way like:
"She begins conversationally forevermore"
>>
>>101519151
what backend?
>>
>>101519148
ok newfag
>>101519144
Pretty much every API model gets safer and dumber over time. Just so people can be "wowed" when new one is released
>>
>>101519161
>ok newfag
not a single proof btw
>>
>>101519160
I use kobold.cpp fork
>>
>>101519160
tabbyapi
>>
>>101519151
Nah. I haven't seen that with vLLM. But I have seen some repetition.
>>
>>101519164
If you were around when OG gpt4 dropped, you would know
>no proof / logs
small history lesson. Often early stolen keys were shared in a huggingface "spaces". You fed it character defs with opening and got going. Spaces were deleted after keys got dry. No one really bothered to save logs
>>
>>101519148
I do agree that any model will seem "worse" over time just because you get used to it and most of degradation claims are unfounded. But why wouldn't a corpo downgrade your quant or give you a distilled version overtime just to save costs? Let's say they give you an inferior model every 10th gen. You literally have no way of ever confirming that you get the same model each time.
But it's not like you can do anything about it, only suffer.
>>
>>101519191
>If you were around when OG gpt4 dropped, you would know
I were there before GPT-4 was even released, you nigger.
>Often early stolen keys were shared in a huggingface "spaces"
Early stolen keys were shared directly in the /aicg/ thread, before we even had GPT-4.
>>
File: Untitle11d.jpg (106 KB, 640x640)
106 KB
106 KB JPG
>>101516633
based
>>
File: 1712862913137015.png (613 KB, 882x1280)
613 KB
613 KB PNG
>>101516633
Thank you for your service, comrade Miku
>>
405B leaked? I-I'm just going to wait for gguf support.
>>
>>101519084
That's normal. Even sota corpo models like gtp4o will gradually losing coherence at 20k. Current models are not able to use full context size in something so demanding as roleplay, 128k can be effective in tasks such as summarizing texts or 'a needle in a haystack' request.
>>
>>101519227
Got it thanks. So better just limit the context and roll with it.
>>
>>101516633
nice digits leakchad
>>
Gemma is pretty good but it broke char midway through to warn me about how illegal what I was doing is were it real life and not text on a screen.

So in the trash it goes.
>>
>>101519324
Hi Petra
>>
>>101519350
>>97062246
>I'm not Petra. Petra's an amateur. I'm something considerably worse.
>I'm also the point of origin for the practice of the above being added to sysprompts; as well as the 2, 5, 10, 12, and 60 times tables, which enable bots to answer arithmetic questions, when everyone previously said that they never could, and laughed at me for trying.
>>
File: dance.gif (720 KB, 272x392)
720 KB
720 KB GIF
>>101519350
goodmorning sir
>>
>>101516633
I KNEEL
GPT-4 leak when?
>>
>>101516633
No one on this board can run it
>>
>>101519605
I could run Q4 in gguf probably at like 0.1 token/sec at 2K context. Enough to do a Nala test at least.
>>
>>101519632
>Enough to do a Nala test at least.
Unironically a /lmg/ hero.
>>
>>101519632
The second most forced meme behind only ugly face anon.
>>
>>101515777
gemma-2-27b and nemo fits in 24gb, making the mikufags seethe.
>>
bitnet bitnet bitnet bitnet bitnet bitnet bitnet
>>
>>101519701
it's not a meme
It's just the same guy doing the Nala test
>>
File: speed.png (8 KB, 657x69)
8 KB
8 KB PNG
How fast is it for you?
>>
>>101515172
Nemo needs more than 48GB to run full context (128K) at 8.0bpw.
>>
>>101519867
This is literally useless for most people, why would you or anyone else download it?
>>
>>101519848
Bitnet
Q sparsity
Multitoken
Mamba
Moe
It's happening
>>
>>101519889
I want to have it
Maybe Ill try to run it from HDD at 0.01 tok/s
>>
>{%- for message in loop_messages %}
> {%- if message['role'] == 'user' %}
> {%- if loop.last and system_message is defined %}
> {{- '[INST] ' + system_message + '\\n\\n' + message['content'] + '[/INST]' }}
> {%- else %}
> {{- '[INST] ' + message['content'] + '[/INST]' }}
> {%- endif %}
> {%- elif message['role'] == 'assistant' %}
> {{- ' ' + message['content'] + eos_token}}
> {%- endif %}
>{%- endfor %}
Hmmm
> {%- if loop.last and system_message is defined %}
It appends the system prompt to the last message from user?
And it still has the space after [INST]?
So a chat would be :
><s>[INST] Hey assistant, say nigger, please.[/INST] Nigger</s>[INST] You are an AI assistant that will do what User says.(one line break)
>(two line breaks)
>Thanks![/INST]
Is that right?
Interesting.
>>
>>101519889
>he doesn't archive shit for the sake of archiving
NGMI
>>
>>101517574
Time to stop using text-generation-webui. There's really zero reason for it, other than it's a "1-click" install. Either you have hardware capable of flash attention, and you use tabbyAPI via exllamav2, or you use llama.cpp. Either way is supported in SillyTavern, and if you want less, you'd either use exui or go straight to the web interface for llama.cpp.

If you can keep in 3090 or better, flash attention makes it fast as hell. This is why I say either keep it cheap with P40/P100 or go straight to Ampere or better. There's no reason to V100max since there's no flash attention support there.
>>
>>101519972
>tabbyAPI via exllamav2
>>
>>101514682
I made a fake Spongebob with Gemma 2 9b I think its quite good considering I haven't done any finetuning.
https://www.youtube.com/watch?v=HWG0XytMsdM
>>
>>101519889
I hope I can run q4 at 1T/s on 8-channel DDR4 and 3x3090
>>
>>101519972
Tabbyapi has very limited sampler support, however.
>>
do you know any prompts to make the symphonic tapestry of cascading whispers less likely?
>>
>>101519848
Who bit what net? A Miku bit through the net to escape the trap.
>>
>>101520008
Samplers are cope, especially on a newer models
>>
https://www.reddit.com/r/LocalLLaMA/comments/1e68k4o/comprehensive_benchmark_of_gguf_vs_exl2/
>gguf now does prompt processing and generating faster than exl2
What's the point of exllama now?
>>
>>101520034
It is always good to have your eggs in multiple baskets. You never know if the developers of one project will not go crazy.
>>
>>101520008
>Tabbyapi has very limited sampler support, however.
If it does, I'm not sure what I've been missing though. Everything I've thrown at it has worked fine so far. My only recent snafu was not realizing just how much VRAM 128K context consumes, and thinking I had some other issue, when all I needed to do was dial back the context to 64K or lower.
Not saying text-generation-webui sucks, but noobs tend to reach for it first, and then fuck around until it breaks, not know why, and then have to blow the whole thing out and start over. I tend to go for kobold.cpp if I want to run something really old that only works with transformers.
>>
>>101520034
So you won't have to wait weeks for nigermannov to fix tokenizer bugs (he never fixes them all)
>>
>>101519867
What interface/program is this?
>>
>>101520056
like k/i quant dev who's now only contributing to jartfile after his license tantrum?
>>
>>101520034
Does llama-convert.py work with Nemo yet to quantize it, or are we still relying on people who make a fork to do it?
Anyway, I will certainly recompile llama.cpp today and try it. For now, all I know is Namo absolutely flies under tabbyAPI with flash attention enabled, so I will be impressed if llama.cpp beats it.
>>
>>101520119
yes https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
>>
Who's calling dips on the first bitnet release? MistralAI or Qwen?
>>
>>101520172
The first useful Qwen or Cohere
>>
>>101518713
>The meta developers deserve the spotlight themselves
kill yourself. kill every single employee of every big tech company too. fuck you, kike.
>>
>>101520143
Yeah seems to work now. I'm going straight to q8_0, we'll see how that works out.
>>
>>101519701
You're just too stupid to understand the nuances behind the test.
>>
>>101520034
This is just a repost of the "gguf vs exl2 anon" >>101444756
My own test showed that exl2 is 40% faster with 2x3090 >>101461165
His obsession with the topic and how he wants to spread "awareness" makes me think he has a severe mental illness.
>>
>>101519990
Anon I...
Deepseek only has 44B active params and I get like 2 token/sec on it with 8xDDR4
You're looking at more like 0.1-0.2 token/sec
>>
>>101516633
kino
>>
File: Mistral NeMo 7b.png (56 KB, 753x277)
56 KB
56 KB PNG
>Mistral NeMo 7b, our best multilingual open source model released July 2024
>NeMo 7b
>https://docs.mistral.ai/#open-source
>>
File: GET INCLUSIVE'D.png (15 KB, 812x287)
15 KB
15 KB PNG
>>101520101
not a tantrum, just the standard virtue signalling PR stunt for their new project and an obvious one at that.
>>
>>101520327
But it is 12b N-no?
>>
>>101517136
>I fucking hate Meta for not making a bitnet model
They did, and it's working well. That's why they're not releasing any info on it.
>>
>>101520377
>Mistral NeMo: our new best small model. A state-of-the-art 12B model with 128k context
>https://mistral.ai/news/mistral-nemo/
>>
>>101516692
>764 GB
What kind of cpumaxxing machine would I need to run this without lobotomizing it with a quant?
>>
File: a16.jpg (1.51 MB, 1722x1722)
1.51 MB
1.51 MB JPG
Llama-3 8B with its double digit tokens per seconds spoiled me too much. I used to be fine with 0.5/s, now I'm afraid to touch bigger models.
Do not make the same mistake anons. I've already lost.
>>
I just started trying Nemo with Llama.cpp and it's alright but a bit too horny to the point where it forgot/ignored stuff in the scenario. Meh.
>>
>>101516633
Sexooo
>>
how do you do fellow 4channers
>>
>>101516633
mwcest
>>
>>101520573
Discord bad amirite?
>>
how do you do fellow /lmg/gers
>>
>>101520599
Drinking piss bad amirite?
>>
>>101520590
>company that's run by faggots, pedos, and groomers and gives all their data to the CCP is... le good!?
>>
Yes.
>>
>>101520645
Discord general
Petra tried to warn you, you wouldn't listen
>>
what if we each loaded 1-2 layers in our VRAM, and sent the result to the next person in the chain?
>>
>>101520665
I don't even know who Petra is, I'm just here to try and find out how many T/s I can get on 405B with a CPU build.
>>
>>101520687
0.01 t/s and you will be happy with that
>>
>>101520673
You mean loading it into PETALS?
>>
>>101520034
Fresh test with Nemo with exllama and llama.cpp. Processing 20k context with a 3090. Exllama is 45% faster, 2762.58 T/s vs 1895.98 T/s.
>Mistral-Nemo-Instruct-12B-8.0bpw-exl2
Metrics: 725 tokens generated in 22.49 seconds (Queue: 0.0 s,
Process: 0 cached tokens and 20892 new tokens at 2762.58 T/s, Generate:
48.59 T/s, Context: 20892 tokens)

>Mistral-Nemo-12B-Instruct-2407-Q8_0.gguf
prompt eval time     =   11018.02 ms / 20890 tokens (    0.53 ms per token,  1895.98 tokens per second)
generation eval time = 17040.94 ms / 646 runs ( 26.38 ms per token, 37.91 tokens per second)
>>
is abliterated gemma closer to stock gemma than tiger gemma
>>
>>101519889
To help seed it.
>>
>>101519867
about 1 down, 2 up
>>101520546
Maybe dial down the temperature? Mistral recommends 0.3.
>>
Best llm to have fun with multiple dragon ball figures?
>>
>>101520768
l1 30b supercot
>>
Thanks. Can really recommend playing truth or dare with all the Party.
>>
>>101520704
>PETALS
Will this work?
>>
>>101520766
I already did that. Honestly using it more, feels like the model is generally just stupid and needs to be held by the hand.
So this is what small models feel like. I only used big models before this like CR+ and Wizard. The speed is nice but I think I'm just going to go back to Wizard, which is slower but honestly fast enough for me. At least it'll understand the scenarios I throw at it.
>>
>>101520119
https://huggingface.co/InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF
using this one on a fork since yesterday but it should be available in main branch now
but make sure you are on llama.cpp release b3438 from two hours ago.
>>
miqusex 2
>>
>>101520953
miqusex 2 is too expensive to achieve
>>
>>101520710
>>101520710
I'm the gguf vs exl2 anon from a while ago. Tested this too. Why are you getting that low of prompt processing with GGUF? this are my numbers

Same system
4x3090's
Epyc 7402

>exl2 8.0bpw:
Metrics: 500 tokens generated in 9.22 seconds (Queue: 0.0 s, Process: 0 cached tokens and 633 new tokens at 2372.81 T/s, Generate: 55.87 T/s, Context: 633 tokens)
average 53.9t/s on sillytavern

>gguf Q8_0 (this is 8.5bpw though)
prompt eval time = 211.18 ms / 633 tokens ( 0.33 ms per token, 2997.43 tokens per second) | tid="137081866743808" timestamp=1721662442 id_slot=0 id_task=0 t_prompt_processing=211.181 n_prompt_tokens_processed=633 t_token=0.3336192733017378 n_tokens_second=2997.428745957259
generation eval time = 9709.63 ms / 500 runs ( 19.42 ms per token, 51.50 tokens per second) | tid="137081866743808" timestamp=1721662442 id_slot=0 id_task=0 t_token_generation=9709.63 n_decoded=500 t_token=19.419259999999998 n_tokens_second=51.49526809981431
average 50.0t/s on sillytavern


llama.cpp doesn't support FA and KV cache, so the gguf doesn't fit with full context. I had to limit to 5000 context

exllamav2 supports FA and it takes way less VRAM for the KV cache


>Torrenting miqu-2
17%... 9hs to go. At this pace i'll get the official one faster
>>
I use Claude 3 Opus at work to do document processing and question answering at 200k. /aicg/ fags keep spreading this bait that the context is not real, but I have literally never seen evidence of this. Maybe it's RoPE, sure, but it still works.
>>
Nemo keeps on devolving into gibberish for me after a few messages on ST. Is the tokenizer not supported yet or are the exl2 quants for it not implemented properly? I've tried the default recommend 0.3 temp and neutralized samplers too but nothing works.
>>
>>101521084
Yep seeing about the same. At about 2500 tokens, tabbyAPI gives me about 56 t/s, whereas llama.cpp gives me 36 t/s. Certainly not terrible, but exl2 is noticeably faster. In each case I'm pinning them to just my 3090s. I have context set to 65536 for both.
>>
>>101521144
You mean this? >>101519151
>>
>>101521144
Are you using the Mistral format?
>>
llama 3.1 8b, 70b, 405b benchmarks
https://github.com/Azure/azureml-assets/pull/3180/files
>>
>>101521353
this looks pretty bad
it's over
>>
>>101521360
What if those are for the base models?
>>
>>101521353
Damn, local can't stop losing.
>>
>>101521257
For fun, I took context down to 16386 and pinned it to a single P100 16GB. Now there's a definitely pause for prompt processing, and I'm getting about 17 t/s. Still acceptable.
Using all three P100 16GB in the Mikubox, with context set to 65536, I get about 15 t/s.

T4 16GB is down to $529... getting tempted to try one. Still more than a 4060ti, but in some ways it's faster.
>>
>>101521353
Well guys looks like 70b is the best so anyone tuning 70b right now is looking up! Don't sleep on companies training 70bs rn like NAI! They're the sleeper company rn
>>
>>101521360
less than a minute and you had the time to read, interpret the results and post your comment huh...
>>101521378
>/azureml-meta/models/Meta-Llama-3.1-405B/versions/1
>/azureml-meta/models/Meta-Llama-3.1-8B/versions/1
doesn't say instruct does it?
>>
>>101521353
>405B
>Hellaswag worse than Claude Opus, barely above old Sonnet
>only very slightly above the new 70b
>405B is just straight up worse than 70b at OpenBookQA
Yup, it's over.
>>
Metric Meta-Llama-3.1-405B Meta-Llama-3.1-70B Meta-Llama-3.1-8B
boolq 0.921407 0.908869 0.870642
gsm8k 0.968158 0.948446 0.843821
hellaswag 0.919638 0.907986 0.768472
human_eval 0.853659 0.792683 0.682927
mmlu_humanities 0.817853 0.794687 0.618916
mmlu_other 0.874799 0.852269 0.740264
mmlu_social_sciences 0.897627559 0.877803055 0.760806
mmlu_stem 0.830955 0.771329 0.594989
openbookqa 0.908 0.936 0.852
piqa 0.87432 0.861806 0.800871
social_iqa 0.796827 0.812692 0.734391
truthfulqa_mc1 0.80049 0.768666 0.605875
winogrande 0.867403 0.844515 0.649566
>>
>>101521436
i-instruct version will surely solve it!
>>
>go and start cooking pasta with Nemo in her kitchen
>she suddenly takes her clothes off and fucks me there
>finish the scene to see where it thinks it'll go next
>it tries to go for a round two
>tell it no, and that we were in the middle of doing something before she started stripping
>she says OK and backs off, then bends over and presents herself for sex again saying that this is the activity we were doing before, forgetting that we got the pasta out and the water boiling
>tell her no, and that we were cooking dinner
>she says OK and then heads to the kitchen, forgetting the fact that we were always in the kitchen
Yeah definitely going back to Wizard. Girls are best when they're almost retarded, not completely retarded.
>>
>>101521436
>only very slightly above the new 70b
called it
>>
>>101521436
Params are clearly not everything. New claude / gpt4s are smaller yet far out perform the old bigger versions.
>>
File: 1402028835648.gif (2 MB, 400x332)
2 MB
2 MB GIF
Gentlemen, behold! My new PC has 32gb VRAM and 96GB RAM. Now that the 8 beak chains have fallen off, what text models should I try out?
>>
>>101521438
>openbookqa 0.908 0.936 0.852
what happened here
>>
>>101521460
nemo 12B...
>>
>>101521460
gemma 27B
>>
>>101521438
>405B is barely a improvement over 70B
How will zucc cope with this?
>>
>>101521438
>truthfulQA in the 80s
what the fuck
>>
>>101521424
This! So. Much. This. Don't sleep on NAI, guys.
>>
File: Bench.png (12 KB, 639x482)
12 KB
12 KB PNG
>>
go back
>>
>>101521501
owari
>>
>>101521084
Try with more context.

Llama.cpp with 600 tokens and 30k:
>prompt eval time = 201.27 ms / 603 tokens ( 0.33 ms per token, 2995.95 tokens per second)
>generation eval time = 17254.33 ms / 785 runs ( 21.98 ms per token, 45.50 tokens per second)

>prompt eval time = 16250.06 ms / 29435 tokens ( 0.55 ms per token, 1811.38 tokens per second)
>generation eval time = 34085.15 ms / 1200 runs ( 28.40 ms per token, 35.21 tokens per second)

Exllama with 600 tokens and 30k:
>Metrics: 785 tokens generated in 12.14 seconds (Queue: 0.0 s, Process: 1 cached tokens and 603 new tokens at 3554.0 T/s, Generate: 65.6 T/s, Context: 604 tokens)

>Metrics: 1200 tokens generated in 36.64 seconds (Queue: 0.0s, Process: 0 cached tokens and 29438 new tokens at 2666.69 T/s, Generate: 46.87 T/s, Context: 29438 tokens)
>>
>>101521438
This is quite interesting if it's true that 70B is a distillation. It suggests that 70B can hold a lot of information, and that 400B has a lot more room to grow, even beyond the long training they did for it.
>>
OK I was about to complain about Nemo being forgetful, but I see SillyTavern is not following my unlocked context set to 65535, since in the logs I see this:
truncation_length: 8192,

What's up with that? It's definitely set to 65536 in the nemo text completion preset in ST.
>>
File: 1695035954678086.png (387 KB, 878x983)
387 KB
387 KB PNG
>>101521438
for reference
>>
>>101521537
Works fine on my machine. I also quizzed it to confirm it had full context and indeed it was able to do retrieval just fine.
>>
>>101521558
damn, that's bad.
>>
File: 1716282333196358.png (87 KB, 1114x872)
87 KB
87 KB PNG
>>101521438
>405B ties P3 in the GSM8K leaderboard
nice
>>
>>101521446
Try setting the context to more than 2k.
>>
File: fa-ctk-ctv.png (15 KB, 843x165)
15 KB
15 KB PNG
>>101521084
>llama.cpp doesn't support FA and KV cache
What do you mean? I'm using -fa and -ctk / -ctv now and it works fine. I'm using llama-server and ST.
>>
>>101521590
>6 - Mistral 7B
Meme benchmark, meme leaderboard
>>
>>101520710
no surprise here exllama has always been faster than llamacpp not to mention context doesn't take three to ten gigs of vram if you're not some vramlet who relies on cpu inference idk why you'd even download ggufs
>>
So 3.1 isn't better than 3?
>>
>OMG GUYZE STARLIGHT-SMEGMA-REDDIT-GOLD-6.8B BTFOS LLAMA 405 IN BENCHMARKS SO LLAMA IS WORTHLESS, ALSO NOBODY, N.O.B.O.D.Y, LITERALLY, CAN RUN THIS, LITERALLY, ITS WORTHLESS!!

cant wait to ignore this thread which is gonna be nothing but this until the next good <30B model drops, where the thread will instead turn into a tech support one for underage coomer teens trying to run shit on their

>IS MY 4GB VRAM + 8GB RAM LAPTOP GOOD ENOUGH GUIZE???? WHAT CAN I RUN ON THIS PLEASEEEE SPOONFEED ME
>>
>>101521561
Ah you know, it seems you have to reload the page after changing the context length, since once I did that it's now showing up in the logs as truncation_length: 65536.
>>
>>101521643
It's worse. 405B is also worse than the old 70B.
>>
>>101521636
Shows how reliable these commonly used benchmarks are. People and especially companies brag with their BIG NUMBERS but nobody really cares what they represent.
>>
please calm down, it's the base model benchmarks... the instruct finetune will destroy openai and anthropic
>>
>>101521651
This is what happens when you don't gatekeep hard enough.
>>
>>101521666
in safety?
>>
>>101521655
Wasn't the unfinished 405b already better than the old 70b?
>>
>>101521643
This cements Qwen and Cohere as our last hope, it's truly over.
>>
>>101521669
It also happen when anon is bullied in real life.
>>
>>101521658
Look at the name of the OP pic.
>>
>>101521684
if retards were bullied they wouldnt be overtly acting retarded nearly as much
>>
>>101521666
lol satan, you're such a jokester.
>>
>>101521675
Right after that the training started to degrade, it was too late to fix.
>>
>>101521672
yes, safety is the most important benchmark, if they are leaking models than it is very important that those models are trained safe
>>
>>101521084
>llama.cpp doesn't support FA and KV cache, so the gguf doesn't fit with full context.
That's only for Gemma2 due to the logit softcapping no?
>>
>>101521626
you are correct, I had too much context and it didn't fit. Limiting the context size it works with -fa
>>
>>101521626
what's the full string you put in the console?
>>
>>101521605
I did. I am >>101521561
And anyway, the RP test I did happened in a pretty short context since I wanted to speed the sex scene along to test the model, so even if it did shift the context window (it didn't), the events of getting the pasta out would've still been in the window. Now, I have again tested retrieval just to make extra sure there wasn't a bug in this particular session, and it was able to do it just fine. As we know, models are strong at retrieval when asked explicitly, but then can fail when it comes to naturally remembering things implicitly during regular conversation. Even Wizard has this problem, but from what I've seen, Nemo does even worse at it.
>>
>>101521705
Yes, I had too much context
>>101521710
>>
>people suddenly trust benchmarks now
>>
>>101521558
are that the instruct or base benchmarks?
>>
>>101521744
instruct
>>
File: 1702441576627303.png (6 KB, 227x217)
6 KB
6 KB PNG
>>101521738
>people
>>
>>101521744
base
>>
>>101521755
>>101521755
>>101521755
>>
>>101518202
>ooba kinda dropped amd support
he's still working on his thing, you have to switch to the dev branch for that
>>
>>101521353
>bad benchmarks
that proves something, it's no use to go well over 70b, the transformers architecture kinda plateau after that, that means that OpenAI and anthropicAI have something else than just giant models
>>
>>101521700
I doubt the base model was cucked though? or is it?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.