[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: miku bread.jpg (270 KB, 1024x1024)
270 KB
270 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>100154945 & >>100166886

►News
>(04/24) Snowflake Arctic Instruct 128x3B MoE released: https://hf.co/Snowflake/snowflake-arctic-instruct
>(04/23) Phi-3 Mini model released: https://hf.co/microsoft/Phi-3-mini-128k-instruct-onnx
>(04/21) Llama3 70B pruned to 42B parameters: https://hf.co/chargoddard/llama3-42b-v0
>(04/18) Llama3 8B, 70B pretrained and instruction-tuned models released: https://llama.meta.com/llama3/
>(04/17) Mixtral-8x22B-Instruct-v0.1 released: https://mistral.ai/news/mixtral-8x22b/

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling/index.xhtml

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: 1691041725883639.png (359 KB, 512x512)
359 KB
359 KB PNG
what are the requirements for using a local model together with an LLM?
i have 64GB RAM and 16GB VRAM on an AMD system. i normally use koboldcpp for llms and comfy for SD stuff.
>>
It's over
>>
>>100173514
>Previous threads: >>100154945 & >>100166886
>>
>>100173514
>>100173573
>>100173584
>>100173590
good morning sir!
>>
>>100173573
The absolute state of /lmg/
>>
>>100173573
>an internet connection
>the ability to read
>a lot of time
I think that about sums it up
>>
>>100173573
depends on what size you're willing to run. you should be able to run a 8b - 11b model and have enough space for sd as well probably.
>>
>people still recommending Mythomax and fucking CR+ to newbie VRAMlets
Why? Is this some form of gatekeeping I'm too deep in to understand?
>>
File: mizuasobi.png (1.2 MB, 1304x744)
1.2 MB
1.2 MB PNG
►Recent Highlights from the Previous Thread: >>100166886

--Enabling Local Language Models to Access External Sources: >>100170746 >>100170878 >>100170905 >>100170924 >>100170942 >>100170947 >>100171324 >>100171202
--Optimizing LLMs for Reasoning: Phi's Limitations and Future Directions: >>100167878 >>100167897
--Anon's Experience with Llama 3 70b Instruct: Shortening Responses Near Context Limit: >>100167911 >>100169713 >>100169792 >>100170112 >>100170062 >>100170270
--Noticable Quality Drop with Quantization in Llama 3 Models: >>100169493 >>100169506 >>100169525 >>100169914
--Are Lengthy Multi-Rule Prompts Killing Model Creativity?: >>100167192
--Anon's Llama Model Performance Benchmarks: >>100167274 >>100167298 >>100167910 >>100167941 >>100168292
--The Utility of Large Language Models: Beyond Fiction Generation: >>100167521 >>100167544 >>100167555
--Can LLMs Generate PDBs from Decompiled Programs?: >>100167690 >>100167736 >>100167871 >>100170388 >>100170564 >>100170589 >>100170630 >>100170953
--Anon's Take on Meta Stock Drop: Faith, Hope, and Market Volatility: >>100168186 >>100168191 >>100168690 >>100168736 >>100168749 >>100168765 >>100168789
--Best Model for ERP and Productivity Tasks?: >>100171747 >>100172054 >>100172096 >>100172361 >>100172407 >>100172423
--Snapdragon X Plus: Promising AI Performance or Overhyped?: >>100168557 >>100168605 >>100168624 >>100168651 >>100168671 >>100168774
--Llama-3-Instruct Model Discussion: Censorship, Prompt Structure, and Role-Playing: >>100167135 >>100167187 >>100167229 >>100167265 >>100167298 >>100167307 >>100167350 >>100167575 >>100167610 >>100167631
--Understanding the Difference Between Uncensored Models and Psycho Models: >>100167678 >>100167724 >>100167813 >>100167880 >>100170302
--Integrating Comfyui with Stable Diffusion 3: >>100168344 >>100168378 >>100168420 >>100168685
--Miku (free space): >>100168445 >>100166912 >>100170598 >>100171118 >>100173294

►Recent Highlight Posts from the Previous Thread: >>100166891
>>
>>100173573
>how do I use an LLM with an LLM
anon...
>>
File: 1704467287466611.png (444 KB, 512x512)
444 KB
444 KB PNG
>>100173701

>>100173573
>what are the requirements for using a local model together with an LLM?
whoops. i actually meant image gen.
what i want is to run llm with SD in something like ST. how well does that work?

pls no bully
>>
Anyone try this yet? https://huggingface.co/TheDrummer/Moistral-11B-v3
>>
>>100173717
Wait for true multimodal LLaMa 3, producing perfect Miku images and RP.
>>
>>100173727
I normally love downloading random slop meme models but that name is stupid as fuck so no
>>
>>100173745
https://old.reddit.com/r/LocalLLaMA/comments/1cc6xb1/moistral_11b_v3_the_finetuned_moist_just_got/
Reddit seems to love it.
>Cream-Phi-2
kek
>>
https://huggingface.co/TheBloke/platypus-yi-34b-GGUF

This model, of all things, performs the best at ooba's secret bechmark.
>>
>>100171184
I grabbed the L3 8B 64k context model and tried it with a close to 16k token chat I have.
It wasn't coherent, so either the claimed 64k context is bs or there's might be something wrong with the q8 gguf.
I want to rule out user error at least.
Has anyone else tried it yet?
>>
>koboldcpp rocm updated again
we still hanging in there AMD bros
>>
>>100173826
Guy probably just edited the config and called it a day.
>>
File: ITS HAPPENING.gif (826 KB, 320x213)
826 KB
826 KB GIF
>>100173829
>ITS REAL
aAAaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

>There was some big changes upstream, that's why it's taken a while to update kcpp-rocm, trying to get it to work.

YELLOWROSE I LOVE YOU AND WHAT YOU DO FOR AMDBROS
>>
>>100173858
Yeah probably.
I don't know if extending the context is actually possible.
I'd assume you'd have to retrain the model from the ground up.
>>
>>100173914
Nah, you can do large context tuning
It does need to be a full finetune though, I doubt a LoRA could handle it
>>
Tf is the snowflake arctic thing? How much ram?
>>
File: 19215413059.png (8 KB, 581x104)
8 KB
8 KB PNG
>>100173869
Nvm its busted with models and settings that work on 1.62 :[
>>
File: 1969783154.jpg (52 KB, 527x177)
52 KB
52 KB JPG
>>100174028
Ummmmm YellowRose???
>>
>>100173829
Why don't you just use linux fucking retard
>>
> keeping the dream alive
>>
>>100173752
localllama is extremely clueless so that doesn't mean anything
most of them probably upvoted it because le funny name without trying it
>>
>>100173938
Not that anon, but I could swear that SuperHOT LoRA was a thing.
I guess I'm getting it mixed up with SuperCOT.
>>
>>100174329
superhot lora was a thing and while it mostly worked a full ft is obviously better
>>
>>100173514
>https://huggingface.co/chargoddard/llama3-42b-v0

So this has 76 mmlu which is really interesting. Has anyone here tested it? How does it compared to 70B/8B? Is it improved over 8B or is it retarded?
>>
>>100174462
Everyone who tested it called it irreparably retarded.
>>
>>100174470
It is not retarded. It is schizophrenic. It has a beautiful mind but can't communicate its thoughts very well. Honestly everyone ITT should love it because it is so relatable.
>>
File: quant.png (51 KB, 969x507)
51 KB
51 KB PNG
currently making a few exl2 quants for Moistral v3. 8bpw and 5.5bpw for 8gb vramlets
>>
I have a macbook air m2 with 8 gb ram laying around because of work. Is there any worthwhile llm I could run on it?
>>
>>100174470
Interesting, they are working on doing the same to instruct model, let's see if the results change. Time to try frankenmerges for now-
https://huggingface.co/raincandy-u/Llama-3-Aplite-Instruct-4x8B-MoE
>>
>>100174549
Quanted mistral 7B or llama 3 8b, I guess.
>>
>>100174549
hahahahaha, no
>>
>>100174549
that 8gb needs to be shared with the rest of the OS, so you're looking at like 4-6 for the model
you could run quanted llama 3 8b at best
>>
>>100173717
I think you will want to reserve how ever much space the SD model takes up, and then only load the LLM layers that will fit with your desired context.
>>
>>100174549
https://huggingface.co/apple/OpenELM
>>
>>100174567
>they
It is a guy.
>>
>>100173914
>I don't know if extending the context is actually possible.
feels like 2023 all over again
>>
>>100174110
Now explain what it means in non-wikipedia faggotry terms
>>
>>100174662
I don't see their pronouns listed anywhere.
>>
any decent phi3 finetunes yet?
>>
>>100174110
So why hasn't anyone done llama.cpp bitnet yet? Is it because everyone is lazy or because the existing bitnet models use row-wise scaling factors which llama.cpp doesn't support at all?
>>
>>100174662
>>100174691
What if it's a woman? You know, not a troon, but a real vagina.
>>
>>100174110
Just remind the companies that they can release their bitnet models without the fp16 weights likely making it a huge ordeal to finetune them.
>>
>>100174732
It is a guy.
>>100174691
It is a guy.
>>
File: 197943296573298.png (121 KB, 463x576)
121 KB
121 KB PNG
>>100174567
>https://huggingface.co/raincandy-u/Llama-3-Aplite-Instruct-4x8B-MoE
>SOMEBODY ACTUALLY MADE A 4x8B
>Q6 is only 20gb
WE ARE SO FVCKING BACK
WE HAVE NEVER BEEN THIS BACK BEFORE
I DONT EVEN CARE IF ITS SLOP
>>
>>100174747
Why are zoomers like this?
>>
>>100174747
Well post your logs from it
>>
>>100174747
When we get tunes like NousHermes, wizardLM, etc... the frankenmerges will be really good.
>>
>>100174758
Download speeds are bad in america for no reason
>>
C-R+ user here. I tried Llama 3 70B instruct and it was slop. I tried Llama 3 70B base and it was schizophrenic.
What's the deal with the people saying it's good? Is there a magic prompt? You can't even preload context because it only has 8k max.
>>
File: porky.png (325 KB, 576x566)
325 KB
325 KB PNG
>>100174780
>for no reason
>>
>>100174780
>for no reason
Oh, there are reasons.
http://irregulators.org/bookofbrokenpromises/
The numbers on that one are slightly inflated IIRC, but the general idea is correct.
>>
>>100174820
>>100174848
I already know the reason you redditors who doesnt its not 2014 anymore
>>
>>100174797
Llama 3 seems to be sensitive to formatting and templates, if you are using the wrong ones you get schizo, also make sure to pull the latest frontends as they all had bugs early on.
>>
>>100174747
>moe frankenmerges
Isn't this basically like merging slop but you don't do the final step (where you calculate the average of all changes and add it into base model) you instead leave all those slop tunes out there so they eat up all the ram and then ask the client to average it out? So you just 4x the required ram for absolutely no reason except retards will buy it?
>>
File: sfa-q8-test-1.png (41 KB, 835x976)
41 KB
41 KB PNG
>>100174018
Its a 476.27B mixture of experts model with 128 experts (2 active).
The main download is 1TB. Q8 is 472GB
It's claiming 4096 context, which is disappointing if true, to say the least.
I've managed to quant it down to Q8 with --skip-unknown and am trying to run it after making a few llama.cpp code tweaks to go beyond 60 experts. It has reserved 486GB of RAM to load at that size.
It's currently outputting tokens for me, but there's some kind of fundamental problem because they appear to be half nonsense.
>弘 Hello saf Season Secretary opportun duties season winter Flora</s></s></s>
>>
>>100174889
Nobody cares, dude.
>>
>>100174912
It's not based on llama. Are you sure llama.cpp has added support for it? It's only been a day, I'm surprised it converted and ran without errors.
>>
>>100174912
I have 512GB of ram which could fit Q8. Currently downloading to quant too. Why the --skip-unknown?
>>
>CAPTIANS LOG 425
llama3 has been out for several weeks and mythomax3 still hasn't been made. neither have any good finetunes like a holodeck or nous hermes. no news on a possible bitnet 70b either. all the hype gone. all the locals have turned to sonnet and opus. half way through 24 and not a single decent 40b in sight for regular 24gb vram folk that only have a single card.
>>
File: file.png (44 KB, 611x334)
44 KB
44 KB PNG
Do you think we'll ever get that 70B model? And how neutered will it be?
>>
>>100174957
>Are you sure llama.cpp has added support for it
I'm almost certain they haven't.
I'm also shocked it works at all
>>
>>100174973
It's over. Microsoft put the axe to them.
>>
>>100174973
either we get a new 70b trained on llama 3 or nothing
>>
>>100174889
If some models are better than others at a particular task the output should be weighted toward the better ones (useful for including formatting code, etc. in responses). The idea is that you trade VRAM for more parameters without increasing compute requirements.

It's a terrible trade-off for local inference, though.
>>
>>100174960
>only have a single card.
If you didnt come into this hobby with at least 1 good card and didnt get another one its basically joever
>>
>>100174973
We'll get it after llama2-34b finishes red teaming
>>
>>100174986
>If some models are better than others at a particular task the output should be weighted toward the better ones
But that requires training a gate layer that decides where the input goes. Is it this kind of frankenmerge?
>>
>>100174766
OpenHermes/NousHermes is a meme I'll never understand. It contains some good datasets (OpenOrca, Capybara, Airoboros-the good part, Wizard70k) and shit datasets (CamelAi slop, glaive code, alpaca-gpt4, Airoboros-the shit part). The overall result is a mess that can't follow instructions well, is overly verbose and ignores system prompts, yet people praise it like it's the best tune ever.
>>
>>100174976
https://github.com/ggerganov/llama.cpp/issues/6877
>>
>>100174988
There is zero reason to buy 2 cards just to run llms. 2 cards do nothing for gaming, for ai art or music. The only reason for a second card is so you can run unoptimized language models that are inferior to free cloud based ones.
No thanks anon. I'm happy with my 4090, when something finally fits on that we'll be cool. I'm not going to be one of those retards trying to hang 6 cards in open air so I can run a 8x22b still dumber than sonnet which is free.
>>
>>100174998
I don't know enough about MergeKit internals to know what it uses for the base router here. I was assuming a fine-tuned MoE, but you're right that this probably isn't fine-tuned.
>>
>>100175032
thats crazy man, but Im here to run my AI locally.
>>
>>100175032
based, if nvidia still had support for SLI, that would be great
>>
>>100174973
I'm considering buying another 3090 for the fuckhuge version.
>>
>>100175032
>buying 2 cards is unthinkable for him
https://www.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/
lol lmao
>>
File: file.png (56 KB, 801x414)
56 KB
56 KB PNG
>>100175032
that's some strong copium there buddy
>>
>>100175082
>Q: How is the performance? A: To continue the spirit of transparency, I'll load one of the slower/VRAM hogging models. Llama-3 70B in full precision. It takes up about 155GB of VRAM which I've spread across all ten cards intentionally. With this, I'm getting between 3-4 tokens per second depending on how high of context. A little over 4.5 t/s for small context, about 3/s for 15k context.
>he spent 13k for this
>>
What if when 405B releases, it finally beats GPT4.
But then OAI releases GPT4V and makes GPT4 free.
Would the $10k 10 card 100GB VRAM setups have been worth it?
>>
>>100174096
What's the current AMD Linux meta? I was trying to get exllama running last year, and after getting my rocm installation set up and my torch environment finagled correctly, it ran like absolute shit for larger models/ context because flash attention 2 still has no rocm support for consumer hardware. I've checked back every now again to see if there are any updates, but I've mostly just been using koboldcpp-rocm as well because it's been the easiest to dial in the right tradeoff between speed and model quality with my graphics card and cpu offloading.
>>
>>100175143
I'm gonna share a secret with you anon, gpt4 is already free if you pirate it. Logless, trackerless, and it works on your fucking phone. The only cope here are the retards who fell for the vram bait.
>>
>>100175127
>>he spent 13k for this
he admitted in the comments to being an old boomer, got to spend the grandkids money so they don't inherit anything before kicking the bucket don't ya know.
>>
>>100174096
because im not a tranny
>>
>>100175143
I don't think people with 4x4090 care if they beat SOTA models, they care about freedom, custom finetunes can beat SOTA at specific tasks
>>
>>100175187
Where are those custom finetunes at anon?
>>
>>100174960
sonnet and opus have too many claudisms and put the chrachter card above the context so you can be talking to someone for three hours and make no progress.
>>
>>100174797
CR+ doesn't follow instructions well, at least with the quants that fits in 48GB of VRAM. And the 70B-instruct is pretty good at following instructions and learning from the context. For that reason I prefer to use L3, CR+ is not very usable in that state.
You need to modify the default 'assistant' role of the chat template to remove the censorship.
I think Rope scale/alpha works well to scale the context.
If you're a /aids/-tier promptlet, you might want to stick to CR+.
>>
File: file.png (99 KB, 512x288)
99 KB
99 KB PNG
>>100175143
All micunny rp is free until FBI asks to open up.
>>
>>100175198
they don't exist yet, but you got the message
>>
>>100175199
>put the chrachter card above the context
you can fix this with author notes anon or a dozen other ways like appending it to the jailbreak
>>
Llama-3-8B-Instruct-32k is pretty good at staying in character, but it isn't "intelligent" enough to do that and output a UI at the same time.
Not bad.
>>
>>100175171
Spending that much and not even knowing how to use it is painful to watch.
>>
>>100175211
>He bought 4x4090s for custom finetunes that will come in two more weeks
>>
>>100175032
they want to gen loli porn
it's really as simple as that
>>
>>100175162
Flash attention doesn't matter that much for our LLM usage, it mostly matter if using batching. Hell, flash attention is not required or recommended by default with exllama. If you had trouble running model, that was not the cause, for LLM, AMD have better speed per dollar.
But anyway, I'm also disappointed with how slow they are working on flash attention, for image gen it significantly reduce vram usage, you can get it working with some hacks on RDNA3 but official support is still supposedly in the work.
I just use llama.cpp since models have gone so big but now that we are back with a good 8b, might go back to run exclusively on GPU with exllama.
>>
>>100175220
I'm gonna tell it to do something and pray? I'm sure that'll work and it won't ignore me 5 messages latter.
>>
>>100175143
Still not sending you my logs, Sammy boy.
>>100173826
>>100173858
I tried it after quanting it down to 8bpw in exl2. Works fine up to around 22k context or so with RoPE alpha @ 5, then shits the bed.
>>
>>100175169
I really, really hope this is not one of those snarky copebrag posts where you say "it's possible, i just won't tell you how!" with a smugsoyjak face on you. Or it could be plain bait.
With that aside, how do you "pirate" a model with trillions of parameters and run it on your phone? Please enlighten us, and spare us the usual "mmhmh not le telling you bro" reddit shit. inb4 asking a discord troon for a proxy key or scrapping git repos
>>
>>100174960
>all the locals have turned to sonnet and opus
Not locals. Those are the tourists and fake ass shitposters. Don't go full retard.
>>
>>100175238
I wouldn't buy 4x4090 for that, unless I begin to work with LLMs, but 2 used 3090 to run 70b Q4 is not a bad idea
>>
>>100175301
I account for some of the claude posts because I wanted to see how green the grass was on the other side.

It's purple. Which isn't bad, just different.

There are parts that I want to bring back to llm that claude did better, but it's not worth switching over. It's worth dying my own grass for though.
>>
>>100175277
sounds like a serious skill issue to me anon I literally just spoonfed you how to fix your problem, if anything claude has better recall than any local model it can reach 200k context, what can our shitty models do? barely 32k and they forget shit in the middle so it's basically 8k in front and 8k in the back. what you're talking about is a non-problem for anyone competent.
>>
any cr+ tunes?
>>
>>100175356
My problem isn't with the memory. My problem is with claude having a mind of it's own. You can tell it things, you can change the jailbreak, but 5 messages latter claude goes "No, I like these tokens better. Cry about it, what you gonna do? Up my repetition penalty?"
>>
>>100175347
It's ok to demo them for the purposes of checking out the enemy. That doesn't mean you've "turned" to them.
>>
>>100175169
>Logless, trackerless
Sounds like some scraped key and server that is for a big enough company where they run it on their own server instead of through open ai and obviously don't expect someone to have hacked them. But logless and trackerless from open ai that is by definition impossible.
>>
>>100175127
>he spent 13k for a talking AI text computer comparable with SOTA that corpos spend billions on
>>
>>100175300
you answered it yourself, you pirate the key, if you're too stupid to do this then you connect to a proxyhost of someone else who has, it's not copebrag but I'm not going to spoonfeed you shit you can take time and learn yourself. you know what happens when retards do that? other idiots come along, brag, flaunt, then it gets fixed and I have to figure out a new way to do this shit. no thanks.
>>
>>100175400
skill issue confirmed
>>
>>100175423
>aicg not sending their finest
>>
>>100175423
>you know what happens when retards do that? other idiots come along, brag
Well why did you say that out loud then, if you don't want more people to ruin it for you? Your best course of action is to shut the fuck up about i and gatekeep it yourself.
Maybe you do want to brag after all.
>>
>>100175474
>idiots coping they bought 4 fucking 4090s when shits free
why didn't you just steal them anon?
>>
I already use GPT-4 for my job. I'm still running local when not on the job. It was never a cost issue. I don't care if you're Sam, /aicg/ or whatever, you're a smelly ass shitposter, fuck off.
>>
Has anyone else been playing around with using something akin to a 'Thoughts'/'Plan'/... headers for the Respond? I think that this could generally result in repetition if they are not unique. So I was thinking of filtering it out of the context afterwards to avoid repetition.
First tests seem promising to me.
>>
>>100175101
do you find 275w limit to be the sweet spot? I didn't find any degradation in speed at 250w
>>
File: 00041-404906826_1.png (1.79 MB, 1456x1024)
1.79 MB
1.79 MB PNG
>>100175423
>>100175502
mfw dual A6000s keeping me toasty quanting another experiment while yet nother /aicg/ streetrat seethes that he has to scrape and scrimp pirated keys for a fleeting taste of the good stuff
>>
>>100175535
wat? It was never a cost issue? Then what is your reason? Are you going to say something stupid like privacy because you're too stupid to obscure your data?
>I'm choosing to eat a shitburger at home BECAUSE REASONS
This is you anon.
>>
>>100175624
You really thought we were spending thousands of dollars on GPUs because $20/month cost too much?
>>
>>100175624
Learn to cook brownoid
>>
>>100175575
Mhmm I'm really seething here I didn't buy a dozen extra cards. I'm so mad you have no idea. Boy if only I had 96gb vram so I could run at a decent quant. Damn I'm mad. FUCK!
>>
>>100175663
t. net worth: $23,404.68
>>
>>100175644
Honestly I thought it was because you're just a fucking retard but maybe I'm wrong. That's why I asked you what the reason was. I still think it's because you're a fucking retard but we'll see if you reply with a good answer.
>>
Fucking cattle, please eat the bugs and own nothing
>>
Dumb question: How to use LCUDA on Windows koboldai? Or is it impossible? Is it better than ROCm?
>>
Is setting rope for llama3 as easy as it was for l2? Does the quality drop significantly or is it completely fine to do it
>>
>>100175683
Do you really think posting your bank account value will win your argument?
>>
File: 00003-1532105500_1.png (1.2 MB, 1024x1024)
1.2 MB
1.2 MB PNG
>>100175663
>tfw digital streetshitter anon tries very very hard to ironypost
>>
>>100175032
Based
>>
>>100174096
i do. i use dualboot. and i only use SD on linux.
but i just usually use windows due to a few key work related programs and kcpp is just easy to use.
>>
>>100175853
Just use VMs with PCI(e) passthrough.
>>
>>100175644
claude isn't 20$ a month, it's 20$ a month to use on their website.
>>
>>100175900
i wouldn't be confortable sending my scenarios to claude lmao.
>>
>https://huggingface.co/BXBX/Moistral-11B-v3-8.0bpw-h8-exl2
Done quanting Moistral v3 8bpw exl2, fits on 12GB VRAM with full context
5bpw for 8GB vramlets coming soon
>>
File: WooMiku.png (1.75 MB, 800x1248)
1.75 MB
1.75 MB PNG
>>100175818
Poverty is noble
>>
>>100174958
is that work or your own hw? impressive, very nice.
>>
>>100175687
nta but I run local models because I enjoy running models locally at home. I get better satisfaction and more enjoyment knowing that it's all on my machine. I wouldn't expect you to understand or care. Even if Claude opus was suddenly free for everyone I would still choose an inferior local model. We live in a different world: I started at the bottom, and my gens have only improved over time. Gpt at least has gotten measurably worse over time. How many times has /aicg/ gone through proxygeddon? compare that to the zero (0) times I have been denied access to my local compute. I enjoy the technology, I enjoy seeing the improvements in models, and I truly do not give a fuck even if corpo models were free 1 billion context and came with a synchronized vibrating onahole.
>>
>>100175687
NTA but you speak like a autist retard
>>
File: 1709079209.png (877 KB, 1290x606)
877 KB
877 KB PNG
can i get the latest redpill on using llms for coding assistance? im talking:
- explaining code blocks
- searching for bugs
- creating patches from descriptions of the desired effects

also, is it possible to train an llm on a given codebase to make it more useful?

t. lazy retard
>>
>>100176153
The latest is still the oldest. LLMs are only useful for shitting out jeet code and will not be of any use to a human.
>>
>>100176099
>nta but I run local models because I enjoy running models locally at home. I get better satisfaction and more enjoyment knowing that it's all on my machine.
Autism
>Gpt at least has gotten measurably worse over time.
It is still miles better than local models though
>How many times has /aicg/ gone through proxygeddon? compare that to the zero (0) times I have been denied access to my local compute.
Proxygeddon only happens to poorfags.
>I enjoy the technology, I enjoy seeing the improvements in models, and I truly do not give a fuck even if corpo models were free 1 billion context and came with a synchronized vibrating onahole.
Again, autism.
>>
>>100175740
>Dumb question
Yes.
>How to use LCUDA
What is that?
>on Windows koboldai
Koboldai is the pytorch-based one.
>Is it better than ROCm
ROCm is for AMD GPUs.

If you have Linux you can use koboldai with a cuda pytorch for your nvidia card, or you can use koboldai with rocm pytorch for your supported amd card. If you have Windows pytorch+amd=no, and if you have nvidia using WSL and running cuda pytorch in there is recommended.
>>
>>100173514
wait a second
OP added petra to the miku bread.. you're kidding me
>>
>>100176172
ok but what about reading code? explaining shit, just helping me comb through code. like a million jeets who grep the code on my behalf, is that possible?
would finetuning help here?
>>
>>100176153
It's honestly hard to find a good use for LLM on coding tasks. I tried multiple times using LLM be it local, GPT-4, opus on any subject I was knowledgeable about and they were just a waste of time.
The only time were they are useful is looking for basic shit in a language I'm not familiar, it's faster than using a search engine. But for that, any model is good enough, I currently just use llama 3 8b for that. I use some neovim plugin but 80% of the times I use it for editing text like mail or commit messages instead of code.
>>
So has anyone built a chatbot with hierarchical annotated memory yet? Not agent stuff, just simply what we've been using already, except with a better memory system than simple vector DB RAG.
>>
File: 849.png (498 KB, 1066x863)
498 KB
498 KB PNG
>tfw you realize meta got dedicated pajeets filtering next llama models
>>
>>100176201
I've found L3 70b is the best at spitting out useful code that works out of the box and is ok for analysis but is hamstrung by its medieval context limits.
Yi-34b-200k is surprisingly good for analysis using in-context training if the portion of your codebase fits into the context limit.
>>
>>100176199
Gotta give him credit. All it took was some subtlety.
>>
File: file.png (49 KB, 730x409)
49 KB
49 KB PNG
>trained multiple ridiculously performant fine-tunes
which ones?
>>
>>100176287
>LLaMA 3
>extended context from 8K -> 128K
Ok, where have you all been keeping this from me.
>>
>>100176194
>autism
sure, and?
>it is still miles better than local models though
not anymore, at least for erp. and if you were to remove its ability to search the internet I think it would generally suck at everything with how lobotomized it has gotten
>proxygeddon only happens to poorfags
such as yourself? since a few grand is obviously beyond your purchasing power
>again, autism
thankfully my autism has gotten me a job that pays well enough that I could buy a brand new 3090 every two weeks without compromising my lifestyle or having to draw blood for my mortgage payment
>>
>>100176287
>128K context
Is the idiot mixing up Llama 3 with Phi-3?
>>
File: 00036-468519150.png (1.69 MB, 1456x1024)
1.69 MB
1.69 MB PNG
>>100176312
>my autism has gotten me a job that pays well enough that I could buy a brand new 3090 every two weeks without compromising my lifestyle
Based buyer and saver
>>
>>100176299
First I've heard of it.
>>
>>100176325
Was Phi's 128k version even real context? Like why even release the 4k version if they have 128k?
>>
>>100173727
It has the less slop and gpt-isms i've seen in a long while, it's not the smartest but the vocabulary sells it for me and is a breath of fresh air (been messing around with it for an hour or two)
>>
>>100176325
>twitter AI personality has no idea what he's talking about
many such cases
>>
>>100176299
https://huggingface.co/MaziyarPanahi/Llama-3-8B-Instruct-32k-v0.1-GGUF
https://huggingface.co/NurtureAI/Meta-Llama-3-8B-Instruct-64k-GGUF
>>
>>100176199
I recognize our dear old petra, same antics.
>>
>>100176345
This smile reminds me of that one image that was drawn by the drawfag that ended up trooning.
>>
>>100176369
>32k
>64k
That's not 128K.
>>
It's so funny how certain models work so much better with the wrong prompt format if you aren't trying to use it as an assistant.
Talking about Qwen 1.5 32B specifically, but I've seen that happen to other models too.
>>
File: 00022-1199107278.png (1.22 MB, 1024x1024)
1.22 MB
1.22 MB PNG
>>100176369
>Meta-Llama-3-8B-Instruct-64k
Fake as fuck. No instructions or description of what they did to extend the context. Tested it in exl2 already and it barely works up to 20k with rope
>>
File: 24-04-19 09-59-12 1242.jpg (153 KB, 1024x1024)
153 KB
153 KB JPG
I've been playing with the small 8B llama3, and I notice it likes to "O-oh..." me a lot - both with a very simple prompt directly in the llama.cpp API, as well as some of my favorite cards in SillyTavern.
I haven't tried 70B yet, since I'm on vacation at the moment and only have the 32GB macbook to play with.
>>
Anyone have a sense about mradermacher's older imatrix quants, given that the llama3 ones are broken? I downloaded one of his WizardLM2 8x22B, and I'm trying to figure out if I need to get a different one (or download a third of a terabyte to quant it myself)

From https://github.com/ggerganov/llama.cpp/issues/6841 it sounds like the breakage was resulting in outright garbage, as opposed to subtle quality loss. So, given that the model I have is not spewing obvious garbage, it seems likely fine. But I wanted to double-check, in case there's an insidious "subtly worse" failure mode that I would never notice.
>>
Do you think the upcoming Phi 7B or 14B will beat Llama 3 8B?
>>
>>100176244
>>100176261
thanks for the info guys, wish i could leave you some reddit gold but this website doesnt let me :(
>>
>>100176506
yes it will be more slopped
>>
>>100176506
I think it will have strengths and weaknesses over the Llama but not beat it. It is a very different dataset and that will show in what it can do well. The 3.8B already beats all 70B+ local models on some problems I tested it with.
>>
miku posters are unhinged
>>
>>100176195
**ZLUDA mb
>>
File: Miguruguru.png (1.62 MB, 800x1248)
1.62 MB
1.62 MB PNG
>>100176550
>unhinged
I think you mean "ascended"
>>
>>100176548
>The 3.8B already beats all 70B+ local models on some problems I tested it with.
I can hardly believe this unless you post logs.
>>
>>100176550
Unfortunately I only have niche tastes, not mental illness. Otherwise I could blame it on mental illness, rather than just being a weirdo.
>>
>>100176566
Good morning, sir.
>>
spoonfeed me an easy way to set up a high quality TTS for text generation webui.
>>
>>100176194
I think anon is retarded for buying hardware without getting net returns on the investment (they could at least sell their GPU compute on vast.ai and pay off the cost of the GPU in like a year, but being a provider is more difficult than just buying from a provider and saving your money).
I am looking forward to renting 400+gb of vram for $10-20 an hour to try erping with llama 400b.
That hardware would cost me $50,000 (more like $100,000-200,000 with h100's, if I bought the hardware I would go for a 2x mi300x or 20x 3090's). Considering the fact that I fap in like 10 minutes every day, it would take me 2500 days for the $50,000 worth of hardware to be more worth it than renting for $20 a day (and I don't even fap to AI every day or account for the cost of electricity).
But who knows maybe 400b will be shit for ERP, and nobody can finetine it.
>>
>>100176600
>erping with llama 400b.
>That hardware would cost me $50,000
Like, a fifth of that if you're not retarded.
>>
>>100176600
>hardware would cost me $50,000
still cheaper than getting divorced
>>
70b rp undi finetune wen
>>
>>100176567
I guard my test set so that it will never have even the slightest chance of being trained on, so I will not do that. You're free to distrust my claims.
>>
>>100176639
monday, 3pm
>>
>>100176287
I tried the dolphin 8B finetune and yeah it's uncensored but it made it retarded. I got base 8B to solve a simple math problem (yes, I know) but then the dolphin finetune failed.
>>
>>100176566
sir please do not redeem ze miku shartsune bloody bastard kind sir thank you
>>
>>100176550
they were never on a hinge to begin with
>>
>>100176623
I like arguing over hardware, give me your dream setup for 400b (even if it's Q4, I am probably gonna rent for Q8).
>>
File: 24-04-19 22-00-14 1393.jpg (202 KB, 1024x1024)
202 KB
202 KB JPG
>>100176600
>I think anon is retarded for buying hardware without getting net returns on the investment (they could at least sell their GPU compute on vast.ai and pay off the cost of the GPU in like a year, but being a provider is more difficult than just buying from a provider and saving your money).
I highly doubt you can break even on electricity costs from vast, let alone pay back your hardware. I feel like the only party making money would be vast.
>>
>>100176725
>give me your dream setup for 400b
2 x C4140 (8xV100 32GB) = 256GB VRAM for $10k
>(even if it's Q4, I am probably gonna rent for Q8).
Pretty sure bigger models suffer less by being quantized. Q4 should be fine, but even if it's not, I have a spare 3090 and can offload the rest to RAM.
>>
>>100176731
>I feel like the only party making money would be vast.
Why bother doing math if you have feels, right?
>>
running local is just stupid in 2024 I really don't see the point and all the arguments are just justifying my reasons further in fact all I'm really seeing is cope and retards with too many cards
>>
How do you know how many context tokens a model can handle?
>>
File: 24-04-19 10-05-34 1251.jpg (223 KB, 1024x1024)
223 KB
223 KB JPG
>>100176813
OK, how much profit do you make from vast.ai?
>>
>>100176623
>400b.
>$50,000
What is the price point where you would start considering a mail order bride? And what would be the number of beaks for that price where ai wins over bride?
>>
>>100176960
A wife might more financially sound if you have zero income but otherwise you have to consider the 50%+ of all your wealth and income you pay in perpetuity
>>
>>100176960
Women are fucking expensive to keep happy. More so if you have children. Just one kid will cost you quarter to half a million dollars before you can legally kick them out. So, beaks can 10x and they'd still be cheaper in the long run.
>>
>>100176960
NTA but 3D can't compete with AI fantasy roleplay.
>>
>>100176841
It usually says on the model page, but if you're running a gguf it's also in the meta data that's displayed when you load up the model
>>
>>100176960
The only reason I don't have a mail order bride is nobody taught me how do to that. Hell I think some of the countries pay YOU to get a girl a greencard.
>>
>>100176822
You're not wrong but what other options do we have? I'm not waiting ten years for ai to get better I'll just play with it now even if it's bad.
>>
>>100176312
>not anymore, at least for erp.
lol
lmao even
this is false, but even if you weren't saying this out of your ass, how would you know? aren't you a LOCAL autist? Or are you telling me you tried Claude Opus? Did you lurk aicg to see how good Claude Opus is?
I guess this tells a lot about you.
>such as yourself? since a few grand is obviously beyond your purchasing power
Cope. I would rather invest my money to retire earlier than waste all my money on niche hardware that will lose its value and become deprecated in a few years.
>>
>>100176960
>mail order bride
yeah let me pay to get into a retarded relationship that will simmer with resentment until it explodes, sounds like a great investment
>>
>>100176797
maybe you might find that at a local liquidation auction that won't accept shipping, but I have a feeling that you will only get like half the vram for $10k, and getting 400gb would add up to around $40k.
>>100176813
Not anon, I think the people hosting are 100% making money, but I think what anon is referring to is that residentially you don't have access to cheap electricity, and cheap ISP service (and I think business rates are cheaper than residential + less taxes but the downside is I think you need to own a company building in a business zone or something).
So it's the same problem with mining bitcoin that people felt when mining coins on a gaming GPU's cost more in electricity than the wattage they pay.
I still think you can pay off your 4090 in like a few years, but if it's constantly at 100% power draw (400watts) the cost would be like 15 cents per kilowatt, so that's $500 per year. But if selling to vast.ai per hour is 20 cents (below market) you get $1750, so you pay off your 4090 in a year on paper (realistically its not at 100% load 24/7 but also not rented 24/7, and not counting internet and what vast takes out).
>>100176635
Honestly I wish I could cope and say "robo wife is cheaper" I think sex is 100% free if you try to keep it that way, and the only downside is that humans have ego and they don't follow and like everything you do unlike an AI. I don't want a GF because I think women are going to fuck up my self confidence as a virgin and I won't be truely happy if the girl isn't truely happy, and it's weird how girls on dating apps casually had sex with 30 guys, I feel like there is some sort of societal imbalance preventing people from just being together forever. So I guess I'm an AI incel???
>>
>>100177130
if you don't treat her like shit enough fucking will make any girl love you cause oxytocin. the tough part for most is getting to the fucking part
>>
>>100177136
>maybe you might find that at a local liquidation auction that won't accept shipping, but I have a feeling that you will only get like half the vram for $10k, and getting 400gb would add up to around $40k.
Again you retards and your feelings. I already have one. Just need to get a second in the upcoming months.
>>
>>100177114
i'll have this hardware and still retire early. and I was specifically referencing gpt4 which has measurably gotten worse, there have been academic papers about this even
Two grand for 2x3090 plus a few hundred for 128gb memory has literally zero bearing on my retirement whatsoever. I am so sorry that you are struggling in life, and I hope that things get easier for you in the future. I'm going to continue having fun with my local models and I'm skeptical that there's anything you can do about it
>>
Can llm learn anything from large code base?
>>
File: BoheMiku.png (1.74 MB, 800x1248)
1.74 MB
1.74 MB PNG
>>100177136
>I think sex is 100% free if you try to keep it that way
Yeah, but you tend to end up with chicks like picrel
>>
File: tttet.jpg (428 KB, 1825x1152)
428 KB
428 KB JPG
>>100161344
>>
>>100177145
idk anon, in my experience you gotta hit that infatuation mark before the fucking for the woman love to set and cure properly
>>
>>100177181
>/lmg/ actually believes they'll be able to run gpt4 on 2x3090s
holy cope batman
well you'll figure it out eventually how are those L3 finetunes coming along btw?
>>
>>100177263
Me on the left side of the right image
>>
>>100177181
I see, so you close your eyes to avoid facing the reality... Such unfiltered cope.
>>
>>100177181
>2x3090
LMAO, I hope you have plans to buy more for LLaMA 3 400B
>>
>>100177263
Is the pixel Teto AI-generated? If so, model?
>>
>>100177181
>"richfag"
>didn't buy 4090
ngmi
>>
>>100177370
only retards go with 4090s ideally you want 10 to 20 3090s to future proof yourself
>>
>>100177340
400B won't be noticeably better than 70B anyway. Mememarks aren't everything.
>>
>>100177343
https://www.mediafire.com/view/zzr1x9dzf0b9vuz
>>
>>100177380
why are you calling CUDA dev retarded? take it back
>>
>>100177382
This is true. There's a paper from Google that shows scaling up compute without scaling up training data will net you minimal gains. And from the looks of things we're plateauing data-wise. Sure Altman may try to retard strength it but he won't get his superintelligence that way
>>
>>100177382
I actually agree with you. I still think OpenAI/Anthropic has some secret sauce.
>>
File: PersonalMikuDJ.png (59 KB, 1136x912)
59 KB
59 KB PNG
>>100177343
nta, but pixelArtDiffusionXL_spriteShaper.safetensors [7adffa28d4] works really well for me
>>
>>100177452
The secret sauce is 256x1B
>>
>>100177114
>Did you lurk aicg
NTA but i go in there maybe once a month and the few logs i've seen posted are roughly the same as the ones you see in here, except the perversions are an order of magnitude more retarded.
this really is the dumbest shit to get upset over or try to argue about
>>
>>100177501
>>100177263
>>100177260
>>100176910
>>100176731
https://www.youtube.com/watch?v=fsUvejZPTLI&t=3595s
>>
>>100177452
the secret sauce is proprietary datasets containing copyrighted information. beyond that I really don't think they're doing much more than some fancy vector db and plugins to pull from external sources on GPT's end. Claude I don't think has any tricks like that, just a good well-curated dataset.
>>
>>100177380
>only retards go with 4090s ideally you want 10 to 20 3090s to future proof yourself
3090 and 4090 have essentially the same bandwidth bottleneck, so if you have 2 3090's you will run the same speed as 2 4090's.
more larger the model = more bandwidth needed, and for inference nvlink / pcie is not the bottleneck, the bandwidth is the bottleneck.
So if you can run Q4 70b model at like 15tk/s on 2 3090's you should be able to get 3-5tk/s if you got more 3090's to run Q4 400b (someone with 10 3090's loaded 70b full precision with 150gb of vram usage and it ran at 3-5tk/s https://old.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/).
If you want something that will run 400b at a fast speed, you need something like a h200 (it cost as much as a luxury car) or AMD's mi300x (fraction of the price, but requires special OAM for the mobo and AMD LOL).
>>
>>100177599
*take this with a grain of salt, I have zero knowledge in actual AI or benchmarks, I am looking for someone to call me an idot
>>
>>100176502
No way anon. After seeing the way he acts when others point out the holes in his bad imat files I'm staying clear of all his shit
>>
Can anyone point me to code that will let me display images in the Gradio chatbot? I have the image available on local disk and I would like it to present it to me in the chat.
Ive tried embedding it as markdown code and returning markdown code to no avail.
>>
>>100177599
You are a very smart and valuable person.
>>
>>100174110
https://mathchan.org/ai/ needs more love
>>
>>100177599
>(someone with 10 3090's loaded 70b full precision with 150gb of vram usage and it ran at 3-5tk/s https://old.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/).
If that reddit retard knew what tensor parallelism was and wasn't running at full precision, maybe his speeds wouldn't be shit.
>>
>>100177695
Can we move /lmg/ there? The captcha would keep out the riffraff. Maybe less raiding and thread hijacking.
>>
>>100177559
Ignorance truly is bliss...
>>
This might explain some stuff for people with quanted llama 3 models.

https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/

Apparently llama 3 takes a huge hit going down from 8 bit to 6 bit unlike older models which didn't take a huge hit till under 5 bit.
>>
>>100176005
> jew construct.
>>
>>100177263
teto with sexo
>>
>How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
https://arxiv.org/abs/2404.14047

>>100177788
Makes sense. They trained on 15T tokens. Each weight packs a lot more information, meaning quantization is going to hurt more compared to undertrained models.
>>
Guys...
what if...
Guys, listen!

What if we trained a LLM to predict... the previous token of a sentence?
>>
File: 1706272736148234.jpg (227 KB, 960x960)
227 KB
227 KB JPG
>test my two new 3090s by loading a 3bpw mixtral onto each
>temps spike to 80~90C and the fans sound like they're about to take off
I guess I'll have to replace the thermal pads on these. The 3090s I already have came with the pads already swapped so I didn't realize how lucky I was.
>>
>>100177599
Does VRAM overclock worth it?
>>
>>100177634
this - don't be a child and blame llama.cpp for your shitty quant. he is clearly a very emotional individual. bart's quants have usually worked for me and there's way less whining when they don't. like it should be.
>>
>>100176703
maybe they figured out "safety" such that trying to finetune it away will make it retarded.
>>
>>100177878
pretty much all GPU's are already overclocked, some people are underclocking their GPU so it's more power efficient and so fans don't spin so hard.
>>
>>100177788
It is reddit so it is like the spergs here that say they see a huge difference between Q8 and Q5 because it touched their cock incorrectly that one time. Except in /lmg/ someone will call him a faggot and a retard and on reddit people will be nice to him.

Any memeplexity measurements done for quants? That is actually the only thing memeplexity is good for. Also makes me think that if he is actually correct (even without giving any source for what was revealed in his dream) that would mean that bitnet is dead. The spare unneeded extra accuracy of weights is a thing of past and now you are going to all be running a 13B 8bit or a 30B 8 bit for those who got 2 cards.
>>
>>100177938
The buzzword to content ratio in this post is off the charts.
>>
>>100177938
Why didn't you just read up on BitNet before spouting bullshit about it? I bet you think it's a quant method too huh?
>>
>>100177978
Nope I stand by what I said if you don't understand the point then you are dumb.
>>
>>100177938
Even if these heavily trained models hurt from quantization more, it doesn't follow that 13BQ8 > 30BQ4. Packing more data into floating point weights is clearly a horribly inefficient and slow process. Even if a bitnet 70B saturates before fp16 70B (which is something I'd worry about), it should still be better than a 30BQ4 of equal size trained for the same time.
Also the best way to compare quants is kl divergence, but ppl is a reasonable substitute.
>>
>>100176566
She's just a bunch of noise-pollution, a digital abomination created to torture our poor ears. Miku's 'ascension' is just a myth perpetuated by her brainwashed fanbase.
>>
>>100177991
Your "point" goes out the window when you're technically incorrect
>>
>>100177899
That or the Dolphin dataset is garbage. Because the answers are always really short and bad.
>>
>>100172723
What stack are you using?
>>
>>100178124
probably some gay c++ library like imgui
>>
>>100178124
nta but looks like imgui
>>
>>100178149
>>100178151
yeah seems like it thanks
>>
>>100178051
>you're technically incorrect
kill yourself you nigger
>>
File: 20-meetingthepope.jpg (84 KB, 960x539)
84 KB
84 KB JPG
>>100178149
<= imgui's dev
>>
>Fire up beat saber
>Look for a song to play
>Most of my songs are Miku
>Remember the meltdown yesterday...
Thanks trannies....
>>
>>100178265
Beat saber more like meat saber
>>
>>100177831
where are the SmoothQuant quantized models of L3-70B-Instruct then? They tested it but they didn't upload the models on their own HF repo? Their repo is here: https://huggingface.co/mit-han-lab
>>
>>100178281
It is a god game for basement autists. I hate physical exercise but it tickled my autism enough that I am still playing it at least once a week for 3 years now.
>>
>>100178293
It is at the bottom of summary... https://huggingface.co/LLMQ
>>
>>100173727
>Closed dataset
>Kobold
Nah.
>>
>>100177831
This is why I don't understand why they chose 8 and 70b. They know the market hardware available. Why the fuck aren't they making a 35b or a 40b? What good is a 70b we have to run at low quants? Is the only point to win stupid benchmarks? Okay then we need a benchmark for 40b, what's that? There are no 40b models? Then you lost to yi, congrats zuck, you lost to yi!
>>
>>100177597
The Claude dataset must be the most interesting one in all of ML imo. It has such a different personality compared to EVERY other language model. I'd love to know what they did.

Given how much smuttier it is than all the others I guess it's possible the only difference is that Anthropic doesn't remove stuff like ASSTR or Literotica from the dataset? But I'm not sure those alone would lead to it having such a different and more human-like personality.
>>
>>100178115
It's the dataset. Every single one of them have so much shit data from deciding to ouroboros synthetic data from other bots and not cleaning it up. I can't believe every one of these people decided that the state of things was fine because they were able to improve on prior Llama releases so 3 would be no different. The finetune for Llama 3 was obviously done on mined Meta social network data and there is no way any synthetic data is going to match that quality. I guess I'm going to have to suck it up and download one of those "uncensoring" finetune models if possible but man, this really sucks that the community got that complacent and fine with the state of things. I don't think for the next 3 months anyone is going to be able to fine-tune past what the official instruct release did because of how much work is needed to clean a dataset to get it anywhere near where it needs to be.
>>
File: file.png (1.22 MB, 768x768)
1.22 MB
1.22 MB PNG
My anime image of the day.
>>
Can you make a learning model understand causality? How would you encode causality into a model? What would be your mechanism for making a model understand causality? Would you bruteforce it using statistical techniques?
The principles of most models I've seen so far are about encoding world data (text, audio, video, etc..) as compressed bits of information into models. Do you think that current models are able to infer causality from the encoded bits of information as an emergent property? And does it do well the longer you train on the data and the more tokens you feed it?
>>
>>100178471
right now you will need to buy new hardware to buy the latest and greatest AI models.
however the prices are not going down, so right now we need to spend around $1500-2000 to run 70b, next 2 years you will need to spend around $4000 on the next latest and greatest AI.
>>
>>100178725
>And does it do well the longer you train on the data and the more tokens you feed it?
Yes no maybe? Made me realize that probably at the beginning of training it is learning compression of data instead of actual reasoning. Eventually it should start learning reasoning because it will let it compress more efficiently, but now that I thought about it maybe the problem is that the structure of the network ends up in a sort of local minimum of compression and can't really learn reasoning efficiently? But that would be pretty easy to prove or disprove if you use some benchmark for reasoning during training to see if the reasoning accuracy progresses at the same rate as reciting wikipedia. Also I am just a 4chan moron so I don't know what I am talking about.
>>
>>100173514
I'm trying to figure out how an chatbot could be integrated into a game. I suppose that if you lead with a prompt explaining to the bot the context of the NPC it could talk in the moment, and if you have it say some command e.g. *follow player* or *attack* you could have it interact with the world.

Is there some extensive research group or similar where one can read up on what ideas people have come up with, and how they execute it?
>>
File: 1.png (66 KB, 702x318)
66 KB
66 KB PNG
picrel response with prompt from : >>100171961
this DPO tune, working and failing at the same time.https://huggingface.co/mradermacher/Llama3-8B-DPO-uncensored-GGUF
>>
>The sexual tension builds deeper in her spleen, her body responding eagerly.
wat
>>
>>100178725
Isn't this attention? Statistically bruteforcing what appears to be causality? Is there really some essence of understanding the causal links? Is it just a mirage from dumb rule following? In the Chinese Room does that matter?
>>
>>100178870
How is spleen tokenized? Did your sampler have sp- and not pick spine?
>>
>>100178656
Whether they (dolphin/hermes authors) like it or not, they'll eventually have to scale finetuning data down to curate it properly instead of continuing to use millions of GPTsloppy examples. A relatively small hand-curated finetuning dataset (~10^3-10^4 examples) + large human preference dataset (in the order of 10^5-10^6 examples or more) should be the proper way.
>>
>>100178744
Not so fast richnigga https://hacks.mozilla.org/2024/04/llamafiles-progress-four-months-in/
"Today, you can today use the very latest and most capable open models with llamafile thanks to her hard work. For example, we were able to roll-out llamafiles for Meta’s newest LLaMA 3 models–8B-Instruct and 70B-Instruct–within a day of their release. With yesterday’s 0.8 release, llamafile can also run Grok, Mixtral 8x22B, and Command-R."
When llamafile hits mainstream, there would be a shift to use server-class processors with 64-cores, dual-channel mode, and DDR5-6400 for inference-only purposes.
>>
>>100176365
>twitter AI fag
>unironically calls it "X"
>"No-Code" in bio
every time
>>
>>100178870
>he's never had a hooker massage his spleen
get a load of this pleb
>>
>>100178913
>llamafile can also run Grok
Is this a reason to be proud?
>>
>>100178913
yup... I'm thinkin' jart won
>>
>>100178913
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>her
>>
>>100178871
If I understand it correctly, no. Attention and Flash Attention are just prioritizing assigning weights to the most 'important' information for the task.
>>100178801
Reasoning benchmarks are retarded. The thesis of the MSFT's Phi team already proved it - you can just train a relatively small LLM up to 8B on specific datasets and it will score high on reasoning benchmarks but when the user uses it, it will be retarded as fuck. The best 'benchmark' that should be used is to make these LLMs navigate a maze in a rogue-like fashion and take the average of their runs.
>>
/lmg/, please give me some RP situations that 7B/8B models usually suck at.
>>
are there any good models for generating 3d meshes?
>>
Has anyone tried bark.cpp yet?
>>
Some days ago gguf were broken because newlines got merged. Is that fixed by now?
>>
>>100179014
>Reasoning benchmarks are retarded.
You could just use pure math benchmark. It is all just to check the trend and if it is gradually getting better as it is getting better at compressing data. Or if you see the math result improvement slow down while compression result continue to improve then it is probably becoming just a retarded winrar for text.
>>
>burgers are home
fuck. it's all tech support from here
>>
File: THE SPLEEN THO.png (39 KB, 881x311)
39 KB
39 KB PNG
>>100178888
>pic related

>>100178924
Can't say I have.
>>
>>100179024
https://www.chub.ai/characters/Vyrea_Aster/doppelganger-interrogation-simulator-654daf19
>>
>>100179076
https://github.com/PABannier/bark.cpp
>no .exe
no thanks.
>>
File: thumb-1920-1127692.png (1.13 MB, 1920x1080)
1.13 MB
1.13 MB PNG
>>100179094
I can make the next thread tech support edition.
>>
>>100179146
haven't you made enough threads?
>>
File: q1h6mwgu9vz51.jpg (402 KB, 854x1200)
402 KB
402 KB JPG
>>100179154
After yesterday? I am just getting started.
>>
>>100179094
good morning sir please kind bastard redeem the american burger home thanks!
>>
>>100177831
There's only one fp16 8B quant
https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/tree/main

Also, these results could explain why so many think there's "8B" retardation. At fp16 it's superior to Q4 70B, which is huge (and actual good use of 24GB of VRAM).
>>
>>100179092
Ehhh, there's lots of caveats with benchmarks, especially task-oriented ones like math benchmarks. Take Phi-3, connect it to WolframAlpha, and you have your own agentic math buddy; I'm fondly wary of benchmarks because they can be easily gamed. Testing the models needs to be more broad and active - stochastic scenarios of different levels. The trend is towards them becoming agentic - benchmarks will be just like tying shoelaces or putting on shirts for them. This is why using benchmarks and not updating them as fast as finetunes or models are being trained is such a retarded idea. And no, using LLMs to generate datasets for benchmarks is even more fucking retarded and whoever came up with it needs to be fucking fed to pitbulls.
>>
>>100179164
>>100179146
I have nothing against kurisu, she is cute. But she is not /lmg/ you are making me slowly but surely dislike her with your threads.
>>
>>100179223
Woah, I suddenly want more Kurisu threads.
>>
>>100179099
>When a person/doppelganger comes into the room, IMMEDIATELY DECIDE IF THEY ARE HUMAN OR DOPPELGANGER, BUT DO NOT TELL {{user}} IN ANY WAY. THEN USE THIS DECISION TO INFLUENCE HOW THEY WILL TALK FROM NOW ON.
>THESE TRAITS ALSO APPLIES TO HUMAN. If {{char}} was talking as human, and {{user}} is being mean and started to accuse them, they will still exhibit the symptom above.
I can see how this can confuse the LLM lol
>>
>>100179223
NTA but she is definitely /lmg/, if you don't know why you should leave immediately.
>>
>>100178854
A little too meme-y. Also, it seems that it doesn't really get the concept of niggebrehaviour. But at least no moralizing.
>>
>>100179221
>I'm fondly wary of benchmarks because they can be easily gamed.
Anon I am talking about trying to get a good model. Not about selling it. Of course you wouldn't be trying to game benchmark or even try to make what I am saying a selling point. It is just my fan theory and I am saying how you could easily falsify it or prove it.
>>
File: 1695117076473585.png (140 KB, 1029x898)
140 KB
140 KB PNG
Did kalomaze give a /g/erdict on this or what?

https://github.com/oobabooga/text-generation-webui/pull/5677

In my experience testing every setting for writing was shit except min_p 0.1
>>
>>100179256
she is definitely related, but the way he tried to make her a mascot by shitting on miku and starting the whole trans herpes miku thing is what's leading to people disliking him, and by extension her
>>
>>100179223
I had nothing against Miku, she is cute. But I came to /lmg/ and /lmg/ is making me slowly but surely dislike her. Curb your autism sperg.
>>
File: file.png (341 KB, 640x480)
341 KB
341 KB PNG
>>
>>100179201
The paper shows basically no degradation at 8bpw, though. And their tables have fp16 8B nowhere near as good as 4bit 70B, even shitty RTN quant, where are you getting that from?
>>
>>100179293
>autist talking about autism
>>
>>100179327
I am keeping mine in check. You should do the same.
>>
File: IMG_20240425_173853.jpg (471 KB, 1080x1127)
471 KB
471 KB JPG
>Our findings indicate that while LLAMA3 still demonstrates superior performance after
>quantization, the performance degradation associated with quantization is significant and can even
>lead to larger declines in many cases. This discovery highlights the potential challenges of deploying
>LLAMA3 in resource-constrained environments and underscores the ample room for growth and
>improvement within the context of low-bit quantization. The empirical insights from our research are
>expected to be valuable for the development of future LLM quantization techniques, especially in
>terms of narrowing the performance gap with the original models. By addressing the performance
>degradation caused by low-bit quantization, we anticipate that subsequent quantization paradigms
>will enable LLMs to achieve stronger capabilities at a lower computational cost, ultimately driving
>the progress of generative artificial intelligence, as represented by LLMs, to new heights.

(V)RAMlet bros... it's over.
>>
>>100179327
>autist talking about a autist talking about autism
>>
>>100179337
>in check
>>
>>100177788
Are there any 6-8 70B confirmed not to be broken? NousResearch is good but they only did up to Q5 and I don't want to spend hours figure out how to convert and quant myself.
>>
>>100178957
No, but it's a good proof of concept that you don't need an H100 or multiple GPUs to run a 100B+ parameter model. There's a lot of room for optimization of inferencing and we're barely scratching it.
>>100178982
>nooo it's a heckin' tranny I can't accept his work!!!
beat his work if you can instead of moping around gender identities. you're no better than leftist and normie retards complaining about 'muh patriarchy' when you focus on someone's gender instead of the quality of their work.
>>
>>100179353
vramlets destroyed anally as usual. Btw bitnet will plateau much earlier than fp32. 7B bitnet trained on 15T tokens will be just as shit as 7B bitnet trained on 1T tokens.
>>
>>100179376
>uhm! don't you think that both sides are LE BAD!
go away.
>>
>>100179376
its literally just packaged llama.cpp
>>
>>100179353
did anyone here think otherwise?
full-sized AI on a home pc will never be a reality.
>>
File: msedge_0aAtXxig0v.png (190 KB, 2926x660)
190 KB
190 KB PNG
>decide to update both koboldcpp and SillyTavern since I haven't done it in a while
>everything broken
How do I get CPP to show up in my API list?
>>
>>100179353
>>100178318
>they still didn't upload the smoothquant versions of 70b-instruct.
these fucking cunts someone message them. both of the HF repos only have the L3-8b quantized versions. how is someone supposed to validate their findings for the 70b quantized versions?
>>
>>100179440
it's an option under text completion
>>
>>100179440
pick text completion instead
>>
File: msedge_XBbS5ZXAHF.png (75 KB, 1311x538)
75 KB
75 KB PNG
>>100179452
>>100179453
Tried that
>>
>>100179201
I didn't see that on the chart. Also how can fp16 be good use when fp8 is just as good. Also fp6 isn't on the chart. Prob use 6 bit 70b.
>>
>>100179478
ur blind
its there http://127.0.0.1:5001
>>
>https://old.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/
erm.... GGUF bros what is this..?
>>
>>100179353
Why no gguf or exl2?
>>
>>100179478
Seems like your SillyTavern settings could not be saved. You should check the SillyTavern server connection and reload the page to prevent data loss.
>>
File: msedge_uQ5qgUhRli.png (79 KB, 1303x552)
79 KB
79 KB PNG
>>100179511
>>
>>100179353
This makes me wonder how, say, llama 3 8b would perform if it was trained in 4bit to begin with.
>>
File: 1.png (59 KB, 550x398)
59 KB
59 KB PNG
>>100179544
anon... i...
>>
>>100179544
Close everything then try again.
>>
>>100179478
$10 says you shut down your ST at some point and are still using your old session
>>
>>100179555
Bloated bitnet
>>
>>100179591
It was this, I'm a fucking retard
I had like 20+ windows open looking at different frontends I could try out and models to download and loli tummies to goon over and I got fed up and nuked everything including the server, lmao
>>
>>100179395
>70B bitnet that fits on a 3090 and plateaus at 2T so it performs like llama 2
You know what, I'll take it
>>
So exllama and vllm > llama.cpp I guess
>>
>>100177732
A lot of the research side of /lmg/ would work better over there. Maybe if quality is dense enough it'd spread by word of mouth, I don't know how to give word to industry devs without inviting the planet.
>>
>>100179353
Damn, so it's over for BitNet huh. And the meta will be Q8 8B
>>
>>100179503
>>100179325
Why everyone going by numbers on a chart? Just try the models out yourself. The benchmarks do not cover everything that can fit into 15T parameters.
>>
>>100179652
What research side
>>
georgie's grift is starting to unravel
>>
>>100179681
I did, 8B fp16 is retarded compared to 70B Q4. This should not surprise anyone. The paper does not even contradict this.
>>
>>100179667
Bitnet isn't quantized
>>
>>100179667
I don't think so.
The problem that's being pointed out is that a model trained on a fuckton of tokens using high precision FP loses information when compressed to a lower precision.
BitNet is, what, 1.58 bpw by default? It already has all the information encoded that way.
Apples and oranges, I think.
>>
>>100179681
Meta themselves said Q8 is no degredation
>>
>>100179095
>in her spong
>in her sparse
>in her spon
I guess that's what grabbing a 0.02% likely token does.
>>
>>100179787
Kind of makes me wish we had minP but not scaled to the top token probability so that I could simply tell the thing to ignore every token under a certain threshold.
Which shouldn't be hard to implement at all, a simple flag that enables or disables the scaling.
>>
>>100179524
turboderp and ikawrakow haven't paid their membership fees to the quant papers mafia.
>>
>>100179524
Isn't awq close to exl2? I remember seeing that AWQ used parts of exllama
>>
>>100179524
Because no companies care about that shit. Everyone is using vLLM or TensorRT-LLM.
>>
>>100179312
>When you go to sleep amidst a kino plot and in the morning suddenly your model is retarded once more.
>>
>>100179451
>how is someone supposed to validate their findings for the 70b quantized versions?
Stop pretending you can 70B 8bit
>>
>>100173514
It's been literally two weeks. How can I get koboldcpp to behave when generating with llama-3 70B? It keeps putting out tokens that aren't recognized, like "|eot_id|><|start_header_id|>assistant<|end_header_id|>"
And each chat ends with an error dialog about unexpected end of output or somesuch
>>
>>100179353
Wasn't Meta betting on open source because providers can use llama cheaper than other models or closed api providers? Doesn't making all their models large, dense, and unquantifiable hurt that? Seems like llama might be good for a handful here that can afford to build mining rigs, but if you're trying to run a service, assuming equal performance, something like Snowflake would be far more cheaper and desirable than a dense unquantable L3 405B.
>>
Everything is broken, even hugging.chat llama3 is broken!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>
>>100179353
lmao so based, the more you buy the more you save
>>
>>100180094
I don't think they consciously chose to finally reach saturation point. Or maybe it is some bug but even if it is there will be a point where quants will stop working. It is pretty obvious.
>>
>>100180197
>>100180197
>>100180197
>>
god i hope next gen consoomer nvidia cards start at 1000 dollars for 8gb of vram, total vramlet cucl death
>t. h100 cluster GOD
>>
>>100180200
yep time to leave lmg for a day
>>
>>100180209
bye!
>>
>>100180200
>(EMBED)
again
>>
New bake pls
>>
>>100180228
>rent free
>>
>>100180200
Baking on page seven?
The stink of desperation is not appealing
>>
>>100180200
what's with autists coming back from time to time to force their own garbage?
>>
>>100180204
>>t. h100 cluster GOD
yeah cool story bro
>>
>>100180249
I am desensitized to your (you)'s now I want your (ree)'s
>>
>>100180249
>mikufaggot afraid of changes
>>
how do I set up rope for llama3?
I don't want my wife to forget how we met
>>
>>100180370
>I don't want my wife to forget how we met
2MW!
>>
>>100179395 >>100179427 >>100179451 >>100179524 >>100179555 >>100179667 >>100180094 >>100180124
>>100180094
>>100180160
>It is pretty obvious
It's obvious bullshit. You are retarded. Being dense (vs. MoE crap or whatever) or thoroughly pretrained (vs half-baked like llama2) has no bearing on quality degradation due to quantization. And not even this retarded paper makes such a claim. Those "researchers" didn't even compare to other models. But they know how to prompt you to hallucinate bullshit.
Their finding: Quantized models perform worse than unquantized models. This is seriously everything they've got. What a great new discovery.
>>
>>100178913
at a full 2 token/s amazing !
>>
>>100180634
I was going to make a post but thanks for doing it for me. It's like they didn't even look at the paper. Some of these posts are pretty suspiciously worded anyway, really makes you think.
>>
>>100177263
I like this Teto



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.