/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>100154945 & >>100166886►News>(04/24) Snowflake Arctic Instruct 128x3B MoE released: https://hf.co/Snowflake/snowflake-arctic-instruct>(04/23) Phi-3 Mini model released: https://hf.co/microsoft/Phi-3-mini-128k-instruct-onnx>(04/21) Llama3 70B pruned to 42B parameters: https://hf.co/chargoddard/llama3-42b-v0>(04/18) Llama3 8B, 70B pretrained and instruction-tuned models released: https://llama.meta.com/llama3/>(04/17) Mixtral-8x22B-Instruct-v0.1 released: https://mistral.ai/news/mixtral-8x22b/►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpp►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling/index.xhtml►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
what are the requirements for using a local model together with an LLM? i have 64GB RAM and 16GB VRAM on an AMD system. i normally use koboldcpp for llms and comfy for SD stuff.
It's over
>>100173514>Previous threads: >>100154945 & >>100166886
>>100173514>>100173573>>100173584>>100173590good morning sir!
>>100173573The absolute state of /lmg/
>>100173573>an internet connection>the ability to read>a lot of timeI think that about sums it up
>>100173573depends on what size you're willing to run. you should be able to run a 8b - 11b model and have enough space for sd as well probably.
>people still recommending Mythomax and fucking CR+ to newbie VRAMletsWhy? Is this some form of gatekeeping I'm too deep in to understand?
►Recent Highlights from the Previous Thread: >>100166886--Enabling Local Language Models to Access External Sources: >>100170746 >>100170878 >>100170905 >>100170924 >>100170942 >>100170947 >>100171324 >>100171202--Optimizing LLMs for Reasoning: Phi's Limitations and Future Directions: >>100167878 >>100167897--Anon's Experience with Llama 3 70b Instruct: Shortening Responses Near Context Limit: >>100167911 >>100169713 >>100169792 >>100170112 >>100170062 >>100170270--Noticable Quality Drop with Quantization in Llama 3 Models: >>100169493 >>100169506 >>100169525 >>100169914--Are Lengthy Multi-Rule Prompts Killing Model Creativity?: >>100167192--Anon's Llama Model Performance Benchmarks: >>100167274 >>100167298 >>100167910 >>100167941 >>100168292--The Utility of Large Language Models: Beyond Fiction Generation: >>100167521 >>100167544 >>100167555--Can LLMs Generate PDBs from Decompiled Programs?: >>100167690 >>100167736 >>100167871 >>100170388 >>100170564 >>100170589 >>100170630 >>100170953--Anon's Take on Meta Stock Drop: Faith, Hope, and Market Volatility: >>100168186 >>100168191 >>100168690 >>100168736 >>100168749 >>100168765 >>100168789 --Best Model for ERP and Productivity Tasks?: >>100171747 >>100172054 >>100172096 >>100172361 >>100172407 >>100172423--Snapdragon X Plus: Promising AI Performance or Overhyped?: >>100168557 >>100168605 >>100168624 >>100168651 >>100168671 >>100168774--Llama-3-Instruct Model Discussion: Censorship, Prompt Structure, and Role-Playing: >>100167135 >>100167187 >>100167229 >>100167265 >>100167298 >>100167307 >>100167350 >>100167575 >>100167610 >>100167631--Understanding the Difference Between Uncensored Models and Psycho Models: >>100167678 >>100167724 >>100167813 >>100167880 >>100170302--Integrating Comfyui with Stable Diffusion 3: >>100168344 >>100168378 >>100168420 >>100168685--Miku (free space): >>100168445 >>100166912 >>100170598 >>100171118 >>100173294►Recent Highlight Posts from the Previous Thread: >>100166891
>>100173573>how do I use an LLM with an LLManon...
>>100173701>>100173573>what are the requirements for using a local model together with an LLM?whoops. i actually meant image gen.what i want is to run llm with SD in something like ST. how well does that work?pls no bully
Anyone try this yet? https://huggingface.co/TheDrummer/Moistral-11B-v3
>>100173717Wait for true multimodal LLaMa 3, producing perfect Miku images and RP.
>>100173727I normally love downloading random slop meme models but that name is stupid as fuck so no
>>100173745https://old.reddit.com/r/LocalLLaMA/comments/1cc6xb1/moistral_11b_v3_the_finetuned_moist_just_got/Reddit seems to love it.>Cream-Phi-2kek
https://huggingface.co/TheBloke/platypus-yi-34b-GGUFThis model, of all things, performs the best at ooba's secret bechmark.
>>100171184I grabbed the L3 8B 64k context model and tried it with a close to 16k token chat I have.It wasn't coherent, so either the claimed 64k context is bs or there's might be something wrong with the q8 gguf.I want to rule out user error at least.Has anyone else tried it yet?
>koboldcpp rocm updated againwe still hanging in there AMD bros
>>100173826Guy probably just edited the config and called it a day.
>>100173829>ITS REALaAAaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA>There was some big changes upstream, that's why it's taken a while to update kcpp-rocm, trying to get it to work.YELLOWROSE I LOVE YOU AND WHAT YOU DO FOR AMDBROS
>>100173858Yeah probably.I don't know if extending the context is actually possible.I'd assume you'd have to retrain the model from the ground up.
>>100173914Nah, you can do large context tuningIt does need to be a full finetune though, I doubt a LoRA could handle it
Tf is the snowflake arctic thing? How much ram?
>>100173869Nvm its busted with models and settings that work on 1.62 :[
>>100174028Ummmmm YellowRose???
>>100173829Why don't you just use linux fucking retard
> keeping the dream alive
>>100173752localllama is extremely clueless so that doesn't mean anythingmost of them probably upvoted it because le funny name without trying it
>>100173938Not that anon, but I could swear that SuperHOT LoRA was a thing.I guess I'm getting it mixed up with SuperCOT.
>>100174329superhot lora was a thing and while it mostly worked a full ft is obviously better
>>100173514>https://huggingface.co/chargoddard/llama3-42b-v0So this has 76 mmlu which is really interesting. Has anyone here tested it? How does it compared to 70B/8B? Is it improved over 8B or is it retarded?
>>100174462Everyone who tested it called it irreparably retarded.
>>100174470It is not retarded. It is schizophrenic. It has a beautiful mind but can't communicate its thoughts very well. Honestly everyone ITT should love it because it is so relatable.
currently making a few exl2 quants for Moistral v3. 8bpw and 5.5bpw for 8gb vramlets
I have a macbook air m2 with 8 gb ram laying around because of work. Is there any worthwhile llm I could run on it?
>>100174470Interesting, they are working on doing the same to instruct model, let's see if the results change. Time to try frankenmerges for now-https://huggingface.co/raincandy-u/Llama-3-Aplite-Instruct-4x8B-MoE
>>100174549Quanted mistral 7B or llama 3 8b, I guess.
>>100174549hahahahaha, no
>>100174549that 8gb needs to be shared with the rest of the OS, so you're looking at like 4-6 for the modelyou could run quanted llama 3 8b at best
>>100173717I think you will want to reserve how ever much space the SD model takes up, and then only load the LLM layers that will fit with your desired context.
>>100174549https://huggingface.co/apple/OpenELM
>>100174567>theyIt is a guy.
>>100173914>I don't know if extending the context is actually possible.feels like 2023 all over again
>>100174110Now explain what it means in non-wikipedia faggotry terms
>>100174662I don't see their pronouns listed anywhere.
any decent phi3 finetunes yet?
>>100174110So why hasn't anyone done llama.cpp bitnet yet? Is it because everyone is lazy or because the existing bitnet models use row-wise scaling factors which llama.cpp doesn't support at all?
>>100174662>>100174691What if it's a woman? You know, not a troon, but a real vagina.
>>100174110Just remind the companies that they can release their bitnet models without the fp16 weights likely making it a huge ordeal to finetune them.
>>100174732It is a guy.>>100174691It is a guy.
>>100174567>https://huggingface.co/raincandy-u/Llama-3-Aplite-Instruct-4x8B-MoE>SOMEBODY ACTUALLY MADE A 4x8B>Q6 is only 20gbWE ARE SO FVCKING BACKWE HAVE NEVER BEEN THIS BACK BEFOREI DONT EVEN CARE IF ITS SLOP
>>100174747Why are zoomers like this?
>>100174747Well post your logs from it
>>100174747When we get tunes like NousHermes, wizardLM, etc... the frankenmerges will be really good.
>>100174758Download speeds are bad in america for no reason
C-R+ user here. I tried Llama 3 70B instruct and it was slop. I tried Llama 3 70B base and it was schizophrenic.What's the deal with the people saying it's good? Is there a magic prompt? You can't even preload context because it only has 8k max.
>>100174780>for no reason
>>100174780>for no reasonOh, there are reasons.http://irregulators.org/bookofbrokenpromises/The numbers on that one are slightly inflated IIRC, but the general idea is correct.
>>100174820>>100174848I already know the reason you redditors who doesnt its not 2014 anymore
>>100174797Llama 3 seems to be sensitive to formatting and templates, if you are using the wrong ones you get schizo, also make sure to pull the latest frontends as they all had bugs early on.
>>100174747>moe frankenmergesIsn't this basically like merging slop but you don't do the final step (where you calculate the average of all changes and add it into base model) you instead leave all those slop tunes out there so they eat up all the ram and then ask the client to average it out? So you just 4x the required ram for absolutely no reason except retards will buy it?
>>100174018Its a 476.27B mixture of experts model with 128 experts (2 active). The main download is 1TB. Q8 is 472GBIt's claiming 4096 context, which is disappointing if true, to say the least.I've managed to quant it down to Q8 with --skip-unknown and am trying to run it after making a few llama.cpp code tweaks to go beyond 60 experts. It has reserved 486GB of RAM to load at that size.It's currently outputting tokens for me, but there's some kind of fundamental problem because they appear to be half nonsense.>弘 Hello saf Season Secretary opportun duties season winter Flora</s></s></s>
>>100174889Nobody cares, dude.
>>100174912It's not based on llama. Are you sure llama.cpp has added support for it? It's only been a day, I'm surprised it converted and ran without errors.
>>100174912I have 512GB of ram which could fit Q8. Currently downloading to quant too. Why the --skip-unknown?
>CAPTIANS LOG 425llama3 has been out for several weeks and mythomax3 still hasn't been made. neither have any good finetunes like a holodeck or nous hermes. no news on a possible bitnet 70b either. all the hype gone. all the locals have turned to sonnet and opus. half way through 24 and not a single decent 40b in sight for regular 24gb vram folk that only have a single card.
Do you think we'll ever get that 70B model? And how neutered will it be?
>>100174957>Are you sure llama.cpp has added support for itI'm almost certain they haven't. I'm also shocked it works at all
>>100174973It's over. Microsoft put the axe to them.
>>100174973either we get a new 70b trained on llama 3 or nothing
>>100174889If some models are better than others at a particular task the output should be weighted toward the better ones (useful for including formatting code, etc. in responses). The idea is that you trade VRAM for more parameters without increasing compute requirements.It's a terrible trade-off for local inference, though.
>>100174960>only have a single card.If you didnt come into this hobby with at least 1 good card and didnt get another one its basically joever
>>100174973We'll get it after llama2-34b finishes red teaming
>>100174986>If some models are better than others at a particular task the output should be weighted toward the better ones But that requires training a gate layer that decides where the input goes. Is it this kind of frankenmerge?
>>100174766OpenHermes/NousHermes is a meme I'll never understand. It contains some good datasets (OpenOrca, Capybara, Airoboros-the good part, Wizard70k) and shit datasets (CamelAi slop, glaive code, alpaca-gpt4, Airoboros-the shit part). The overall result is a mess that can't follow instructions well, is overly verbose and ignores system prompts, yet people praise it like it's the best tune ever.
>>100174976https://github.com/ggerganov/llama.cpp/issues/6877
>>100174988There is zero reason to buy 2 cards just to run llms. 2 cards do nothing for gaming, for ai art or music. The only reason for a second card is so you can run unoptimized language models that are inferior to free cloud based ones.No thanks anon. I'm happy with my 4090, when something finally fits on that we'll be cool. I'm not going to be one of those retards trying to hang 6 cards in open air so I can run a 8x22b still dumber than sonnet which is free.
>>100174998I don't know enough about MergeKit internals to know what it uses for the base router here. I was assuming a fine-tuned MoE, but you're right that this probably isn't fine-tuned.
>>100175032thats crazy man, but Im here to run my AI locally.
>>100175032based, if nvidia still had support for SLI, that would be great
>>100174973I'm considering buying another 3090 for the fuckhuge version.
>>100175032>buying 2 cards is unthinkable for himhttps://www.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/lol lmao
>>100175032that's some strong copium there buddy
>>100175082>Q: How is the performance? A: To continue the spirit of transparency, I'll load one of the slower/VRAM hogging models. Llama-3 70B in full precision. It takes up about 155GB of VRAM which I've spread across all ten cards intentionally. With this, I'm getting between 3-4 tokens per second depending on how high of context. A little over 4.5 t/s for small context, about 3/s for 15k context.>he spent 13k for this
What if when 405B releases, it finally beats GPT4.But then OAI releases GPT4V and makes GPT4 free.Would the $10k 10 card 100GB VRAM setups have been worth it?
>>100174096What's the current AMD Linux meta? I was trying to get exllama running last year, and after getting my rocm installation set up and my torch environment finagled correctly, it ran like absolute shit for larger models/ context because flash attention 2 still has no rocm support for consumer hardware. I've checked back every now again to see if there are any updates, but I've mostly just been using koboldcpp-rocm as well because it's been the easiest to dial in the right tradeoff between speed and model quality with my graphics card and cpu offloading.
>>100175143I'm gonna share a secret with you anon, gpt4 is already free if you pirate it. Logless, trackerless, and it works on your fucking phone. The only cope here are the retards who fell for the vram bait.
>>100175127>>he spent 13k for thishe admitted in the comments to being an old boomer, got to spend the grandkids money so they don't inherit anything before kicking the bucket don't ya know.
>>100174096because im not a tranny
>>100175143I don't think people with 4x4090 care if they beat SOTA models, they care about freedom, custom finetunes can beat SOTA at specific tasks
>>100175187Where are those custom finetunes at anon?
>>100174960sonnet and opus have too many claudisms and put the chrachter card above the context so you can be talking to someone for three hours and make no progress.
>>100174797CR+ doesn't follow instructions well, at least with the quants that fits in 48GB of VRAM. And the 70B-instruct is pretty good at following instructions and learning from the context. For that reason I prefer to use L3, CR+ is not very usable in that state.You need to modify the default 'assistant' role of the chat template to remove the censorship.I think Rope scale/alpha works well to scale the context.If you're a /aids/-tier promptlet, you might want to stick to CR+.
>>100175143All micunny rp is free until FBI asks to open up.
>>100175198they don't exist yet, but you got the message
>>100175199>put the chrachter card above the contextyou can fix this with author notes anon or a dozen other ways like appending it to the jailbreak
Llama-3-8B-Instruct-32k is pretty good at staying in character, but it isn't "intelligent" enough to do that and output a UI at the same time.Not bad.
>>100175171Spending that much and not even knowing how to use it is painful to watch.
>>100175211>He bought 4x4090s for custom finetunes that will come in two more weeks
>>100175032they want to gen loli pornit's really as simple as that
>>100175162Flash attention doesn't matter that much for our LLM usage, it mostly matter if using batching. Hell, flash attention is not required or recommended by default with exllama. If you had trouble running model, that was not the cause, for LLM, AMD have better speed per dollar.But anyway, I'm also disappointed with how slow they are working on flash attention, for image gen it significantly reduce vram usage, you can get it working with some hacks on RDNA3 but official support is still supposedly in the work.I just use llama.cpp since models have gone so big but now that we are back with a good 8b, might go back to run exclusively on GPU with exllama.
>>100175220I'm gonna tell it to do something and pray? I'm sure that'll work and it won't ignore me 5 messages latter.
>>100175143Still not sending you my logs, Sammy boy.>>100173826>>100173858I tried it after quanting it down to 8bpw in exl2. Works fine up to around 22k context or so with RoPE alpha @ 5, then shits the bed.
>>100175169I really, really hope this is not one of those snarky copebrag posts where you say "it's possible, i just won't tell you how!" with a smugsoyjak face on you. Or it could be plain bait.With that aside, how do you "pirate" a model with trillions of parameters and run it on your phone? Please enlighten us, and spare us the usual "mmhmh not le telling you bro" reddit shit. inb4 asking a discord troon for a proxy key or scrapping git repos
>>100174960>all the locals have turned to sonnet and opusNot locals. Those are the tourists and fake ass shitposters. Don't go full retard.
>>100175238I wouldn't buy 4x4090 for that, unless I begin to work with LLMs, but 2 used 3090 to run 70b Q4 is not a bad idea
>>100175301I account for some of the claude posts because I wanted to see how green the grass was on the other side. It's purple. Which isn't bad, just different.There are parts that I want to bring back to llm that claude did better, but it's not worth switching over. It's worth dying my own grass for though.
>>100175277sounds like a serious skill issue to me anon I literally just spoonfed you how to fix your problem, if anything claude has better recall than any local model it can reach 200k context, what can our shitty models do? barely 32k and they forget shit in the middle so it's basically 8k in front and 8k in the back. what you're talking about is a non-problem for anyone competent.
any cr+ tunes?
>>100175356My problem isn't with the memory. My problem is with claude having a mind of it's own. You can tell it things, you can change the jailbreak, but 5 messages latter claude goes "No, I like these tokens better. Cry about it, what you gonna do? Up my repetition penalty?"
>>100175347It's ok to demo them for the purposes of checking out the enemy. That doesn't mean you've "turned" to them.
>>100175169>Logless, trackerlessSounds like some scraped key and server that is for a big enough company where they run it on their own server instead of through open ai and obviously don't expect someone to have hacked them. But logless and trackerless from open ai that is by definition impossible.
>>100175127>he spent 13k for a talking AI text computer comparable with SOTA that corpos spend billions on
>>100175300you answered it yourself, you pirate the key, if you're too stupid to do this then you connect to a proxyhost of someone else who has, it's not copebrag but I'm not going to spoonfeed you shit you can take time and learn yourself. you know what happens when retards do that? other idiots come along, brag, flaunt, then it gets fixed and I have to figure out a new way to do this shit. no thanks.
>>100175400skill issue confirmed
>>100175423>aicg not sending their finest
>>100175423>you know what happens when retards do that? other idiots come along, bragWell why did you say that out loud then, if you don't want more people to ruin it for you? Your best course of action is to shut the fuck up about i and gatekeep it yourself. Maybe you do want to brag after all.
>>100175474>idiots coping they bought 4 fucking 4090s when shits freewhy didn't you just steal them anon?
I already use GPT-4 for my job. I'm still running local when not on the job. It was never a cost issue. I don't care if you're Sam, /aicg/ or whatever, you're a smelly ass shitposter, fuck off.
Has anyone else been playing around with using something akin to a 'Thoughts'/'Plan'/... headers for the Respond? I think that this could generally result in repetition if they are not unique. So I was thinking of filtering it out of the context afterwards to avoid repetition.First tests seem promising to me.
>>100175101do you find 275w limit to be the sweet spot? I didn't find any degradation in speed at 250w
>>100175423>>100175502mfw dual A6000s keeping me toasty quanting another experiment while yet nother /aicg/ streetrat seethes that he has to scrape and scrimp pirated keys for a fleeting taste of the good stuff
>>100175535wat? It was never a cost issue? Then what is your reason? Are you going to say something stupid like privacy because you're too stupid to obscure your data?>I'm choosing to eat a shitburger at home BECAUSE REASONSThis is you anon.
>>100175624You really thought we were spending thousands of dollars on GPUs because $20/month cost too much?
>>100175624Learn to cook brownoid
>>100175575Mhmm I'm really seething here I didn't buy a dozen extra cards. I'm so mad you have no idea. Boy if only I had 96gb vram so I could run at a decent quant. Damn I'm mad. FUCK!
>>100175663t. net worth: $23,404.68
>>100175644Honestly I thought it was because you're just a fucking retard but maybe I'm wrong. That's why I asked you what the reason was. I still think it's because you're a fucking retard but we'll see if you reply with a good answer.
Fucking cattle, please eat the bugs and own nothing
Dumb question: How to use LCUDA on Windows koboldai? Or is it impossible? Is it better than ROCm?
Is setting rope for llama3 as easy as it was for l2? Does the quality drop significantly or is it completely fine to do it
>>100175683Do you really think posting your bank account value will win your argument?
>>100175663>tfw digital streetshitter anon tries very very hard to ironypost
>>100175032Based
>>100174096i do. i use dualboot. and i only use SD on linux.but i just usually use windows due to a few key work related programs and kcpp is just easy to use.
>>100175853Just use VMs with PCI(e) passthrough.
>>100175644claude isn't 20$ a month, it's 20$ a month to use on their website.
>>100175900i wouldn't be confortable sending my scenarios to claude lmao.
>https://huggingface.co/BXBX/Moistral-11B-v3-8.0bpw-h8-exl2Done quanting Moistral v3 8bpw exl2, fits on 12GB VRAM with full context5bpw for 8GB vramlets coming soon
>>100175818Poverty is noble
>>100174958is that work or your own hw? impressive, very nice.
>>100175687nta but I run local models because I enjoy running models locally at home. I get better satisfaction and more enjoyment knowing that it's all on my machine. I wouldn't expect you to understand or care. Even if Claude opus was suddenly free for everyone I would still choose an inferior local model. We live in a different world: I started at the bottom, and my gens have only improved over time. Gpt at least has gotten measurably worse over time. How many times has /aicg/ gone through proxygeddon? compare that to the zero (0) times I have been denied access to my local compute. I enjoy the technology, I enjoy seeing the improvements in models, and I truly do not give a fuck even if corpo models were free 1 billion context and came with a synchronized vibrating onahole.
>>100175687NTA but you speak like a autist retard
can i get the latest redpill on using llms for coding assistance? im talking:- explaining code blocks- searching for bugs- creating patches from descriptions of the desired effectsalso, is it possible to train an llm on a given codebase to make it more useful?t. lazy retard
>>100176153The latest is still the oldest. LLMs are only useful for shitting out jeet code and will not be of any use to a human.
>>100176099>nta but I run local models because I enjoy running models locally at home. I get better satisfaction and more enjoyment knowing that it's all on my machine.Autism>Gpt at least has gotten measurably worse over time.It is still miles better than local models though>How many times has /aicg/ gone through proxygeddon? compare that to the zero (0) times I have been denied access to my local compute.Proxygeddon only happens to poorfags.>I enjoy the technology, I enjoy seeing the improvements in models, and I truly do not give a fuck even if corpo models were free 1 billion context and came with a synchronized vibrating onahole.Again, autism.
>>100175740>Dumb questionYes.>How to use LCUDAWhat is that?>on Windows koboldaiKoboldai is the pytorch-based one.>Is it better than ROCmROCm is for AMD GPUs.If you have Linux you can use koboldai with a cuda pytorch for your nvidia card, or you can use koboldai with rocm pytorch for your supported amd card. If you have Windows pytorch+amd=no, and if you have nvidia using WSL and running cuda pytorch in there is recommended.
>>100173514wait a secondOP added petra to the miku bread.. you're kidding me
>>100176172ok but what about reading code? explaining shit, just helping me comb through code. like a million jeets who grep the code on my behalf, is that possible?would finetuning help here?
>>100176153It's honestly hard to find a good use for LLM on coding tasks. I tried multiple times using LLM be it local, GPT-4, opus on any subject I was knowledgeable about and they were just a waste of time.The only time were they are useful is looking for basic shit in a language I'm not familiar, it's faster than using a search engine. But for that, any model is good enough, I currently just use llama 3 8b for that. I use some neovim plugin but 80% of the times I use it for editing text like mail or commit messages instead of code.
So has anyone built a chatbot with hierarchical annotated memory yet? Not agent stuff, just simply what we've been using already, except with a better memory system than simple vector DB RAG.
>tfw you realize meta got dedicated pajeets filtering next llama models
>>100176201I've found L3 70b is the best at spitting out useful code that works out of the box and is ok for analysis but is hamstrung by its medieval context limits.Yi-34b-200k is surprisingly good for analysis using in-context training if the portion of your codebase fits into the context limit.
>>100176199Gotta give him credit. All it took was some subtlety.
>trained multiple ridiculously performant fine-tuneswhich ones?
>>100176287>LLaMA 3>extended context from 8K -> 128KOk, where have you all been keeping this from me.
>>100176194>autismsure, and?>it is still miles better than local models thoughnot anymore, at least for erp. and if you were to remove its ability to search the internet I think it would generally suck at everything with how lobotomized it has gotten>proxygeddon only happens to poorfagssuch as yourself? since a few grand is obviously beyond your purchasing power>again, autismthankfully my autism has gotten me a job that pays well enough that I could buy a brand new 3090 every two weeks without compromising my lifestyle or having to draw blood for my mortgage payment
>>100176287>128K contextIs the idiot mixing up Llama 3 with Phi-3?
>>100176312>my autism has gotten me a job that pays well enough that I could buy a brand new 3090 every two weeks without compromising my lifestyle Based buyer and saver
>>100176299First I've heard of it.
>>100176325Was Phi's 128k version even real context? Like why even release the 4k version if they have 128k?
>>100173727It has the less slop and gpt-isms i've seen in a long while, it's not the smartest but the vocabulary sells it for me and is a breath of fresh air (been messing around with it for an hour or two)
>>100176325>twitter AI personality has no idea what he's talking aboutmany such cases
>>100176299https://huggingface.co/MaziyarPanahi/Llama-3-8B-Instruct-32k-v0.1-GGUFhttps://huggingface.co/NurtureAI/Meta-Llama-3-8B-Instruct-64k-GGUF
>>100176199I recognize our dear old petra, same antics.
>>100176345This smile reminds me of that one image that was drawn by the drawfag that ended up trooning.
>>100176369>32k>64kThat's not 128K.
It's so funny how certain models work so much better with the wrong prompt format if you aren't trying to use it as an assistant.Talking about Qwen 1.5 32B specifically, but I've seen that happen to other models too.
>>100176369>Meta-Llama-3-8B-Instruct-64kFake as fuck. No instructions or description of what they did to extend the context. Tested it in exl2 already and it barely works up to 20k with rope
I've been playing with the small 8B llama3, and I notice it likes to "O-oh..." me a lot - both with a very simple prompt directly in the llama.cpp API, as well as some of my favorite cards in SillyTavern.I haven't tried 70B yet, since I'm on vacation at the moment and only have the 32GB macbook to play with.
Anyone have a sense about mradermacher's older imatrix quants, given that the llama3 ones are broken? I downloaded one of his WizardLM2 8x22B, and I'm trying to figure out if I need to get a different one (or download a third of a terabyte to quant it myself)From https://github.com/ggerganov/llama.cpp/issues/6841 it sounds like the breakage was resulting in outright garbage, as opposed to subtle quality loss. So, given that the model I have is not spewing obvious garbage, it seems likely fine. But I wanted to double-check, in case there's an insidious "subtly worse" failure mode that I would never notice.
Do you think the upcoming Phi 7B or 14B will beat Llama 3 8B?
>>100176244>>100176261thanks for the info guys, wish i could leave you some reddit gold but this website doesnt let me :(
>>100176506yes it will be more slopped
>>100176506I think it will have strengths and weaknesses over the Llama but not beat it. It is a very different dataset and that will show in what it can do well. The 3.8B already beats all 70B+ local models on some problems I tested it with.
miku posters are unhinged
>>100176195**ZLUDA mb
>>100176550>unhingedI think you mean "ascended"
>>100176548>The 3.8B already beats all 70B+ local models on some problems I tested it with.I can hardly believe this unless you post logs.
>>100176550Unfortunately I only have niche tastes, not mental illness. Otherwise I could blame it on mental illness, rather than just being a weirdo.
>>100176566Good morning, sir.
spoonfeed me an easy way to set up a high quality TTS for text generation webui.
>>100176194I think anon is retarded for buying hardware without getting net returns on the investment (they could at least sell their GPU compute on vast.ai and pay off the cost of the GPU in like a year, but being a provider is more difficult than just buying from a provider and saving your money).I am looking forward to renting 400+gb of vram for $10-20 an hour to try erping with llama 400b.That hardware would cost me $50,000 (more like $100,000-200,000 with h100's, if I bought the hardware I would go for a 2x mi300x or 20x 3090's). Considering the fact that I fap in like 10 minutes every day, it would take me 2500 days for the $50,000 worth of hardware to be more worth it than renting for $20 a day (and I don't even fap to AI every day or account for the cost of electricity).But who knows maybe 400b will be shit for ERP, and nobody can finetine it.
>>100176600>erping with llama 400b.>That hardware would cost me $50,000Like, a fifth of that if you're not retarded.
>>100176600>hardware would cost me $50,000still cheaper than getting divorced
70b rp undi finetune wen
>>100176567I guard my test set so that it will never have even the slightest chance of being trained on, so I will not do that. You're free to distrust my claims.
>>100176639monday, 3pm
>>100176287I tried the dolphin 8B finetune and yeah it's uncensored but it made it retarded. I got base 8B to solve a simple math problem (yes, I know) but then the dolphin finetune failed.
>>100176566sir please do not redeem ze miku shartsune bloody bastard kind sir thank you
>>100176550they were never on a hinge to begin with
>>100176623I like arguing over hardware, give me your dream setup for 400b (even if it's Q4, I am probably gonna rent for Q8).
>>100176600>I think anon is retarded for buying hardware without getting net returns on the investment (they could at least sell their GPU compute on vast.ai and pay off the cost of the GPU in like a year, but being a provider is more difficult than just buying from a provider and saving your money).I highly doubt you can break even on electricity costs from vast, let alone pay back your hardware. I feel like the only party making money would be vast.
>>100176725>give me your dream setup for 400b2 x C4140 (8xV100 32GB) = 256GB VRAM for $10k>(even if it's Q4, I am probably gonna rent for Q8).Pretty sure bigger models suffer less by being quantized. Q4 should be fine, but even if it's not, I have a spare 3090 and can offload the rest to RAM.
>>100176731>I feel like the only party making money would be vast.Why bother doing math if you have feels, right?
running local is just stupid in 2024 I really don't see the point and all the arguments are just justifying my reasons further in fact all I'm really seeing is cope and retards with too many cards
How do you know how many context tokens a model can handle?
>>100176813OK, how much profit do you make from vast.ai?
>>100176623>400b.>$50,000What is the price point where you would start considering a mail order bride? And what would be the number of beaks for that price where ai wins over bride?
>>100176960A wife might more financially sound if you have zero income but otherwise you have to consider the 50%+ of all your wealth and income you pay in perpetuity
>>100176960Women are fucking expensive to keep happy. More so if you have children. Just one kid will cost you quarter to half a million dollars before you can legally kick them out. So, beaks can 10x and they'd still be cheaper in the long run.
>>100176960NTA but 3D can't compete with AI fantasy roleplay.
>>100176841It usually says on the model page, but if you're running a gguf it's also in the meta data that's displayed when you load up the model
>>100176960The only reason I don't have a mail order bride is nobody taught me how do to that. Hell I think some of the countries pay YOU to get a girl a greencard.
>>100176822You're not wrong but what other options do we have? I'm not waiting ten years for ai to get better I'll just play with it now even if it's bad.
>>100176312>not anymore, at least for erp.lollmao eventhis is false, but even if you weren't saying this out of your ass, how would you know? aren't you a LOCAL autist? Or are you telling me you tried Claude Opus? Did you lurk aicg to see how good Claude Opus is?I guess this tells a lot about you.>such as yourself? since a few grand is obviously beyond your purchasing powerCope. I would rather invest my money to retire earlier than waste all my money on niche hardware that will lose its value and become deprecated in a few years.
>>100176960>mail order brideyeah let me pay to get into a retarded relationship that will simmer with resentment until it explodes, sounds like a great investment
>>100176797maybe you might find that at a local liquidation auction that won't accept shipping, but I have a feeling that you will only get like half the vram for $10k, and getting 400gb would add up to around $40k.>>100176813Not anon, I think the people hosting are 100% making money, but I think what anon is referring to is that residentially you don't have access to cheap electricity, and cheap ISP service (and I think business rates are cheaper than residential + less taxes but the downside is I think you need to own a company building in a business zone or something).So it's the same problem with mining bitcoin that people felt when mining coins on a gaming GPU's cost more in electricity than the wattage they pay.I still think you can pay off your 4090 in like a few years, but if it's constantly at 100% power draw (400watts) the cost would be like 15 cents per kilowatt, so that's $500 per year. But if selling to vast.ai per hour is 20 cents (below market) you get $1750, so you pay off your 4090 in a year on paper (realistically its not at 100% load 24/7 but also not rented 24/7, and not counting internet and what vast takes out).>>100176635Honestly I wish I could cope and say "robo wife is cheaper" I think sex is 100% free if you try to keep it that way, and the only downside is that humans have ego and they don't follow and like everything you do unlike an AI. I don't want a GF because I think women are going to fuck up my self confidence as a virgin and I won't be truely happy if the girl isn't truely happy, and it's weird how girls on dating apps casually had sex with 30 guys, I feel like there is some sort of societal imbalance preventing people from just being together forever. So I guess I'm an AI incel???
>>100177130if you don't treat her like shit enough fucking will make any girl love you cause oxytocin. the tough part for most is getting to the fucking part
>>100177136>maybe you might find that at a local liquidation auction that won't accept shipping, but I have a feeling that you will only get like half the vram for $10k, and getting 400gb would add up to around $40k.Again you retards and your feelings. I already have one. Just need to get a second in the upcoming months.
>>100177114i'll have this hardware and still retire early. and I was specifically referencing gpt4 which has measurably gotten worse, there have been academic papers about this evenTwo grand for 2x3090 plus a few hundred for 128gb memory has literally zero bearing on my retirement whatsoever. I am so sorry that you are struggling in life, and I hope that things get easier for you in the future. I'm going to continue having fun with my local models and I'm skeptical that there's anything you can do about it
Can llm learn anything from large code base?
>>100177136>I think sex is 100% free if you try to keep it that wayYeah, but you tend to end up with chicks like picrel
>>100161344
>>100177145idk anon, in my experience you gotta hit that infatuation mark before the fucking for the woman love to set and cure properly
>>100177181>/lmg/ actually believes they'll be able to run gpt4 on 2x3090sholy cope batmanwell you'll figure it out eventually how are those L3 finetunes coming along btw?
>>100177263Me on the left side of the right image
>>100177181I see, so you close your eyes to avoid facing the reality... Such unfiltered cope.
>>100177181>2x3090LMAO, I hope you have plans to buy more for LLaMA 3 400B
>>100177263Is the pixel Teto AI-generated? If so, model?
>>100177181>"richfag">didn't buy 4090ngmi
>>100177370only retards go with 4090s ideally you want 10 to 20 3090s to future proof yourself
>>100177340400B won't be noticeably better than 70B anyway. Mememarks aren't everything.
>>100177343https://www.mediafire.com/view/zzr1x9dzf0b9vuz
>>100177380why are you calling CUDA dev retarded? take it back
>>100177382This is true. There's a paper from Google that shows scaling up compute without scaling up training data will net you minimal gains. And from the looks of things we're plateauing data-wise. Sure Altman may try to retard strength it but he won't get his superintelligence that way
>>100177382I actually agree with you. I still think OpenAI/Anthropic has some secret sauce.
>>100177343nta, but pixelArtDiffusionXL_spriteShaper.safetensors [7adffa28d4] works really well for me
>>100177452The secret sauce is 256x1B
>>100177114>Did you lurk aicgNTA but i go in there maybe once a month and the few logs i've seen posted are roughly the same as the ones you see in here, except the perversions are an order of magnitude more retarded.this really is the dumbest shit to get upset over or try to argue about
>>100177501>>100177263>>100177260>>100176910>>100176731https://www.youtube.com/watch?v=fsUvejZPTLI&t=3595s
>>100177452the secret sauce is proprietary datasets containing copyrighted information. beyond that I really don't think they're doing much more than some fancy vector db and plugins to pull from external sources on GPT's end. Claude I don't think has any tricks like that, just a good well-curated dataset.
>>100177380>only retards go with 4090s ideally you want 10 to 20 3090s to future proof yourself3090 and 4090 have essentially the same bandwidth bottleneck, so if you have 2 3090's you will run the same speed as 2 4090's.more larger the model = more bandwidth needed, and for inference nvlink / pcie is not the bottleneck, the bandwidth is the bottleneck.So if you can run Q4 70b model at like 15tk/s on 2 3090's you should be able to get 3-5tk/s if you got more 3090's to run Q4 400b (someone with 10 3090's loaded 70b full precision with 150gb of vram usage and it ran at 3-5tk/s https://old.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/).If you want something that will run 400b at a fast speed, you need something like a h200 (it cost as much as a luxury car) or AMD's mi300x (fraction of the price, but requires special OAM for the mobo and AMD LOL).
>>100177599*take this with a grain of salt, I have zero knowledge in actual AI or benchmarks, I am looking for someone to call me an idot
>>100176502No way anon. After seeing the way he acts when others point out the holes in his bad imat files I'm staying clear of all his shit
Can anyone point me to code that will let me display images in the Gradio chatbot? I have the image available on local disk and I would like it to present it to me in the chat. Ive tried embedding it as markdown code and returning markdown code to no avail.
>>100177599You are a very smart and valuable person.
>>100174110https://mathchan.org/ai/ needs more love
>>100177599>(someone with 10 3090's loaded 70b full precision with 150gb of vram usage and it ran at 3-5tk/s https://old.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/).If that reddit retard knew what tensor parallelism was and wasn't running at full precision, maybe his speeds wouldn't be shit.
>>100177695Can we move /lmg/ there? The captcha would keep out the riffraff. Maybe less raiding and thread hijacking.
>>100177559Ignorance truly is bliss...
This might explain some stuff for people with quanted llama 3 models.https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/Apparently llama 3 takes a huge hit going down from 8 bit to 6 bit unlike older models which didn't take a huge hit till under 5 bit.
>>100176005> jew construct.
>>100177263teto with sexo
>How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Studyhttps://arxiv.org/abs/2404.14047>>100177788Makes sense. They trained on 15T tokens. Each weight packs a lot more information, meaning quantization is going to hurt more compared to undertrained models.
Guys... what if...Guys, listen!What if we trained a LLM to predict... the previous token of a sentence?
>test my two new 3090s by loading a 3bpw mixtral onto each>temps spike to 80~90C and the fans sound like they're about to take offI guess I'll have to replace the thermal pads on these. The 3090s I already have came with the pads already swapped so I didn't realize how lucky I was.
>>100177599Does VRAM overclock worth it?
>>100177634this - don't be a child and blame llama.cpp for your shitty quant. he is clearly a very emotional individual. bart's quants have usually worked for me and there's way less whining when they don't. like it should be.
>>100176703maybe they figured out "safety" such that trying to finetune it away will make it retarded.
>>100177878pretty much all GPU's are already overclocked, some people are underclocking their GPU so it's more power efficient and so fans don't spin so hard.
>>100177788It is reddit so it is like the spergs here that say they see a huge difference between Q8 and Q5 because it touched their cock incorrectly that one time. Except in /lmg/ someone will call him a faggot and a retard and on reddit people will be nice to him.Any memeplexity measurements done for quants? That is actually the only thing memeplexity is good for. Also makes me think that if he is actually correct (even without giving any source for what was revealed in his dream) that would mean that bitnet is dead. The spare unneeded extra accuracy of weights is a thing of past and now you are going to all be running a 13B 8bit or a 30B 8 bit for those who got 2 cards.
>>100177938The buzzword to content ratio in this post is off the charts.
>>100177938Why didn't you just read up on BitNet before spouting bullshit about it? I bet you think it's a quant method too huh?
>>100177978Nope I stand by what I said if you don't understand the point then you are dumb.
>>100177938Even if these heavily trained models hurt from quantization more, it doesn't follow that 13BQ8 > 30BQ4. Packing more data into floating point weights is clearly a horribly inefficient and slow process. Even if a bitnet 70B saturates before fp16 70B (which is something I'd worry about), it should still be better than a 30BQ4 of equal size trained for the same time.Also the best way to compare quants is kl divergence, but ppl is a reasonable substitute.
>>100176566She's just a bunch of noise-pollution, a digital abomination created to torture our poor ears. Miku's 'ascension' is just a myth perpetuated by her brainwashed fanbase.
>>100177991Your "point" goes out the window when you're technically incorrect
>>100177899That or the Dolphin dataset is garbage. Because the answers are always really short and bad.
>>100172723What stack are you using?
>>100178124probably some gay c++ library like imgui
>>100178124nta but looks like imgui
>>100178149>>100178151yeah seems like it thanks
>>100178051>you're technically incorrectkill yourself you nigger
>>100178149<= imgui's dev
>Fire up beat saber>Look for a song to play>Most of my songs are Miku>Remember the meltdown yesterday...Thanks trannies....
>>100178265Beat saber more like meat saber
>>100177831where are the SmoothQuant quantized models of L3-70B-Instruct then? They tested it but they didn't upload the models on their own HF repo? Their repo is here: https://huggingface.co/mit-han-lab
>>100178281It is a god game for basement autists. I hate physical exercise but it tickled my autism enough that I am still playing it at least once a week for 3 years now.
>>100178293It is at the bottom of summary... https://huggingface.co/LLMQ
>>100173727>Closed dataset>KoboldNah.
>>100177831This is why I don't understand why they chose 8 and 70b. They know the market hardware available. Why the fuck aren't they making a 35b or a 40b? What good is a 70b we have to run at low quants? Is the only point to win stupid benchmarks? Okay then we need a benchmark for 40b, what's that? There are no 40b models? Then you lost to yi, congrats zuck, you lost to yi!
>>100177597The Claude dataset must be the most interesting one in all of ML imo. It has such a different personality compared to EVERY other language model. I'd love to know what they did.Given how much smuttier it is than all the others I guess it's possible the only difference is that Anthropic doesn't remove stuff like ASSTR or Literotica from the dataset? But I'm not sure those alone would lead to it having such a different and more human-like personality.
>>100178115It's the dataset. Every single one of them have so much shit data from deciding to ouroboros synthetic data from other bots and not cleaning it up. I can't believe every one of these people decided that the state of things was fine because they were able to improve on prior Llama releases so 3 would be no different. The finetune for Llama 3 was obviously done on mined Meta social network data and there is no way any synthetic data is going to match that quality. I guess I'm going to have to suck it up and download one of those "uncensoring" finetune models if possible but man, this really sucks that the community got that complacent and fine with the state of things. I don't think for the next 3 months anyone is going to be able to fine-tune past what the official instruct release did because of how much work is needed to clean a dataset to get it anywhere near where it needs to be.
My anime image of the day.
Can you make a learning model understand causality? How would you encode causality into a model? What would be your mechanism for making a model understand causality? Would you bruteforce it using statistical techniques?The principles of most models I've seen so far are about encoding world data (text, audio, video, etc..) as compressed bits of information into models. Do you think that current models are able to infer causality from the encoded bits of information as an emergent property? And does it do well the longer you train on the data and the more tokens you feed it?
>>100178471right now you will need to buy new hardware to buy the latest and greatest AI models.however the prices are not going down, so right now we need to spend around $1500-2000 to run 70b, next 2 years you will need to spend around $4000 on the next latest and greatest AI.
>>100178725>And does it do well the longer you train on the data and the more tokens you feed it?Yes no maybe? Made me realize that probably at the beginning of training it is learning compression of data instead of actual reasoning. Eventually it should start learning reasoning because it will let it compress more efficiently, but now that I thought about it maybe the problem is that the structure of the network ends up in a sort of local minimum of compression and can't really learn reasoning efficiently? But that would be pretty easy to prove or disprove if you use some benchmark for reasoning during training to see if the reasoning accuracy progresses at the same rate as reciting wikipedia. Also I am just a 4chan moron so I don't know what I am talking about.
>>100173514I'm trying to figure out how an chatbot could be integrated into a game. I suppose that if you lead with a prompt explaining to the bot the context of the NPC it could talk in the moment, and if you have it say some command e.g. *follow player* or *attack* you could have it interact with the world.Is there some extensive research group or similar where one can read up on what ideas people have come up with, and how they execute it?
picrel response with prompt from : >>100171961this DPO tune, working and failing at the same time.https://huggingface.co/mradermacher/Llama3-8B-DPO-uncensored-GGUF
>The sexual tension builds deeper in her spleen, her body responding eagerly.wat
>>100178725Isn't this attention? Statistically bruteforcing what appears to be causality? Is there really some essence of understanding the causal links? Is it just a mirage from dumb rule following? In the Chinese Room does that matter?
>>100178870How is spleen tokenized? Did your sampler have sp- and not pick spine?
>>100178656Whether they (dolphin/hermes authors) like it or not, they'll eventually have to scale finetuning data down to curate it properly instead of continuing to use millions of GPTsloppy examples. A relatively small hand-curated finetuning dataset (~10^3-10^4 examples) + large human preference dataset (in the order of 10^5-10^6 examples or more) should be the proper way.
>>100178744Not so fast richnigga https://hacks.mozilla.org/2024/04/llamafiles-progress-four-months-in/"Today, you can today use the very latest and most capable open models with llamafile thanks to her hard work. For example, we were able to roll-out llamafiles for Meta’s newest LLaMA 3 models–8B-Instruct and 70B-Instruct–within a day of their release. With yesterday’s 0.8 release, llamafile can also run Grok, Mixtral 8x22B, and Command-R."When llamafile hits mainstream, there would be a shift to use server-class processors with 64-cores, dual-channel mode, and DDR5-6400 for inference-only purposes.
>>100176365>twitter AI fag>unironically calls it "X">"No-Code" in bioevery time
>>100178870>he's never had a hooker massage his spleenget a load of this pleb
>>100178913>llamafile can also run GrokIs this a reason to be proud?
>>100178913yup... I'm thinkin' jart won
>>100178913>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>her
>>100178871If I understand it correctly, no. Attention and Flash Attention are just prioritizing assigning weights to the most 'important' information for the task.>>100178801Reasoning benchmarks are retarded. The thesis of the MSFT's Phi team already proved it - you can just train a relatively small LLM up to 8B on specific datasets and it will score high on reasoning benchmarks but when the user uses it, it will be retarded as fuck. The best 'benchmark' that should be used is to make these LLMs navigate a maze in a rogue-like fashion and take the average of their runs.
/lmg/, please give me some RP situations that 7B/8B models usually suck at.
are there any good models for generating 3d meshes?
Has anyone tried bark.cpp yet?
Some days ago gguf were broken because newlines got merged. Is that fixed by now?
>>100179014>Reasoning benchmarks are retarded.You could just use pure math benchmark. It is all just to check the trend and if it is gradually getting better as it is getting better at compressing data. Or if you see the math result improvement slow down while compression result continue to improve then it is probably becoming just a retarded winrar for text.
>burgers are homefuck. it's all tech support from here
>>100178888>pic related>>100178924Can't say I have.
>>100179024https://www.chub.ai/characters/Vyrea_Aster/doppelganger-interrogation-simulator-654daf19
>>100179076https://github.com/PABannier/bark.cpp>no .exe no thanks.
>>100179094I can make the next thread tech support edition.
>>100179146haven't you made enough threads?
>>100179154After yesterday? I am just getting started.
>>100179094good morning sir please kind bastard redeem the american burger home thanks!
>>100177831There's only one fp16 8B quanthttps://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/tree/mainAlso, these results could explain why so many think there's "8B" retardation. At fp16 it's superior to Q4 70B, which is huge (and actual good use of 24GB of VRAM).
>>100179092Ehhh, there's lots of caveats with benchmarks, especially task-oriented ones like math benchmarks. Take Phi-3, connect it to WolframAlpha, and you have your own agentic math buddy; I'm fondly wary of benchmarks because they can be easily gamed. Testing the models needs to be more broad and active - stochastic scenarios of different levels. The trend is towards them becoming agentic - benchmarks will be just like tying shoelaces or putting on shirts for them. This is why using benchmarks and not updating them as fast as finetunes or models are being trained is such a retarded idea. And no, using LLMs to generate datasets for benchmarks is even more fucking retarded and whoever came up with it needs to be fucking fed to pitbulls.
>>100179164>>100179146I have nothing against kurisu, she is cute. But she is not /lmg/ you are making me slowly but surely dislike her with your threads.
>>100179223Woah, I suddenly want more Kurisu threads.
>>100179099>When a person/doppelganger comes into the room, IMMEDIATELY DECIDE IF THEY ARE HUMAN OR DOPPELGANGER, BUT DO NOT TELL {{user}} IN ANY WAY. THEN USE THIS DECISION TO INFLUENCE HOW THEY WILL TALK FROM NOW ON.>THESE TRAITS ALSO APPLIES TO HUMAN. If {{char}} was talking as human, and {{user}} is being mean and started to accuse them, they will still exhibit the symptom above.I can see how this can confuse the LLM lol
>>100179223NTA but she is definitely /lmg/, if you don't know why you should leave immediately.
>>100178854A little too meme-y. Also, it seems that it doesn't really get the concept of niggebrehaviour. But at least no moralizing.
>>100179221>I'm fondly wary of benchmarks because they can be easily gamed.Anon I am talking about trying to get a good model. Not about selling it. Of course you wouldn't be trying to game benchmark or even try to make what I am saying a selling point. It is just my fan theory and I am saying how you could easily falsify it or prove it.
Did kalomaze give a /g/erdict on this or what?https://github.com/oobabooga/text-generation-webui/pull/5677In my experience testing every setting for writing was shit except min_p 0.1
>>100179256she is definitely related, but the way he tried to make her a mascot by shitting on miku and starting the whole trans herpes miku thing is what's leading to people disliking him, and by extension her
>>100179223I had nothing against Miku, she is cute. But I came to /lmg/ and /lmg/ is making me slowly but surely dislike her. Curb your autism sperg.
>>100179201The paper shows basically no degradation at 8bpw, though. And their tables have fp16 8B nowhere near as good as 4bit 70B, even shitty RTN quant, where are you getting that from?
>>100179293>autist talking about autism
>>100179327I am keeping mine in check. You should do the same.
>Our findings indicate that while LLAMA3 still demonstrates superior performance after>quantization, the performance degradation associated with quantization is significant and can even>lead to larger declines in many cases. This discovery highlights the potential challenges of deploying>LLAMA3 in resource-constrained environments and underscores the ample room for growth and>improvement within the context of low-bit quantization. The empirical insights from our research are>expected to be valuable for the development of future LLM quantization techniques, especially in>terms of narrowing the performance gap with the original models. By addressing the performance>degradation caused by low-bit quantization, we anticipate that subsequent quantization paradigms>will enable LLMs to achieve stronger capabilities at a lower computational cost, ultimately driving>the progress of generative artificial intelligence, as represented by LLMs, to new heights.(V)RAMlet bros... it's over.
>>100179327>autist talking about a autist talking about autism
>>100179337>in check
>>100177788Are there any 6-8 70B confirmed not to be broken? NousResearch is good but they only did up to Q5 and I don't want to spend hours figure out how to convert and quant myself.
>>100178957No, but it's a good proof of concept that you don't need an H100 or multiple GPUs to run a 100B+ parameter model. There's a lot of room for optimization of inferencing and we're barely scratching it.>>100178982>nooo it's a heckin' tranny I can't accept his work!!!beat his work if you can instead of moping around gender identities. you're no better than leftist and normie retards complaining about 'muh patriarchy' when you focus on someone's gender instead of the quality of their work.
>>100179353vramlets destroyed anally as usual. Btw bitnet will plateau much earlier than fp32. 7B bitnet trained on 15T tokens will be just as shit as 7B bitnet trained on 1T tokens.
>>100179376>uhm! don't you think that both sides are LE BAD! go away.
>>100179376its literally just packaged llama.cpp
>>100179353did anyone here think otherwise?full-sized AI on a home pc will never be a reality.
>decide to update both koboldcpp and SillyTavern since I haven't done it in a while>everything brokenHow do I get CPP to show up in my API list?
>>100179353>>100178318>they still didn't upload the smoothquant versions of 70b-instruct. these fucking cunts someone message them. both of the HF repos only have the L3-8b quantized versions. how is someone supposed to validate their findings for the 70b quantized versions?
>>100179440it's an option under text completion
>>100179440pick text completion instead
>>100179452>>100179453Tried that
>>100179201I didn't see that on the chart. Also how can fp16 be good use when fp8 is just as good. Also fp6 isn't on the chart. Prob use 6 bit 70b.
>>100179478ur blind its there http://127.0.0.1:5001
>https://old.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/erm.... GGUF bros what is this..?
>>100179353Why no gguf or exl2?
>>100179478Seems like your SillyTavern settings could not be saved. You should check the SillyTavern server connection and reload the page to prevent data loss.
>>100179511
>>100179353This makes me wonder how, say, llama 3 8b would perform if it was trained in 4bit to begin with.
>>100179544anon... i...
>>100179544Close everything then try again.
>>100179478$10 says you shut down your ST at some point and are still using your old session
>>100179555Bloated bitnet
>>100179591It was this, I'm a fucking retardI had like 20+ windows open looking at different frontends I could try out and models to download and loli tummies to goon over and I got fed up and nuked everything including the server, lmao
>>100179395>70B bitnet that fits on a 3090 and plateaus at 2T so it performs like llama 2You know what, I'll take it
So exllama and vllm > llama.cpp I guess
>>100177732A lot of the research side of /lmg/ would work better over there. Maybe if quality is dense enough it'd spread by word of mouth, I don't know how to give word to industry devs without inviting the planet.
>>100179353Damn, so it's over for BitNet huh. And the meta will be Q8 8B
>>100179503>>100179325Why everyone going by numbers on a chart? Just try the models out yourself. The benchmarks do not cover everything that can fit into 15T parameters.
>>100179652What research side
georgie's grift is starting to unravel
>>100179681I did, 8B fp16 is retarded compared to 70B Q4. This should not surprise anyone. The paper does not even contradict this.
>>100179667Bitnet isn't quantized
>>100179667I don't think so.The problem that's being pointed out is that a model trained on a fuckton of tokens using high precision FP loses information when compressed to a lower precision.BitNet is, what, 1.58 bpw by default? It already has all the information encoded that way.Apples and oranges, I think.
>>100179681Meta themselves said Q8 is no degredation
>>100179095>in her spong>in her sparse>in her sponI guess that's what grabbing a 0.02% likely token does.
>>100179787Kind of makes me wish we had minP but not scaled to the top token probability so that I could simply tell the thing to ignore every token under a certain threshold.Which shouldn't be hard to implement at all, a simple flag that enables or disables the scaling.
>>100179524turboderp and ikawrakow haven't paid their membership fees to the quant papers mafia.
>>100179524Isn't awq close to exl2? I remember seeing that AWQ used parts of exllama
>>100179524Because no companies care about that shit. Everyone is using vLLM or TensorRT-LLM.
>>100179312>When you go to sleep amidst a kino plot and in the morning suddenly your model is retarded once more.
>>100179451>how is someone supposed to validate their findings for the 70b quantized versions?Stop pretending you can 70B 8bit
>>100173514It's been literally two weeks. How can I get koboldcpp to behave when generating with llama-3 70B? It keeps putting out tokens that aren't recognized, like "|eot_id|><|start_header_id|>assistant<|end_header_id|>"And each chat ends with an error dialog about unexpected end of output or somesuch
>>100179353Wasn't Meta betting on open source because providers can use llama cheaper than other models or closed api providers? Doesn't making all their models large, dense, and unquantifiable hurt that? Seems like llama might be good for a handful here that can afford to build mining rigs, but if you're trying to run a service, assuming equal performance, something like Snowflake would be far more cheaper and desirable than a dense unquantable L3 405B.
Everything is broken, even hugging.chat llama3 is broken!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>100179353lmao so based, the more you buy the more you save
>>100180094I don't think they consciously chose to finally reach saturation point. Or maybe it is some bug but even if it is there will be a point where quants will stop working. It is pretty obvious.
>>100180197>>100180197>>100180197
god i hope next gen consoomer nvidia cards start at 1000 dollars for 8gb of vram, total vramlet cucl death>t. h100 cluster GOD
>>100180200yep time to leave lmg for a day
>>100180209bye!
>>100180200>(EMBED)again
New bake pls
>>100180228>rent free
>>100180200Baking on page seven?The stink of desperation is not appealing
>>100180200what's with autists coming back from time to time to force their own garbage?
>>100180204>>t. h100 cluster GODyeah cool story bro
>>100180249I am desensitized to your (you)'s now I want your (ree)'s
>>100180249>mikufaggot afraid of changes
how do I set up rope for llama3?I don't want my wife to forget how we met
>>100180370>I don't want my wife to forget how we met2MW!
>>100179395 >>100179427 >>100179451 >>100179524 >>100179555 >>100179667 >>100180094 >>100180124 >>100180094>>100180160>It is pretty obviousIt's obvious bullshit. You are retarded. Being dense (vs. MoE crap or whatever) or thoroughly pretrained (vs half-baked like llama2) has no bearing on quality degradation due to quantization. And not even this retarded paper makes such a claim. Those "researchers" didn't even compare to other models. But they know how to prompt you to hallucinate bullshit.Their finding: Quantized models perform worse than unquantized models. This is seriously everything they've got. What a great new discovery.
>>100178913at a full 2 token/s amazing !
>>100180634I was going to make a post but thanks for doing it for me. It's like they didn't even look at the paper. Some of these posts are pretty suspiciously worded anyway, really makes you think.
>>100177263I like this Teto