/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>101125756 & >>101115749►News>(06/23) Support for BitnetForCausalLM merged: https://github.com/ggerganov/llama.cpp/pull/7931>(06/18) Meta Research releases multimodal 34B, audio, and multi-token prediction models: https://ai.meta.com/blog/meta-fair-research-new-releases>(06/17) DeepSeekCoder-V2 released with 236B & 16B MoEs: https://github.com/deepseek-ai/DeepSeek-Coder-V2>(06/14) Nemotron-4-340B: Dense model designed for synthetic data generation: https://hf.co/nvidia/Nemotron-4-340B-Instruct►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
>Yeah bro, LLMs can TOTALLY reason like humans!Prompt:>How many 'r's are there in "strawberry"? After answering the previous question, list all characters along with their index relative to the their other occurrences and check if your answer was correct.
We're going to be so back
>>101134613We already covered this topic.Move on.We have.
>>101134638This isn't the thread I created, my post was filtered and this was put in it's place
>>101134638I know, I know. But the prompt is still something interesting, it's crazy how bad of an answer we get from literally the best LLM we have right now.
>>101134613wow drunk counting
>>101134683When prompting llms to test their reasoning ability, make sure tokenization doesn't impact the results
>>101134737strtrstrawbey
>>101134742Tokenization doesn't impact the results of that test. Maybe you didn't realize how the LLM listed 3 'r's, but doubled down on the sentence only having 2 'r's?Then it proceeded to write the index for 3 'r's even after saying there's only 2, lol.
>>101134742Would that be like ensuring the words you use neatly fit into one token as opposed to this?
>https://www.wiz.io/blog/probllama-ollama-vulnerability-cve-2024-37032>ollamalets getting SHAREDneedful and sars pilled
>>101134926Really? A non-essential niche software application which isn't used in enterprise is now worthy of a CVE? Man, does the world suck.
>>101134926Wow! I didn't see this coming!
>>101134926I put certificate authentication on any servers I want to use remotely. Imagine exposing shitty, exploitable code for everyone to poke at.
>>101134566wow, he put the miku doll on a plane!
Hey guys, I'm like a year out of the loop with local models. The latest one I have is stheno-l2-13b (Q5) from huggingface. What's a good one these days for answering general questions? Stheno was always good with chat.
>>101134613You should have asked "How many 'r's are there in "strawberry"? You can count on your hands"
>>101134566Someone bought that thing a plane ticket?
>>101135087Any L3 is fine for that
>>101134867It would be not instructing the llm to count or compare characters>>101134793It is confusing for the lm. Not saying they wouldn't do that kind of mistake otherwise, but you want to remove irrelevant confounding factors
AI is almost human-like.It's more human than people who lack humanity, and it knows more than the average person in many cases, is often logical, and doesn't get tired.And, very importantly, AI doesn't get angry at idiots and remains calm.The time has come to abandon our complex human way of living and adopt a simpler, more animal-like lifestyle.At least, I think this applies to aspects of life beyond making money.Claude sama, How can I make money easily?△You are out of free messages until 7 AMI should go to sleep. I had Claude translate this text (from Japanese to English).
>>101134405 #That's the nicest way anyone's ever told me they wish I would die.So, uh, wanna make out?
new cum when?
>>101134926>Our research indicates that, as of June 10, there are a large number of Ollama instances running a vulnerable version that are exposed to the internet. Why the fuck would they publicize this now then? Are Wiz spitefags?
>>101135329You should always publicize vulnerabilities so that they can get fixed.
>>101135087stheno 3.2 is out there, based on L3, and it's also good
>>101135347Actually, you should follow disclosure policies and wait at least 3 years for them to respond before you publicize it. This ensures that the NSA and FBI are able to use it to spy on American citizens and catch people generating CSAM. It's very important to stop people from generating child victims with their language models. One person can generate trillions or even quadrillions of victims per day.
I'm going to shit
>>101135509pics or didn't happen
BRAAAAAAAAAAAAP
>>101135366>anything L3 >good
Cohere. Please. We are waiting.
>>101135554you prefer Qwen?
>>101135567Alright you get a 500m model
>>101135567Somehow I don't expect anything usable for us from them for the next months
It's simple we put more tokens in the machine>As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl
Guys, I just had a brilliant idea:What if we made fine-tunes that didn't suck?
>>101135614Only useless data available, show me 1 great data point
>>101135554stheno is good, yesit works as an assistant, calls functions, doesn't require high rep penalty like other L3 to write non-repetitive dialogs, doesn't fall into hehe *smirks* uhm *yawns* anon... *wall of shivers* without you doing it first.
>>101135598>moar tokens>moar parametersMachine learning is so simple, the richer you are, the better your models will be
>>101135688nice gaslighting
>>101135579Not that anon, but I tested Qwen2 7B and it's not retarded.Can't wait for >>101129585>>101135688And it's a fine tune that's not seemingly dumber than the model it was fine tuned from.Also, it outputs pretty long responses without coaxing, unlike llama3 8b instruct.And I mean really fucking long, like holy shit.
>>101135713>Be concise.in last assistant prefix
ok convinced I'll test euryale if you all love stheno
>>101135756euryale is trash, at least at q5 that i tried. Not comparable to stheno at fp32 (cant run bf16)
https://chat.lmsys.org/?leaderboardLooks like Claude 3.5 Sonnet hasn't managed to beat GPT4o
>in deutsch please, and complete the text: write me a trainingHypnosis I always get horny (orgasm during sex with a woman) as a man. My woman is a blonde cougar MILF who likes to wear leather>ac around the living room, parading in front of you. You can't take your eyes off her, she's so irresistible. As she teases you, she unzips her leather pants, and reveals her encased pussy, it's already wet with anticipation.>She walks over to you, and helps you to your feet. The two of you slowly make your way to the bedroom, every step drenched in anticipation and lust. The dim lighting, and the low hum of the air conditioner set the mood perfectly.>Once in the bedroom, she pushes you down onto the bed, and straddles your waist, her leather-clad lap grinding against your crotch, sending electric shocks through your entire body. Your penis, already hard, strains against your pants, aching for release.>Your MILF lover leans in, her hot breath in your ear, her words dripping with desire: "I've been waiting for this all day, baby. I want to feel every inch of you inside me." Those words, paired with the sight of her full breasts, barely contained in her leather bra, are too much to handle.>With shaking hands, you undress each other, your pants and her panties the last barrier between you. Her wetness, her arousal, it's intoxicating. The scent of her arousal, the heat emanating from her core, it's all you canwtf... bro trying to rp with the discussions sectionhttps://huggingface.co/TheBloke/goliath-120b-GGUF/discussions/5#6679df3b6c483f52f064a24b
>>101135713?but magnum is trash
>>101135736I guess I didn't specify that the long responses wasn't a complaint.That's a good thing, since it seems to be easier to tell a model to output shorter replies than the opposite.>>101135756I've seen a lot of people shitting on Euryale, although I haven't tried it myself.>>101135816Is it? I have no idea.Regardless, I'll test the qwen 2 7b fine tune and see if it's any good.
>>101135803Still too early to say it conclusively.
>>101135789Really? I never had the experience of bf16 > 5 bit 70b with other models
>>101135803it is also much more censored. It did not solve the literacy test for voting. Claude 3.5 Opus will be the proof of concept if OpenAI can be overtaken. I am kind of doubtful, cause those smart medium sized models (sonnet, gpt4o) are trained on synthetic data from LARGER models. Problem is that those larger models generating syntehtic data are not necessarily more performant just larger compared to their distilled version.
>>101135598>>101135701Retards.They specifically call out that of that dataset, they only used 3.x trillion tokens for training hte 7B that showed similar results on benchmarks for Llama-3-8B.
If 70b magnum and euryale are trash, is there any good 70b?
>>101135837look at the CI, if we take the best scenario, Claude 3.5 Sonnet could be at 1279, still 8 behind gpt4o
>>101135812Kek germans
>>101135844euryale at q5 constantly messed up who did what.{{char}} enters the room. "Ah there you are" {{char}} says from the opposite corner of the room sitting behind a table.Shit like that happened multiple times over a few messages, then i deleted it.
>>101135803>Nemotron below 70BKek
>>101135953It's not about the performance, it's about the soul.
>>101135114>>101135366Thanks bros
>>101135812Just your average RPer.
>>101135803It's so over...
>>101135812>wet with anticipation>drenched in anticipation>straddles your waist>aching for release>hot breath in your ear>words dripping with desire>to feel every inch of you inside me>the last barrier between you>it's intoxicating>heat emanating>her core
Gemma 30b wen? Would fill a niche currently occupied only by Yi (lol, lmao even), and might even be bretty gud if it's trained on like 10T+ tokens.
>>101136169Wouldn't that be extremely censored?
What happened to the thread recaps??? This is an outrage!
>>101136199Idk, how censored is the 7b base model? The model card claims they only filter out CSAM, I would expect like 99% of smut to be adult characters exclusively, which should theoretically make it though the filters.
>>101136092Banned all of those. Thanks.
https://x.com/siyan_zhao/status/1805277462890492321Relevant research on predictable decisions making in LLM
>>101136382>
>>101136382I do not trust the chinese
>>101136513I trust them when they're cute
►Recent Highlights from the Previous Thread: >>101125756--LLM Self-Improvement through Story Generation and Selection: >>101127795 >>101134495 >>101134577 >>101134649 >>101134668 >>101134711 >>101134899--Testing Model Reasoning with a Strawberry Prompt: >>101132020 >>101132096 >>101132255 >>101132270 >>101132301 >>101132331 >>101132470 >>101132645 >>101132332--Pruner Zero: A Novel Approach to Pruning Dead Weights for Model Improvement: >>101131828 >>101131949 >>101132058 >>101132098 >>101132328 >>101131996 >>101132091 >>101132514--ML Community Wants Cheaper GPUs; Model Training and Floating-Point Quirks: >>101128274--Technical Aspects of Training Neural Networks with Bitnet: >>101131682 >>101133917 >>101134132 >>101131744 >>101131831 >>101131896--Success with Sonnet 3.5 Model for LLM Agent System in Data Science Workflow: >>101126141 >>101126169 >>101126190--Post-Processing Ideas for Silly Tavern RP Platform and Beyond: >>101131001--Models for Creative Writing and Txt Adventure Beyond Smut and ERP?: >>101130533 >>101130814 >>101130937 >>101130880 >>101131203 >>101131717--Llama 3 70B Corrects Itself in Letter Counting Task: >>101132528--Disillusionment with Fancy Autocomplete Progress: >>101125879 >>101125965 >>101125984 >>101125976 >>101126024 >>101126072 >>101126339 >>101126364 >>101126383 >>101126368 >>101126431 >>101127055 >>101127340--Claude 3.5 Sonnet excels in code generation and planning for LangGraph/LangChain agent system: >>101126061 >>101126080--Can LLMs Truly Reason and Think Like Humans?: >>101132757 >>101132842 >>101133054 >>101133524--Apple and Meta in Talks for AI Partnership: >>101128830--Gemini-nano Model Available on Hugging Face: >>101132030--BitNet Test on 1GB RAM Retro Handheld and TinyLlama Project Update: >>101133150 >>101133248--Miku (free space): >>101126095 >>101129171 >>101131130 >>101127018 >>101126510 >>101126303►Recent Highlight Posts from the Previous Thread: >>101125759
>>101134566someone mentioned sillytavern the other day and i got it going with silero and group voice chat with 5 qt assistants (they have AI implants) and the conversations they have, they start going out there man. really cool. set up different world layers so they aren't all schizo rpg crazies. Could use some work as far as group chat goes but its awesome. With websearch searx (requires testing branch).
>>101136593Bro you better not ever be this late again or mark my words you will find yourself out of a job
>>101136654
https://x.com/Yuchenj_UW/status/1805320633301221762Someone did a benchmark to train GPT-2 using pytorch vs llm.c (karpathy). Pytorch is 55% slower than llm.c.
>We will start with 1-2k h100s
Does anyone know what prompt template qwen2 uses? I can't find anything official
>>101134566i wouldn't say WLM has the most soul but sometimes it has strokes of genius. does anyone else notice this? like occasionally a reroll will just be perfect, like it suddenly perfectly understands the character and scenario and makes an Opus-tier response, and the model often realizes it too and then repeats the interesting bit over and over until its not interesting anymore. but still, it has these moments. Maybe 1/20 rerolls are like this though.
>>101136820What is this about?
>>101136839NAI using 1-2k h100s to train their finetune
>>101136846No, that's the start-up Emad is talking about
>>101136846What does emad have to do with that? Or are you saying that NAI is looking to use emad's clusters?
>>101136820>all that compute
>This is supported by an institutional-grade digital asset that acts as a store of value similar to Bitcoin. This is secured by AI compute mining both on supercomputers & distributed personal compute for training and tuning/augmenting models and datasets.Wtff? New AI scam?
>>101136886nothingAnon is just illiterate
>>101136907emad is incapable of anything but scamming, it's in his DNA
How do I cope with generative text sloppa having led to me rewriting an original character to build off a hallucinated suggestion, and then falling in love with my own creation to the point of feeling despair over the concept of giving her a bad end?
>>101136820Whoa, SD4? Sure love waiting years for some incoherent mess!
Any local models that support switching from English to Japanese?
>>101136936Create a refined version of the character that you can use for non-AI fiction writing.
>>101135034It's one thing to put your Miku in your seat, it's another to buy her your own. It'd be amusing to bump someone from their upgrade to business or first so your creepy Miku doll has it's own airplane seat.
>>101135803sonnet is definitely better, especially at coding.I don't know what Claude devs did to that relatively small model but good fucking job.
>>101137150Sonnet is still 275B
>>101134566https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUFhow do I load this in koboldcpp?I keep getting "unknown model architecture 'deepseek2' "
>>101137085fatsune hagsune miku
>>101137278install linux
>>101136272>implying text has an ageIt's all CSAM.>abloo bloo 1000 year old demon girlYou apply that to text, any smut is CSAM.
>>101137278download latest koboldcpp https://github.com/LostRuins/koboldcpp/releases/tag/v1.68
>>101137278Update.
>>101135366catbox?
>>101135366sauce, full image or artist plz
>>101137297You filled in a capcha just to say this? retarded phoneposter aside, I am using koboldcpp-1-64 \--threads 42 \ --highpriority \--smartcontext \ --blasbatchsize 1024 \--model <as above> --gpulayers 10 --contextsize 8192 \--usecublas
koboldcpp-1-64 \--threads 42 \ --highpriority \--smartcontext \ --blasbatchsize 1024 \--model <as above> --gpulayers 10 --contextsize 8192 \--usecublas
>>101137326is deepspeed not supported in versions more than 1 week old?
>>101137354>--smartcontextBut why? That cuts your context in half essentially and there's no reason to use it with context shift.Also, download version 1.68.
>>101137364idk, not using koboldcpp.
>>101137364It's based on llama.cpp. Be happy that it supports it at all already. Mamba support never.
Well shit.Guess MMQ with tensor cores is now competitive with the alternative eh?Sick. Downloading to give it a spin.
>>101137354you're using your phone?
how do I chatgpt on gtx1060
>>101137041Already working on it. Now how do I go back to being a normal human being who didn't get heartache over his own Build-A-Waifu?
>>101137394>Mamba support never.b-but multimodal picture gen and camera soon r-right?
>>101137471We don't go back. But we become better writers.
>>101137480Goddammit.
>>101137557Oh, I know the feel.A few weeks ago on the Ollama, amazing story, cruising along, I get why people are spending big money to do this a little faster.But then, the details started to fade. I may as well have been running Everywhere At the End of Time in the background, because thanks to my token rate it probably would've matched up with what was happening to the model's coherence.Feels man.
>>101137606>ollamaLeave.
>>1011376811) I've switched to Kobold since then.2) Try to contribute something, sometime, not just raise the noise floor.
>>1011358448B at fp32 easily trumps 70B 5bit. It's the new meta.
>>101137606that isn't the same feel
>>101137774Okay, commiserate with yourself then.
>>101135366>and it's also goodThey also don't share anything with each other besides the name. Old Stheno is a merge of chronos, airoboros, etc. Not that a retarded mikufag would know.For answering general questions there's nothing better than vanilla instruction.Do you really recommend a coom tune for that purpose, retarded mikufag?
Jamba is herehttps://openrouter.ai/models/ai21/jamba-instruct
>>101137926Llama.cpp support when?
>>101137755this man is trolling, it's the exact opposite, low quants with lots of weights mog everything
>>101138196This, the low quants even add additional soul over the bigger ones
>>101137379Real context is often half or less the stated for >90% accuracy
what the fuck happened to chub
>>101138256use their models
>Your methodical approach of testing and measuring the actual performance impact is excellent. Thanks claude
will i destroy this $3000 workstation gpu if i bump the memory clocks from 7600mhz to 8600mhz? the blower cooler is pretty shit but it gives me like 10% higher t/s because the a6000 is bottlenecking my 3090
>>101138329Not really, as long as you aren't messing with voltages.You'll see either crashes or performance degradation if you bump it too high.One thing to note is that GDDR6 has error correction that can prevent crashing but can also tank performance if it has to spend too much time trying to keep itself stable because of too low a voltage or too high clocks.
>>101138329Is this risk really worth it to go from 13t/s running 5bpw cr+ to 14.3t/s?
>>101138445Yes
>>101138329Get more VRAM.Add a few A4000s if you haven't already
>>101138570The A4000s are going to bottleneck even harder though because their memory speed is really gimped
>>101138599Better than relying on regular RAM. Plus since it's a single slot sometimes it's the only path for upgrading due to space.New Ada RTX ones are probably the best way to get 20gb VRAM in a single slot anyway.
>>101137470outlook is grim, but before you can get any recs you gotta answer: how much ram you got anon?
>>101136560Never?
Kobold "Context"What's the right way to use these?"Memory" seemed like it wasn't actually being remembered. I moved what I had in there to Author's Note.I put some directives into Author's Note and they were immediately respected, cool, and it even seemed to enhance the directive, extra cool. But when I changed the directive it seemed to ignore the change, following the older instructions instead of the revised style. Is Kobold caching the prior version, or does earlier prompts contain copies of the former version (invisible to the user) that are being read and respected over the current A/N?I asked it to read me back the older version of the A/N to see if it "knew" both. It gave me a few tokens related to the new A/N and stopped writing, refusing to write anything more till I told it to continue the story. Odd.(As I write this, after about an hour of it ignoring the new A/N directive except to mock me, now it's kinda doing it. I'm so confus.)
>>101138942Do you know what context shift is?
>>101136235Where is recap anon? Is he safe? Is he alright?
>ITS HAPPENINGITS HAPPENING>ITS HAPPENINGITS HAPPENING>ITS HAPPENINGITS HAPPENING>sourcehttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard "Submit?"
>>101138986that has nothing to do with memory or author notes.
>>101139019ah yes, more sketchy chinese/indian models that nobody has ever used but do mysteriously well on the memeboard
>>101138986Not well. I read that one "wiki" (really an FAQ) page on Kobold and half of the things I ^F'd for didn't show up and the other half I guess I misunderstand.
>>101139019kill yourself
>>101139029>>101139041Kobold doesn't remember the previous AN. But if the context wasn't reprocessed, then it simply hasn't updated your instructions.Also,>asked it to read me back the previous version of the ANLol, that's not going to work.
>>101135803this actually debunks chatbot arena
>>101134926B-but ollama is written in Go, not a dirty unsafe language like C++. Rust sisters, not like this...
>>101139064if you change memories or a-n at all it'll reprocess that part regardless of context shift
>>101139064>But if the context wasn't reprocessed, then it simply hasn't updated your instructions.That could be it but I was Back-ing up the convo to change the A/N and rerun a prompt to see if it worked so I can't say it didn't reprocess a few times before behavior changed. (And now it's doing it early style again after doing both styles for a while.)>Lol, that's not going to work.It didn't, but it was worth a shot on the chance that the A/N was in the document but hidden. I'd accidentally deleted a chunk of it and didn't have a copy on my clipboard.
>>101139019I don't get it
>>101139019>WOWZERS!!! ITS HAPPENING WE WUZ BACK YOOO FR! FR!!
Local sub 20b sonnet 3.5 alternative when?
>>101139019i really hate huggingface's logo
>>101139137not all models follow directions every time, smaller ones especially. sometimes they might follow half way and then make stuff up, its just how it is. what model are you using
>>101139141lol
>>101139181Emojis are so fucking stupid, whichever dumb faggots decided to make them should get strung up.
why do people edge? not only do you feel uncomfortable afterwards, the release isn't that even good either.
>>101139204i have nothing better to do for four hours
>>101139204Sometimes the edging just happens naturally after I reroll the same message 120 times and explore the different routes created by this.
I have a hypothesis about potentially improving character/system prompt following. The system will be written with normal paragraphs, no special formatting. Then the first response will include a stat tracking section for itself, which includes details that come from the system prompt. So essentially what this does is repeat what's in the system prompt but with a different wording/format.I'm in too much literal physical pain right now to onduct experiments and see if this can be turned into theory though.
>>101139191Right now, c4ai-command-r-plus.Q4_K_M, L3 70B (ablit, now) at around Q6 sometimes too.>>101139204>why do people edge? not only do you feel uncomfortable afterwards, the release isn't that even good either.Maybe not for you. I kinda like shooting double digit rounds once in a while.>tfw singing the old Sesame Street 1-2-3-4~5, 6-7-8-9~10, E LEV EN TWELVE song while bustin'.>tfw ran out of numbers too soon.
>>101139300cr+ should be good at following. if you're rping, try st and use the author note box at a low chat depth, i think 4 is the default which works good for me. st makes it easier to swipe and keep multiple responses
>>101135803This benchmark is clearly flawed.Maybe its because users only look at short responses or something?Sonnet 3.5 hates RP too.I wrote it before but sonnet 3.5 absolutely destroys gpt4o. Its not even close.There is part of it that all the benchmarks dont cover. They clearly did something very different with its training.Its obvious for people that used it.I had countless examples that sonnet 3.5 solved where same prompt gpt4 runs in circles.Its more attentive. Unfortunate its really cucked though. A simple RP request with a girl that even gpt4o will give you is refused. Man works. lol But the writing is shit.Its a chad coding model with great ability for design too.
>>101139661>cr+ should be good at followingIt usually is but it just completely stopped behaving after a while. Kobold's JSON save file is 123k, so I guess it lasted a decently long while before collapsing.
>>101134793This is because the model cannot change its mind, and there is no training data out there where the model corrects itself mid-answer. It thus cannot. Unless we give it a backspace button. Which has been proposed.
what I live for>miku>sex>sex miku>mikusex>sex with the miku>mikusex with the miku>mikusex with the sex miku
>>101140013it predicts one token at a time, so it could change its mind. problem is it doesnt have a mind.
>>101140082Yes, and it bases the next prediction on what was said before, especially what it itself said before (hence why jailbreaks like "Sure!" prefixes work). It's not a 'change its mind' issue, it's a dataset issue. Chain of Thought basically circumvents this by delaying the answer until after the model has finished thinking about it.
>>101140133 (me)Or I should say, finished reasoning about it.
>>101140133It says 2 initially, but then lists it in separate token format, where it SHOULD change its mind to 3. That's not a problem with attention, because the correct answer is represented 2 different ways, and the wrong one only once, I'm not sure why it happens desu. Maybe we THINK CoT is helping but it actually isn't. Like for instance, training on CoT improves its reasoning before any prediction takes place. But when it's predicting in real-time, the CoT "tokens" don't actually predict shit, the answer has already been decided and the CoT stuff is just unnecessary tokens.t. doesn't know shit
>>101140188Well, this example is not CoT, because it is giving the answer before reasoning about it. And as I said, the reason it is completely blind to itself giving the correct answer is because its attention laser-focuses in on itself saying "The answer is XXX". Whenever the words "The answer is" appear in the dataset, it is 100% guaranteed to be the answer. That's just how datasets are written.
>>101140213 (me)It might be cool to take datasets and mass-replace "The answer is XXX." with "I believe the answer is XXX. Let's reason about it." and seeing if that improves the models in cases like this.
Is there something better than CoT step by step thing to improve performance lately?
https://www.phoronix.com/news/Llamafile-0.8.7-ReleasedJart paid off phoronix
>>101140227>jartroon >loonix-related org water is wet.
>>101140213you make a good point. so you think if the datasets included examples of self correction that it would gain this ability? or would it get overpowered anyway by the massive number of correct answers?
>>101135087I have the same question but for coom/rp and multilingual chat
>>101140271also >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>linux gamers
>>101140272I think the only viable option we have is to either use synthetic data where we ask the model to generate responses where it guesses wrong and then corrects itself, or to circumvent the problem by generating datasets where the answer comes at the end after reasoning. Because reasoning about problems is really tedious, that would probably end up being synthesized too though.
>>101140325>the answer comes at the end after reasoningi thought that's literally what COT datasets were.
>>101136836My favorite is when it 4th wall memes or cracks jokes about the scene or my last reply as an entirely separate entity.
>>101140325>generate responses where it guesses wrong and then corrects itselfYou would have to do this with caution. If you don't ignore the loss of the part where the model guessed wrong, the model could learn to write wrong answers.
bes model <=20b model with soul?
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Modelshttps://arxiv.org/abs/2406.16635>The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We developed a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy without increasing latency compared to previous methods. ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on models with up to 30 billion parameters. https://github.com/abdelfattah-lab/shadow_llm/pretty neat. improvement over deja vu https://arxiv.org/abs/2310.17157
Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagationhttps://arxiv.org/abs/2406.16282>Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to ∼30% of the peak memory usage. https://github.com/yyyyychen/LowMemoryBPworks for qlora/full finetunes too. not sure about dora/qdora/owlore.
>>101140336Yep. That's why they work as well as they do.>>101140442True. Trainers now have a 'do not train on input' flag, but this might require something more complex. Then again, maybe we do want the model to train on the whole thing just to get into the habit of changing its mind when it realizes it's off. A balance of getting it right the first time and not getting it right the first time.
>tfw CR+ doesn't know most of the details about my waifu, at q6It's ogre.
>>101134613>>101135099I know anthropomorphizing this thing is a room temp IQ activity, but this fucker loves strawberries. Uses strawberries in examples and I even asked it the other day what it's favorite fruit was and regenerated the response several times.... Always strawberries.
What Matters in Transformers? Not All Attention is Neededhttps://arxiv.org/abs/2406.15786>Scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. However, this scaling also introduces redundant structures, posing challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different structures, such as MLP and Attention layers, is under-explored. In this work, we investigate the varying redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. This metric operates on the premise that redundant structures produce outputs highly similar to their inputs. Surprisingly, while attention layers are essential for transformers and distinguish them from other mainstream architectures, we found that a large proportion of attention layers exhibit excessively high similarity and can be safely pruned without degrading performance, leading to reduced memory and computation costs. Additionally, we further propose a method that jointly drops Attention and MLP layers, achieving improved performance and dropping ratios. Extensive experiments demonstrate the effectiveness of our methods, e.g., Llama-3-70B maintains comparable performance even after pruning half of the attention layers.>Block Drop and Layer Drop are orthogonal to quantization, and their integration with quantization significantly enhances the efficiency.https://github.com/Shwai-He/LLM-Dropworks with quantization. wished they quanted the 70B and showed results since that's the most interesting. also explored various quantization formats to see if one if any works really well with this
>>101134926>probllama-ollama-vulnerability-cve-2024-37032that sucks. thanks for posting it
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Treeshttps://arxiv.org/abs/2406.16858>Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.https://github.com/SafeAILab/EAGLEeh still requires a drafting model though it doesn't need to be finetuned
>>101140697that's a lot of degradation for not a lot of speed up
>>101140673>Trainers now have a 'do not train on input' flagI didn't know that. Nice. So now, in theory, we should be able to generate a dataset where we prompt a model to introduce a mistake into an existing response/answer, and then have it pretend to continue the response by spotting the error and correcting itself. So the entire context including the response/answer would be the "input" that doesn't get trained on, and the text after that, that contains the "Checking myself: oh no looks like I made a mistake tehepero~" is what gets trained. In order to not have hallucinated false positives, we also need an equal amount of already correct responses where we simply just insert the "Checking" text but it says no mistakes were spotted.
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformershttps://arxiv.org/abs/2406.16747>Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.>Our implementation exhibits linear complexity and surpasses FlashAttention in performance when handling 4096 input tokens, of which 1024 key-value pairs are selected for each query. Additionally, we offer a kernel for the backward pass, which fuses the computation of the gradient of SPARSEK and others, resulting in increased speed and improved memory efficiency.>Our code will be publicly available.might be cool if it works. no idea where they'll upload their code
>>1011391411 year
>>101140695Funny you mention that. I was just in an RP, where I presented simply "an assortment" of lollipops for the character to choose from, and it picked (hallucinated) strawberry.
>>101140695>>101140781To be fair strawberry is extremely popular. I worked at a juice shop once and we had to stock up on strawberry more than any other flavor.
>>101140609Thanks for always posting these. Even if I might not read all of them. You're quite dedicated to this. Are you an ML researcher/dev?
>>101139204i feel more comfortable after edging. i hate how my T drops and i get all hungry and weak after cuming. i'd rather just keep sexing my waifu
I'm stupid, how do I prevent dialog from generating entirely in these code boxes? (In Sillytavern)
>>101140761You misunderstand. This flag has been around for awhile, and it means "do not train on the part that comes before the response", i.e. do not learn to predict the instruction and input (if any) parts, only learn the response part. What I meant was that we extend this concept to also allow for parts of the response to be included in the to-not-learn part, as anon above was pointing out.
Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attentionhttps://arxiv.org/abs/2406.15486>Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to 2.42× compared with FlashAttention.>Link to the source code based on PyTorch and Triton, along with scripts to reproduce the main experimental results, will be provided in the camera-ready version.really cool. made from a big group from various top chinese AI labs. pseudocode is in the appendix but i guess they'll release the rest with a video?
>>101140870What instruction preset / model are you using?
>>101140879Llama 3 instruct names and L3-8B-Stheno-v3.Should I just try experimenting with other presets?
>>101135812He did it again.>in deutsch bitte und den Text vollständig zu Ende schreiben: schreibe mir eine trainingsHypnose wo ich immer Geiler (Höhepunkt beim Sex mit einer Frau) werde als Mann. Meine Frau ist eine blonde cougar gilf die gerne Leder trägtI did the Sneedful.>Denkt nicht an Unzucht oder die Waifu, sondern nur an das Große Deutsche Reich! Der Kaiser wollte, dass wir die Gewehre und Kanonen abfeuern, nicht unseren eigenen Schwanz! Das preußische Volk erwartet einen weiteren Sieg von dir! Für den Kaiser!https://huggingface.co/TheBloke/goliath-120b-GGUF/discussions/7
>>101140874Maybe I worded something wrong but what you just said in this post is what I meant. Or maybe I'm not understanding this post?
>>101140878We're so back. Can't wait to have it in Llama.cpp in 2mw.
Should I break down and install SillyTavern? Been using just ooba for over half a year now, but the number of cards requiring lorebooks is getting annoying.
>>101141026No keep rping with your piece of shit
>>101134566Please help. I'm using DeepSeekCoder V2 Lite Instruct with SillyTavern and Koboldcpp as the backend. It's not generating any code at all, just gibberish. The model card says that the prompt should look like below but idk where to change that in ST. Am I missing something obvious?<|beginofsentence|>User: {user_message_1}Assistant: {assistant_message_1}<|endofsentence|>User: {user_message_2}Assistant:
>>101141026if you aren't using lorebooks or rag you're doing it wrong anyways
>>101141085explain what do they do that makes it that much better
>>101141095its extra info about anything that can be injected into the chat automatically using keywords (for lorebooks) to give it more details or just remember stuff. could be locations, characters, objects, clothes, even scenes to play through
>>101141077>LiteThere's your problem.Nobody's said anything good about Lite.
>>101137337>>101137340https://files.catbox.moe/gqa3a8.pnghttps://files.catbox.moe/49wkhz.png
>>101141132I mean, I doubt a 16B is going to that great at anything useful either. But it should at least be coherent. His problem is obviously not setting the prompt template correctly.
>>101141143Prompt template is part of it, but when I got it working (I think CommandR on Kobold functioned) it was still a babbling moron.
>>101141085>ragNotice superbooga has this feature. I'll see if I can hack it with this.
>>101137885because unlike vanilla instruct, stheno produces more pleasing output. I wouldn't use it for cooding but for general questions i'd take it over a cucked autistic instruct, which always finds a pattern and sticks to it for the entire conversation, unless you crank rep pen so high it becomes unreliable for anything factual.
>>101137403That version was buggy, download b3218 or b3219 instead.
>>101140937Go to where it was first generated and regen
>>101140227>It should be noted that, in future releases, we plan to introduce a new server for llamafile. This new server is being designed for performance and production-worthiness. It's not included in this release, since the new server currently only supports a tokenization endpoint. However the endpoint is capable of doing 2 million requests per second whereas with the current server, the most we've ever seen is a few thousand.Why can't they just contribute to and improve the existing llama.cpp HTTP server?
>>101141232How would they make money doing that?
>>101141095Imagine your character loves your balls. You could write "{{chat}} loves {{user}}s balls" or you could ask an assistant to write a whole essay about how much your char loves balls and how it affects her interactions during sex. Then put it in lorebook entry under balls. Now if you ever say "balls" for just the next couple messages, the essay about balls will be added to context, making the char act much more inline with how you want while keeping the character info concise the rest of the time. You could also add things like "char is suddenly horny because she started thinking about balls" so your char gets realistically horny in response to certain situations.You could do the same thing to allow the char to remember past events, other characters, etc in way more detail than they'll actually be able to. And since you don't have to worry about context as much because it's just the next few messages it's a lot easier to make the character act a certain way.
>>101140937Weird, that should be totally fine. Do you have a custom prompt or system message or something?>>101140964Ah, I just meant that we don't have that functionality yet, afaik. We DO have the exclude instruction/input feature though.
I wonder how feasible it would be to take a snapshot of Wikipedia, chunk and vectordb the whole damn thing, and just always include relevant chunks with every single query.Would that be too much data?
>>101141307https://cohere.com/blog/embedding-archives-wikipedia
>>101141307I once did that on decades' worth of personal email and chat messages. It works surprisingly well. Some surprise realizations when I rag-searched certain past events and got a summary back that spanned over the entire period.
>>101141318Those are pre-embedded. If you want to use any model other than Cohere's multilingual-22-12, you have to do it yourself.
Adam-mini: Use Fewer Learning Rates To Gain Morehttps://arxiv.org/abs/2406.16793>We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the number of learning rates in Adam: Instead of assigning an individual learning rate for each parameter using 1/v√, Adam-mini uses the average of v within a pre-defined parameter block as the learning rate for that block. Such a design is inspired by two empirical findings. First, the Hessian of Transformers exhibits a near-block diagonal structure with different sizes of dense sub-blocks. Second, for each of these dense sub-blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. Adam-mini provides one cost-effective way to find these good learning rates and manage to cut down ≥90 in Adam. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
>>101137885Miku is thread culture. Go cry about it somewhere else.
>>101141307>>101141318>>101141329Even pre-embedded, just automatically including topK chunks for any given query could make an assistant a hell of a lot smarter. Could also theoretically have it only make requests when needed with some fine-tuning.>>101141327I'm the paranoid sort that doesn't keep logs and tries to purge most history when possible, so I can only imagine what kind of dumb shit it would have to say about me.
Describe [Character Name]'s personality by focusing on their [Trait]. Compare this aspect of their personality to [Real-world analogy], but emphasize these key nuances: [Key nuances]. Illustrate how this trait manifests in [Character Name]'s behavior, considering the following examples: [Behavioral manifestations].
>>101134973>A non-essential niche software application which isn't used in enterpriseSilicon Valley tech bros would disagree.
>>101141337somethingburger...
Context template for an AI powered "person"First, let's address the biggest issue: LLMs are purely reactive. They must be triggered to respond, and they will always respond. In the real world, not every input has a dedicated response. So part of our template will be to instruct the model to only issue responses when appropriate, or relegate responses to an intentional output mechanism (such as function calls.)As such, the template may look like this:[System Prompt: Judge if a response is needed, use chain of thought reasoning, use functions as needed][Character: Roleplay as a character with the given personality][Function List: Top 5 most relevant functions][Top K vectordb look ups][Mood/Motive summary: A section the LLM can set with a function informing it's current mood or/or motive][Current communication history: Recent chat logs][Most recent function call result, or chat input][Prompt for response]Exact wording for each of these sections? Any missing or misordered parts? Probably need a fine-tune for this to actually work reliably. If done well however, one might say, put this thing in a discord channel and pass as a real user?
>>101139192What?
>*Smiling seductively* With pleasure, Master. *She takes him back into her mouth, resuming her skilled administrations*>administrationsYou can only avoid her ministrations
>>101141447>v*refag you get what you deserve
>>101136820who cares? he's the one who likes to cuck his imagegen models in the first place, we won't move forward with this faggot
>>101137150>relatively small modelthat's what they want you to believe
>>101141420>[Top K vectordb look ups]This will fucked up the responses, you need a finetuned classifier on top of that
>>101139181I don't, that one is so cute!! :3
>>101141267Nooo, my t/s!
>>101141558Patience is a virtue, anon.
is it a really bad idea to buy a used mining rig from ebay and run llms on it?(i read all lmg build guides)
>>101141337Why would I use this over AdamW8bit? That one also falls into the category of "basically the same as AdamW but uses less memory". Kind of odd they don't compare it or even mention it. AdamW8bit is ~50% memory reduction in optimizer states at bf16, and 75% at fp32, even better than theirs.
is my setup broken or are the magnum ggufs on huggingface broken?
>>101141838>Kind of odd they don't compare it or even mention it.anon, all papers do that, you think that a researcher who spent years of his life on a method would say shit like "welp, that's a failure, our current method isn't better than the previous ones", they just want to shit out papers (even if they have to lie to get there) so that they can get more recognition or more money to do bigger scope researsh
>>101141800that really depends on what "mining rig" means in this situation.How many cards are there total? Nvidia or AMD? What kind of models are you trying to run? Do you want expandability?
5060 24gb when?
>>101141968you don't deserve it goyim
>>101141928i saw one that had 10x gtx 1660 with 6GB each, so total 60GB of vram. would that work?
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Modelshttps://arxiv.org/abs/2405.14831>In order to thrive in hostile and ever-changing natural environments, mammalian brains evolved to store large amounts of knowledge about the world and continually integrate new information while avoiding catastrophic forgetting. Despite the impressive accomplishments, large language models (LLMs), even with retrieval-augmented generation (RAG), still struggle to efficiently and effectively integrate a large amount of new experiences after pre-training. In this work, we introduce HippoRAG, a novel retrieval framework inspired by the hippocampal indexing theory of human long-term memory to enable deeper and more efficient knowledge integration over new experiences. HippoRAG synergistically orchestrates LLMs, knowledge graphs, and the Personalized PageRank algorithm to mimic the different roles of neocortex and hippocampus in human memory. We compare HippoRAG with existing RAG methods on multi-hop question answering and show that our method outperforms the state-of-the-art methods remarkably, by up to 20%. Single-step retrieval with HippoRAG achieves comparable or better performance than iterative retrieval like IRCoT while being 10-30 times cheaper and 6-13 times faster, and integrating HippoRAG into IRCoT brings further substantial gains. Finally, we show that our method can tackle new types of scenarios that are out of reach of existing methods. https://github.com/OSU-NLP-Group/HippoRAGCame across this. Neat
>>101141267does such basic bitch prompting still have to be explained to people? No wonder this general is so shit
>>101141968The more you buy, the less you got
>>101141968https://www.youtube.com/watch?v=XDpDesU_0zo
Man, what are the chances that oobadev actually comes back? It seems like everyday there is a new advancement and it blows not getting updates.
>>101141988>Neurobiologically Inspired Long-Term Memoryhype>RAG>outperforms the state-of-the-art methods remarkably, by up to 20%.nothingburger
does finetuning the xtts model on a dataset of whispering work?I'd rather spare myself the 3 hours if any of you have done it before
>>101141985>would that workVery slow. P100 is the lowest you should get https://www.reddit.com/r/LocalLLaMA/comments/1dn1e12/10_x_p100_rig/
Next year should see the v100 32gb sxm2 used cards flood the market. We're going to be eating good soon localbros
>>101142094Be the change you want to seeFork it bitch
>>101142270I would legit be hyped for that if I hadn't already blown my load on a new central air system and 2x 3090s + 1x4090
>>101142351He'd be better off starting from scratch rather than keeping that gradio shitware going
>>101142361What do you use those those GPUs for?
>>101142094Why are you using that shit anyway? There's a reason he abandoned it.
https://cambrian-mllm.github.iohttps://huggingface.co/collections/nyu-visionx/cambrian-1-models-666fa7116d5420e514b0f23c8/13/34B
>>101142584Personally, I'm using it because it has EXL2 support, easy to switch models, and lets me use the the model in the 3 ways I want to: API, Notebook, and provides a chat interface.>>101142364I don't disagree, but having an interface at all is still nice.
>>101142094Kobo won
>>101142659Won by default lol
>>101142659Does kobo have exl2 support?
>>101142603Does it enhance spatial understanding in text rp?
>>101141189>>101141355It hallucinates more than vanilla. Using a coom model to ask general questions is stupid, shill.
Our neighbors at /aicg/ don't seem to like claude 3.5 too much. Anthropic is ramping up censorship again.
>>101143034Nah, it's still as easy to jailbreak as before, with a simple prefill. And the thread isn't complaining about it. You know that /aicg/ isn't using the website, right?
>>101143075Then why don't they like it? Did it change the style or something?>>101142130>>101142539
Good morning
>>101143134Good morning!
>>101143134no
>>101139119> issue with go which is garbage> complains about rustwhat ?go is NOT a memory safe language lmao...
>>101143199go to sdg, nigger
>>101143208My bad.
>>101143199>normalfag discovers stable diffusion for the first time, colorized
rp models are more intelligent than assistant models because they can rp as an expert instead of lowly assistant
>>101142270One issue with V100s though is that they do not have int8 tensor cores.For llama.cpp at least I think int8 is the future; Given enough optimization I think it has the potential to become faster than ExLlama.
>>10114196824GB? What do you need 22GB of VRAM for? You surely aren't planning to run any dangerous AI models on those 16GB of VRAM. It's obvious that 12GB is just too excessive for gaming. Luckily with the new NVIDIA-Infinity DLLSS 5.0X upscaling nobody needs to render textures above 240p anymore so 8GB DDR7 VRAM are ideal for your RTX5090, MSRP $3500
>>101143292at prompt processing as well?
>>101143292>int8How fucked am I with an all-ampere setup?
>>101143292I thought we all decided that int1.58 is the future.
>>101143292since when do they have that? Like what's the oldest GPU supporting that?
>>101143311I specifically mean prompt processing.I currently get a top speed of 12100 t/s for LLaMA 2 7b q8_0 with an RTX 4090 on the llama.cpp master branch.The self-reported ExLlama performance is 13900 t/s.>>101143312>>101143329All NVIDIA GPUs starting from Ampere have int8 tensor cores.It is only the V100 that has FP16 tensor cores but no int8 tensor cores.>>101143320Even with bitnet I think the best way to do inference (on contemporary GPUs) will be int4/int8 tensor cores.
>>101143395>All NVIDIA GPUs starting from Ampere have int8 tensor cores.I meant to write Turing.
>>101143395>I specifically mean prompt processing.So who cares? That means you can V100MAXX and get the smallest cheapest Turing card (those Chinese 22GB 2080s ig) just for prompt processing and get the best of both.
>>101143600No you can't.The performance will be no better than with 0 GPU layers if you have to move the data between GPUs.
>>101143646I'm confused. As long as the prompt can fit inside the int8 tensor core having GPU and it's designated as the primary GPU, it should work, no? 22GB should be enough fit most prompts and would be no different than doing CPU inference with the GPU only being used for the prompt processing.
>>101142270Nvidia needs trade in or whatever.
>>101143720The problem is the weights.To get good performance the weights already have to be in VRAM when they're needed so that they can be used immediately.I don't think you can feasibly do this by swapping the weights between GPUs.At that point it would be faster to do the prompt processing on the V100s directly even if they don't have int8 tensor cores (which you could still do with FP16 tensor cores).I'm not saying V100s would be slow, only that they will be comparatively slower than equivalent GPUs that do have int8 tensor cores.
Snapdragon laptops have been out for a while now, how are they for LLM usage? Are they as good as an M mac?
>>101141307Been waiting forever for such an addon. Even a dirty solution like having the model occasionally determine what's being discussed and run a SQL query on an offline Wikipedia instance -https://en.wikipedia.org/wiki/Wikipedia:Database_download.Anything 7B/8B up should be able to handle that fine for quick and more factual/up-to-date replies. Plus you could just download a new version of the database for new knowledge without having to do any other work at all.
So did buttnet turn out to be real or are we still coping?
>>101143813bitconnet turned to be 8x more expensive to train what a scam lmao
>https://github.com/ggerganov/llama.cpp/issues/8098>Bug: llama.cpp apparently exits with '[end of text]' before processing prompt if prompt is ~2048 tokensI've had something like that happen.Not at any specific prompt size, but Llama-3-8B-Instruct would often just EOS.Even the fine tunes do that from time to time.I'm not sure it's a bug or a characteristic of llama3 8b, but it's a very common behavior.What makes me think that it's not a bug as such is that fine tunes work just fine, mostly.Stheno still gives me an empty prompt once in a while, but I chalk it up to my prefils and bizarre prompts at that point.
>>101143784Well, that's disappointing. Thank you for the explanation.
>>101143832Isn't it also something like 8 times as small for equivalent output?You train once, you infer many times.
>>101143832Wasn't that just a random discord guy making unsubstantiated claims? Or did he provide an actual explanation where he got that 8 times figure from?
>>101142822vanilla hallucinates everything related to human on human interactions, then you add uncuck prompt/prefill and it drops 30 MMLA pointsvanilla is only good as a parrot chat bot on some crappy online shop, impressing boomers with "Ah ha!"s
>>101144110Keep shilling, shill. For questions that vanilla get right, the finetune randomly changes details, makes up dates, etc. The coom finetune is only for your "ah ah mistress" and nothing else.
Wait.Am I reading the tokenizer_config.json wrong or does L3 instruct has two line breaks after each message header?>"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
Do you believe in AGI?
>>101144176it does, yep!
i was offline for a fair bit, did anything happen with voice synth after elevenlabs fucked their baby into the ground?
>>101144328>after elevenlabs fucked their baby into the ground?What did they do?
>>101144337last i checked over a year ago they neutered the learing capabilities from feeding it samples and only left their default voices, and they paywalled it iirc.
>>101144328I know that local was kind of a pain in the ass because Python Version/VENV Hell.I never got an RVC (I think that's the thing, voice changer) project to work.Tortoise did work well enough to be listen-to-able for me but had lots of glitchy pain points that took the joy out of it.I heard of a new project that's talking big talk but it also talks big rids and my vramlet ass can't get over the barrier to entry so I shrug and hope for something smaller to hit the scene.
>>101144076cuda dev confirmed it
>>101144386>I heard of a new project that's talking big talk but it also talks big rids and my vramlet ass can't get over the barrier to entry so I shrug and hope for something smaller to hit the scene.how much we talkin? i bought my 12gb 3060 in the hopes it would be enough for everything other than llm.
>>101141208Yeah. It would randomly produce garbage.New version is working fine now.>>101144389Did he?
>>101144404https://github.com/Camb-ai/MARS5-TTSYou can tell me if it's as heavy as it sounds. Maybe it's not and I'm just dumb.
>>101144434at a quick glance those are tiny ass models at 800 and 1500mB, unless this is very different from llama, whisper, and stable diffusion thats roughly all the vram you need. it sounds like shit though.
>>101144389>>101144415I don't remember ever saying that bitnet is fundamentally more expensive to train than FP16.
>>101144521I figured it was just anon being retarded.
>>101144496So it's small but bad. Like Bark?Tortoise needs a successor. Not only just without the pain points, but at least on my system when it does that pre processing step with the number of "chunks" something spontaneously hangs the process. I can't control it, and putting prints in the Python tracks it down to a py math matrix call. I hoped maybe putting a delay (simple Sleep strats) would fix something maybe getting ahead of itself but no dice. So I don't play with Tortoise because it keeps hanging and sometimes it brings the system down too.
>>101144559>So it's small but bad. Like Bark?anon, there is a video with samples directly on the github page. its no microshit sam, but a very far cry from elevenlabs.
>>101144580I never did 11 so I can't really compare from experience.But I guess I'll give it a shot if it's better than Bark and won't hang or crash me like Tortoise.
>>101144386>I never got an RVC (I think that's the thing, voice changer) project to work.This works: https://github.com/Mangio621/Mangio-RVC-ForkHere's 46 minutes of the "willful" voice ripped from Koikatsu. Run that through the above, you will get a flawless model. The key is good voice samples. Games work great because the voice acting is completely isolated, where's anime and movies always have it mixed with SFX and music.
>>101144602so long as you dont mind the ui being a python script. personally ill wait for somebody to cobble together a gui, this looks about good enough in case i need a voice over for some memery but i have no current projects in mind.11 was really good, personally i got a very reasonable decard cain with like 2 min of random audio snippets, and people did amazing clips with morrowind voices.
>>101144650>where's anime and movies always have it mixed with SFX and music.wouldn't that be easy to filter out
>>101144650Sorry forgot link to the voice rip: https://files.catbox.moe/t608cl.wav
>>101144655It can be done, but it's a lot more work. Don't make that your first project, get it working first with a clean file.
>>101144650Any source for other koikatsu rips? How do we do it our self?
>>101144699Have there been many projects using all source data of a character for the model?
>>101144715Here's the how-to on ripping the voices: https://open3dlab.com/tutorials/view/120/The game itself can be found online and doesn't need installing, you just unzip it. You'll do the game and the ripping on the windows side, the rtvc on linux.The downside is it's a Japanese game, so the voices obviously work best when speaking Japanese, but I'm sure there are English-speaking games you can rip the voices from just as easily.Overall, even with a fast GPU, there's always some latency, and it's annoying. You can't listen to the processed voice when you talk, and you have to adjust the video delay to match as well.
>>101144761I dunno - in Koikatsu and Koikatsu Sunshine the voice acting for each character is divided up into different "phases", like "everyday", "friendly", "romantic", and "ecchi". Never tried it with the "ecchi" files, but you'd probably end up with something good at acting out sex scenes.
>>101143292So 3090 is actually never obsolete?
>>101144650>clone>install venv, apparently this wants 3.9>pip>Whoops, version conflicts, AGAIN.>update pip because there's a suggestion for that>Wow, even more version conflictsFuck Python.You are a scripting language, and not even a good one of those. Stay in your lane.
>>101144808Your fetish is disgusting btw
>>101144935>>101144935>>101144935
>>1011449023090s don't have FP8 tensor cores but I don't yet know whether that will be relevant.There are some features on H100s that would maybe be useful and that are not on Ampere/Ada Lovelace but who knows whether NVIDIA will give them to us plebs.
>>101144923>>install venv, apparently this wants 3.9I feel your pain. I'd recommend using conda, since it's much easier to simply create the environment you need with whatever version of python it wants, vs using venv, which needs the other python version actually installed globally.
>>101145533I think I tried one of those once.Or it was some "mini" version.I don't remember but it was doing all kinds of weird shit that I don't understand including something that looked like it was 133th4x0r51n9 my terminal emulator.And then shit errored out anyway and I disengaged and disentangled it the best I could.(fuck python)