/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>101250468 & >>101243128►News>(07/02) Japanese LLaMA-based model pre-trained on 2T tokens: https://hf.co/cyberagent/calm3-22b-chat>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>101250468--Troubleshooting Custom Limarp Adapter for Wizard Mixtral: >>101255564 >>101255603 >>101255680 >>101255900 >>101255999 >>101255658 >>101255670 >>101255749--Pluggable RAM Sticks for GPUs: A Potential AI Powerhouse: >>101250618 >>101250663--LLaMA.cpp Flash Attention 2 on Gemma 2 and Cache Quantization Possibilities: >>101256405 >>101256491 >>101256539--Gemma2 Training Support Added to QLora-Pipe, Shows Promise Over LLaMA 3 8B: >>101251790 >>101252075 >>101252240 >>101252250 >>101252824--Gemma2 21b: A Promising Alternative to LLaMA: >>101254733 >>101254757 >>101254781 >>101254860 >>101254922--Gemma Responds Well to Specific Instructions, Like Mixtral: >>101252897 >>101252919 >>101252947 >>101252957 >>101253002--BigTech's Hyperparameter Tuning Secrets for LLMs: >>101250835 >>101250871 >>101250881--Anon's Japanese Travel Phrases Get Judged by Gemma 2 9b: >>101256710--Twin Peaks AI Model Impresses with Fandom Knowledge and Sentiment Analysis: >>101254235 >>101254247 >>101254277 >>101254326--Time-Based RNG Issues in LLMs Need a Solution: >>101254647 >>101254707 >>101254722--Phi3-Mini Update Surpasses LLaMA 3 8B Performance: >>101255873 >>101255918--New cpumaxxfag Server Specs Discussion: >>101250702 >>101252875 >>101252929 >>101252938--Microsoft's MInference for Faster Long-Context LLM Inference: >>101254324--LLaMA CPP Python Version Bump and Gemma Model Compatibility: >>101254881--Gemma 21b Passes the Watermelon Test: >>101256472 >>101256514 >>101256945--CXL-GPU Image: Nothingburger Without Industry Support: >>101254227 >>101254253--Pytorch 2.2.2 Downgrade Ruins VRAM Allocation: >>101255284 >>101255775--InternLM2.5 Collection Released on Hugging Face: >>101255862 >>101255885 >>101257520 >>101257548--Fixing Gemma with <bos> Token: >>101253884 >>101254036 >>101254050 >>101254423--Miku (free space): >>101250518►Recent Highlight Posts from the Previous Thread: >>101250472
>>101258584>Gemma 21BIt's 27B
>>101258641>GemmaIt's Kumma
>>101258641Our Gemma is lighter, having hollow bones for more efficient flight.
magnum is dumb as fuck
>>101258689It's hard to move forward from Mythomax
this is itthis is the month of llama 3.5 turbo
Gemma-2 (even 9B) is Claude Sonnet.
>>101258792Then Gemini must be God
Good morning lmg!
>>101258845If miku is a robot, perhaps it could have weaponized T-doll modules installed in conjunction with civilian behaviors...https://www.youtube.com/watch?v=ioRhanH4xmE
Here's a bit of morning hopium:I've been running llama-bench and logging the results for a few months now looking for regressions in specific patches to report back to the devs, but its been a pretty steady march forward.These are the unfiltered t/s results for the leaked miqu 70b q5
>>101258845Good morning Miku
oops! time to uninstall firefox! you can thank meta for this btw
>>101258990What kinds of prompt processing speeds do you get with RAM?
>>101259034>thank meta for firefox owners being massive faggotsI guess we can thank meta when they bundled pocket and other shitware into their shitty browser too?
>>101259077>moving goalposts your favorite jewcorp - meta is building real time censorship tools, and you indirectly supporting this by providing negative data for them to filter & classify, don't come and cry here later when you get banned for having wrong opinions on *random topic* because meta's llm hallucinated something.
>>101259034what's the point? twitter is owned by Elon now, people can say whatever they want to their platform, and no libtards tears is gonna change that
Do any of you fine fellows have any experience with the Orin AGX? A business near me is liquidating a bunch of them for $600/piece.Their GPUs are pretty shit (worse than a 3060) but they've got 64gb of memory and some special accelerators apparently, so I was thinking about picking one up
>>101259034>anon has a knee-jerk reaction to the word "hate speech" without actually understanding anything about the situationclassic
What the fucks is this dude talking about kek.
>>101259163You'd have to rely on this guy's stuff to use the accelerators right >https://github.com/dusty-nv
>>101259163Do they have an online shop for that?4 tokens/second on 70B llama, it's not very good. The software is also a pain in the dick to work with if you're doing something that's even remotely memory-constrained. The shared memory is retarded, it's way easier to deal with dedicated VRAM.Software support is also dicks, enjoy trying to compile shitty python packages for cuda that don't have ARM support.
>>101259063>CPU prompt processingThey're abysmal, as expected (pp512 is about 22 t/s max for llama-bench of miqu 70b q5...about 10% of what I can do with CUDA). I use a GPU purely for context.The RAM speed isn't the bottleneck with pp, its a compute capability mismatch of CPU vs GPU.
>L3 tunes maturing on the upper and lower end of beak spectrum>Gemma2 quirks slowly getting worked out in the middle-ground>Beeg 400B L3 imminent (and multimodal?)Local is about to finally start eating good on every level
>>101258576https://x.com/BoyuanChen0/status/1808538170067407264>Introducing Diffusion Forcing, which unifies next-token prediction (eg LLMs) and full-seq. diffusion (eg SORA)! It offers improved performance & new sampling strategies in vision and robotics, such as stable, infinite video generation, better diffusion planning, and more!I smell a new potential
>>101259283we've been eating good for a long time! :)
>>101259322i smell a nothingburger
>>101259234If it's that much of a pain in the ass, I'll skip it. Thanks for answering my question, though.> Do they have an online shop for that?Not as far as I know, but I can ask
>>101259361i smell a sitcom
Recently got a 4090. What should I run other than 3.5 exl2 Command R?
>>101258990Based regression inspector.
>>101259499Gemma 2 27B with context extended to 16K at the highest quant you can then report back how it compares to CR.
>>101259578How do you extend the context?
>>101259620RoPE/YarN.If you are using exl2 you want to use the alpha calculator in the OP.
>>101258689ギジュツ問題(笑)
Does Gemma 27B have a capped temperature? It doesn't go off the rails even at 5.0. It feels like it writes differently than at 1.0 (where it still basically never goes schizo), but that might just be placebo because I would expect that. But it remains entirely coherent.
>>101259737I have wondered this about other models that seem to remain coherent at higher temp samplers as well. What is it that causes some models to be more tolerant of a wider range of temps while some others seem to have a much narrower working range.
>>101259782in my experience this mostly means the model is really overbaked
>>101259192Because its just enough to know the real use-case for this shit. You can keep sucking on that corporate AI-jew cock though, i am sure it will never ever backfire!
LONG CONTEXT ALERTredditors say it works well past 200k
>>101259322Gradually i began to hate them ai-jeet hypefaggots, a bunch of buzzwords and promises is just enough to catch average lmgtard's attention!
>>101259737>>101259782Isn't it just a sign of more tokens being trained?
>>101259817Yeah, but it's dumb. Failed music theory, failed tricky programming question even after being told what the trick was.Thousands of 7B tokens are still 7B tokens, alas.
>>101259782It could be >>101259805, or maybe, it could be related to the logit capping feature that the model needs to work correctly.>https://github.com/ggerganov/llama.cpp/pull/8197
>>101259826you're no better for seeing a bunch of technical terms you don't really understand and instinctually having the same reaction but in the other direction
>>101259737I'm an idiot, with min_p 0.0 it does go schizo at higher temperatures. But it feels like even tiny min_p values are much better at keeping Gemma sane than with other models.
>>101259322so it's a new diffusion architecture? Didn't know you could use the llm technique (next token prediction) to make pictures though
Bros I like Mikupad. It and ST are all I need.
>>101259924Yeah, I was thinking more along the lines of being used for a audio generations with more accurate guidance for generating consistent quality.
How does gemma2 27b cope with context extension? I don't even want to bother if it goes schitzo past 8k
>>101259932Relatable. I only use ST and Mikupad too, but I haven't used ST for a long time now because I got bored of LLM slop, so I use Mikupad for random experiments.
>>101258576Alright doc, give it to me straight.What's the best model I can run right now that fits on 24GB
>>101259880Based minP 0.01-0.05 gang
>>101259782it depends on the tail of the token distributionwhen you use cross-entropy loss, it forces the model to have overconfidence on the next token at the expense of the other tokens in the distributionwhen trained for a very long time, the variance of that tail can get quite high, and temperature scaling will amount to randomly sampling from the rest of the vocabularyit's impossible to tell what they did during training to improve on that, you could do many things, even simple things like label smoothing might help, or they could have introduced an additional loss term that they didn't disclose in the paper
>>101260199>when you use cross-entropy loss, it forces the model to have overconfidence on the next token at the expense of the other tokens in the distributionNot sure what you mean by that. The next token is the only thing any of these models predict, and cross entropy loss doesn't encourage "overconfidence", just "correct confidence". If you're overconfident then every missed prediction will be very expensive in terms of cross entropy. For a model to become deterministic, its training data must be predictable. That's not the fault of cross entropy but of overfitting and possibly reusing data for too many epochs.
>>101259880>>101260195MinP+Temp basically ended the retarded sampler debate once and for all.
>>101259991Mamba+forcing=back
>>101260241>The next token is the only thing any of these models predict, and cross entropy loss doesn't encourage "overconfidence", just "correct confidence"what exactly do you think cross-entropy is calculating and how do you think it is calculated?in order for the model to achieve the lowest loss, it needs to have maximum confidence in the next token, that necessitates that it has low confidence in the rest of the vocabulary, as the optimal probability is 1.0 on the next tokenin fact that is the very reason label smoothing exists in the first place
>>101260307>forcedThe more intellectual term would be contrived.
>>101260271kanyemonk won
new nothingburger dropped!
>>101260535https://x.com/AlphaSignalAI/status/1808534830683979792Actual demo
>>101260546pretty sick, i wonder why it speaks in 3-4 word chunks
>>101260546artificial biden kek
>>101260307>5%I mean, it's not nothing. But slicing the Y axis like that is kind of a dick move.
best llm for tea recommendations?
>>101260535>>101260546https://moshi.chat/?queue_id=talktomoshi
>>101260571Upstage-70b-Instruct
>>101260564PPL doesn't make sense to compare linearly. A difference of 0.1 PPL might be the difference between highschool level intelligence and PhD level. (Or to be fair, it might also be a nothingburger) Loss is similar, if that's what they're showing.
>>101260546nowhere near gpt-4o lol https://www.youtube.com/watch?v=vgYi3Wr7v_g
>>101260564unless it's a flop-adjusted graph, it's meaningless
>>101260601>>101260621So it's more meaningless than I thought? Just "this line lower" and that's about it?
4 new commits!AARGHHI MUST POOOOL
Any good L3 tunes yet?
>>101260651idk, but posting the loss graph by itself means nothingit could easily be the case that if they adjust the transformer to match the training flops of their mamba model, the transformer wins in the endthere's no way to know since all they posted was the loss
>>101260651Pretty much. It means "it works better," and that's it. Accuracy on a task would give a real performance indication.
Can I still into this with 16gb vram nowadays? If so what models should I even be looking at?
lmao https://x.com/ac_crypto/status/1807882764261417000
>>101260721You won't get good speed, but you can run a model up to about 90% of your system RAM.I'm 12 GB V, 64 GB system and I can reach up to 58 GB models as long as I don't let anything else hog too much. About 1 t/s. It's sufficient for playing around.
>>101260752>you don't need more than 1 t/s
>3090 is $1500nani
>>101260752Wouldn't this take like 10 minutes per response?
>>101260322>needs to have maximum confidence in the next tokenOnly if it knows with certainty what the next token is. Otherwise the expected value of a max confidence guess will be worse, because it can be wrong.In a sense you're correct, but only when overfitting. If the training data is truly novel on each batch, there is no issue. I realize that probably is never true in reality due to running multiple epochs and unknown data duplication, but those are the actual problem, not cross entropy loss.
>>101260721Anything that fits at least 80% in our VRAM. You can offload the rest to your RAM.
>>101260786Depends on the mood of the session. Right now I've got two tabs open, same model, same scenario, but I'm testing slightly different author's notes that I threw in along the way. One side, every turn is one quality paragraph. Most recent turnaround on that was 160 seconds. The other tab it's in a six paragraph mood, being a lot more detailed about the action. 606 seconds last turn.I don't know if this is just LLM randomness or if A/N is truly affect how it interprets the instruction I've given it, but time wise, it's not much different from the days of AOL/AIM when you would say something and wait a bit and then get the response. I know all y'all zoomers start to shiver barely above a whisper if feedback isn't instant, but if you have only one video card, that's what you've signed up for.
>>101260586it's shitexactly the same issue as with glados or gpt4o. Using fucking "sleep(2000)" after last user input to determine whether the model can start talking is retarded and can only impress brainlets on videos. Until the model can "infer" that i'm done talking, this concept will remain nothing more than a funny tech demo.
>>101260805i have no idea what you are sayingif you use cross entropy loss for next-token prediction, you are optimizing for maximum probability on the correct next token - that is how CEL works, the dynamics do not change regardless of what your data distribution looks like, so I have no idea what you mean by "Only if it knows with certainty what the next token is" - that is what you are optimizing for, and you'll know if the model learns how to do it by checking the loss>If the training data is truly novel on each batch, there is no issuewhat does this even mean? you do not need to train a model on infinitely new data to prevent overfitting, in fact in many cases multiple epochs on the same data is the best way to achieve generalization simply by training the model for long enough> but those are the actual problem, not cross entropy lossneither of those are problems, they are desired behaviorsit is expected that a model trained on CEL will assign maximum probability to the correct next token and 0 probability to all other tokens - if it truly understands the dataall of this just makes me think you don't fundamentally understand what's happening under the hood and you're extrapolating based on your intuition
I get this when using latest ooba:---------------------------python.exe - Entry Point Not Found---------------------------The procedure entry point ggml_backend_cuda_log_set_callback could not be located in the dynamic link library B:\src\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda\lib\llama.dll. ---------------------------OK ---------------------------Updated to latest to try gemma, and when loading the model I get this error as a message box, and the model loads in CPU. Nothing unusual in the console. Anyone?
>>101259322AGIbros we are so back!I always saw diffusion methods as a means of making neural net "think" more and arrive to a more plausible outputthe ability to model uncertainty at a token level through noise seems powerful - you can make tokens less noisy near tokens without noise and more noisy far from them - you don't need to arrive at the solution immediately and it's possible to go beyond token lengths ot saw during training without model shitting itselfthe video and robot planning results are nice
>>101260874>Until the model can "infer" that i'm done talking, this concept will remain nothing more than a funny tech demothis. model should be big enough to understand it, and obviously this is impossible locally.
>>101260907If you can predict the next token with absolute certainty then p=1.0 is not overconfidence but the correct level of confidence. In reality of course this won't happen but you can get close sometimes e.g. for tokens which are parts of words. Otherwise I don't know what you're on about here, you seem to keep assuming a perfect model which doesn't exist. There's nothing I can say to explain that I didn't already say. Unless a model is trained on the same token sequences repeatedly it should not converge to 1.0 probability.>all of this just makes me think you don't fundamentally understand what's happening under the hood and you're extrapolating based on your intuitionI'm getting the same feeling here kek maybe we are just bad at talking
>>101260874well how do you think people infer that their interlocutor is finished? exactly the same way, they wait for a moment of silence that's longer than the ones between words or sentences that it's heard so far. the content of what's being said can be used as a clue but i think you're overestimate the weight those clues. it might be a naive sleep as you put it, but a conventional algorithm could easily find the break more cleverly without saddling the LLM with this problem.
>>101260307I wonder how many B of transformers would be the equivalent to Mamba + forcing? looks like they managed to make their models even more efficient than transformers
>>101260958>still using python for general inferences when there are many native supports
>>101260307>SSMs are good for anything but text lol
>>101260989>Unless a model is trained on the same token sequences repeatedly it should not converge to 1.0 probabilityyou don't seem to understand how neural networks work, but we're in /lmg/ so that's fine
>>101261096This has almost nothing to do with neural networks, it's common sense. Suppose you have to guess the next word after>you don't seem to understand how neural networks work, butThere are plenty of ways that sentence could continue. You won't be able to guess with 100% accuracy unless you saw it before. This is true for even the smartest models unless you suppose they have literal godlike intelligence.
>>101261006you can't solve it with just codehumans don't just start talking when one person goes silent for 2000ms, humans understand when a person is not done yet, or that what they said isn't a full "prompt" yet. If another person goes silent for a few seconds during a dialog with me, i might just nod, or say "uhu", or just wait more. Imagine you are explaining some complex problem to it with a voice. You need to ensure you never make a pause longer than 2000ms or whatever the debounce time is hardcoded to, or you risk triggering model too early. It's feels awful.
>>101260571Hi fellow tea chad, this test is just for you.
>>101261192>no puerhtrash
There's a pull request for llama.cpp that I'm really excited about. Will it help if I message one of the devs directly and yell at him to work harder?
>>101261146you have 0 clue as to what you are talking aboutto you, it must sound absolutely mystical that it is even possible for a neural network to generalize to an entire distribution after seeing only a small sample setwhen your intuition tells you that it's just a learned hashmap, many obvious aspects of machine learning probably seem like magic to you
>>101261192AI dude talking about tea like an audiophile kek.
>>101261146>This has almost nothing to do with neural networks, it's common sense.>You won't be able to guess with 100% accuracy unless you saw it before.and on that note, if you used common sense, you'd be able to see that your example holds no water in the context of mathbut by all means, keep pretending like you know what you're talking about
>>101261243Embedding (latent?) spaces are kind of magical what's with the meaning of proximity, direction, etc.It's one of the coolest things in computing.
>>101261260real people do this too, people are very serious about their tea
i wonder if it'll ever be possible to rip out the google assistant from android and swap it with some llm
>>101261210Oh we had that at the place I worked a while ago. It was nice. I enjoyed chrys as well.
Is there a rentry for gemma sillytavern instruct settings?
>>101261210To be frank it's just not top 10 worthy
Gemma-2-27b-it often fails asterisks. Does this happen to anyone else? Using gemma-2-27b-it-Q4_K_M.gguf by Bartowki from 1 day ago, and llamacpp http server.
>>101261243t. soifaced at word2vec and thinks latent spaces are magicyes anon, I'm sure they stumbled on the deep structure of the universe, that's why their logits are carved so deep, it's not just that they got lazy and ran too many epochs over the same data
>>101261192fun list, it at least mentions some more variety than a couple models I tested. which model is this?just had some tie guan yin btw it was good>>101261389>Allow me, with all due deference, to present my case and beg for clemency from the Communist Party of China.kekdecent enough reasoning if you ignore that similar stuff applies to many of the teas it did mention
>>101261441happens with Q5_M too, not often though
>>101261451not sure how you read me trashing that guy for probably thinking latent spaces are magic to meaning that I think latent spaces are magicbut you're an idiot, so anything's possible
>>101261441Are you using rep pen?
Okay, well, either it's broken, or retarded.
>>101261441>"personal space"if you needed any more proof the RP training data is by w*men
>>101261549No, just Universal-light.
>>101261484Llama-3-TenyxChat-DaybreakStorywriter-70B it's surprisingly solid for non-RP uses like this. Made a cute mechanic waifu card and I've been just asking shit about different jalopies, audiophile card would probably be fun too actually
>>101261569>she tells me she's pregant and has a nervous breakdown
>>101261441Depends on your card too.If the card mixes asterisks and non-asterisks for actions the model will mirror it
>>101261597is it white?
>>101261607I edited it myself, all conversations are asterisked properly. The card description itself obviously has none, but that works fine with other models...
>>101261441It seems biased toward producing novel/book-style RP, in my tests (q6k). Even if it starts with asterisks and no quote marks, eventually it will begin using quote marks and narration without asterisks.
i haven't followed the whole thing for a while.gemma27b (works now?) for 24gb vram, which quant, which (simple) ui?it's all so chaotic.
Does gemma 27B work with 12k ctx already?
Running the same thing on corpo servers (I posted whole system prompt as a part of normal message), here's the outcome:- it also loses asterisks- it isn't retardedSo something seems to be broken still in local implementation.This could be prompt template issue...>>101261655I use Q4_K_M, and SillyTavern as client.
>>101261714Forgot the screenshot.
>>101261638I noticed this as well.
>>101261731>edited
>>101261777I edited the history to be the copy of the chat I had in sillytavern. The last message is where the clear retardation appeared in local, so that's what I was trying to reproduce.
funny how large parts of the local llm scene are carried by a small bunch of guys who just two years ago were sitting in a discord talking about hentais and pony porn
>>101261858literally einsteins at the patent office
>>101261858I think many of them are still doing that now
i hate discordfags so much it's unreal
>>101261858That's just open source in general.
>>101261858How is that surprising? Academia has become about risk-adversity. So don't expect anything but "NEW DPOPPOOPPPOOPIE FINE TUNING METHOD BEATS GPT-4 ON THIS ONE CHERRY PICKED BENCHMARK. No we haven't actually tried doing anything abstract with it, CHUD"
>>101261898don't know, some of the hentai discord guys from back then are now releasing basemodels or doing pioneering work - so a bit more than finetune stuff
>>101261858kys discord groomer
>>101261575Why such a high minP?drop it to 0.01 and see if it helps maybe
>>101262170I tried neutralizing samplers completely, model is still broken.
What do you use to create a good story and for storytelling?
>>101262217You ask this every week and get the same response every time
>>101258576Do we have exl2 Gemma models already?
>>101262217Mythomax
>>101262282NEVER EVER
I got a 2060 for basically nothing and have a spare 1X slot to hook it into, next to my 4080. What's the best model to run with this supposed 22GB of VRAM?
>>101262316It's not a good model anyway.
>>101262318Mythomax
>>101262342
>>101261883Kobold discord general (trans friendly)
how is gemma2 doing?
>>101260958Same issue here. I thought it was coz ooba fucked up the wheel but he did a build just now and it isn't fixed. I haven't analysed the dll yet but it looks like it's in there. If you fix it post it here but I'm probably gonna try to build it myself to see what's up.
>>101262849I ended up using llamacpp server as another anon suggested.
Find it hard to make gemma 2 27B to ERP...It's a solid model though, follows prompts very tight
>>101261441it's retardedi have no asterisks anywhere, and it randomly starts inserting them for one paragraph, then switches to non-asterisk prose for the second paragraph.
>>101262897try adding "Do not use asterisks" in the system prompt
>>101262921>Avoid using asterisks, instead use no asterisks.t. certified prompt engineer
>>101262935>Utilize avoidance of the necessary asterisks in order to provide the user with an asterisk-free experience
ok gemma
wtf on the latest llama_cpp_python (0.2.81) and when using booba on dev, I can't run a model into the gpu anymore, only the cpu, the fuck? :(
>>101260958>>101263160So, do we know if the problem is on llama_cpp_python side or on booba side?
cuda dev is back!
>>101263211Johannes...
>>101262755asteriks are messed up and 27b qunats are still incoherent
>>101263199>>101263160I've been hacking at this all evening. Ooba's wheels are fucked but I was able to fix that up, but I still don't get GPU offloading with GGUF. I don't know how this all works under the hood but it's a pretty early departure from some old logs I have, where pretty early into the loading it starts spitting stuff out of ggml.dll finding the GPU.I fixed the dll issue by trying to get a prebuilt cuda-appropriate dll from the llama_cpp_python repo directly but that hasn't fixed the GPU bits. Using their ggml.dll didn't help either. I started trying to set up to compile but Windows so...Current plan is to see if I can figure out what gets sent to the ggml DLL and maybe see how it tries to identify devices. I bet this is some fundamental cuda/torch version incompatibility with the new llama.
So bitnet...
>shit on discord>blackedposter comes out to playinteresting
>>101263345Waiting for someone to fund the shit out of it, I think.
>>101258576What's the best chat bot I can run on 8 gb VRAM now?
>>101263382this but 4
>>101261883Pygcord got colonized by r*ddit like 5 days into it's existence
>>101263395why does this happen every single time?
>>101263372Who would do that?
>>101263382llama3 8b.gemma 9b will be better when everything is working properly.
Two weeks more until the llama2 anniversary.
>>101263150Love...
>>101263435apple maybe. they'd stand to benefit most
>>101263487m4 max 500GB with native bitnet processors when
>>101263260I want to use ooba (it's the only interface I can stand) but they've clearly given up and entered maintenance mode, it's taking them increasingly long stretches of time to implement popular new models even on the dev branch
>>101263599I think part of it is just the amount of people involved and how they're a bit less reactive than they were when this was all new. First the llama cpp people have to get their stuff together, then Mr llama cpp python has to do the same, then ooba has to build his wheels which takes years, and that's outside of any actual code changes required to support new models.I've managed to lose the ability to offload any model so I'm gonna try a fresh install. Some new wheels that seem to fix the old dll issue are dropping so that might fix something for someone.
>https://x.com/alignment_lab/status/1808634784136245446>All content is safe for work, filtered using Reddit's moderation metadata.Is this also how Alignment Lab make their "uncensored" finetunes? (i.e. if you remove NSFW from the training data, you don't have to add refusals)Couldn't have they added NSFW quality markers? Nah...
>>101263448>llama3 8bSadness.Is there an 8B spin that isn't pants-on-head?I don't know what one would do to fix it, but maybe there's some way to hybridize it with an Encarta CD or something to make it 700 MB larger and not so stupid.
>>101263794I don't know what your benchmark for stupid is, but you can try qwen2 7b, yi 1.5 9b, and a couple of l3 fine tunes like iterative-dpo, sppo, and stheno v3.2.
>>101263679>600MFinally, the GPT-4chan killer.
>>101263831>I don't know what your benchmark for stupid isI'm music theory question to test models anon, so my benchmark is talking about some notes without fucking up. Few can.L3-sppo failed at Q8_0 and f32.Stheno I took the time to note as "X fail badly" so it must've been atrocious.I don't think I've seen a DPO so I'll give that a try.I also haven't tried the Qwen and Yi smalls, so I shall. Yi did get the music question right. Qwen only pass with the K_S phenomenon in effect, the _M's blow it.
>>101263658>>101263599>>101263160>>101260958So I finally fixed it. I will admit it involved a fresh install to get 3.11 (my old env was 3.10). The 3.11 windows wheel from Ooba still seems fucked so there's that. What I did:Fresh ooba install on dev branch (3.11)Manually install the appropriate wheel from llama_cpp_pythonManually install the fresh wheel from ooba's latest buildCopy the llama dll from the llama_cpp_python over ooba's llama_cpp_cuda oneThen it all magically started working. I had to remember to set the --gpu-memory in CMD_FLAGS too.
Whenever I try to run a model that was quantized using imatrix, koboldcpp's memory usage goes through the roof, like it's duplicating the model into both RAM and VRAM at the same time, or something. Non-imatrix models work perfectly fine.Pic rel is attempting to load bartowski's gemma-2-27b-it-Q6_K.gguf (20.8GB)Exact same thing happens when I tried a Mixtral imatrix model earlier, but non-imatrix Mixtral works fine.Played around with different launch settings, and still happens.Am I misunderstanding something about imatrix quants? Any reason why this might be happening?
I saw the chinese have their own evals (opencompass, cmmlu, c-eval etc) but where is the chinese ayumi? I want to know what models they're ERPing with
>>101264042I'm sure there are plenty useless chinese benchmarks. Take your pick to see the ayumi equivalent.
>>101263936I fixed it aswell by installing the wheels by myselfset CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
>>101263345Multi token bitnet models are coming...
bitnet jamba retnet with multi token prediction.two weeks.
>trying to quantize output/embed layers in order to do a test>for some reason it's not quantizing to the proper data type>"Huh, is it because I used uppercase? No way they accidentally made this case sensitive and instead of using upper case like their documentation says, they used lower case.">try typing it in lower case>it works...
>>101264029Could be context size? I just tried that exact model on my 4080 and it blew up @ 8k context but runs nicely @2k. Same profile - @ 8k it started slurping up like 20gb of regular ram and topped out the vram, but 2k context everything fits correctly.Also I see you're running 5 layers on the CPU, that'll make it slow (or at least it does for me).
>>101264072The fact that this can work so well and so badly at the same time is impressive.
>>101261239No but it would help if you pull the pull request to local, compile it, test the code out, and give them feedback on your experiences.
>>101264065That's insane how much VRAM gemma-27b-Q5_K_M is asking, I'm at 30gb of vram used, something's not normal at all
>>101264113>Could be context size?I just tried 2k context, same deal.I'm aware of the layers, but I get pretty good speeds with Mixtral (non-imatrix) .Q4_K_M with 25/33, so I figured 42/47 for gemma would be fine. I haven't used gemma at all yet so wanted to try a smarter model, I'm not worried about speeds, I just need to know why the memory usage is so high. There's nothing in koboldcpp's github about problems with imatrix quants.
>>101264029>>101264151One thing is that gemma doesn't support flash attention, which can cut context memory by half.
>>101264167I'm the first quote, flash attention on/off doesn't affect memory usage, even when lowering the context. Also as I said, I get the same problem with Mixtral imatrix when non-imatrix Mixtral work fine.
https://huggingface.co/bartowski/gemma-2-27b-it-GGUF>Prompt format<start_of_turn>user{prompt}<end_of_turn><start_of_turn>modelBe careful about that, it's wrong and it should be:<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{prompt}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
>>101264232they're right, what you posted is wrong, that's the command r format...
>>101264232https://huggingface.co/unsloth/gemma-2-27b-it/blob/main/tokenizer_config.json#L1747>"chat_template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}",
>>101264270>>101264279Oh my fucking god I was running aya-23b all this time... I should get some sleep, my bad :(
>>101264290Heh.Is it any good at least?
>>101264294I was about to say it's weird that gemma-27b wasn't able to plot something as simple as a cube on matplotlib, and then I realized it was aya-23b, so I guess you get the answer right there kek
/lmg/bros, wtf are you even using gemma for. Is it coomable? Just messing with it?I understand it's an impressive model I just want to hear the usecases rn.
>>101264321It's the current cope for 24gb poorfags
>>101264279>bos_tokenon booba the "bos" thing is explicitely written, is it good?
>>101264338What I've gleaned is that <bos> is required and that Kobold does it automagically but other interfaces might be lacking it causing substandard performance with Gemma.
>>101264338I think it depends on the backend you are using.llama.cpp I know adds the bos token automatically, so that might be bad.Somebody please correct me if I'm wrong.
>>101264314All right, I did the test on the actual gemma and it worked, fucking finally lol
>>101264365But Gemma needed two shots at my programming test, which was the nightmarish challenge of correctly returning a string that describes the sign of a double.
>>101264396I don't know if you're being sarcastic or if the coding challenge is actually hard to do kek
>>101264411I haven't found a model that can get it right on the first request.
>>101264424Even the API models?
>>101264433This is /lmg/. Though I guess I could throw it at Copilot, that's freely available, right?
>>101264452yep, you can use bing freely, and I like that one because it goes to the internet to searsh the up to date trivias
>>101264452technically everything is freely available on lmsys.
>>101264477you get some limited runs with it though, with bing it's unlimited
>Fire up Gemma2 for the first time>Try generating with usual ERP card>First generation has "a shiver runs down her spine"Why does every language model have this shit overtrained in so strongly? Does literally every erp author write this phrase?
Gemma-27b is quite good at french, it's probably the only language with Mixtral that isn't just good at english, but unlike mixtral, gemma has no problem playing bad characters, it's not as bland, I'm really impressed by Google, who the fuck expected something like that seriously? lmao
Are there decent models I can run on a CPU? I have a terabyte of DDR4 RAM in my server if that helps
>>101264497gpt4 makes up for 50% of the erp that's currently on the internet
>>101264509How many channels?Or rather, what's your total memory bandwidth?Do you have a nvidia gpu do prompt processing?
>>101264452Copilot completely blows it. The local models at least did the three classification steps I ask for. Copilot does one and returns. Lazy bastard.And when I asked it to correct the problem I think it hallucinated an implementation detail of Double.compare that isn't in the documentation to pretend that it would be a fix.I was hoping that Copilot would get it right and I would have that as a standard that describing a number's value and sign was doable by LLM, but I guess not.
The fuck is Q8_0_L? I thought Q8_0 was virtually loseless, does this quant works on the regular llama_cpp?
>>101264602It's a literal meme.Not by the "creator"'s design, but by his actions.
>>101264571what copilot are you using? there are 3 types of copilot if you're connected to bing
>>101264524Dual E5-2667v4 & 16x 64GB DDR4-2400 ECC RAM. CPU spec says 4 channels & max bandwidth of 76.8GB/s per CPU.
Does Gemma need repetition penalty or nah?
Okay, surely it's fixed now.
>>101264524And I don't have a GPU. Should I get one if I'm only interested in fine-tuning and inference?
>>101264643What? Like 0 GPU? You're not playing games in your free time anon?
>>101264613Internet / copilot microsoft com I think it was.I'm on Linux right now so I don't have the desktop button.>>101264602Q8_0_L is 0ww's hybrid. I think what he's doing is expanding the old _M and _L technique beyond just adding one or two points of Q to instead run f16 or f32 on some layers.The last I heard about it, Q8_0_L did test out as slightly better but so slightly that there's no reason to care about the difference.
What's the best model for 8bg VRAM and 16 RAM?
>>101264695Phi 3 mini
>>101264602It works on llama.cpp. It was shilled by a random guy who swore that it gave better results, but, to my knowledge, he didn't provide proof of it. PPL tests and all that revealed that it is barely better than Q8_0, but it didn't justify the file size increase.
>>101264708>>101264682I see, guess that I'll stick to Q8_0 then, 1gb of memory is huge at that point, I don't wanna waste it
>>101264695Gemma2 9b
>>101264695You can probably swing a medium quant of a 7-8B kind of model. I think Qwen2 has a super small 500M edition but who knows if it's worth anything. Looks like it's 2GB at f16, half a gig at Q8.>>101264738Sounds like the right call to me.
>>101264649Free time? There is work time and shitposting time. Should I just get a few P100s?
>>101264704I mean 8Gigs of vram and 16gigs ram>>101264746Isn't that still super broken and being figured out?
Hmm... I'll try SLERPing it with the parent model and see if that yields a better result.Please reply.
>>101264769>Isn't that still super broken and being figured out?There are still issues (the sliding window thing I think is still flaky) but it does function on updated Kobold.
Gemma seems pretty decent. My rude tsundere character is being a lot more mean to me than usual, I like it
>>101264509>Are there decent models I can run on a CPU? I have a terabyte of DDR4 RAM in my server if that helpsThe best CPU models are MoE, and the best MoE is mixtral 8x22b Wizard LM. Its one of the best overall models out there at any size (but prose is a bit dry)>76.8GB/s per CPUOuch...ok, its gonna be slow. Anything is. Even a middling GPU is 10x faster for inference. Check the OP build guide for the cpumaxx build for numa options to make it better on 2 sockets.>>101264643>I don't have a GPUYou should get one. prompt processing is garbage without it and forget fine-tuning
>>101264770>Please replylmao
Just impregnated the slowburn maid today. I love AI bros...
gamma-27b-it is really impressive, can't wait to see how good gamma-27b-SSPO will be
I did some extensive KLD tests to answer a few questions:Are Bart's quants consistent with locally made quants (expected: yes)?Is there a difference between quanting from a bf16 GGUF vs fp32 (expected: no)?Are the KLD results for L quants correlated with MMLU results (maybe)?Why answer the first question? Because when I downloaded his quants, I noticed that they did not have the exact same MD5 hash as my own quants.As for the second question, the motivation is to check if the quantization script is really doing its job properly, since someone suggested it could be the case that quantizing from FP32 is important.Results:Yes to the first two questions. The KLD numbers are the exact same for locally made quants versus Bart's. And quants made from fp32 get the same numbers as quants made from bf16.For MMLU correlation, well, he hasn't finished running his tests, but so far it seems that maybe the answer could potentially be a no. Bart's hypothesis is that for the output+embed layers, Q8_0 is better than FP16 when the original was in BF16. However, through these extensive KLD tests, it seems that FP16 does, overall, generate token probabilities that are closer to the unquantized model, compared to Q8_0, even if the difference is very small.How it could be explained that KLD does not correlate with MMLU: We've actually seen multiple times that quants outperform their original models when tested against some particular benchmarks, so it's entirely normal, but how it happens could be due to bias in the benchmarks (and/or bias in the quants). Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge. This means that if the knowledge it improves happens to align with a benchmark's test formatting, subject areas, etc, then it would boost scores on those.1/2
>>101265037Practical conclusions:Bart's quants are entirely fine to DL, and if you want to make your own, you can do that as well, without caring about whether you do it from BF16 or FP32. Regarding L quants, don't care about them. They're virtually the same as non-L. And if you ever spend more memory on a model, spend it on upgrading to a different quant instead, which will massively improve quality compared to spending it on an L. But you may get L quants for Q8_0 if you have a bit more memory and are an audio- I mean AIphile, as it is technically still an improvement, just very small.Raw data:pastebin 0XHLeAKH2/2
>>101265037>>101265051Oops, I forgot to link the MMLU results (so far) from Bart.https://www.reddit.com/r/LocalLLaMA/comments/1du0rka/small_model_mmlupro_comparisons_llama3_8b_mistral/lbdi2pi/
>that title bar
>>101265037>Results:>Yes to the first two questions.Ok I screwed this sentence up. It's supposed to be "Yes, no, maybe not". I forgot that the second question had an inverted answer.
>>101264875Thanks, those guides seem pretty useful
>>101264770what model i wasn't following the discussion
Any new ERP models that don't fall into the typical pitfalls of previous models and chatgpt4 like using "tantalizing" and other typical shit?
>>101264029Update: The problem goes away if ALL layers are offloaded to GPU. If even a single layer runs on CPU, it seems to load the entire model into both RAM and VRAM. Only happens with imatrix quants.
>>101265194Command R+ and Gemma are the only ones that can consistently avoid that in my experience. But they have their own quirks. It's because that's just how female authors write sex scenes.
>>101264983What model and more importantly for slowburn what context length?
>>101265209I'll try those thank you
>>101265037>Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge.Thanks for saying this. This is a key fact that not all are aware of when comparing quants and citing sub-percent improvements on benchmarks.
Does gemma work with sillytavern? What context/instruct should I use?
>>101264365>Lmao I've tested that on llama 8B fp16 and it got it right the first time
>>101264513>>101264497Sounds to me like the solution is to literally not finetune on "erp" at all and rely on the model's basic behavioral awareness and prompting to get it to play the game.
OLLAMA OR KOBOLDCPP?!
>>101265375booba or boohboo or whatever the fuck it's called
>>101264497Not just erotic fiction or romance novels but low quality fiction in general is full of the cliche phrases that people complain about language models outputting. And bad novels dwarf the high quality material in terms of quantity. Most authors aren't Dostoevsky (though to be fair you probably wouldn't want your smut/ERP to be in Dostoevsky's writing style, either).
>>101265180Lora on Tenyx-Daybreak
>>101265439A lot of the so called good stuff is overrated Reddit garbage. Unironically some of the best prose for this stuff comes from my little pony fan fics. Unironically.
>>101265375llama.cpp server + mikupad or silly tavern
>>101265474I regret to inform you that you are retarded.
>>101265375Why would you use ollama when it's still broken? It hasn't pulled llama.cpp updates in 2 weeks.
>>101264497It is a very popular phrase with shitty fanfic writers
fuck gemma too slow me for (7 T/s) on 8GB vram
C-R/+is still better than Gemma. What a shame. Canadians remain undefeated.
>>101265514Someone said it manages gemma just fine
>>101265672Go back to /r/localllama.
>>101265375koboldcpp for just werksoobabooba if you need models that aren't in gguf format
Looking for a way to run models on Linux headless, no x no nothing. Llamafiles needs a glibc that my LTS Ubuntu doesn’t get otherwise it runs in cpu only. What is then recommended tool?
>>101265797llama.cpp or koboldcpp
>>101265836is there any difference? i am only looking at gguf models, but different bases. some mixtral, some llama, some phi etc
>>101265504ponyfags made the best art model currently available, maybe he's onto something
>>101265797Every backend does that.........
>>101266179Gemmy!
Whats the opendevin alternative that can run in a docker but not require windows or shit. The whole reason for running in a docker is so that I can run on any OS/machine without worrying about dependencies.
>>101266179Not sure what your problem is anon.Maybe you are using 70b+ models?I only have enough vram for around 30b and lower.Its certainly the best in the range for me. Its a huge step up.What model do you think is better?
Well that was certainly an interesting result...
>>101266242If all you're saying is that it's the best in its size class, that's fine and we have no beef, that's a reasonable thing to think. My post was more for the people claiming it's better than huge models
>>101266179it's only a matter of time when small models will perform at the same level as gigantic do, rendering your 10k$ gpu cluster useless
>>101266262well this has devolved into a rather interesting argument.
>>101266290>conveniently forgets all the months you/they complained and whined about no one giving any good models to 24 GB VRAMlets and only catering to the ultra low or ultra highNot that anon but there's always going to be a range of models for all hardware. At some points there will be a bit of lagging behind in any one range but it likely won't be for long.
>>101266281It is better than some huge models in the 70B range in the same way Llama 3 8B beats L2 70B in certain aspects but there are just some things you have to scale up the parameter count to accomplish.
>>101265866llama.cpp gives you the bare essentials. If that's not enough, koboldcpp adds another layer of stuff on top.
Well looks like Bartowski finished his MMLU Pro test and the results are that Q8 is as good if not better than FP16, or it's within margin of error (he hasn't and won't see our KLD tests lmao), so he's now deciding to still have L quants except have them be Q8 instead of FP16 for the embed and output layers. It's an OK decision, but we already had enough quants, and now he's going to make and upload more. HF will probably love that. But perhaps it's worth it for some people that want a bit more granularity to choose from to get a perfect fit in their memory. It is interesting that it seems like having the embedding layer be Q3 ("default" in this graph) significantly impacts economics knowledge, possibly beyond margin of error. Q3 is pretty damaging though, so it could be expected, that something has to suffer. It's just that in this case it happens to be economics.
>>101266919How much size are you sacrificing by keeping the output layers in Q8? It's several GB with the current strategy of using FP16 but cutting that in half would still mean things could easily fall outside of your VRAM limit and is it worth that little increase to do it? Hard to tell.
what the fuck am i doing wrongi am using kobald and have told it to offload to gpuits applying SOME to the gpu, but its generating at less than 2 tokens a second
>>101267097Without more info no one can help you.
>>101267107good call. what is useful info?
>>101267110That differs. Start by giving SOME information and seeing if anyone spots anything out of order. E.g. what command are you executing? Or if you use a GUI, show a screenshot of the settings you start koboldcpp with (btw, you are using koboldcpp, not koboldai, yes? Also usful info), what OS are you on, which version of koboldcpp, which model/quant are you trying to load (how big is it in total), etc.
https://huggingface.co/grapevine-AI/CALM3-22B-Chat-GGUFnon official gguf ver for vram-letonly three variation
>>101267127>what command are you executing? ./koboldcpp --model dolphin-2.5-mixtral-8x7b.Q8_0.gguf --usecublas>(btw, you are using koboldcpp, not koboldai, yes?yes>what OS are you onUbuntu 20.04.6 LTS>which version of koboldcppWelcome to KoboldCpp - Version 1.69.1>which model/quant are you trying to load dolphin-2.5-mixtral-8x7b.Q8_0>(how big is it in total)49.6gb
>>101267164./koboldcpp --model dolphin-2.5-mixtral-8x7b.Q8_0.gguf --usecublas --gpulayers 99 --debug --contextsize 8192(tweak gpulayers and contextsize to whatever you can fit)
>>101267163what's this? will this make my mesugakis sex big?
>>101267029It could make a difference at Q2-4 levels. If you happen to have the VRAM and the next non-L step up is farther away than you can fit, then it might make sense to choose an L quant. So I guess it's just the same as before, you choose the biggest quant you can fit.
>>101267163Is this better than CR+? Actually what is the current best local model for Japanese anyway?
>>101266357So the model sucks for RP at any rate and it's not as interesting with a bare chat template promptBut going back to my ST assistant card I prompted it asking for a nu-metal song about machines becoming conscious and taking over the world. Other than that I prompted back and forth with it asking for its stylistic guidance anywhere I thought it could be applied (genre tags, location of guitar solos, changes in vocals, title, image prompt for cover image) and this is what we came up with (after melting 400 suno credits just to make it all work)https://suno.com/song/f4fbf0c2-04cd-4f9b-bb05-53d8c6c2b14fI think it did a pretty good job.
>>101267240Which model is this again?
>>101267179nice one thank you, i was able to stack up 15 layers with this context size.in this instance what are layers and what is their relationship with context? kobald is telling me i think this wants 33 layers (which i cannae fit)
>>101267210I would say Gemma 2 27b if you are using it for machine translation
>>101267253qlora I ran on Tenyx-Daybreak with a private dataset.
>>101267255Context size lets you fit more stuff before the thing starts truncating you. The model will only ever be aware of stuff that is within context, so if you go beyond that threshold, the model will start to forget shit. That's not the end of the world though. More context size == more VRAM. Dropping this may mean you can put more gpu layers on, but as said, it means less "memory". GPU layers will simply speed it up. If you're happy with your current speed prioritize context size. If not, drop context size and see if you can put more layers on to make it faster. And if all else fails, go find a smaller quant.
Came here because someone said Gemma 9B was better than Llama3 7B, is that true?
>>101267285For RP, yeah.
>>101267285It's bigger and with a better dataset so yes at least for 4k context. It's still undecided yet whether it's good at 4k-8k since the major backends haven't updated to fully support the SWA feature of the model yet.
>>101267163>caution!>このGGUFは本来の性能を十分に発揮できていない「暫定版」です。>これは2024年7月3日現在のllama.cppがCALM3モデル固有のpre-tokenization(≒前処理)をサポートしていないことに起因します。>妥協策として、pre-tokenization処理はllama.cppデフォルトのものを利用するように改造してありますが、これはモデルの性能低下を引き落としている可能性が極めて高いです。Apparently it gguf'ing it might have made it dumb because it uses some special 'pre-tokenization' llama.cpp doesn't support.
Interesting. Claude's injecting in a prompt that asks itself to have internal monologue <thinking> before responding coherently. The output is invisible under normal user output due to the response being sanitized with the tags, but with the tag prompt hack, you can see the inner thoughts.
>>101267285If your main concern is RP and you can fit 9B but nothing higher then you'll be better off with Stheno 8B
Why is Gemma2b so bad? It can barely make 2 coherent sentences before it goes to endless repetition loop.
>>101267435I hate this shit, even if it gives normies better results for their retarded prompts. This is why I always stick to direct APIs rather than consoomer interfaces.
>huggingface weight downloads getting slower and slower the last 2 weeks>doesn't seem to be my internet, still getting max line speed from everywhere elsewonder if they're running out of cash finally, hope whatever they do to try to get profitable isn't too retarded
>>101267437How come
>>101267363ダウンロード中。テストします。
>>101267449it won't be long before they try to limit api to some country use only while blocking some
>>101267442>https://github.com/ggerganov/llama.cpp/pull/8248It was unusable when i tested on release as well. Apparently, the tokenizer has been broken for it this whole time. That PR was created just yesterday. You may be able to test it by the end of the day.I doubt it's better than phi-mini in the tiny range, though.
>>101267272How much headroom do I need to leave on the gpu for it to be functional? I am pushing the layer count as high as I can, but about two/three messages in, I run out of vram. I don’t understand why that is occurring, I thought once the model is loaded into memory that is what was worked on? Do I need to cap it at halfway or something?
>>101267435Huh. I learned something from aicg for once. How can chain of thought be implemented on sillytavern for local gens RPing? /gen [Think about stuff] | /sendas name="{{char}}"Maybe... I am not sure how sillytavern handle Chain of thoughts.
/gen [Think about stuff] | /sendas name="{{char}}"
>>101267546>I don’t understand why that is occurringThe model weights take up X amount of space, but the context (your prompt) takes up Y amount (much lower than X, but non-trivial) of space too and that must be allowed forAs your chat gets longer, Y is getting larger and larger because the prompt you are sending to the model is getting longer
>>101267574>he doesn't know about hidden textoh no no noyour bots have been saying all sorts of things about you without you knowing, bro
>>101267584>>101267546also regarding your "halfway" question, no that is too muchjust tinker, you'll get a feel for what your hardware can manage eventually
>>101266919>"content": "You are an knowledge expert, you are supposed to answer the multi-choice question to derive your final answer as `The answer is ...`.",mememarkers could learn a thing or two from expert rpers
>>101267648Kek. And this is supposed to be the most well-respected and used benchmark in the field.
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Modelshttps://arxiv.org/abs/2407.01906>Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.more effective the higher number of experts the models has so those 16/32/64 models will benefit
I just downloaded the new phi model and wtf, did they still not fix the token trimming issue? Literally it trims spaces and newlines after the special tokens, except the fucking instruct format literally uses newlines after the special tokens. Are they retarded?
I have 9GiB of RAM and "AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx". What model would be the best to run? Would any run at all? I wanted it to analyze some of my writings (gpt/claude does well but it's a paper so not good to let the content with openai or antrophic) like summarizing its "understanding".
>>101267682>theywho?>Literally it trimsit? what?>they WHO? Microsoft, llama.cpp, ollama, transformers, kobold.cpp?
>>1012677159GB? Weird number. You can can run llama-3-8b and gemma2-9B with llama.cpp at Q5_K or probably even higher, though it's not going to be too fast. No GPU at all?llama-3-8b seems a little more stable than gemma2-9b, at least for now.
>>101267768the globalists, duh
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attentionhttps://arxiv.org/abs/2407.02490>The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. https://github.com/microsoft/MInferencecode is up. looks like they added in support for vllm at least.
>>101267682Ok so I investigated the issue more and it looks like the model literally just generates without newlines after a special token, despite their readme showing newlines in the prompt format. So this means that what they trained on isn't the format they're telling users to use. For fuck sake.Also I had a look in the config and it says a sliding window of 2k. What? Does this use SWA? And it's only 2k? >>101267768They as in Microsoft. It as in really any program, it does this in both llama.cpp and transformers. But I went digging and found that someone also brought this issue up and it looks like it's an option that can be set in tokenizer.config. So now it's fixed, but you have to do it manually.They really couldn't just spare a bit of their day to put a note into the readme about this.
what happened to that 1.5Bit thing?
>>101267823*tokenizer_config.jsonI need to sleep.
memory-holed
>>101267546Also try turning flash attention on (--flashattention) and also play with quantizing the KV cache (--quantkv=0/1/2 where 0 means 'keep as 16 bit', 1 means 'quant to 8 bit' 2 means 'quant to 4 bit').
>>101258689Yep, magnum and euryale are pretty fucking dumb compared to miqu or midnight miqu, but midnight miqu is so damn dry compared to them, so I have been mostly settling with l3 70b euryale.
>>101259283Uhh what? What 70b l3 fine tunes are good? The only ones I know about are euryale and story writer. Euryale is lewd as fuck and filthy, but it's dumber than miqu which has been out a long time.
>>101267656Just one guys shoddy script. the reference MMLU-Pro eval code (new distinct thing from MMLU) has a sane prompt and uses CoT properly https://github.com/TIGER-AI-Lab/MMLU-Pro
Is it safe to do picrel if I'm powerlimiting to 300 watts?
>>101268027>is it safe to plug in all 3 pcie x8 cables to my gpu ?/g/ - Technology
>>101268027You should check your power supply specs. So far, you're the only one that knows which one you're using.
>>101265375>2024>still using GEGGOOFS
>>101264029>>101265195That should not be happening.The importance matrix is used during quantization to better determine which model weights should be prioritized in terms of precision but apart from the numerical values the resulting quantized model should be the exact same.>>101264185Check the console log, with Gemma Flashattention gets turned off regardless of what the user specifies.>>101265037>Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge.Fundamentally adding noise to a signal always results in a worse signal; check the asymmetry of the token probability percentiles and mean Delta p.However, while on average the probability of a correct token prediction will decrease, for individual tokens the probabilities can randomly be better.For >= 4 BPW the change in token probabilities is mostly symmetrical so the effect of quantization is comparable to increasing the temperature.
WHY when you offload some layers to GPU the WHOLE model is still in RAM?i don't think that's what offloading means???
>>101268091#justwindowsthings
>>101268091>i don't think that's what offloading means???When did people start adding question marks to statements?Also, different words mean different things in different contexts.
>>101268091Disable mmap.
>>101268125you look like you're brown
>>101268154Shit. I left my cam on again...
>>101268044>a tech illiterate is making a fool of himselfThere are only two cables in the picture, with the third one daisy-chained, dumbass
>>101268178>>101268178>>101268178
>>101268181>he thinks his cam wasn't turned on remotelybruh
>>101268183I'm aware of this. Still, where's the issue ?
>>101268193But muh seven proxies!
>>101268091You made the mistake of pulling. Use an older version of ooba before this commit: >>101255284
>>101268238>he thinks proxies matter when the feds are in his UEFIb r u h
>>101268078I'm the memory usage guyPlaying around with the settings, I found that ticking 'Disable MMAP' fixed the issue. My understanding was that MMAP wouldn't be used unless RAM was full and so the option would be safe to leave on, and that seemed to be how it works will all non-imatrix models, but I guess koboldcpp might be bugged under certain conditions, unless this is a problem exclusive to my system for some reason.
>Hi Emily, do you know Jamiroquai?>Are you for real? You're still listening to that mainstream pop crap? Go and listen to some real sound, some white metal, some stuff that's got some guts?>white metal? what's that?>This is pure metal, without all those screaming black singers. Real metal, with lyrics about honor, country and strength. You should check out bands like Marduk or Burzum. They're really authentic.Lmaooo, what did google do to give so much sovl to gemma-27b-it??