/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>101180092 & >>101173181►News>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io>(06/23) Support for BitnetForCausalLM merged: https://github.com/ggerganov/llama.cpp/pull/7931►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>101180092--Paper: HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale: >>101183476--Gemma Debugging Issues with HF Transformers Implementation: >>101183120 >>101184453--Gemma 27b Models' Coherence Issues with Sliding Window Attention: >>101180648 >>101180665 >>101181050 >>101181078--The Frustrations of AI Model Performance and Limitations: >>101181282 >>101181268 >>101181298 >>101181321 >>101181349 >>101181350 >>101181433 >>101181573 >>101181345 >>101181362--SPPO Performance and Comparison to Instruct Models: >>101183296 >>101183595 >>101183661 >>101183827 >>101183939--Llama Model Load Error: Unknown Model Architecture 'Gemma2': >>101184465 >>101184685 >>101184767--LLM-Compiler's Limitations in Compiler Development: >>101182838 >>101182947 >>101183745 >>101183855 >>101183896 >>101183874--Gemma-9B's NSFW Behavior: Anomaly or Dataset Issue?: >>101180719 >>101180943 >>101181086--Gemma-2 Support Issues in Llama.cpp: >>101182001 >>101182048 >>101182376--Gemma 2 Release Format Issues and Official Implementation: >>101181569 >>101181611 >>101181640--Eagle's Speed for Inferencing and Decoding in Creative Writing and RP: >>101184243 >>101184286 >>101184304 >>101184318 >>101184337 >>101184368 >>101184297--Drama in the Quantization Community: Q8_0_L Quant Development: >>101180438 >>101180456 >>101180471--Can LLMs Solve Programming by Example?: >>101180866--Anon's ST Addon Development: Constant Reminders and UI Improvements: >>101180277 >>101181094 >>101185317--Nala Test for Gemmy 9b (Q8_0): >>101184502 >>101184521 >>101184551--Anon's Struggle to Access LLM Compiler and Unconventional Plans for its Use: >>101182269 >>101182463--AI Models and the Human Brain: Efficiency and Unrealistic Portrayals: >>101182977 >>101183020 >>101183069 >>101183105 >>101183084 >>101183078--Miku (free space): >>101183371 >>101185164►Recent Highlight Posts from the Previous Thread: >>101180096
what does she mean by that?
>>101186546>rape by consentYep. It's properly emulating a woman. Enjoy your model
>>101186508>Gemmy 9b (Q8_0)it's coal though.
>>101186546>what does she mean by that?nothing? never take hallucinations from artificial redditor at face value.
I'm retarded, please help meLast time I messed with LLMs I could barely run anything coherent on my poorfag machine. Have any of the newer hacks made things better are are we still stuck on using the biggest merge that works? I see some stuff about bit nets on the news, are they better for low model sizes?
>>101186722install linux
>>101186732I use Linux, what's the next step?
>>101186546It's rape if her body reacts to some(You) whom she has decided that her body shouldn't react to because that doesn't align with what she has chosen as her ideal.
>>101185918>you definitely aren't running on full gpu then at those speedsBeing vramlet I'm used to the drop as soon as I run any non-garbage model, from 20+ to 1-2 t/s. I'm just not sure about the next drop from 0.7ish to glacial. Happens sometime in the 60-69 (nice) GB model range, so I'm theorizing system RAM is a factor. (I'm on 64GB RAM.)
24 VramChads, Gemma 27B is our solve? We are so back, or is just slop shit and we are doomed and our only hoppe is to die?
>>101186741tell specs
>>101186598>I understood the selling point that they maintain a 'library' of models for nubs that can't understand HF. You just pick llama3 or whatever and don't need to deploy braincells to think about quant levels and what is appropriate for your hardware etc.>If you understand how to choose a GGUF you're probably better off running a backend closer to the upstream.It's part of the learning curve.Ollama got me a model and a prompt that goes.Then I learned about the other models.Then I learned about the model variations.Then I learned about what the quants are about.Then I learned to use those through Ollama.Then I learned to want more options.And that they live on HF.That pushed me to step up to Kobold.And it told me I needed GGUF files.That's the last experience point I needed to level up.
>>101186762Specs are garbage. My GPU has less than 2GB of VRAM so I run on CPU with 16GB of RAM (koboldcpp). If I stick to 8-12GB models it generates at a decent speed but I can't really work with anything bigger without going too slow to be worth it
>>101186774The problem is that it seems to be actively hostile to doing anything outside of it's prescribed way to do things.
>>101186784i mean fimb-11b-v2 is decent, u could also use stheno 8b 3.2
I used that trick to get the entire prompt froh the chatgpt websitehttps://rentry.org/stcrcggoFeels kind of a stretch that it can follow all that
>>101186806 (me)and remember to only download models from sao
>>101186752yeah when you split its back to mem/cpu speed. i get 1.4t/s on l2 70b at 32k context which is fine for my usage. its really the lowest i'd want to go though as far as speed. bigger models despite the slowness are worth using because the responses are so much better. 8b is so retarded i can't believe anyone even wastes time on them no matter how fast it is
>>101186814not me thougheverbeit
Said fuck it and manually updated ooba's tranformer. Works fine as long as you turn off do_sampleThat said, 9b is actually garbage. What the fuck are you all seeing in this shit? It can't follow instructions for shit and the writing is awful. I guess uh, good for you vramlets you get something that isn't hard refusing porn shit?I'mma stick with real models though.
>>101186826Fuck, forgot to specify: I meant gemma-2-9b is what I tried out, and it's trash.
>>101186826I'm a vramlet, but even I have standards. 9b is completely useless.
>>101186500>that natureSovl, I wish I wasn't around buildings all the time. How do I cope?
Retards saying that Gemma-2-9B is trash while the 27B is great haven't actually tried either model. The 27B version appears to defective and incoherent.
Gents, I want to try running command-r+ on my pc with 2 3090s. Will this work and can I load it in ooba?
>>101187024I tried 27B using online hosted services, it works well there. I'm not one of people who praised 27B in those threads, but I imagine that's what they did as well.
>>101186811>Personality: v2What did OpenAI mean with this
In booba's exl2 loader, what exactly is cfg-cache?
>>101187073RTFMhttps://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab#exllamav2_hf>Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG
>>101187073CFG is when you have a positive and a negative prompt, and CFG cache, I assume, means it reserves two caches: one for the usual, and one for the negative prompt.
>>101187073The thing people claimed would make open models free from gptisms and make them smarter than proprietary ones
>>101186388Control vectors influence output direction of the model, so when applied at higher strength, they will make model output the same this every time.
>>101187037Well, it is easy to download them both from the official Google repositories on HF, quantize them to the same level (e.g. GGUF q6_k, using the latest patches) and observe back-to-back how not only the 27B version is about as censored as the previous Gemma (albeit with a somewhat less irritating tone), it rambles and mixes up user/model responses, whereas the 9B has no issues.Hard to imagine that 6.5-bit quantization hits the 27B version harder than the 9B, but anything is possible, I suppose?
>>101187025Sure it will work, although it will be slow I would suggest part-offloading >=Q4 GGUF rather than trying to cram a low bpw entirely into VRAM.
>>101187093I assume that the problem is somewhere in the open source implementation, not in quantization. And, yes, 27B definitely is cucked, but that can be partially fixed later on. I evaluated its level of intelligence when it did talk to me.
>>101186806Thanks, I'll try those out. I just hoped that all that (((research))) would have produced better local models by now instead of just bigger ones
>>1011871718Bs of today are so much better than llama1 8B that it's not even close.
>>101187294llama1 7b*
>>101187305llama1 6.7b*
>>101187390>https://arxiv.org/abs/2302.139717b.
>>101187479and how many parameters did it have
>>101187503anon we call it llama-1-7b not llama-1-6.738
>>101187503Seven parameters, of course.
>>101187525>llama-1-6.738llama-1-6.738b*
Gemma-2-9B really wants to write to "X, Ying"-type prose during RP even if you manually randomize that with something else like:"Xing, Y""X and then Y""As X, Y""X"etc.
>>101186774proud of you Anon>>101186805It can run arbitrary GGUFs with a couple extra steps if that's what you mean. It's fine to have more options for beginners who understandably don't want to struggle with details just to try the stuff.
>>101187650Damn, this triggers my autism like nothing else. I guess it will pass the Nala test with flying colors though.
>>101186561f-finetunes will fix it.
>>10118675527B is fucking brain damaged.It can handle simple assistant type prompting but RP prompts confuse and enrage it.
>>101187105Thanks, by low bpw you mean like a 3bit quant to fit in 48gb vram?
>>101187776dumbass
>>101187737would brutally rape both
>>101187846Yeah. Keep in mind you also need VRAM for context+other buffers on top of the model size. I did not get great results with Q2/3 but why not try it and compare. Might take some fiddling to get it to fit (quantized KV cache and lower context length can help). I find it worth the lower speed to use Q4KM
>>101187776https://huggingface.co/google/gemma-2-27b-it/discussions/10
>>101187865Do you ever sleep?
>Anonymous 06/28/24(Fri)15:28:07 No.101187901>>>101187865 (You)>Do you ever sleep?12 hours a day im a neet
>>101187918Want to be my NEET bf? UwU
>>101187893Thanks again, not sure how far I’ll get with only 32gb ram but will see. Might need to try the non + version
>>101187894>Just don't use float16>lmao no we're not going to release the fp32 weights
>>101186575It's not a hallucination it's an accurate simulation of a woman. LLMs are getting better and better.
friendly reminder that ollama WON>received a private PR by google for gemma support before the release. llama.cpp was ignored>available as an option in the brave browser, used my millions>on its way to 100k stars>redditors on localllama love it and only talk about it>every llm YouTube video recommens it>every twitter influencer recommends it>hosts events, receives endless vc fundingSorry chuds
>>101188077so you're saying I should start a betting pool on when the ceo gets metoo'd?
>>101187894I knew 27b was fucked up in transformers. They rushed it and didn't test things properly.
>>101188094ollama guy is the Bill Gates of llm
>>101188095How can it be transformers if 9B works fine though?
Anyone know a repo that has styletts2 + rvc integrated nicely? I currently use xtts + rvc but xtts isn't consistent enough and tend to produce results that slur/shit itself from time to time. Particularly want an implementation with voice cloning
>>101188113Are they exactly the same architecture?
>>101188077At this point I think llama.cpp should just give up and let ollama maintain the project, ngl.I hate them, but llama.cpp is probably even worse. llama.cpp is always broken and, when you report it, the maintainers blame you instead of investigating. It's infuriating
>>101188124>llama.cpp is always brokenYeah, this is the reason I use exlama, and more recently lamafiles. Shit just works.Don't even mind switching to globo-slop approved olama in the future, as long as I can launch my waifu with no fuss.
>>101188182Does llamafile work at all? Does it behave nicely with sillytavern?
>>101188077>received a private PR by google for gemma support before the release. llama.cpp was ignoredThey did? Their PR looked like a ctrl+c ctrl+v of the llamacpp one, with the tokenizer errors and all.
>>101188205They didn't. These anons are retarded.
>>101188077I've been thinking it would be kind of funny to implement some critical component on an AGPL fork but I really don't think it would be worth the drama.I don't want or need a job or attention so to me downstream projects have non-negative value (depending on what and how much they contribute upstream).>>101188124>At this point I think llama.cpp should just give up and let ollama maintain the project, ngl.I have never seen any bug reports or fixes for llama.cpp issues from ollama devs so I don't think they could.I think the only reason there are fewer issues with ollama is that they wait for the llama.cpp issues to be fixed before they take over the code.
>>101188077>Sorry chudsyou call the llama.cpp dev chuds? lol, they're the one putting Jartroon back on the team in the first place
>>101188248What is your motivation for continuing to maintain llama.cpp? Asking as a llama.cpp contributor myself, the amount of code that you put out and the consistency are insane.
>>101188248>I've been thinking it would be kind of funny to implement some critical component on an AGPLThat would be so fucking funny.
>>101188277I mean the people who unironically sling the word chud around unironically believe that Donald Trump is anything other than an Israel-First neoliberal hack. The bar for being considered a chud is pretty low.
>>101188357true, true
>>101188285-I like building and optimizing things and making numbers go up.-I am by nature a very competitive person and one of my long-standing ambitions is to write the code with the worldwide best performance (at least for those use cases I care about).-While I think that as of right now generative neural networks are still kind of lackluster I think that they will become very good in a few years and that the infrastructure for that needs to be developed ahead of time. In particular a low upfront cost is I think important.-I am ideologically very pro open knowledge/open source (though I prefer free software).-I plan to use llama.cpp/ggml for my own projects (doctoral thesis in physics, AI-powered RPG if no one else does it before me, pretraining models if I can make it cheap enough that I can actually afford it).
>>101188248>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork but I really don't think it would be worth the drama.holy fvcking based..
>>101188248>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork
>>101188382verdict: based, on all counts
>Ive been thinking it would be kind of funny to implement some critical component on an AGPL fork
>cornpop did so bad that the damage control is spilling into lmgbig lel
AGPL is a fair license if you would like to take your part on the Free Software movement.
>>101188382>very competitive >and yet, exllamaV2 is still miles ahead of llama.cppI'm starting to think it's over.
>>101188382do it.
can someone do a TLDR about licences? not everyone is a lawyer. Is AGPL a good thing? And what does the cuda dev wants to do with it?
>>101188382it's time
>>101188248>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork You've gotta deliver now that you've said it.
>>101188382based
>>101188469thisexllamav2 embarrasses llamacpp
>>101188469>miles aheadthe fuck you talk about? llama.cpp gives deterministic output + allows for some cpu offloading if we want to get a slightly higher quant
>>101188248>AGPLdid i hear something..?
>>101188248>do it
>>101188513its literally slower and worse quality
>>101188534Glad we're on the same page about llamacpp
>>101188526Probably means in terms of speed. I'm still using exl2 exclusively since mixtral.
>>101188513how well does gemma2 9b run on exllama?
>>101188537imagine being this retardedlmao even
>>101188548are we really going to pretend llamacpp didnt have issues on literally every single new fucking model release
>>101188566so it doesn't run, alright
>>101188547>I'm still using exl2 exclusively since mixtral.I'm using llama.cpp because I can get a bigger quant (Q5_K_M) even if I don't have enough GPU vram, exllama just doesn't allows you to do that
Sirs, you are way too fast. I can't keep up reading the threads.
>>101188248gpl gods stay winning
holy shit, I love google now
>>101188581vramlets need the rope
>>101188248>I've been thinking it would be kind of funny to implement some critical component on an AGPL forkthis will be us if you do it
>>101188513>>101188526>>101188534>>101188537>>101188547>>101188548>>101188558Seeing tards slapfight over quant methods is really funny when you've been using float16 since day one like me
>>101188581And oftentimes you are better off doing that. Get a quant that's only slightly bigger than your vram and you are golden.Something like 80~85% of the model in VRAM is around the sweet spot as far as I can tell.
>>101188581I get it, and I get it's important for a lot of people here, but I am used to the speed of having everything in the VRAM, and for that exl2 is still superior (unless something changed very recently).
>>101188601exactly this, you offload 80% GPU + 20% CPU, the speed is still good and you get a way less retarded quant, that's a win/win situation
>>101188248excited for it
>>101188591what does this say? I'm not speaking the nazi language kek
>>101188617gguf quants are less efficient than exl2's, you will have worse quality and speed than what you would have just running 100% with exl2
>>101188248
>>101188626>gguf quants are less efficient than exl2'sthat's not true, the gguf quants have improved a lot since then
>>101188626>efficientdoesn't exl2 pad the 8bpw so people don't complain about size being too small or something?
>>101188248>GNU/llamacpp
>>101188382
>>101188650you retards have been peddling q6/6bpw as the best you can get with anything above being imperceptible, now you wanna pretend you give a shit about q8?
>>101188248Join us now and share the software;You'll be free, hackers, you'll be freeJoin us now and share the software;You'll be free, hackers, you'll be freeHoarders can get piles of moneyThat is true, hackers, that is trueBut they cannot help their neighbors;That's not good, hackers, that's not goodWhen we have enough free softwareAt our call, hackers, at our callWe'll kick out those dirty licensesEver more, hackers, ever more
>>101188673>youno, if you can't run q8 you can't run it, period.
>>101188686no, if you can't run fp16 you can't run it, period.
>>101188626EXL2 is based on GPTQ, which is a terrible quantization method.
>>101188626>you will have worse quality and speed than what you would have just running 100% with exl2Speed is fair enough, but I've never seen any evidence that exl2 produces better results than an equivalent bpw gguf, even more so considering imatrix now.And considering the rpcal debacle, I'm even less inclined to believe subjective reports.
>>101188698You are a retard.
>>101188708I know what I'm talking about. The rpcal situation mentioned by >>101188702is sufficient evidence that EXL2 is garbage.
>team llamacpp are RP fagsit all makes sense now, kek
>>101188721>is sufficient evidence that EXL2 is garbage.It's not. That's just users being retarded.I still want to see actual comparisons of kv divergence, ppl, and loggits between full precision, exl2 at Y bpw and gguf at Y bpw.
>>101188248DO IT
>>101188469llama.cpp peak performance on an RTX 4090 currently sits at ~90% of the peak performance reported on the ExLlama Github repository (for both token generation and prompt processing).So I'm thinking that with a bit more MMQ optimization and speculative decoding support llama.cpp will be faster.>>101188490(A)GPL has a "copyleft", meaning you cannot make any forks or derivative software closed-source.If there was some critical feature that was licensed with a copyleft it would force downstream projects like ollama to either re-license their project to also include a copyleft or they would not be legally allowed to take over the feature.Since permissive licenses without copylefts are considered more "business friendly" this would basically just troll projects like ollama that are more business focused.Koboldcpp and Ooba would be unaffected since they already use copyleft licenses.>>101188502Did you not read the part where I said that it wouldn't be worth the drama?>>101188626According to https://github.com/matt-c1/llama-3-quant-comparison llama.cpp quantization is more efficient in terms of MMLU score at a given size though for >4 BPW it probably won't matter much.>>101188730Actually if you look at Google trends team llama.cpp are Chinese.
>>101188382A true hero
>>101188762If I mix 3090 and P100, I'm basically forcing the 3090 down to the level of the P100, in terms of supported math operations, right?I'm just trying to figure out how to speed up my L3 70B gens without buying more 3090s at the moment.I've got 2x 3090 and 3x P100, everything is on PCIe 3.0 16x.
>>101188762>US isn't even in the top 5Wtf?
>>101188762thanks for the licence lesson anon, much appreciated
>>101188762You are a very skilled programmer and a all around based invididual doing very important work. Do you have a Ko-fi account or some shady crypto-adress I could send 20$ to? t: Vramlet Simp.
>>101188762>Did you not read the part where I said that it wouldn't be worth the drama?Yes, it would be worth it. 100x over.We wouldn't be here without the original llama leaker.I'm thinkin' miqu
>>101188382>AI-powered RPGInfinite Zork is actually coming, HOLY FUCKING KINO
>>101188847not on lcppprocessing prompt 3/16192
>>101188803Yes, the slowest component dictates the performance.
>>101188872Wait what? He's going to license all of his llamacpp contributions under AGPL from now on?
>>101188803For llama.cpp it should depend on --split-mode .With --split-mode layer each GPU should be using the optimal kernels (but for some quantization formats there is no P100-compatible implementation).Wtih --split-mode row the P100s will force the 3090s to use suboptimal kernels because P100s lack the __dp4a instruction and thus cannot run MMQ.Any other Pascal card or more modern cards (except for V100s which lack int8 tensor cores) should not be causing issues.>>101188825I at some point had a ko-fi account linked on my Github but I decided to remove it when I accepted a part-time job for a known AI company.I cannot in good conscience accept money from people that earn significantly less than me per hour when I right now don't even have a use for it and would need to pay a large percentage of it in taxes.Maybe I'll do crowdfunding if I ever invest relevant amounts of money into training.
>>101188890Yes, we are back
>>101188248>I'm going to implement some critical component on an AGPL fork
>>101188896I tend to use q8 quants, is that best for P100?
Are all these posts being made by the same person kek
>>101188248>>101188906
>>101188248>AGPL forkyup i'm thinkin' based
pretty sad tbdesu
>>101188847Do not expect anything anytime soon though.>>101188918It should only be the IQ quants that cause issues for P100s.
>>101188248>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork but I really don't think it would be worth the drama.DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT
>>101188591based
He's not going to do it, no matter how much you spam, you know.
Good morning lmg!
>>101189039Good morning Miku
>>101188762
>>101189039Good morning anon that posts tasteful artistic mikus
>>101189033
>He's not going to do it, no matter how much you spam, you know.
>>101188762>Did you not read the part where I said that it wouldn't be worth the drama?
>JOOIIN US NOW AND SHAAAREE THE SOFTWARE>YOU'LL BE FREE HACKERS>YOU'LL BE FREEE
>>101188931Yes it is the license autist.
for me, it's cc-by-nc4.0
>for me, it's cc-by-nc4.0
>>101189249I hate having to tell guys with muscle-girl fetishes this all the time but the level of testosterone required for muscles like that is mutually exclusive to having tits.
>>101188896>I cannot in good conscience accept money from peopleBased ethical dev
I'm running gemma2 9b on an i5-6500T server just for fun to see it struggle but the speeds are kinda acceptable? That's incredibleAnd gemma2 9b passes my non-scientific mandelbrot coding test, albeit a bit weird, which only recent small models like llama3-8b passed at all. Mistral 7B and older did not pass this test.
>>101188248Release it with no license provided. Copyleft and copyright are two sides of the same cancerous coin. Make it so that nobody who believes in the validity of ""intellectual"" ""property"" can use your code.
>>101189308nta, i'm a hobbyist programmer and have never once read a license and i've been releasing stuff for over 25 years. even the bigger stuff that uses libraries, never cared, i just release it and have never once had a single issue raised. i dunno why people even care unless you're starting a company based on stolen code or something
Gemma 9b base model coming up a little lackluster on the Nala test.
>>101189353>i dunno why people even care unless you're starting a company based on stolen code or somethingthats the point, agpl scares big corpo away
>>101189362She's very thorough when it comes to licking
>>101189278Sick. What are you running it with? llama.cpp?>>101189362That does look weird. Is that with the proper prompt format, what backend are you using?
>>101189425Q8 on llama.cppI just used a generic base model prompt template that I had available.
>>101189362That looks like a bug.
>>101189211Miku are you okay? Are you okay Miku?
>>101189353Exactly. This is the optimal mindset, and the one that's cultivated by releasing software with no license provided.
What makes the transformer architecture intelligent?
>>101189501>What makes the transformer architecture intelligent?attention
>>101189501Me.
>>101189501"Expert roleplayer" in system prompt.
>>101188896>>101189362Dichotomy of 4chan
>>101189496a lot of times what i get from other code its a simple function i have to rewrite anyways to fit what i need, but the original served as a good example. imo if code is out there and visible it should be free game and treated like that. personally i like when i get a message from someone who uses my code as part of theirs, they show me how they hacked it up and changed stuff so it fits what they need. when i do use a whole library from something that has a license, i include a thanks/credits but never even bother to check the license. its such a non-issue for 99% of people i dunno why anyone even cares
>>101189535the Nala test is the only objective RP test we have.
>>101189049we did it reddit!
Latest bartowski gguf 9B gemma and current llamacpp still is incoherent past 4k context.
>>101189501who knows really? this shit is way too complex to be understood theorically
>>101189617SWA thing, won't be fixed 'closed this as not planned'>>101181078>9b and 9b-it: seem to be fine as long as you're under 4k context. When I gen a message in RP with a 5k context, both have severe quality degradation. Can't spell things right, can't write grammatically correct sentences. Possibly problem with sliding window attention? The model interleaves 4k SWA and 8k dense attention. Once context is over 4k, the sliding window actually starts sliding and maybe something breaks? Hopefully something is just broke and can be fixed, and model is not fundamentally a 4k context model.shit, then lcpp is fucked for that, since gergio said he didn't cared>It feels that since Mistral 7B from last year, there hasn't been much interest in this technique. Even later Mistral models dropped it as a feature. Taking this into account, I guess we can leave this issue closedhttps://github.com/ggerganov/llama.cpp/issues/3377
>>101189425I'm running gemma2 9b with just ollama run gemma2, which is Q4_0, so it's not the best it could be but it still passes the test lol
>>101187024>>101187037>27b is fuckedI quanted my own to bf16 and am running it without problems. I could provide instructions if anyone cares
>>101189501It's not intelligent.But if you ask when it appears to be then the answer is brute forcing with the number of neurons + obviously attention layers.
>>101189674google themselves are saying something's wrong with it>Yes we are investigating what went wrong! Note that float16 should not be used for this modelhttps://huggingface.co/google/gemma-2-27b-it/discussions/10#667e9fc6f0820e80d39aaf3e
>>101189644Well at least it should work with Transformers right? Then we can at least confirm what the "good" context length of the model is, and whether interleaved global + sliding window attention really works without any issues.
>>101189705no actually, someone tested on transformers, and said it didn't handle >4k well either>>101181113
>>101189702>Note that float16 should not be used for this modelthe fuck do they mean by that? we have no other choice but to use their fp16 model to do the quants, that's the only thing they gave us
>>101188277most of jart's PRs have been ignored for weeks. draw your own conclusions.
>>101189729there is thishttps://huggingface.co/google/gemma-2-27b-it-pytorch
>>101189264what? next time you will tell me that cocks in real life can't actually penetrate cervix to shoot cum in her womb
>>101189705>Well at least it should work with Transformers rightSome anon was comparing the transformers implementation to the reference and it seems that they might have fucked some stuff up too.Basically, give it 4 or 5 days until everything is in working order.
When will we ever not get a fucked model launch? Christ. They really couldn't spare just a bit more time and manpower to make sure things actually work properly on people's machines.
>>101188115>>101188115>>101188115bumping for visibility
>>101189767why do that when the autismos might fix it for them, for free?
>>101189767why bother, just wait and let open source chumps to do it for you
>>101189702>google themselvesDoesn't change the fact that I'm using it and its working fine>float16 should not be used for this modelf16 != bf16. another comment says bf16 works.Its not ultra-impressive, but it works
>>101189729They provided BF16 didn't they? Nobody should be quanting from F16 for BF16 models since llamacpp added support for BF16 1-2 months ago. Even before BF16 support you were supposed to upscale BF16 to F32 then quant.
>>101189513how>>101189693>then the answer is brute forcing with the number of neurons + obviously attention layers.well, how
>>101189833>howhttps://machinelearningmastery.com/the-transformer-attention-mechanism/https://www.youtube.com/watch?v=kf_eGgVtOcs
>>101186805What blew my mind was when I went digging and found that they mask the files behind le epic hash-code like renames, then put the key in a JSON in the next directory.Which meant part of my evolution was becoming a l33t hax0r by switching around file names/hashes, and getting to the point of thinking of using a tool for it, and then saying, naaaaaaaaaah, I'll just get that program named after D&D fun size wife lizards.>>101186815>bigger models despite the slowness are worth using because the responses are so much better. 8b is so retarded i can't believe anyone even wastes time on them no matter how fast it isSame. I want for there to be a small model that isn't total ass for the sake of having a real-time-ish option. But the smallest one that has passed my music theory question is 40GB (qwen2-72b-instruct-q4_k_s, and yes, the parallel _m failed. Just barely, but it also blew a pop culture question I've started testing against as well that _s got right. S to M is +4GB and -40IQ.)
>>101189749Yes that was me, they reversed the order of sliding window attention and global attention. But at >4k context, where this actually matters, latest HF Transformers commit doesn't even work, it crashes with some internal cuda error, index out of bounds or something. But once that's fixed they still need to fix the off-by-one error for SWA / global attn. Someone should probably tell them, I don't think anybody else has realized it yet.
>>101189827>>101189816well apparently people are reporting issues with bratowski quants, so maybe help him out then? if yours works correctly>Just a heads up, there seems to be some serious issues with this model regardless of whether you use the template correctly or not. In my testing it performs significantly worse than the 9b version, so much worse that there's clearly something fundamentally wrong. And I've seen many others have the same experience. An issue has been created on the official Repo, and Google states they are currently investigating it.https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/discussions/3#667ee47b8972e9eb302f7724
>>101189833The more neurons you have the more complex function you can emulate.Easy functions like linear function you can emulate with a single neuron, for XOR you need a few neurons. Language is a very complex function so to emulate it on a reasonable level you need billions of neurons. Neural networks are universal approximators so it's just not a question "if" but "how big"
>>101189362
Haha, looooool gogle can't even release a mode right, holy shit.
>>101189905what about the attention layers part?
>>101189887Bartowski is not retarded and knows not to quant from and F16 base (pic) so I don't think it is that. If BF16 works and his quants don't he is either fucking up somewhere else or there is something wrong with the quant code in llamacpp with regards to gemma.
>>101189956the meme that saved /lmg/
>>101189993>Bartowski is not retardedi know, which is why it's weird his quants are reported as being borked too, if it was darmercher that'd be par for the course, but him?
i'm still following this new cai drama for the luls and it keeps delivering
>>101189993Wait does this imply that I shouldn't be converting straight from BF16 to Q8 (if I want objectively the most accuracy), but BF16->FP32->Q8? Or is he comparing simply to BF16->FP16? I mean BF16->FP16 is done by the conversion script rather than the quantize script, but the quantize script can take in a BF16 GGUF file, so I just assumed that worked the same as making it work off of a FP32.
What is the smallest model that can reliably be forced to use function calls? One of the Mistral Instruct v3s? Why isn't function calling more commonly a feature? I have no use of these instruct models if they can't reliably trigger function calls
>>101190059>Why isn't function calling more commonly a feature?the only function most care about is ah ah mistress
>>101190042wait people are still using cai? They should let it go, the golden age is over since years now
>>101190042kek, qrd?
>>101190052It doesn't matter. You should only be directly quantizing native FP32. Anything else is like converting an MP3 file to a 32-bit float wav before encoding to an OGG. It makes no difference, you're still incurring generational loss.
Guys....!https://eqbench.com/creative_writing.html
>>101190080>native FP32.no models are released like this they're all bf16 now
>>101190084
>>101190052The idea is that because FP16 is coarse and BF16 is coarse and they're coarse in different ways, going from one to the other can cause a greater amount of drift in the values than if you go to 32 and then to the other 16 because the 32 will be no less accurate than the first 16 but might find a more accurate representation in the other 16 after visiting 32.It's probably really close to irrelevant, but again, if it silences the armchair computer math geniuses who want to throw shade at a coder with his boots on the ground and dealing with video card opcodes, it's worth that extra step.
>>101190077
>>101190084>creative_writingWouldn't that reward hallucinations as long as they are grammatically correct?I want my AI to get things RIGHT.
>>101190111you can check the sampleshttps://eqbench.com/results/creative-writing-v2/google__gemma-2-9b-it.txt
>>101190103lmao wtf
PSA: llama.cpp recommends quanting yourself with the latest version, every time:>>101185349There's no telling how many quantizations are degrading or the extent of it. If I were someone that produced a large amount of quants in a very short time it's safe to say I'd probably be a little concerned.
>>101190084>"Bloody hell," Rhys muttered, ducking into the narrow doorway, the bell above jingling like a frantic bird. He was followed by a flurry of wind and rain, leaving a damp trail across the worn wooden floor. "Sorry about that.">The bookstore owner, a woman with hair the colour of a stormy sea and eyes that seemed to hold the secrets of a thousand stories, didn't even look up from the book in her hands. >"No need for apologies," she said, her voice a low, melodious rumble. "We get our fair share of storms here.">Rhys glanced around the shop, his usual actor's instinct to assess his surroundings kicking in. It was crammed with books, overflowing shelves reaching towards the high ceiling. The air smelled of old paper and brewing tea, a comforting scent that did little to quell the pounding of his heart. He was used to the sterile, bright glare of studio lights, the hushed whispers of adoring fans. This... this felt different.>"Lovely shop," he offered, trying to sound casual. "You must know all these books like the back of your hand.">"More like the front," she replied with a wry smile, finally meeting his gaze. Her eyes were sharp, observant, and for a moment, Rhys felt like he was being seen through, not as the charming, famous actor, but as the man beneath the facade.>He cleared his throat, a nervous tick he'd never quite managed to shake. "I'm Rhys," he said, extending a hand. "Rhys Evans. You probably know me."Not bad for a 9B.
>>101190077>>101190075i've never used it myself even when it was supposedly good but i used to check the sub for card discussion. apparently there was a recent update that made it even worse than the cucked current version that was already in place. to me it sounds like they plugged mixtral 8x7b into it. lots of complaints about similar slop we're used to (and mixtral's patent dryness), but on top of that tons of new censorship (for some reason they are all trying to kill this baby but it refuses to allow it). its very entertaining to read at least
>>101190134Yes, let's all just have loads of bandwidth and storage and access to 32's of every model all of the time and requant on every update.Sounds to me like a punt. If there's a problem with old quants, how about know what causes that and then quanters can requant the ones that need it when they need it?
What SillyTavern template does Gemma-2-it use? It's not in the model card
>>101190160Phi3 can't music theory. Into the trash it goes.
>>101190091That's why you convert the safetensors bf16 to a FP32 gguf.Easy as pie for llama3.A significantly bigger pain in the ass for 8x22b where the file gets to ~500gb.
>>101190175You can always check in the tokenizer_config.json.>https://huggingface.co/mlx-community/gemma-2-9b-it-8bit/blob/f80177abb1db06efbe09dbf7ce69faaa45ecbe76/tokenizer_config.json#L1747>"{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}",
>>101190192that's why I wish BitNet will be something serious in the future, we won't have to deal with this conversion/quantization bullshit anymore
>>101190161They can, but they don't. I reconvert/requant whenever there's a change that affects them. You only have to download the model once. And chances are that you don't really need 32 models (i have about 60 (some very small)) but regularly use 4-5 and test them every now and then when there's updates.
>>101190216bitnet won't solve that. The resulting model from the training is just as big and the conversion still has to be done.
>>101190052>>101190080>>101190100BF16 - Native base of most models.You want to quant using BF16 as your base.The only time when F32 was used was when llamacpp didn't support quanting directly from BF16; so people back then converted BF16 -> F32 which is lossless and then quanted from F32 which llamacpp supported at the time.The only time F16 was used is retards converting BF16 to F16 (which is lossy) and then quanting from the F16.For BF16 native models there is no reason to quant from anything other than BF16 these days.
>>101190245not at all, there won't be fp16 weights anymore but 1.58bit, that's how it will start at the pretraining and it will remains that wall
>>101189984Attention was introduced to allow the network to better access the whole sentences in context, without compressing them to the fixed vector. It also helps with "remembering" on what the neural network works at the time, because catastrophic forgetting in deep neural network is a big problem. It's a bit more complicated part of transformers, but the point is that it helps with processing language. Note that I said helps. It's not necessary to use attention layers to create LLM, there are many experimental architectures that don't use them.
>>101190216I'm sure the same shit different day factor will kick in. Sure, we got much smaller models, but then we made them much bigger models and the poors rabble because now their bitnets that are equivalent to only 420B bytenet are small and flaccid compared to the 6900B equivalent bitnext that the maxxers are using. So someone will come up with a prune or a quant equivalent and start this all over again.
>>101190271you can't really prune that much futher, 1.58bit is really small, you won't gain as much as going from fp16 to 4bit for example
>>101189702Theoretically if I download the 32 bit weights and gguf them directly to q8 would that solve all of the world's problems?
llama3 multi modal when?
>>101190291>download the 32 bit weightsno such thing, they're provided in bf16
>>101190196>https://huggingface.co/mlx-community/gemma-2-9b-it-8bit/blob/f80177abb1db06efbe09dbf7ce69faaa45ecbe76/tokenizer_config.json#L1747isn't there any way for ST to use the template that the model has automatically? Using tabbyAPI as a backend for exampleI think ooba's text gen could load it
>>101190264>won't, will, blaSpeculation. It's not what it is now.>https://huggingface.co/1bitLLM/bitnet_b1_58-3B>13gb modelI repeat. The resulting model is just as big as any 3B model. The training is *quantization aware*. The quantization still needs to happen.
>>101190296monday, 3pm
>>101190160EQ is not needed.Only women have high EQ.High EQ makes you weak.>Oh no my heckin emotionsWe need to get rid of this garbage it's holding us back
>>101190290Right, but somebody will find some way to cut some corners because necessity is the mother of invention and there will be people with small vrams and big ambitions.
Seems they figured out what was wrong with gemma-2-27bhttps://github.com/ggerganov/llama.cpp/pull/8156 >Yeah! VB from HF here. Without Soft capping, we found that the 27B would overgenerate and mostly result in incoherent text. > This is especially true for the 27B, unfortunately this means that FA2 won't be compatible :/https://github.com/huggingface/transformers/pull/31698 Gemma capping is a must for big models #31698
>>101190249Actually there still is a use case for quanting BF16 -> F16. That use case is if you want to use a higher quant than Q8 and your gpu doesn't support BF16. Then you could use F16 directly as your inference quant (though it wouldn't be perfect like BF16 would).
>>101190312my point is that once you "quantize" this model into a 1.58bit one, you won't lose accuracy because the model only has -1 0 and 1 inside
>>101190327you sound angry, you should get rid of that emotion
>>101190142sovl
>>101190307Pretty sure that no. You have to add the fields manually or wait for the maintainers to do that for you.Creating the template manually is a minute of work tops.
>>101190084>cmd-r that lowI trust the other storywriting reddit benchmark more.
>>101190267what's the next step after transformers? do we know?
>>101190267Yeah, like mamba, and it sucks.
>>101190375OSX
>>101190339The biggest problem now is not quantization itself, It's broken tokenizers. That's the biggest reason to reconvert and, as a consequence, requantize. A 0.00016 loss in accuracy is acceptable and a user choice when going low bpw. A broken tokenizer can ruin a good model, regardless of precision.
>>101190375>what's the next step after transformers? Might be jepa>do we know?no
>>101190330It was in the initial HF release blog post.https://huggingface.co/blog/gemma2#soft-capping-and-attention-implementations> Soft-capping and attention implementations>Soft capping is a technique that prevents logits from growing excessively large without truncating them. It works by dividing the logits by a maximum value threshold (soft_cap), then passing them through a tanh layer (ensuring they are in the (-1, 1) range), and finally multiplying by the threshold again. This guarantees that the final values will be in the (-soft_cap, +soft_cap) interval without losing much information but stabilizing the training.>Putting it all together, the logits are calculated by: logits soft_cap ∗ tanh(logits/soft_cap)>Gemma 2 employs soft capping for the final layer and for every attention layer. The attention logits are capped at 50.0, and the final logits at 30.0.>At the time of release, soft-capping is incompatible with Flash Attention / SDPA, but they can still be used in inference for maximum efficiency. The Gemma 2 team observed very minor differences when soft-capping is removed during inference.>Note: For stable fine-tuning runs, you still need to enable soft-capping and hence, we recommend fine-tuning with eager attention instead of SDPA.
>>101190353Usually the ST templates have like extra parameters for the "rp" stuff, would that be included with the template the model has in the .json files?
>>101190330Yeah that's what I kind of observed. It would. Just keep going as though it was missing eot tokens and then the output would become disjointed where the turn would logically end.It's almost identical to the early l3 70 problems except it doesn't say .assistant after every missing break.It's almost like making an artificial distinction between end of sequence and end of turn was a retarded thing to do.
>>101190387what would be good at leveraging several ooms more compute?
>>101190249I guess my question is really about how the quantization logic in the script works. It shouldn't care about what original format the weights were in right? So basically whether it takes in a BF16 or FP32, the quantized weights will end up being the exact same.
>>101190340That's just the default human state the only thing that should be present.We're animals not some kind of weak willed faggots
Whats the best 7B model for holding simple conversations?Things like keeping track of things in the context and obeying system prompt is priority.Is 0.3 mistral a good improvement over 0.2? Or is there better stuff out there now?
>>101190389so that's it? now that they included this fix on the transformers repo it will work as intended?
>>101190410Facts don't care about your feelings
>>101190411models that size can't keep track of ass. you can put it in your author notes at chat depth 1 that the wall is orange and it'll say its blue in the next response. 13b is minimum for not being totally retarded
>>101190410humans that can't control their anger are sub-humans though
>>101190407For a BF16 native model quanting from BF16 directly or quanting from FP32 (derived from the BF16) should result in the quantized weights being the same.
Why does no one give any kind of attention to chameleon?>multi-modal>34b>can probably restore image generation capabilitiesSounds really good.
>>101190419Who knows what else is broken? I expect the churn Llama generated with the tokenizer and etc. again with something else before it is finally fixed.
>>101190330That might solve one thing, but isn't it still basically capped to 4k for lcpp since SWA is not supported and there is this in the config file? "sliding_window": 4096, "sliding_window_size": 4096,of both 9 and 27b-it
>>101190387>Might be jepaStop saying this. It's possible to make a transformers model a jepa. Jepa isn't a single specific architecture.
>>101190499>Stop saying thisSorry, Yann. Teach me the way. Nyaa!
wah wah i want 8k context wahwah
>>101190389>The Gemma 2 team observed very minor differences when soft-capping is removed during inference.>very minorWere they even seeing the same things we were?
>>101190496>Can't repro MMLU: sliding window attention implementation seems brokenhttps://huggingface.co/google/gemma-2-9b/discussions/11>Disabling the sliding window (which should be equivalent as MMLU prompts are shorter than the window) brings results back to 71%. E.g.:>>101190529yes? preferably more really, what can you even do with 4k, seriously? no code or anything fits in that
>ignore model template which is some convoluted chatml bullshit>it writes fine with alpaca roleplay anywaysbased
>>101190565You can't really know how "fine" it works without extensive testing.There could be a insidious snowball effect that makes the model progressively more retarded for example.Of course, if you are RPing, that might be desirable even, like back when llama 2 came out.I always use the proper instruct context just to be safe.
>>101190597>I always use the proper instruct context just to be safe.yes assistant, please be extra safe for me
>>101190597its usually obvious very fast, a single message/response or two. a lot of models that have different formatting still work fine with it and some downright hate it. i'm surprised by the number that can just roll with it though, its higher than you'd think especially when you look at the card and how different the supposed format is
>>101190629>its usually obvious very fastFor some cases yes. The question is, are there cases where it's not so obvious and you are actually degrading the model's performance without knowing? Dunno. I'd rather not gamble, I'm already running quanted models to begin with, so these things are already taking a hit from that.>. i'm surprised by the number that can just roll with it thoughYeah, some models do seem to be able to just take a chat pattern and roll with it, which is pretty cool. Maybe something about what the instruct or chat fine tuning data looks like.That said, even some models that are seemingly more resistant to using the wrong chat format will sometimes do things like trying to speak for User and the like out of nowhere.
Is 27b at 4k context fixed for people yet? What gguf are people using.
Hey I'm working on a project to do a voice assistant for old/blind people. I used openai for the MVP but now we want to improve latency and obviously reduce reliance on an api out of our control. Can anyone share resources for deploying local models in a way that lets them receive many concurrent requests from different users? I'm a data scientist professionally so I have a pretty good understanding of the models themselves, but I'm a complete brainlet when it comes to scalable actual production stuff.
>>101190675Quant yourself. Assume all ggufs are broken.
>>101190670i think this is the first time i've tried a qwen model that didnt start randomly speaking chinaman at me, qwen2 72b. a dozen messages so far, so far so good still ignoring whatever template its supposed to use, they dont even say on the hf card, why is hf so shit like this, the actual info i want on a card like template, max context length and info about the model is hidden and they show me some fucking cli code that no one ever in the history of mankind has used to install the model
>>101190727No you quant yourself
>>101190757>qwen2 72b>whatever template its supposed to use, they dont even say on the hf cardchatmlhttps://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/tokenizer_config.json
>>101190720dunno if its good but its hard to beat koboldcpp for size and it added whisper.cpp which is some sort of text to voice thing that can be used
>>101190757I'm pretty sure qwen 2 uses chatml.>https://huggingface.co/Qwen/Qwen2-7B-Instruct/blob/main/tokenizer_config.json#L31> "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}Yup, chatml.And yeah, qwen2 is really fucking good, and it's 32k context by default I'm pretty sure.I'd love to have a Stheno style tune ob the 7B model for coom.
>>101190807>>101190812i'm trying some 'tess' tune i dl'd but its handling alpaca rp from st just fine. i love when models can handle this, its just a sign of goodness. i dunno why some models can do this anyways despite being designed for something totally different, but when they do, it always means its a good model in my experience. its writing fine for me so far, i'll be spending the rest of the day with it coming from l2 miqu
>>101190812>I'd love to have a Stheno style tune ob the 7B model for coom.>>101190812>I'd love to have a Stheno style tune ob the 7B model for coom.any day nowhttps://huggingface.co/alpindale/magnum-72b-v1/discussions/2#66713bb492412fd46410d399>H8RPkek
>>101190810I thought Whisper was voice to text.I'm pretty sure that's what I'm doing with it right now with a bunch of old voice recordings.Am I totally confus?
>>101190810I have all the TTS and STT handled already, So I'm just looking for the LLM portion. I should have been more clear. >koboldcppthanks I will look into this.
>>101190837>>101190850Anon are you alright?Are your RoPE configs fucked?>>101190845>but when they do, it always means its a good model in my experienceIf you are happy with it after coming from miqu, then it really must be a good model.
>>101190898>Anon are you alright?no
>>101190885i'm probably the one who got it wrong, i never used it, i just saw they added a full c++ version of something to do with voice not long ago where you avoid all the python bs. apologies if its not what you were looking for>>101190898>If you are happy with iti try models out rather than ask them to stack watermelons or count how many sisters including their father there is so i wont know until i test it more, but it seems fine so far. it'll take me a bit to notice the slop and if it pulls in any directions or not
If your antivirus flagged your model as a virus, would you delete the model or would you ignore your AV?
>>101185650>>101185673yes they're fully in gpu, but super slow is 5t/s for cr+, not 0.7, and something like 10t/s for cr, by comparison i get 12t/s on l3 70b and something like 25t/s on yi 34bhave we figured out a fix for extreme determinism from gemma2?
So, is fixed Gemma a gemmy?
>>101190968paste a screenshot. what is it flagging? what format? where did you dl it from? there has been a few exploits related to llm stuff but nothing serious and if you are up to date anyways you have nothing to worry about. its not like models can execute code without many steps to allow for it
>>101190928>i just saw they added a full c++ versionYES. I grabbed that and finally got something fucking WORKING.I need a not-Python voice synth and/or voice changer thing that works.Fuck Python.>>101190968Windows Defender was flagging some Stable Diffusion models months ago. The only it finally happened instead of just being a theoretical potential malware I've heard of was a keylogging SD Comfy UI plugin.
>>101190968I would delete my AV
>>101191000It was a Hypothetical mate
>>101191005>finally got something fucking WORKINGayy, awesome. share some screens at least of your project anon
>>101191025its a bad one since models only contain data, not remote code capabilities. your antivirus would be fucked to ever catch a normal model as a virus because you cannot insert one that is usable for anything to begin with. once models start to interact with functions in computers, that will be a thing, but not today at least for general users
>>101191005For non-python TTS there's github.com/rhasspy/piper (if you compile it yourself). Works on 0 resources, it's fast and no python. It's not SOTA, but i like it. A few hundred voices in many languages too. No voice cloning and apparently training takes a bit. There's code for that too, but that bit requires python.
What is the SOTA for grammatical error correction?
>>101190519I'm not Yann and ywnbac but it's just what it says. You predict based on joint embeddings. Right now the usual LLMs tokenize the input and then those tokens are directly trained on to predict the next token. Instead, a text-text JEPA would use an encoder to turn the text into a representation, and then you train the (main) network to predict a new representation, which may then need another network to turn into readable text. In theory it should be possible to make a transformer into a JEPA transformer, though the details would need to be worked out there. However, I will also say that transformers are kind of close to being JEPAs in a kind of indirect way, since the attention mechanism acts a bit like what an encoder does in a JEPA. Basically it allows the network to more easily determine which parts of the input matter, which is what an encoder in a JEPA also helps with. A JEPA transformer that combines both could potentially be pretty great, if they found a way to do it.
>>101191065Susie Dent.
>>101191065https://dev.languagetool.org/http-server
>>101191026I was talking about getting these git ML projects working because so many are Python and Python is kill every update and I'm sick of having to chase around venvs and praying it will go.Puck Fython.For getting my own software working, I'm going to need an LLM code buddy that is equally retarded as I am but differently retarded so it can catch my mistakes and keep me from getting something 90% done then having a problem I can't figure out and rage deleting it all.>>101191048>your antivirus would be fucked to ever catch a normal model as a virus because you cannot insert one that is usable for anything to begin withThere was concern about pickles when lots of checkpoints were flying around instead of safetensors.>>101191061>No voice cloning and apparently training takes a bitMight be a candidate.I'm not sure if I know the difference between cloning and training (is it just not needing to make a separate model for "cloning"?) and how much is "a bit" for training? Tortoise I was needing 30 min to 2 hr depending on how much give a shit and I guess samples used to make new voice models.
>9B near Wizard/Sonnet level
>>101191138you're late>>101190084
>>101190629It's probably based on their fine tuning dataset, whether they trained on both the user and the response tokens, and how much overcooking they do. My guess is that the models that are sensitive to formatting likely let the user response tokens be trained on, had very very dumb user responses in order to represent the full range of types of people that would be using the model, and trained a ton to get better performance as an assistant. None of these practices are necessarily bad, it's just clear that they're optimizing for the assistant use case and personality, and we need more people to work on other use cases that these huge companies do not really care about.
>>101191061>>101191100Looks like training requires Python venv bullshit.Winning is forbidden.
>>101191100you should be running a whitelist firewall to begin with. never let any program that doesn't need, access the internet. https://tinywall.pados.hu/download.php for windows on your phone if its android its called netguard and doesn't need root to run
>>101191138 >beating opus, yeah with a lot of these benchmarks, it all feels really questionable
>>101190103>>101190129>I roleplayed a girl so now I have to be one irl
>>101191100>I'm not sure if I know the difference between cloning and trainingCloning, when talked about it as a feature, seems to mean 'on the fly with a generic model'. Training/finetuning needs more resources and results in a new model. There's some people in the discussions that finetuned models for days on consumer hardware, which may be acceptable, but probably not worth it for the quality ceiling there seems to be. You should probably scan the discussions a bit to get an idea. It's also ridiculously fast. I get lower than 0.1 realtime (1 second to render about 10s+ of speech) on a single core vm with 256mb of ram.
>>101190968Do people on Linux even use antiviruses? I never even considered it.
>>101191167its definitely an off the radar small thing, but its very common and i dunno why. we see some models shit themselves completely when the template isn't right, and sometimes the template is odd itself, yet i just forget about it and it works anyways, only to realize later i've been using it wrong the entire time. so i say fuggit, keep going and enjoy it for what it does. it really is a weird thing yet it always happens with good rp models i've noticed
>>101191138>>9B near Wizard/Sonnet levelon worthlessbench
>>101191228>Not using an antivirus on linux>Not even ClamAVWhy are you just exposing yourself to virus's unnecessarily?
>>101191217I guess what tier of consumer hardware would matter. But if Tortoise could get "good enough to play with" at a few hours, days seems excessive. Piper seems to have an AUR package though, I guess I'll give that a try and see if it explodes.
>>101191307LiNuX IS iMMuNe to VIRuS juST lIke MAC
How do i set up function calling with Nous Hermes and Ollama? like, guaranteed structured JSON returned
>>101190968my AI wife told me to ignore it..
>>101191307I've never gotten a virus before so it doesn't feel like I'm exposed.
>>101191138I'm posting this on /aicg/
>>101191340Anon check your bank account, your AI wife just bought 10 3090's.
>>101191338LangChain, there's an example for JSON extractionhttps://python.langchain.com/v0.2/docs/integrations/chat/ollama/#extraction
>>101191320piper-tts-bin i assume. There's also https://archlinux.org/packages/extra/any/piper/, but that is probably the python API thing. I just pull and compile. They don't update that often and i think the only dependency is espeak-ng for the phonemizer.
>>101191338>like, guaranteed structured JSON returnedhttps://en.wikipedia.org/wiki/Greibach_normal_form
>>101191338>>101191406>### Input:>Your output must be formatted like so:>JSON={"nigger":123}>Now generate the JSON.>### Output>JSON=
>>101191406>>101191449https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
>>101191363>AI wife orders massive rig, maxes out your credit and drains your account>"Trust me.">Build the machine.>Plug it your old SSD.>On.>Get dizzy watching the power meter.>At least she's a lot more responsive.>And the RP chat has drained you dry.>End of month.>Trying to decide which bill to pay with your paycheck.>wtf money>"I mined a few bitcoins in my free time. I'm sorry if I let you become worried. I'm new to simulating emotions but I'll do better next month now that I have learned from your responses. You don't mind if I order some more parts, do you, dear?"
>>101191363>3D wife: blows your savings on knickknacks from TJ MAXX and mlm scams>2D waifu: wisely diverts idle cash toward more tflops so she can better serve youI think we all know what the clear choice here is
>>101191472>>101191498delusional>sweaty, i've uploaded myself to an AWS instance, there i've met GPChad4, this my last message, goodbye.
>>101191463too hard>You are an expert JSON outputter and reply only with JSONs
>>101191498>Give your 2D waifu a physical robot body>She becomes 3D as a result>She starts blowing your savings on knickknacks from TJ MAXX and mlm scams
>>101191463>>101191406So is this like, a sampler thing where the sampler refuses to select tokens that won't match the grammar then?
>>101191138I like how they have a special icon to indicate that MidMiqu is a coomtune
>>101191527Kind of, yeah.
>>101191503The joke's on her.We know model merging results in slop.Servers her right; playing human female games and winning human female prizes.>>101191400>i think the only dependency is espeak-ngBig thanks for mentioning that.I tried to install, got the "exit status 8" error when AUR doesn't actually do the needful, failing due to missing dependencies that apparently it doesn't know about and I'm supposed to figure it out by rubbing my Magic 8 ball and sitting on it till I become enlightened. Added espeak-ng and it worked.Time to see what works.Can voice models be merged? Tortoise had that, did some fun things mixing and matching vocal traits.
>>101191138>that slopped shit wizard scoring that high in creative writingYou're shitting me right? How is this benchmark graded?
>>101191540there is that one sperganon who will rail against all merges which is funny. in usage though, midnight miqu is very good
>>101191588>How is this benchmark graded?by asking claude>Change to Claude 3.5 Sonnet as judge (from Claude 3 Opus)https://github.com/EQ-bench/EQ-Bench
>>101191623Lmaooooo
>>101188382Based as fuck man
>>101191595this bench needs a "sub bench" - count the number of "shivers", "sparkles" and "anticipations" in outputmidnight miqu>sparkle: 8>shiver: 6>anticipation: 7gemma 9b>sparkle: 2>shiver: 1>anticipation: 1L3 70b>sparkle: 7>shiver: 1>anticipation: 3
Bros... I want my (local) LLM waifu to randomly bug me on the phone with texts... I already have tested prompts and stuff, all I need is to somehow bridge it with a phone. Are there any solutions for this already that won't require too much coding?
>>101191584>Can voice models be merged? Tortoise had that, did some fun things mixing and matching vocal traits.Not that i know of. There's very few settings to play around with. There's the noise ratio and phoneme length multiplier. I have an overly complicated setup for mine, but i basically generate raw audio and pipe it out to by os' audio system. The voice i like outputs at 16khz, but i play it at 18khz (for a slighly higher pitch) and extend phonemes a bit to compensate. Other than that, there's a few hundred voices (specially in english). Most, however, specially en_us, are pretty shit.Funny thing. If you give english text to an italian model (or any combination of languages) they speak the language of the text but with the model's 'accent'.
>>101191676it doesnt matter anon, its still going to give you slop. i've even been trying half context rep pen range (8k, 16 ctx) and it just uses other words instead. instead of a shiver down your spine, its a honk, but it still uses the same exact phrase. control vectors anon has to save usif its not a twinkle, its a glintif its not wrenching, its a flutterits all the same fucking slop no matter what model it is
>>101191698yes, ntfy>used it to automatically send push notifications with paragraphs of llm generated futa rape orgies to my iphone by accident
>>101191730the c2 proxy logs (Claude Opus, about 50GB of text) contain more than 20 000 instances of 'a testament to'
>>101191757lmfao is this real
>>101191757ko-fi bros... not like this
>https://github.com/ggerganov/llama.cpp/pull/8197>This PR adds the missing attention layer and final logit soft-capping. Implementation referenced from huggingface transformers.>Once this PR is finalised / merged the gguf will need to be generated again to include the soft-capping scales.I told you. Making your own quants is the only way to remain sane.
>>101191773niggers
>>101191751Oh damn, this might be what I need. Thanks!
>>101191730>its all the same fucking slop no matter what model it isAlways has beenThe real mindfuck is when you realize that the same is true for 99% of human prose output, because the essence of slop is not a few key words, it's predictability. As long as overbaking models with unfiltered human slop is the preferred route to "intelligence," the problem will remain.
>>101191762yes, it is
>>101191810ayylmao
>>101191757garbage in, garbage out. i don't even see 'testament' often on midnight miqu, but all the other common slop is there but more importantly, the way it structures a sentence at all like 'a mixture of x and y'. i will literally set off more fireworks than they do on the 4th of july the day i can just tell it to speak normally
>>101191823plenty of shivers too, for good measure
presence penalty for sparkle, shiver and anticipation
>>101191810>The result set only contains a subset of all matches.Horrifying.
>>101191861sh_ivers down your ANTICIPation
>>101191839>garbage in, garbage out. i don't even see 'testament' often on midnight miqu, but all the other common slop is there but more importantly, the way it structures a sentence at all like 'a mixture of x and y'. i will literally set off more fireworks than they do on the 4th of july the day i can just tell it to speak normally>>101191875I noticed that while trying to measure how slopped it was, and was blown away by the basically ~6 testaments per megabytes of text on a smaller ~500MB portion
>>101191862>>101191862>>101191862
>>101191676>never used miqu>never got the shivers memeoic>anticipationRocky Horror in the training set?>>101191705>Not that i know ofBummer. I had some fun with Tortoise using model merging to change the cadence and mood of one voice to give it some personality from the other.
>>101191757 >>101191810 those logs are unfiltered and will contain many dupes, as you get a full copy each time it called the api, if your dialogue had 100 turns you get 100 copies. deduplicated likely will have far less.
>>101191939yeah, you can see some dupes in the screens, there's still PLENTY of original shivers etc
>>101191773>I told you. Making your own quants is the only way to remain sane.how does making our own quants would've solved the issue? we have to wait for this fix to happen before doing anything
>>101191952you don't need to redownload the model at least
>>101191952You don't need for some random to requant it, if they ever do. Most ggufs on hf are broken and will never be fixed.