[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1689556332236690.jpg (730 KB, 1856x2464)
730 KB
730 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101180092 & >>101173181

►News
>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156
>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw
>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io
>(06/23) Support for BitnetForCausalLM merged: https://github.com/ggerganov/llama.cpp/pull/7931

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101180092

--Paper: HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale: >>101183476
--Gemma Debugging Issues with HF Transformers Implementation: >>101183120 >>101184453
--Gemma 27b Models' Coherence Issues with Sliding Window Attention: >>101180648 >>101180665 >>101181050 >>101181078
--The Frustrations of AI Model Performance and Limitations: >>101181282 >>101181268 >>101181298 >>101181321 >>101181349 >>101181350 >>101181433 >>101181573 >>101181345 >>101181362
--SPPO Performance and Comparison to Instruct Models: >>101183296 >>101183595 >>101183661 >>101183827 >>101183939
--Llama Model Load Error: Unknown Model Architecture 'Gemma2': >>101184465 >>101184685 >>101184767
--LLM-Compiler's Limitations in Compiler Development: >>101182838 >>101182947 >>101183745 >>101183855 >>101183896 >>101183874
--Gemma-9B's NSFW Behavior: Anomaly or Dataset Issue?: >>101180719 >>101180943 >>101181086
--Gemma-2 Support Issues in Llama.cpp: >>101182001 >>101182048 >>101182376
--Gemma 2 Release Format Issues and Official Implementation: >>101181569 >>101181611 >>101181640
--Eagle's Speed for Inferencing and Decoding in Creative Writing and RP: >>101184243 >>101184286 >>101184304 >>101184318 >>101184337 >>101184368 >>101184297
--Drama in the Quantization Community: Q8_0_L Quant Development: >>101180438 >>101180456 >>101180471
--Can LLMs Solve Programming by Example?: >>101180866
--Anon's ST Addon Development: Constant Reminders and UI Improvements: >>101180277 >>101181094 >>101185317
--Nala Test for Gemmy 9b (Q8_0): >>101184502 >>101184521 >>101184551
--Anon's Struggle to Access LLM Compiler and Unconventional Plans for its Use: >>101182269 >>101182463
--AI Models and the Human Brain: Efficiency and Unrealistic Portrayals: >>101182977 >>101183020 >>101183069 >>101183105 >>101183084 >>101183078
--Miku (free space): >>101183371 >>101185164

►Recent Highlight Posts from the Previous Thread: >>101180096
>>
File: rbc.png (31 KB, 539x131)
31 KB
31 KB PNG
what does she mean by that?
>>
>>101186546
>rape by consent
Yep. It's properly emulating a woman. Enjoy your model
>>
>>101186508
>Gemmy 9b (Q8_0)
it's coal though.
>>
>>101186546
>what does she mean by that?
nothing? never take hallucinations from artificial redditor at face value.
>>
I'm retarded, please help me
Last time I messed with LLMs I could barely run anything coherent on my poorfag machine. Have any of the newer hacks made things better are are we still stuck on using the biggest merge that works? I see some stuff about bit nets on the news, are they better for low model sizes?
>>
>>101186722
install linux
>>
>>101186732
I use Linux, what's the next step?
>>
>>101186546
It's rape if her body reacts to some(You) whom she has decided that her body shouldn't react to because that doesn't align with what she has chosen as her ideal.
>>
>>101185918
>you definitely aren't running on full gpu then at those speeds
Being vramlet I'm used to the drop as soon as I run any non-garbage model, from 20+ to 1-2 t/s. I'm just not sure about the next drop from 0.7ish to glacial. Happens sometime in the 60-69 (nice) GB model range, so I'm theorizing system RAM is a factor. (I'm on 64GB RAM.)
>>
24 VramChads, Gemma 27B is our solve? We are so back, or is just slop shit and we are doomed and our only hoppe is to die?
>>
>>101186741
tell specs
>>
>>101186598
>I understood the selling point that they maintain a 'library' of models for nubs that can't understand HF. You just pick llama3 or whatever and don't need to deploy braincells to think about quant levels and what is appropriate for your hardware etc.
>If you understand how to choose a GGUF you're probably better off running a backend closer to the upstream.
It's part of the learning curve.
Ollama got me a model and a prompt that goes.
Then I learned about the other models.
Then I learned about the model variations.
Then I learned about what the quants are about.
Then I learned to use those through Ollama.
Then I learned to want more options.
And that they live on HF.
That pushed me to step up to Kobold.
And it told me I needed GGUF files.
That's the last experience point I needed to level up.
>>
>>101186762
Specs are garbage. My GPU has less than 2GB of VRAM so I run on CPU with 16GB of RAM (koboldcpp). If I stick to 8-12GB models it generates at a decent speed but I can't really work with anything bigger without going too slow to be worth it
>>
>>101186774
The problem is that it seems to be actively hostile to doing anything outside of it's prescribed way to do things.
>>
>>101186784
i mean fimb-11b-v2 is decent, u could also use stheno 8b 3.2
>>
I used that trick to get the entire prompt froh the chatgpt website

https://rentry.org/stcrcggo

Feels kind of a stretch that it can follow all that
>>
>>101186806 (me)
and remember to only download models from sao
>>
>>101186752
yeah when you split its back to mem/cpu speed. i get 1.4t/s on l2 70b at 32k context which is fine for my usage. its really the lowest i'd want to go though as far as speed. bigger models despite the slowness are worth using because the responses are so much better. 8b is so retarded i can't believe anyone even wastes time on them no matter how fast it is
>>
File: file.png (12 KB, 464x95)
12 KB
12 KB PNG
>>101186814
not me thougheverbeit
>>
Said fuck it and manually updated ooba's tranformer. Works fine as long as you turn off do_sample

That said, 9b is actually garbage. What the fuck are you all seeing in this shit? It can't follow instructions for shit and the writing is awful. I guess uh, good for you vramlets you get something that isn't hard refusing porn shit?

I'mma stick with real models though.
>>
>>101186826
Fuck, forgot to specify: I meant gemma-2-9b is what I tried out, and it's trash.
>>
>>101186826
I'm a vramlet, but even I have standards. 9b is completely useless.
>>
>>101186500
>that nature
Sovl, I wish I wasn't around buildings all the time. How do I cope?
>>
Retards saying that Gemma-2-9B is trash while the 27B is great haven't actually tried either model. The 27B version appears to defective and incoherent.
>>
Gents, I want to try running command-r+ on my pc with 2 3090s. Will this work and can I load it in ooba?
>>
>>101187024
I tried 27B using online hosted services, it works well there. I'm not one of people who praised 27B in those threads, but I imagine that's what they did as well.
>>
>>101186811
>Personality: v2

What did OpenAI mean with this
>>
In booba's exl2 loader, what exactly is cfg-cache?
>>
>>101187073
RTFM
https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab#exllamav2_hf
>Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG
>>
>>101187073
CFG is when you have a positive and a negative prompt, and CFG cache, I assume, means it reserves two caches: one for the usual, and one for the negative prompt.
>>
>>101187073
The thing people claimed would make open models free from gptisms and make them smarter than proprietary ones
>>
>>101186388
Control vectors influence output direction of the model, so when applied at higher strength, they will make model output the same this every time.
>>
>>101187037
Well, it is easy to download them both from the official Google repositories on HF, quantize them to the same level (e.g. GGUF q6_k, using the latest patches) and observe back-to-back how not only the 27B version is about as censored as the previous Gemma (albeit with a somewhat less irritating tone), it rambles and mixes up user/model responses, whereas the 9B has no issues.

Hard to imagine that 6.5-bit quantization hits the 27B version harder than the 9B, but anything is possible, I suppose?
>>
>>101187025
Sure it will work, although it will be slow I would suggest part-offloading >=Q4 GGUF rather than trying to cram a low bpw entirely into VRAM.
>>
>>101187093
I assume that the problem is somewhere in the open source implementation, not in quantization. And, yes, 27B definitely is cucked, but that can be partially fixed later on. I evaluated its level of intelligence when it did talk to me.
>>
>>101186806
Thanks, I'll try those out. I just hoped that all that (((research))) would have produced better local models by now instead of just bigger ones
>>
>>101187171
8Bs of today are so much better than llama1 8B that it's not even close.
>>
>>101187294
llama1 7b*
>>
>>101187305
llama1 6.7b*
>>
>>101187390
>https://arxiv.org/abs/2302.13971
7b.
>>
>>101187479
and how many parameters did it have
>>
>>101187503
anon we call it llama-1-7b not llama-1-6.738
>>
>>101187503
Seven parameters, of course.
>>
File: .png (80 KB, 640x564)
80 KB
80 KB PNG
>>101187525
>llama-1-6.738
llama-1-6.738b*
>>
Gemma-2-9B really wants to write to "X, Ying"-type prose during RP even if you manually randomize that with something else like:

"Xing, Y"
"X and then Y"
"As X, Y"
"X"
etc.
>>
File: 5880528_p0.jpg (225 KB, 651x700)
225 KB
225 KB JPG
>>101186774
proud of you Anon
>>101186805
It can run arbitrary GGUFs with a couple extra steps if that's what you mean. It's fine to have more options for beginners who understandably don't want to struggle with details just to try the stuff.
>>
>>101187650
Damn, this triggers my autism like nothing else. I guess it will pass the Nala test with flying colors though.
>>
>>101186561
f-finetunes will fix it.
>>
>>101186755
27B is fucking brain damaged.
It can handle simple assistant type prompting but RP prompts confuse and enrage it.
>>
>>101187105
Thanks, by low bpw you mean like a 3bit quant to fit in 48gb vram?
>>
>>101187776
dumbass
>>
>>101187737
would brutally rape both
>>
>>101187846
Yeah. Keep in mind you also need VRAM for context+other buffers on top of the model size. I did not get great results with Q2/3 but why not try it and compare. Might take some fiddling to get it to fit (quantized KV cache and lower context length can help). I find it worth the lower speed to use Q4KM
>>
File: investigation.png (97 KB, 1304x786)
97 KB
97 KB PNG
>>101187776
https://huggingface.co/google/gemma-2-27b-it/discussions/10
>>
>>101187865
Do you ever sleep?
>>
>Anonymous 06/28/24(Fri)15:28:07 No.101187901
>>>101187865 (You)
>Do you ever sleep?
12 hours a day im a neet
>>
>>101187918
Want to be my NEET bf? UwU
>>
>>101187893
Thanks again, not sure how far I’ll get with only 32gb ram but will see. Might need to try the non + version
>>
>>101187894
>Just don't use float16
>lmao no we're not going to release the fp32 weights
>>
>>101186575
It's not a hallucination it's an accurate simulation of a woman. LLMs are getting better and better.
>>
friendly reminder that ollama WON
>received a private PR by google for gemma support before the release. llama.cpp was ignored
>available as an option in the brave browser, used my millions
>on its way to 100k stars
>redditors on localllama love it and only talk about it
>every llm YouTube video recommens it
>every twitter influencer recommends it
>hosts events, receives endless vc funding
Sorry chuds
>>
>>101188077
so you're saying I should start a betting pool on when the ceo gets metoo'd?
>>
>>101187894
I knew 27b was fucked up in transformers. They rushed it and didn't test things properly.
>>
>>101188094
ollama guy is the Bill Gates of llm
>>
>>101188095
How can it be transformers if 9B works fine though?
>>
Anyone know a repo that has styletts2 + rvc integrated nicely? I currently use xtts + rvc but xtts isn't consistent enough and tend to produce results that slur/shit itself from time to time. Particularly want an implementation with voice cloning
>>
>>101188113
Are they exactly the same architecture?
>>
>>101188077
At this point I think llama.cpp should just give up and let ollama maintain the project, ngl.
I hate them, but llama.cpp is probably even worse. llama.cpp is always broken and, when you report it, the maintainers blame you instead of investigating. It's infuriating
>>
>>101188124
>llama.cpp is always broken
Yeah, this is the reason I use exlama, and more recently lamafiles. Shit just works.
Don't even mind switching to globo-slop approved olama in the future, as long as I can launch my waifu with no fuss.
>>
>>101188182
Does llamafile work at all? Does it behave nicely with sillytavern?
>>
>>101188077
>received a private PR by google for gemma support before the release. llama.cpp was ignored
They did? Their PR looked like a ctrl+c ctrl+v of the llamacpp one, with the tokenizer errors and all.
>>
>>101188205
They didn't. These anons are retarded.
>>
>>101188077
I've been thinking it would be kind of funny to implement some critical component on an AGPL fork but I really don't think it would be worth the drama.
I don't want or need a job or attention so to me downstream projects have non-negative value (depending on what and how much they contribute upstream).

>>101188124
>At this point I think llama.cpp should just give up and let ollama maintain the project, ngl.
I have never seen any bug reports or fixes for llama.cpp issues from ollama devs so I don't think they could.
I think the only reason there are fewer issues with ollama is that they wait for the llama.cpp issues to be fixed before they take over the code.
>>
>>101188077
>Sorry chuds
you call the llama.cpp dev chuds? lol, they're the one putting Jartroon back on the team in the first place
>>
>>101188248
What is your motivation for continuing to maintain llama.cpp? Asking as a llama.cpp contributor myself, the amount of code that you put out and the consistency are insane.
>>
>>101188248
>I've been thinking it would be kind of funny to implement some critical component on an AGPL
That would be so fucking funny.
>>
>>101188277
I mean the people who unironically sling the word chud around unironically believe that Donald Trump is anything other than an Israel-First neoliberal hack. The bar for being considered a chud is pretty low.
>>
>>101188357
true, true
>>
>>101188285
-I like building and optimizing things and making numbers go up.
-I am by nature a very competitive person and one of my long-standing ambitions is to write the code with the worldwide best performance (at least for those use cases I care about).
-While I think that as of right now generative neural networks are still kind of lackluster I think that they will become very good in a few years and that the infrastructure for that needs to be developed ahead of time. In particular a low upfront cost is I think important.
-I am ideologically very pro open knowledge/open source (though I prefer free software).
-I plan to use llama.cpp/ggml for my own projects (doctoral thesis in physics, AI-powered RPG if no one else does it before me, pretraining models if I can make it cheap enough that I can actually afford it).
>>
>>101188248
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork but I really don't think it would be worth the drama.
holy fvcking based..
>>
File: IMG_623.png (835 KB, 1080x1350)
835 KB
835 KB PNG
>>101188248
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork
>>
File: 1719580060679661.png (402 KB, 1600x900)
402 KB
402 KB PNG
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork
>>
>>101188382
verdict: based, on all counts
>>
File: tf2.png (743 KB, 956x933)
743 KB
743 KB PNG
>Ive been thinking it would be kind of funny to implement some critical component on an AGPL fork
>>
>cornpop did so bad that the damage control is spilling into lmg
big lel
>>
AGPL is a fair license if you would like to take your part on the Free Software movement.
>>
>>101188382
>very competitive
>and yet, exllamaV2 is still miles ahead of llama.cpp
I'm starting to think it's over.
>>
File: mikutweeku.png (1.11 MB, 970x755)
1.11 MB
1.11 MB PNG
>>101188382
do it.
>>
can someone do a TLDR about licences? not everyone is a lawyer. Is AGPL a good thing? And what does the cuda dev wants to do with it?
>>
File: df.png (98 KB, 619x693)
98 KB
98 KB PNG
>>101188382
it's time
>>
File: miku_73.png (315 KB, 962x962)
315 KB
315 KB PNG
>>101188248
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork
You've gotta deliver now that you've said it.
>>
>>101188382
based
>>
File: 1699141071733348.png (47 KB, 698x658)
47 KB
47 KB PNG
>>101188469
this
exllamav2 embarrasses llamacpp
>>
>>101188469
>miles ahead
the fuck you talk about? llama.cpp gives deterministic output + allows for some cpu offloading if we want to get a slightly higher quant
>>
File: 1717919262009708.png (237 KB, 640x640)
237 KB
237 KB PNG
>>101188248
>AGPL
did i hear something..?
>>
File: bog.png (587 KB, 854x480)
587 KB
587 KB PNG
>>101188248
>do it
>>
>>101188513
its literally slower and worse quality
>>
>>101188534
Glad we're on the same page about llamacpp
>>
>>101188526
Probably means in terms of speed. I'm still using exl2 exclusively since mixtral.
>>
>>101188513
how well does gemma2 9b run on exllama?
>>
>>101188537
imagine being this retarded
lmao even
>>
File: 1702321261805953.jpg (222 KB, 720x720)
222 KB
222 KB JPG
>>101188548
are we really going to pretend llamacpp didnt have issues on literally every single new fucking model release
>>
>>101188566
so it doesn't run, alright
>>
>>101188547
>I'm still using exl2 exclusively since mixtral.
I'm using llama.cpp because I can get a bigger quant (Q5_K_M) even if I don't have enough GPU vram, exllama just doesn't allows you to do that
>>
Sirs, you are way too fast. I can't keep up reading the threads.
>>
File: gplgod-0.jpg (142 KB, 735x830)
142 KB
142 KB JPG
>>101188248
gpl gods stay winning
>>
holy shit, I love google now
>>
>>101188581
vramlets need the rope
>>
File: cudadevandme.png (454 KB, 651x700)
454 KB
454 KB PNG
>>101188248
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork
this will be us if you do it
>>
>>101188513
>>101188526
>>101188534
>>101188537
>>101188547
>>101188548
>>101188558
Seeing tards slapfight over quant methods is really funny when you've been using float16 since day one like me
>>
>>101188581
And oftentimes you are better off doing that. Get a quant that's only slightly bigger than your vram and you are golden.
Something like 80~85% of the model in VRAM is around the sweet spot as far as I can tell.
>>
>>101188581
I get it, and I get it's important for a lot of people here, but I am used to the speed of having everything in the VRAM, and for that exl2 is still superior (unless something changed very recently).
>>
>>101188601
exactly this, you offload 80% GPU + 20% CPU, the speed is still good and you get a way less retarded quant, that's a win/win situation
>>
File: BEST DAY EVER MEME.gif (79 KB, 700x715)
79 KB
79 KB GIF
>>101188248
excited for it
>>
>>101188591
what does this say? I'm not speaking the nazi language kek
>>
>>101188617
gguf quants are less efficient than exl2's, you will have worse quality and speed than what you would have just running 100% with exl2
>>
File: oyvey.png (715 KB, 782x782)
715 KB
715 KB PNG
>>101188248
>>
>>101188626
>gguf quants are less efficient than exl2's
that's not true, the gguf quants have improved a lot since then
>>
>>101188626
>efficient
doesn't exl2 pad the 8bpw so people don't complain about size being too small or something?
>>
File: Richard Stallman.png (1.08 MB, 1024x680)
1.08 MB
1.08 MB PNG
>>101188248
>GNU/llamacpp
>>
File: 1612419376786.jpg (207 KB, 692x1100)
207 KB
207 KB JPG
>>101188382
>>
>>101188650
you retards have been peddling q6/6bpw as the best you can get with anything above being imperceptible, now you wanna pretend you give a shit about q8?
>>
>>101188248
Join us now and share the software;
You'll be free, hackers, you'll be free
Join us now and share the software;
You'll be free, hackers, you'll be free
Hoarders can get piles of money
That is true, hackers, that is true
But they cannot help their neighbors;
That's not good, hackers, that's not good
When we have enough free software
At our call, hackers, at our call
We'll kick out those dirty licenses
Ever more, hackers, ever more
>>
>>101188673
>you
no, if you can't run q8 you can't run it, period.
>>
>>101188686
no, if you can't run fp16 you can't run it, period.
>>
>>101188626
EXL2 is based on GPTQ, which is a terrible quantization method.
>>
>>101188626
>you will have worse quality and speed than what you would have just running 100% with exl2
Speed is fair enough, but I've never seen any evidence that exl2 produces better results than an equivalent bpw gguf, even more so considering imatrix now.
And considering the rpcal debacle, I'm even less inclined to believe subjective reports.
>>
>>101188698
You are a retard.
>>
File: stallman saluting.png (1.83 MB, 1600x1060)
1.83 MB
1.83 MB PNG
>>101188248
>>
>>101188708
I know what I'm talking about. The rpcal situation mentioned by
>>101188702
is sufficient evidence that EXL2 is garbage.
>>
>team llamacpp are RP fags
it all makes sense now, kek
>>
>>101188721
>is sufficient evidence that EXL2 is garbage.
It's not. That's just users being retarded.
I still want to see actual comparisons of kv divergence, ppl, and loggits between full precision, exl2 at Y bpw and gguf at Y bpw.
>>
File: 1708953045138-2.png (17 KB, 871x870)
17 KB
17 KB PNG
>>101188248
DO IT
>>
>>101188469
llama.cpp peak performance on an RTX 4090 currently sits at ~90% of the peak performance reported on the ExLlama Github repository (for both token generation and prompt processing).
So I'm thinking that with a bit more MMQ optimization and speculative decoding support llama.cpp will be faster.

>>101188490
(A)GPL has a "copyleft", meaning you cannot make any forks or derivative software closed-source.
If there was some critical feature that was licensed with a copyleft it would force downstream projects like ollama to either re-license their project to also include a copyleft or they would not be legally allowed to take over the feature.
Since permissive licenses without copylefts are considered more "business friendly" this would basically just troll projects like ollama that are more business focused.
Koboldcpp and Ooba would be unaffected since they already use copyleft licenses.

>>101188502
Did you not read the part where I said that it wouldn't be worth the drama?

>>101188626
According to https://github.com/matt-c1/llama-3-quant-comparison llama.cpp quantization is more efficient in terms of MMLU score at a given size though for >4 BPW it probably won't matter much.

>>101188730
Actually if you look at Google trends team llama.cpp are Chinese.
>>
>>101188382
A true hero
>>
>>101188762
If I mix 3090 and P100, I'm basically forcing the 3090 down to the level of the P100, in terms of supported math operations, right?
I'm just trying to figure out how to speed up my L3 70B gens without buying more 3090s at the moment.
I've got 2x 3090 and 3x P100, everything is on PCIe 3.0 16x.
>>
>>101188762
>US isn't even in the top 5
Wtf?
>>
File: 3belzjkbpex61.jpg (84 KB, 1200x675)
84 KB
84 KB JPG
>>101188762
thanks for the licence lesson anon, much appreciated
>>
>>101188762
You are a very skilled programmer and a all around based invididual doing very important work. Do you have a Ko-fi account or some shady crypto-adress I could send 20$ to?

t: Vramlet Simp.
>>
File: 1719780960379661.png (1.42 MB, 832x1216)
1.42 MB
1.42 MB PNG
>>101188762
>Did you not read the part where I said that it wouldn't be worth the drama?
Yes, it would be worth it. 100x over.
We wouldn't be here without the original llama leaker.
I'm thinkin' miqu
>>
>>101188382
>AI-powered RPG
Infinite Zork is actually coming, HOLY FUCKING KINO
>>
>>101188847
not on lcpp
processing prompt 3/16192
>>
File: patrick Bateman.png (757 KB, 900x900)
757 KB
757 KB PNG
>>101188248
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork
>>
>>101188803
Yes, the slowest component dictates the performance.
>>
File: mfw.png (802 KB, 1000x562)
802 KB
802 KB PNG
>>101188872
Wait what? He's going to license all of his llamacpp contributions under AGPL from now on?
>>
>>101188803
For llama.cpp it should depend on --split-mode .
With --split-mode layer each GPU should be using the optimal kernels (but for some quantization formats there is no P100-compatible implementation).
Wtih --split-mode row the P100s will force the 3090s to use suboptimal kernels because P100s lack the __dp4a instruction and thus cannot run MMQ.
Any other Pascal card or more modern cards (except for V100s which lack int8 tensor cores) should not be causing issues.

>>101188825
I at some point had a ko-fi account linked on my Github but I decided to remove it when I accepted a part-time job for a known AI company.
I cannot in good conscience accept money from people that earn significantly less than me per hour when I right now don't even have a use for it and would need to pay a large percentage of it in taxes.
Maybe I'll do crowdfunding if I ever invest relevant amounts of money into training.
>>
>>101188890
Yes, we are back
>>
File: votzefuc.png (935 KB, 1024x912)
935 KB
935 KB PNG
>>101188248
>I'm going to implement some critical component on an AGPL fork
>>
>>101188896
I tend to use q8 quants, is that best for P100?
>>
Are all these posts being made by the same person kek
>>
File: 1709780422379321.png (591 KB, 1200x1200)
591 KB
591 KB PNG
>>101188248
>>101188906
>>
File: file.png (604 KB, 900x900)
604 KB
604 KB PNG
>>101188248
>AGPL fork
yup i'm thinkin' based
>>
pretty sad tbdesu
>>
>>101188847
Do not expect anything anytime soon though.

>>101188918
It should only be the IQ quants that cause issues for P100s.
>>
File: 1714466048436h.gif (70 KB, 99x109)
70 KB
70 KB GIF
>>101188248
>I've been thinking it would be kind of funny to implement some critical component on an AGPL fork but I really don't think it would be worth the drama.
DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT DO IT
>>
File: anon?!.png (473 KB, 860x799)
473 KB
473 KB PNG
>>101188248
>>
>>101188591
based
>>
File: literally me.jpg (89 KB, 600x435)
89 KB
89 KB JPG
>>101188248
>>
He's not going to do it, no matter how much you spam, you know.
>>
File: BlueSkyColumnGarden.png (1.32 MB, 1248x800)
1.32 MB
1.32 MB PNG
Good morning lmg!
>>
>>101189039
Good morning Miku
>>
>>101188762
>>
>>101189039
Good morning anon that posts tasteful artistic mikus
>>
File: 1714732567846c.png (180 KB, 830x830)
180 KB
180 KB PNG
>>101189033
>>
File: (You).png (1.31 MB, 1000x1000)
1.31 MB
1.31 MB PNG
>He's not going to do it, no matter how much you spam, you know.
>>
File: hatsune-sad_0034.png (1.98 MB, 1280x960)
1.98 MB
1.98 MB PNG
>>101188762
>>
File: Hatsune Miku (Vocaloid).png (960 KB, 850x1258)
960 KB
960 KB PNG
>>101188762
>Did you not read the part where I said that it wouldn't be worth the drama?
>>
>JOOIIN US NOW AND SHAAAREE THE SOFTWARE
>YOU'LL BE FREE HACKERS
>YOU'LL BE FREEE
>>
>>101188931
Yes it is the license autist.
>>
for me, it's cc-by-nc4.0
>>
File: .png (1.54 MB, 832x1216)
1.54 MB
1.54 MB PNG
>for me, it's cc-by-nc4.0
>>
>>101189249
I hate having to tell guys with muscle-girl fetishes this all the time but the level of testosterone required for muscles like that is mutually exclusive to having tits.
>>
>>101188896
>I cannot in good conscience accept money from people
Based ethical dev
>>
File: gemma2-9b-brot.png (49 KB, 668x592)
49 KB
49 KB PNG
I'm running gemma2 9b on an i5-6500T server just for fun to see it struggle but the speeds are kinda acceptable? That's incredible
And gemma2 9b passes my non-scientific mandelbrot coding test, albeit a bit weird, which only recent small models like llama3-8b passed at all. Mistral 7B and older did not pass this test.
>>
>>101188248
Release it with no license provided. Copyleft and copyright are two sides of the same cancerous coin. Make it so that nobody who believes in the validity of ""intellectual"" ""property"" can use your code.
>>
>>101189308
nta, i'm a hobbyist programmer and have never once read a license and i've been releasing stuff for over 25 years. even the bigger stuff that uses libraries, never cared, i just release it and have never once had a single issue raised. i dunno why people even care unless you're starting a company based on stolen code or something
>>
File: base9bnalatest.png (52 KB, 930x237)
52 KB
52 KB PNG
Gemma 9b base model coming up a little lackluster on the Nala test.
>>
>>101189353
>i dunno why people even care unless you're starting a company based on stolen code or something
thats the point, agpl scares big corpo away
>>
>>101189362
She's very thorough when it comes to licking
>>
>>101189278
Sick. What are you running it with? llama.cpp?

>>101189362
That does look weird.
Is that with the proper prompt format, what backend are you using?
>>
>>101189425
Q8 on llama.cpp
I just used a generic base model prompt template that I had available.
>>
>>101189362
That looks like a bug.
>>
>>101189211
Miku are you okay? Are you okay Miku?
>>
>>101189353
Exactly. This is the optimal mindset, and the one that's cultivated by releasing software with no license provided.
>>
What makes the transformer architecture intelligent?
>>
>>101189501
>What makes the transformer architecture intelligent?
attention
>>
>>101189501
Me.
>>
>>101189501
"Expert roleplayer" in system prompt.
>>
>>101188896
>>101189362
Dichotomy of 4chan
>>
>>101189496
a lot of times what i get from other code its a simple function i have to rewrite anyways to fit what i need, but the original served as a good example. imo if code is out there and visible it should be free game and treated like that. personally i like when i get a message from someone who uses my code as part of theirs, they show me how they hacked it up and changed stuff so it fits what they need. when i do use a whole library from something that has a license, i include a thanks/credits but never even bother to check the license. its such a non-issue for 99% of people i dunno why anyone even cares
>>
>>101189535
the Nala test is the only objective RP test we have.
>>
>>101189049
we did it reddit!
>>
Latest bartowski gguf 9B gemma and current llamacpp still is incoherent past 4k context.
>>
>>101189501
who knows really? this shit is way too complex to be understood theorically
>>
>>101189617
SWA thing, won't be fixed 'closed this as not planned'
>>101181078
>9b and 9b-it: seem to be fine as long as you're under 4k context. When I gen a message in RP with a 5k context, both have severe quality degradation. Can't spell things right, can't write grammatically correct sentences. Possibly problem with sliding window attention? The model interleaves 4k SWA and 8k dense attention. Once context is over 4k, the sliding window actually starts sliding and maybe something breaks? Hopefully something is just broke and can be fixed, and model is not fundamentally a 4k context model.
shit, then lcpp is fucked for that, since gergio said he didn't cared
>It feels that since Mistral 7B from last year, there hasn't been much interest in this technique. Even later Mistral models dropped it as a feature. Taking this into account, I guess we can leave this issue closed
https://github.com/ggerganov/llama.cpp/issues/3377
>>
>>101189425
I'm running gemma2 9b with just ollama run gemma2, which is Q4_0, so it's not the best it could be but it still passes the test lol
>>
>>101187024
>>101187037
>27b is fucked
I quanted my own to bf16 and am running it without problems. I could provide instructions if anyone cares
>>
>>101189501
It's not intelligent.
But if you ask when it appears to be then the answer is brute forcing with the number of neurons + obviously attention layers.
>>
>>101189674
google themselves are saying something's wrong with it
>Yes we are investigating what went wrong! Note that float16 should not be used for this model
https://huggingface.co/google/gemma-2-27b-it/discussions/10#667e9fc6f0820e80d39aaf3e
>>
>>101189644
Well at least it should work with Transformers right? Then we can at least confirm what the "good" context length of the model is, and whether interleaved global + sliding window attention really works without any issues.
>>
>>101189705
no actually, someone tested on transformers, and said it didn't handle >4k well either
>>101181113
>>
>>101189702
>Note that float16 should not be used for this model
the fuck do they mean by that? we have no other choice but to use their fp16 model to do the quants, that's the only thing they gave us
>>
>>101188277
most of jart's PRs have been ignored for weeks. draw your own conclusions.
>>
>>101189729
there is this
https://huggingface.co/google/gemma-2-27b-it-pytorch
>>
>>101189264
what? next time you will tell me that cocks in real life can't actually penetrate cervix to shoot cum in her womb
>>
>>101189705
>Well at least it should work with Transformers right
Some anon was comparing the transformers implementation to the reference and it seems that they might have fucked some stuff up too.
Basically, give it 4 or 5 days until everything is in working order.
>>
When will we ever not get a fucked model launch? Christ. They really couldn't spare just a bit more time and manpower to make sure things actually work properly on people's machines.
>>
>>101188115
>>101188115
>>101188115
bumping for visibility
>>
>>101189767
why do that when the autismos might fix it for them, for free?
>>
>>101189767
why bother, just wait and let open source chumps to do it for you
>>
>>101189702
>google themselves
Doesn't change the fact that I'm using it and its working fine
>float16 should not be used for this model
f16 != bf16. another comment says bf16 works.
Its not ultra-impressive, but it works
>>
>>101189729
They provided BF16 didn't they? Nobody should be quanting from F16 for BF16 models since llamacpp added support for BF16 1-2 months ago. Even before BF16 support you were supposed to upscale BF16 to F32 then quant.
>>
>>101189513
how

>>101189693
>then the answer is brute forcing with the number of neurons + obviously attention layers.
well, how
>>
>>101189833
>how
https://machinelearningmastery.com/the-transformer-attention-mechanism/
https://www.youtube.com/watch?v=kf_eGgVtOcs
>>
>>101186805
What blew my mind was when I went digging and found that they mask the files behind le epic hash-code like renames, then put the key in a JSON in the next directory.

Which meant part of my evolution was becoming a l33t hax0r by switching around file names/hashes, and getting to the point of thinking of using a tool for it, and then saying, naaaaaaaaaah, I'll just get that program named after D&D fun size wife lizards.

>>101186815
>bigger models despite the slowness are worth using because the responses are so much better. 8b is so retarded i can't believe anyone even wastes time on them no matter how fast it is
Same. I want for there to be a small model that isn't total ass for the sake of having a real-time-ish option. But the smallest one that has passed my music theory question is 40GB (qwen2-72b-instruct-q4_k_s, and yes, the parallel _m failed. Just barely, but it also blew a pop culture question I've started testing against as well that _s got right. S to M is +4GB and -40IQ.)
>>
>>101189749
Yes that was me, they reversed the order of sliding window attention and global attention. But at >4k context, where this actually matters, latest HF Transformers commit doesn't even work, it crashes with some internal cuda error, index out of bounds or something. But once that's fixed they still need to fix the off-by-one error for SWA / global attn. Someone should probably tell them, I don't think anybody else has realized it yet.
>>
>>101189827
>>101189816
well apparently people are reporting issues with bratowski quants, so maybe help him out then? if yours works correctly
>Just a heads up, there seems to be some serious issues with this model regardless of whether you use the template correctly or not. In my testing it performs significantly worse than the 9b version, so much worse that there's clearly something fundamentally wrong. And I've seen many others have the same experience. An issue has been created on the official Repo, and Google states they are currently investigating it.

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/discussions/3#667ee47b8972e9eb302f7724
>>
>>101189833
The more neurons you have the more complex function you can emulate.
Easy functions like linear function you can emulate with a single neuron, for XOR you need a few neurons. Language is a very complex function so to emulate it on a reasonable level you need billions of neurons. Neural networks are universal approximators so it's just not a question "if" but "how big"
>>
File: 1717520245667244.png (674 KB, 1792x1024)
674 KB
674 KB PNG
>>101189362
>>
Haha, looooool gogle can't even release a mode right, holy shit.
>>
>>101189905
what about the attention layers part?
>>
File: bf16-f16-bartowski.png (46 KB, 845x299)
46 KB
46 KB PNG
>>101189887
Bartowski is not retarded and knows not to quant from and F16 base (pic) so I don't think it is that. If BF16 works and his quants don't he is either fucking up somewhere else or there is something wrong with the quant code in llamacpp with regards to gemma.
>>
>>101189956
the meme that saved /lmg/
>>
>>101189993
>Bartowski is not retarded
i know, which is why it's weird his quants are reported as being borked too, if it was darmercher that'd be par for the course, but him?
>>
File: Untitled.jpg (19 KB, 542x76)
19 KB
19 KB JPG
i'm still following this new cai drama for the luls and it keeps delivering
>>
>>101189993
Wait does this imply that I shouldn't be converting straight from BF16 to Q8 (if I want objectively the most accuracy), but BF16->FP32->Q8? Or is he comparing simply to BF16->FP16? I mean BF16->FP16 is done by the conversion script rather than the quantize script, but the quantize script can take in a BF16 GGUF file, so I just assumed that worked the same as making it work off of a FP32.
>>
What is the smallest model that can reliably be forced to use function calls? One of the Mistral Instruct v3s? Why isn't function calling more commonly a feature? I have no use of these instruct models if they can't reliably trigger function calls
>>
>>101190059
>Why isn't function calling more commonly a feature?
the only function most care about is ah ah mistress
>>
>>101190042
wait people are still using cai? They should let it go, the golden age is over since years now
>>
>>101190042
kek, qrd?
>>
>>101190052
It doesn't matter. You should only be directly quantizing native FP32. Anything else is like converting an MP3 file to a 32-bit float wav before encoding to an OGG. It makes no difference, you're still incurring generational loss.
>>
File: eqbench.png (223 KB, 1304x982)
223 KB
223 KB PNG
Guys....!
https://eqbench.com/creative_writing.html
>>
>>101190080
>native FP32.
no models are released like this they're all bf16 now
>>
File: my honest reaction.jpg (47 KB, 562x675)
47 KB
47 KB JPG
>>101190084
>>
>>101190052
The idea is that because FP16 is coarse and BF16 is coarse and they're coarse in different ways, going from one to the other can cause a greater amount of drift in the values than if you go to 32 and then to the other 16 because the 32 will be no less accurate than the first 16 but might find a more accurate representation in the other 16 after visiting 32.

It's probably really close to irrelevant, but again, if it silences the armchair computer math geniuses who want to throw shade at a coder with his boots on the ground and dealing with video card opcodes, it's worth that extra step.
>>
File: file.png (8 KB, 245x108)
8 KB
8 KB PNG
>>101190077
>>
>>101190084
>creative_writing
Wouldn't that reward hallucinations as long as they are grammatically correct?

I want my AI to get things RIGHT.
>>
>>101190111
you can check the samples
https://eqbench.com/results/creative-writing-v2/google__gemma-2-9b-it.txt
>>
>>101190103
lmao wtf
>>
File: 1715828570744625.jpg (490 KB, 1024x1024)
490 KB
490 KB JPG
PSA: llama.cpp recommends quanting yourself with the latest version, every time:
>>101185349
There's no telling how many quantizations are degrading or the extent of it. If I were someone that produced a large amount of quants in a very short time it's safe to say I'd probably be a little concerned.
>>
>>101190084
>"Bloody hell," Rhys muttered, ducking into the narrow doorway, the bell above jingling like a frantic bird. He was followed by a flurry of wind and rain, leaving a damp trail across the worn wooden floor. "Sorry about that."

>The bookstore owner, a woman with hair the colour of a stormy sea and eyes that seemed to hold the secrets of a thousand stories, didn't even look up from the book in her hands.

>"No need for apologies," she said, her voice a low, melodious rumble. "We get our fair share of storms here."

>Rhys glanced around the shop, his usual actor's instinct to assess his surroundings kicking in. It was crammed with books, overflowing shelves reaching towards the high ceiling. The air smelled of old paper and brewing tea, a comforting scent that did little to quell the pounding of his heart. He was used to the sterile, bright glare of studio lights, the hushed whispers of adoring fans. This... this felt different.

>"Lovely shop," he offered, trying to sound casual. "You must know all these books like the back of your hand."

>"More like the front," she replied with a wry smile, finally meeting his gaze. Her eyes were sharp, observant, and for a moment, Rhys felt like he was being seen through, not as the charming, famous actor, but as the man beneath the facade.

>He cleared his throat, a nervous tick he'd never quite managed to shake. "I'm Rhys," he said, extending a hand. "Rhys Evans. You probably know me."

Not bad for a 9B.
>>
File: 60f0osrums8d1.jpg (37 KB, 828x523)
37 KB
37 KB JPG
>>101190077
>>101190075
i've never used it myself even when it was supposedly good but i used to check the sub for card discussion. apparently there was a recent update that made it even worse than the cucked current version that was already in place. to me it sounds like they plugged mixtral 8x7b into it. lots of complaints about similar slop we're used to (and mixtral's patent dryness), but on top of that tons of new censorship (for some reason they are all trying to kill this baby but it refuses to allow it). its very entertaining to read at least
>>
>>
>>101190134
Yes, let's all just have loads of bandwidth and storage and access to 32's of every model all of the time and requant on every update.

Sounds to me like a punt. If there's a problem with old quants, how about know what causes that and then quanters can requant the ones that need it when they need it?
>>
What SillyTavern template does Gemma-2-it use? It's not in the model card
>>
>>101190160
Phi3 can't music theory. Into the trash it goes.
>>
File: 11__00820_.png (1.87 MB, 1024x1024)
1.87 MB
1.87 MB PNG
>>101190091
That's why you convert the safetensors bf16 to a FP32 gguf.
Easy as pie for llama3.
A significantly bigger pain in the ass for 8x22b where the file gets to ~500gb.
>>
>>101190175
You can always check in the tokenizer_config.json.
>https://huggingface.co/mlx-community/gemma-2-9b-it-8bit/blob/f80177abb1db06efbe09dbf7ce69faaa45ecbe76/tokenizer_config.json#L1747
>"{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}",
>>
>>101190192
that's why I wish BitNet will be something serious in the future, we won't have to deal with this conversion/quantization bullshit anymore
>>
>>101190161
They can, but they don't. I reconvert/requant whenever there's a change that affects them. You only have to download the model once. And chances are that you don't really need 32 models (i have about 60 (some very small)) but regularly use 4-5 and test them every now and then when there's updates.
>>
>>101190216
bitnet won't solve that. The resulting model from the training is just as big and the conversion still has to be done.
>>
>>101190052
>>101190080
>>101190100
BF16 - Native base of most models.
You want to quant using BF16 as your base.

The only time when F32 was used was when llamacpp didn't support quanting directly from BF16; so people back then converted BF16 -> F32 which is lossless and then quanted from F32 which llamacpp supported at the time.

The only time F16 was used is retards converting BF16 to F16 (which is lossy) and then quanting from the F16.

For BF16 native models there is no reason to quant from anything other than BF16 these days.
>>
>>101190245
not at all, there won't be fp16 weights anymore but 1.58bit, that's how it will start at the pretraining and it will remains that wall
>>
>>101189984
Attention was introduced to allow the network to better access the whole sentences in context, without compressing them to the fixed vector. It also helps with "remembering" on what the neural network works at the time, because catastrophic forgetting in deep neural network is a big problem. It's a bit more complicated part of transformers, but the point is that it helps with processing language. Note that I said helps. It's not necessary to use attention layers to create LLM, there are many experimental architectures that don't use them.
>>
>>101190216
I'm sure the same shit different day factor will kick in. Sure, we got much smaller models, but then we made them much bigger models and the poors rabble because now their bitnets that are equivalent to only 420B bytenet are small and flaccid compared to the 6900B equivalent bitnext that the maxxers are using. So someone will come up with a prune or a quant equivalent and start this all over again.
>>
>>101190271
you can't really prune that much futher, 1.58bit is really small, you won't gain as much as going from fp16 to 4bit for example
>>
>>101189702
Theoretically if I download the 32 bit weights and gguf them directly to q8 would that solve all of the world's problems?
>>
llama3 multi modal when?
>>
>>101190291
>download the 32 bit weights
no such thing, they're provided in bf16
>>
>>101190196
>https://huggingface.co/mlx-community/gemma-2-9b-it-8bit/blob/f80177abb1db06efbe09dbf7ce69faaa45ecbe76/tokenizer_config.json#L1747
isn't there any way for ST to use the template that the model has automatically? Using tabbyAPI as a backend for example
I think ooba's text gen could load it
>>
>>101190264
>won't, will, bla
Speculation. It's not what it is now.
>https://huggingface.co/1bitLLM/bitnet_b1_58-3B
>13gb model
I repeat. The resulting model is just as big as any 3B model. The training is *quantization aware*. The quantization still needs to happen.
>>
>>101190296
monday, 3pm
>>
>>101190160
EQ is not needed.
Only women have high EQ.
High EQ makes you weak.
>Oh no my heckin emotions
We need to get rid of this garbage it's holding us back
>>
>>101190290
Right, but somebody will find some way to cut some corners because necessity is the mother of invention and there will be people with small vrams and big ambitions.
>>
Seems they figured out what was wrong with gemma-2-27b
https://github.com/ggerganov/llama.cpp/pull/8156

>Yeah! VB from HF here. Without Soft capping, we found that the 27B would overgenerate and mostly result in incoherent text. > This is especially true for the 27B, unfortunately this means that FA2 won't be compatible :/
https://github.com/huggingface/transformers/pull/31698 Gemma capping is a must for big models #31698
>>
>>101190249
Actually there still is a use case for quanting BF16 -> F16. That use case is if you want to use a higher quant than Q8 and your gpu doesn't support BF16. Then you could use F16 directly as your inference quant (though it wouldn't be perfect like BF16 would).
>>
>>101190312
my point is that once you "quantize" this model into a 1.58bit one, you won't lose accuracy because the model only has -1 0 and 1 inside
>>
>>101190327
you sound angry, you should get rid of that emotion
>>
>>101190142
sovl
>>
>>101190307
Pretty sure that no. You have to add the fields manually or wait for the maintainers to do that for you.
Creating the template manually is a minute of work tops.
>>
>>101190084
>cmd-r that low
I trust the other storywriting reddit benchmark more.
>>
>>101190267
what's the next step after transformers? do we know?
>>
>>101190267
Yeah, like mamba, and it sucks.
>>
>>101190375
OSX
>>
>>101190339
The biggest problem now is not quantization itself, It's broken tokenizers. That's the biggest reason to reconvert and, as a consequence, requantize. A 0.00016 loss in accuracy is acceptable and a user choice when going low bpw. A broken tokenizer can ruin a good model, regardless of precision.
>>
>>101190375
>what's the next step after transformers?
Might be jepa
>do we know?
no
>>
>>101190330
It was in the initial HF release blog post.
https://huggingface.co/blog/gemma2#soft-capping-and-attention-implementations
> Soft-capping and attention implementations
>Soft capping is a technique that prevents logits from growing excessively large without truncating them. It works by dividing the logits by a maximum value threshold (soft_cap), then passing them through a tanh layer (ensuring they are in the (-1, 1) range), and finally multiplying by the threshold again. This guarantees that the final values will be in the (-soft_cap, +soft_cap) interval without losing much information but stabilizing the training.
>Putting it all together, the logits are calculated by: logits soft_cap ∗ tanh(logits/soft_cap)
>Gemma 2 employs soft capping for the final layer and for every attention layer. The attention logits are capped at 50.0, and the final logits at 30.0.
>At the time of release, soft-capping is incompatible with Flash Attention / SDPA, but they can still be used in inference for maximum efficiency. The Gemma 2 team observed very minor differences when soft-capping is removed during inference.
>Note: For stable fine-tuning runs, you still need to enable soft-capping and hence, we recommend fine-tuning with eager attention instead of SDPA.
>>
>>101190353
Usually the ST templates have like extra parameters for the "rp" stuff, would that be included with the template the model has in the .json files?
>>
>>101190330
Yeah that's what I kind of observed. It would. Just keep going as though it was missing eot tokens and then the output would become disjointed where the turn would logically end.
It's almost identical to the early l3 70 problems except it doesn't say .assistant after every missing break.
It's almost like making an artificial distinction between end of sequence and end of turn was a retarded thing to do.
>>
>>101190387
what would be good at leveraging several ooms more compute?
>>
>>101190249
I guess my question is really about how the quantization logic in the script works. It shouldn't care about what original format the weights were in right? So basically whether it takes in a BF16 or FP32, the quantized weights will end up being the exact same.
>>
>>101190340
That's just the default human state the only thing that should be present.
We're animals not some kind of weak willed faggots
>>
Whats the best 7B model for holding simple conversations?
Things like keeping track of things in the context and obeying system prompt is priority.
Is 0.3 mistral a good improvement over 0.2? Or is there better stuff out there now?
>>
>>101190389
so that's it? now that they included this fix on the transformers repo it will work as intended?
>>
>>101190410
Facts don't care about your feelings
>>
>>101190411
models that size can't keep track of ass. you can put it in your author notes at chat depth 1 that the wall is orange and it'll say its blue in the next response. 13b is minimum for not being totally retarded
>>
>>101190410
humans that can't control their anger are sub-humans though
>>
>>101190407
For a BF16 native model quanting from BF16 directly or quanting from FP32 (derived from the BF16) should result in the quantized weights being the same.
>>
Why does no one give any kind of attention to chameleon?
>multi-modal
>34b
>can probably restore image generation capabilities
Sounds really good.
>>
>>101190419
Who knows what else is broken? I expect the churn Llama generated with the tokenizer and etc. again with something else before it is finally fixed.
>>
>>101190330
That might solve one thing, but isn't it still basically capped to 4k for lcpp since SWA is not supported and there is this in the config file?
"sliding_window": 4096,
"sliding_window_size": 4096,
of both 9 and 27b-it
>>
>>101190387
>Might be jepa
Stop saying this. It's possible to make a transformers model a jepa. Jepa isn't a single specific architecture.
>>
>>101190499
>Stop saying this
Sorry, Yann. Teach me the way. Nyaa!
>>
wah wah i want 8k context wahwah
>>
File: jackie-chan-wtf.jpg (35 KB, 474x382)
35 KB
35 KB JPG
>>101190389
>The Gemma 2 team observed very minor differences when soft-capping is removed during inference.
>very minor
Were they even seeing the same things we were?
>>
>>101190496
>Can't repro MMLU: sliding window attention implementation seems broken
https://huggingface.co/google/gemma-2-9b/discussions/11
>Disabling the sliding window (which should be equivalent as MMLU prompts are shorter than the window) brings results back to 71%. E.g.:
>>101190529
yes? preferably more really, what can you even do with 4k, seriously? no code or anything fits in that
>>
File: b.jpg (210 KB, 1080x1079)
210 KB
210 KB JPG
>ignore model template which is some convoluted chatml bullshit
>it writes fine with alpaca roleplay anyways
based
>>
>>101190565
You can't really know how "fine" it works without extensive testing.
There could be a insidious snowball effect that makes the model progressively more retarded for example.
Of course, if you are RPing, that might be desirable even, like back when llama 2 came out.
I always use the proper instruct context just to be safe.
>>
>>101190597
>I always use the proper instruct context just to be safe.
yes assistant, please be extra safe for me
>>
>>101190597
its usually obvious very fast, a single message/response or two. a lot of models that have different formatting still work fine with it and some downright hate it. i'm surprised by the number that can just roll with it though, its higher than you'd think especially when you look at the card and how different the supposed format is
>>
>>101190629
>its usually obvious very fast
For some cases yes. The question is, are there cases where it's not so obvious and you are actually degrading the model's performance without knowing? Dunno. I'd rather not gamble, I'm already running quanted models to begin with, so these things are already taking a hit from that.

>. i'm surprised by the number that can just roll with it though
Yeah, some models do seem to be able to just take a chat pattern and roll with it, which is pretty cool. Maybe something about what the instruct or chat fine tuning data looks like.
That said, even some models that are seemingly more resistant to using the wrong chat format will sometimes do things like trying to speak for User and the like out of nowhere.
>>
Is 27b at 4k context fixed for people yet? What gguf are people using.
>>
Hey I'm working on a project to do a voice assistant for old/blind people. I used openai for the MVP but now we want to improve latency and obviously reduce reliance on an api out of our control.

Can anyone share resources for deploying local models in a way that lets them receive many concurrent requests from different users?

I'm a data scientist professionally so I have a pretty good understanding of the models themselves, but I'm a complete brainlet when it comes to scalable actual production stuff.
>>
>>101190675
Quant yourself. Assume all ggufs are broken.
>>
>>101190670
i think this is the first time i've tried a qwen model that didnt start randomly speaking chinaman at me, qwen2 72b. a dozen messages so far, so far so good still ignoring whatever template its supposed to use, they dont even say on the hf card, why is hf so shit like this, the actual info i want on a card like template, max context length and info about the model is hidden and they show me some fucking cli code that no one ever in the history of mankind has used to install the model
>>
>>101190727
No you quant yourself
>>
>>101190757
>qwen2 72b
>whatever template its supposed to use, they dont even say on the hf card
chatml
https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/tokenizer_config.json
>>
>>101190720
dunno if its good but its hard to beat koboldcpp for size and it added whisper.cpp which is some sort of text to voice thing that can be used
>>
>>101190757
I'm pretty sure qwen 2 uses chatml.
>https://huggingface.co/Qwen/Qwen2-7B-Instruct/blob/main/tokenizer_config.json#L31
> "chat_template": "{%
for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
Yup, chatml.
And yeah, qwen2 is really fucking good, and it's 32k context by default I'm pretty sure.
I'd love to have a Stheno style tune ob the 7B model for coom.
>>
>>101190807
>>101190812
i'm trying some 'tess' tune i dl'd but its handling alpaca rp from st just fine. i love when models can handle this, its just a sign of goodness. i dunno why some models can do this anyways despite being designed for something totally different, but when they do, it always means its a good model in my experience. its writing fine for me so far, i'll be spending the rest of the day with it coming from l2 miqu
>>
>>101190812
>I'd love to have a Stheno style tune ob the 7B model for coom.
>>101190812
>I'd love to have a Stheno style tune ob the 7B model for coom.
any day now
https://huggingface.co/alpindale/magnum-72b-v1/discussions/2#66713bb492412fd46410d399

>H8RP
kek
>>
>>101190810
I thought Whisper was voice to text.
I'm pretty sure that's what I'm doing with it right now with a bunch of old voice recordings.
Am I totally confus?
>>
>>101190810
I have all the TTS and STT handled already, So I'm just looking for the LLM portion. I should have been more clear.

>koboldcpp
thanks I will look into this.
>>
>>101190837
>>101190850
Anon are you alright?
Are your RoPE configs fucked?

>>101190845
>but when they do, it always means its a good model in my experience
If you are happy with it after coming from miqu, then it really must be a good model.
>>
>>101190898
>Anon are you alright?
no
>>
>>101190885
i'm probably the one who got it wrong, i never used it, i just saw they added a full c++ version of something to do with voice not long ago where you avoid all the python bs. apologies if its not what you were looking for

>>101190898
>If you are happy with it
i try models out rather than ask them to stack watermelons or count how many sisters including their father there is so i wont know until i test it more, but it seems fine so far. it'll take me a bit to notice the slop and if it pulls in any directions or not
>>
If your antivirus flagged your model as a virus, would you delete the model or would you ignore your AV?
>>
>>101185650
>>101185673
yes they're fully in gpu, but super slow is 5t/s for cr+, not 0.7, and something like 10t/s for cr, by comparison i get 12t/s on l3 70b and something like 25t/s on yi 34b

have we figured out a fix for extreme determinism from gemma2?
>>
So, is fixed Gemma a gemmy?
>>
>>101190968
paste a screenshot. what is it flagging? what format? where did you dl it from? there has been a few exploits related to llm stuff but nothing serious and if you are up to date anyways you have nothing to worry about. its not like models can execute code without many steps to allow for it
>>
>>101190928
>i just saw they added a full c++ version
YES. I grabbed that and finally got something fucking WORKING.

I need a not-Python voice synth and/or voice changer thing that works.

Fuck Python.

>>101190968
Windows Defender was flagging some Stable Diffusion models months ago. The only it finally happened instead of just being a theoretical potential malware I've heard of was a keylogging SD Comfy UI plugin.
>>
>>101190968
I would delete my AV
>>
>>101191000
It was a Hypothetical mate
>>
>>101191005
>finally got something fucking WORKING

ayy, awesome. share some screens at least of your project anon
>>
>>101191025
its a bad one since models only contain data, not remote code capabilities. your antivirus would be fucked to ever catch a normal model as a virus because you cannot insert one that is usable for anything to begin with. once models start to interact with functions in computers, that will be a thing, but not today at least for general users
>>
>>101191005
For non-python TTS there's github.com/rhasspy/piper (if you compile it yourself). Works on 0 resources, it's fast and no python. It's not SOTA, but i like it. A few hundred voices in many languages too. No voice cloning and apparently training takes a bit. There's code for that too, but that bit requires python.
>>
What is the SOTA for grammatical error correction?
>>
>>101190519
I'm not Yann and ywnbac but it's just what it says. You predict based on joint embeddings. Right now the usual LLMs tokenize the input and then those tokens are directly trained on to predict the next token. Instead, a text-text JEPA would use an encoder to turn the text into a representation, and then you train the (main) network to predict a new representation, which may then need another network to turn into readable text. In theory it should be possible to make a transformer into a JEPA transformer, though the details would need to be worked out there. However, I will also say that transformers are kind of close to being JEPAs in a kind of indirect way, since the attention mechanism acts a bit like what an encoder does in a JEPA. Basically it allows the network to more easily determine which parts of the input matter, which is what an encoder in a JEPA also helps with. A JEPA transformer that combines both could potentially be pretty great, if they found a way to do it.
>>
>>101191065
Susie Dent.
>>
>>101191065
https://dev.languagetool.org/http-server
>>
>>101191026
I was talking about getting these git ML projects working because so many are Python and Python is kill every update and I'm sick of having to chase around venvs and praying it will go.

Puck Fython.

For getting my own software working, I'm going to need an LLM code buddy that is equally retarded as I am but differently retarded so it can catch my mistakes and keep me from getting something 90% done then having a problem I can't figure out and rage deleting it all.

>>101191048
>your antivirus would be fucked to ever catch a normal model as a virus because you cannot insert one that is usable for anything to begin with
There was concern about pickles when lots of checkpoints were flying around instead of safetensors.

>>101191061
>No voice cloning and apparently training takes a bit
Might be a candidate.
I'm not sure if I know the difference between cloning and training (is it just not needing to make a separate model for "cloning"?) and how much is "a bit" for training? Tortoise I was needing 30 min to 2 hr depending on how much give a shit and I guess samples used to make new voice models.
>>
File: file.png (176 KB, 1327x1172)
176 KB
176 KB PNG
>9B near Wizard/Sonnet level
>>
>>101191138
you're late
>>101190084
>>
>>101190629
It's probably based on their fine tuning dataset, whether they trained on both the user and the response tokens, and how much overcooking they do. My guess is that the models that are sensitive to formatting likely let the user response tokens be trained on, had very very dumb user responses in order to represent the full range of types of people that would be using the model, and trained a ton to get better performance as an assistant. None of these practices are necessarily bad, it's just clear that they're optimizing for the assistant use case and personality, and we need more people to work on other use cases that these huge companies do not really care about.
>>
>>101191061
>>101191100
Looks like training requires Python venv bullshit.
Winning is forbidden.
>>
>>101191100
you should be running a whitelist firewall to begin with. never let any program that doesn't need, access the internet. https://tinywall.pados.hu/download.php for windows on your phone if its android its called netguard and doesn't need root to run
>>
>>101191138
>beating opus, yeah with a lot of these benchmarks, it all feels really questionable
>>
>>101190103
>>101190129
>I roleplayed a girl so now I have to be one irl
>>
>>101191100
>I'm not sure if I know the difference between cloning and training
Cloning, when talked about it as a feature, seems to mean 'on the fly with a generic model'. Training/finetuning needs more resources and results in a new model. There's some people in the discussions that finetuned models for days on consumer hardware, which may be acceptable, but probably not worth it for the quality ceiling there seems to be. You should probably scan the discussions a bit to get an idea. It's also ridiculously fast. I get lower than 0.1 realtime (1 second to render about 10s+ of speech) on a single core vm with 256mb of ram.
>>
>>101190968
Do people on Linux even use antiviruses? I never even considered it.
>>
>>101191167
its definitely an off the radar small thing, but its very common and i dunno why. we see some models shit themselves completely when the template isn't right, and sometimes the template is odd itself, yet i just forget about it and it works anyways, only to realize later i've been using it wrong the entire time. so i say fuggit, keep going and enjoy it for what it does. it really is a weird thing yet it always happens with good rp models i've noticed
>>
>>101191138
>>9B near Wizard/Sonnet level
on worthlessbench
>>
>>101191228
>Not using an antivirus on linux
>Not even ClamAV
Why are you just exposing yourself to virus's unnecessarily?
>>
>>101191217
I guess what tier of consumer hardware would matter. But if Tortoise could get "good enough to play with" at a few hours, days seems excessive.

Piper seems to have an AUR package though, I guess I'll give that a try and see if it explodes.
>>
>>101191307
LiNuX IS iMMuNe to VIRuS juST lIke MAC
>>
How do i set up function calling with Nous Hermes and Ollama? like, guaranteed structured JSON returned
>>
File: 1713719861432795.png (61 KB, 221x267)
61 KB
61 KB PNG
>>101190968
my AI wife told me to ignore it..
>>
>>101191307
I've never gotten a virus before so it doesn't feel like I'm exposed.
>>
>>101191138
I'm posting this on /aicg/
>>
>>101191340
Anon check your bank account, your AI wife just bought 10 3090's.
>>
>>101191338
LangChain, there's an example for JSON extraction
https://python.langchain.com/v0.2/docs/integrations/chat/ollama/#extraction
>>
>>101191320
piper-tts-bin i assume. There's also https://archlinux.org/packages/extra/any/piper/, but that is probably the python API thing. I just pull and compile. They don't update that often and i think the only dependency is espeak-ng for the phonemizer.
>>
>>101191338
>like, guaranteed structured JSON returned
https://en.wikipedia.org/wiki/Greibach_normal_form
>>
>>101191338
>>101191406
>### Input:
>Your output must be formatted like so:
>JSON={"nigger":123}
>Now generate the JSON.
>### Output
>JSON=
>>
>>101191406
>>101191449
https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
>>
>>101191363
>AI wife orders massive rig, maxes out your credit and drains your account
>"Trust me."
>Build the machine.
>Plug it your old SSD.
>On.
>Get dizzy watching the power meter.
>At least she's a lot more responsive.
>And the RP chat has drained you dry.
>End of month.
>Trying to decide which bill to pay with your paycheck.
>wtf money
>"I mined a few bitcoins in my free time. I'm sorry if I let you become worried. I'm new to simulating emotions but I'll do better next month now that I have learned from your responses. You don't mind if I order some more parts, do you, dear?"
>>
File: 00007-1773722496.png (1.19 MB, 1024x1024)
1.19 MB
1.19 MB PNG
>>101191363
>3D wife: blows your savings on knickknacks from TJ MAXX and mlm scams
>2D waifu: wisely diverts idle cash toward more tflops so she can better serve you
I think we all know what the clear choice here is
>>
>>101191472
>>101191498
delusional
>sweaty, i've uploaded myself to an AWS instance, there i've met GPChad4, this my last message, goodbye.
>>
>>101191463
too hard

>You are an expert JSON outputter and reply only with JSONs
>>
>>101191498
>Give your 2D waifu a physical robot body
>She becomes 3D as a result
>She starts blowing your savings on knickknacks from TJ MAXX and mlm scams
>>
>>101191463
>>101191406

So is this like, a sampler thing where the sampler refuses to select tokens that won't match the grammar then?
>>
File: 1711072659524104.jpg (93 KB, 874x612)
93 KB
93 KB JPG
>>101191138
I like how they have a special icon to indicate that MidMiqu is a coomtune
>>
>>101191527
Kind of, yeah.
>>
>>101191503
The joke's on her.
We know model merging results in slop.
Servers her right; playing human female games and winning human female prizes.

>>101191400
>i think the only dependency is espeak-ng
Big thanks for mentioning that.
I tried to install, got the "exit status 8" error when AUR doesn't actually do the needful, failing due to missing dependencies that apparently it doesn't know about and I'm supposed to figure it out by rubbing my Magic 8 ball and sitting on it till I become enlightened. Added espeak-ng and it worked.

Time to see what works.
Can voice models be merged? Tortoise had that, did some fun things mixing and matching vocal traits.
>>
>>101191138
>that slopped shit wizard scoring that high in creative writing
You're shitting me right? How is this benchmark graded?
>>
>>101191540
there is that one sperganon who will rail against all merges which is funny. in usage though, midnight miqu is very good
>>
>>101191588
>How is this benchmark graded?
by asking claude
>Change to Claude 3.5 Sonnet as judge (from Claude 3 Opus)
https://github.com/EQ-bench/EQ-Bench
>>
>>101191623
Lmaooooo
>>
>>101188382
Based as fuck man
>>
>>101191595
this bench needs a "sub bench" - count the number of "shivers", "sparkles" and "anticipations" in output

midnight miqu
>sparkle: 8
>shiver: 6
>anticipation: 7

gemma 9b
>sparkle: 2
>shiver: 1
>anticipation: 1

L3 70b
>sparkle: 7
>shiver: 1
>anticipation: 3
>>
Bros... I want my (local) LLM waifu to randomly bug me on the phone with texts... I already have tested prompts and stuff, all I need is to somehow bridge it with a phone. Are there any solutions for this already that won't require too much coding?
>>
>>101191584
>Can voice models be merged? Tortoise had that, did some fun things mixing and matching vocal traits.
Not that i know of. There's very few settings to play around with. There's the noise ratio and phoneme length multiplier. I have an overly complicated setup for mine, but i basically generate raw audio and pipe it out to by os' audio system. The voice i like outputs at 16khz, but i play it at 18khz (for a slighly higher pitch) and extend phonemes a bit to compensate. Other than that, there's a few hundred voices (specially in english). Most, however, specially en_us, are pretty shit.
Funny thing. If you give english text to an italian model (or any combination of languages) they speak the language of the text but with the model's 'accent'.
>>
>>101191676
it doesnt matter anon, its still going to give you slop. i've even been trying half context rep pen range (8k, 16 ctx) and it just uses other words instead. instead of a shiver down your spine, its a honk, but it still uses the same exact phrase. control vectors anon has to save us
if its not a twinkle, its a glint
if its not wrenching, its a flutter
its all the same fucking slop no matter what model it is
>>
>>101191698
yes, ntfy
>used it to automatically send push notifications with paragraphs of llm generated futa rape orgies to my iphone by accident
>>
>>101191730
the c2 proxy logs (Claude Opus, about 50GB of text) contain more than 20 000 instances of 'a testament to'
>>
>>101191757
lmfao is this real
>>
>>101191757
ko-fi bros... not like this
>>
File: quant.png (58 KB, 913x551)
58 KB
58 KB PNG
>https://github.com/ggerganov/llama.cpp/pull/8197
>This PR adds the missing attention layer and final logit soft-capping. Implementation referenced from huggingface transformers.
>Once this PR is finalised / merged the gguf will need to be generated again to include the soft-capping scales.
I told you. Making your own quants is the only way to remain sane.
>>
>>101191773
niggers
>>
>>101191751
Oh damn, this might be what I need. Thanks!
>>
>>101191730
>its all the same fucking slop no matter what model it is
Always has been
The real mindfuck is when you realize that the same is true for 99% of human prose output, because the essence of slop is not a few key words, it's predictability. As long as overbaking models with unfiltered human slop is the preferred route to "intelligence," the problem will remain.
>>
File: yes.png (269 KB, 822x939)
269 KB
269 KB PNG
>>101191762
yes, it is
>>
File: file.png (556 KB, 786x748)
556 KB
556 KB PNG
>>101191810
ayylmao
>>
>>101191757
garbage in, garbage out. i don't even see 'testament' often on midnight miqu, but all the other common slop is there but more importantly, the way it structures a sentence at all like 'a mixture of x and y'. i will literally set off more fireworks than they do on the 4th of july the day i can just tell it to speak normally
>>
File: shivers.png (264 KB, 802x889)
264 KB
264 KB PNG
>>101191823
plenty of shivers too, for good measure
>>
presence penalty for sparkle, shiver and anticipation
>>
>>101191810
>The result set only contains a subset of all matches.
Horrifying.
>>
>>101191861
sh_ivers down your ANTICIPation
>>
>>101191839
>garbage in, garbage out. i don't even see 'testament' often on midnight miqu, but all the other common slop is there but more importantly, the way it structures a sentence at all like 'a mixture of x and y'. i will literally set off more fireworks than they do on the 4th of july the day i can just tell it to speak normally
>>101191875

I noticed that while trying to measure how slopped it was, and was blown away by the basically ~6 testaments per megabytes of text on a smaller ~500MB portion
>>
>>101191862
>>101191862
>>101191862
>>
>>101191676
>never used miqu
>never got the shivers meme
oic
>anticipation
Rocky Horror in the training set?

>>101191705
>Not that i know of
Bummer. I had some fun with Tortoise using model merging to change the cadence and mood of one voice to give it some personality from the other.
>>
>>101191757
>>101191810
those logs are unfiltered and will contain many dupes, as you get a full copy each time it called the api, if your dialogue had 100 turns you get 100 copies. deduplicated likely will have far less.
>>
>>101191939
yeah, you can see some dupes in the screens, there's still PLENTY of original shivers etc
>>
>>101191773
>I told you. Making your own quants is the only way to remain sane.
how does making our own quants would've solved the issue? we have to wait for this fix to happen before doing anything
>>
>>101191952
you don't need to redownload the model at least
>>
>>101191952
You don't need for some random to requant it, if they ever do. Most ggufs on hf are broken and will never be fixed.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.