[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1692379851934720.jpg (647 KB, 1856x2464)
647 KB
647 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101250468 & >>101243128

►News
>(07/02) Japanese LLaMA-based model pre-trained on 2T tokens: https://hf.co/cyberagent/calm3-22b-chat
>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156
>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw
>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>101250468

--Troubleshooting Custom Limarp Adapter for Wizard Mixtral: >>101255564 >>101255603 >>101255680 >>101255900 >>101255999 >>101255658 >>101255670 >>101255749
--Pluggable RAM Sticks for GPUs: A Potential AI Powerhouse: >>101250618 >>101250663
--LLaMA.cpp Flash Attention 2 on Gemma 2 and Cache Quantization Possibilities: >>101256405 >>101256491 >>101256539
--Gemma2 Training Support Added to QLora-Pipe, Shows Promise Over LLaMA 3 8B: >>101251790 >>101252075 >>101252240 >>101252250 >>101252824
--Gemma2 21b: A Promising Alternative to LLaMA: >>101254733 >>101254757 >>101254781 >>101254860 >>101254922
--Gemma Responds Well to Specific Instructions, Like Mixtral: >>101252897 >>101252919 >>101252947 >>101252957 >>101253002
--BigTech's Hyperparameter Tuning Secrets for LLMs: >>101250835 >>101250871 >>101250881
--Anon's Japanese Travel Phrases Get Judged by Gemma 2 9b: >>101256710
--Twin Peaks AI Model Impresses with Fandom Knowledge and Sentiment Analysis: >>101254235 >>101254247 >>101254277 >>101254326
--Time-Based RNG Issues in LLMs Need a Solution: >>101254647 >>101254707 >>101254722
--Phi3-Mini Update Surpasses LLaMA 3 8B Performance: >>101255873 >>101255918
--New cpumaxxfag Server Specs Discussion: >>101250702 >>101252875 >>101252929 >>101252938
--Microsoft's MInference for Faster Long-Context LLM Inference: >>101254324
--LLaMA CPP Python Version Bump and Gemma Model Compatibility: >>101254881
--Gemma 21b Passes the Watermelon Test: >>101256472 >>101256514 >>101256945
--CXL-GPU Image: Nothingburger Without Industry Support: >>101254227 >>101254253
--Pytorch 2.2.2 Downgrade Ruins VRAM Allocation: >>101255284 >>101255775
--InternLM2.5 Collection Released on Hugging Face: >>101255862 >>101255885 >>101257520 >>101257548
--Fixing Gemma with <bos> Token: >>101253884 >>101254036 >>101254050 >>101254423
--Miku (free space): >>101250518

►Recent Highlight Posts from the Previous Thread: >>101250472
>>
>>101258584
>Gemma 21B
It's 27B
>>
>>101258641
>Gemma
It's Kumma
>>
>>101258641
Our Gemma is lighter, having hollow bones for more efficient flight.
>>
magnum is dumb as fuck
>>
>>101258689
It's hard to move forward from Mythomax
>>
this is it
this is the month of llama 3.5 turbo
>>
Gemma-2 (even 9B) is Claude Sonnet.
>>
>>101258792
Then Gemini must be God
>>
File: BlushingMiku.png (1.82 MB, 1024x1024)
1.82 MB
1.82 MB PNG
Good morning lmg!
>>
>>101258845
If miku is a robot, perhaps it could have weaponized T-doll modules installed in conjunction with civilian behaviors...
https://www.youtube.com/watch?v=ioRhanH4xmE
>>
Here's a bit of morning hopium:
I've been running llama-bench and logging the results for a few months now looking for regressions in specific patches to report back to the devs, but its been a pretty steady march forward.
These are the unfiltered t/s results for the leaked miqu 70b q5
>>
>>101258845
Good morning Miku
>>
File: 1700027607826483.jpg (243 KB, 1781x1635)
243 KB
243 KB JPG
oops! time to uninstall firefox! you can thank meta for this btw
>>
>>101258990
What kinds of prompt processing speeds do you get with RAM?
>>
>>101259034
>thank meta for firefox owners being massive faggots
I guess we can thank meta when they bundled pocket and other shitware into their shitty browser too?
>>
>>101259077
>moving goalposts
your favorite jewcorp - meta is building real time censorship tools, and you indirectly supporting this by providing negative data for them to filter & classify, don't come and cry here later when you get banned for having wrong opinions on *random topic* because meta's llm hallucinated something.
>>
>>101259034
what's the point? twitter is owned by Elon now, people can say whatever they want to their platform, and no libtards tears is gonna change that
>>
Do any of you fine fellows have any experience with the Orin AGX? A business near me is liquidating a bunch of them for $600/piece.

Their GPUs are pretty shit (worse than a 3060) but they've got 64gb of memory and some special accelerators apparently, so I was thinking about picking one up
>>
>>101259034
>anon has a knee-jerk reaction to the word "hate speech" without actually understanding anything about the situation
classic
>>
What the fucks is this dude talking about kek.
>>
>>101259163
You'd have to rely on this guy's stuff to use the accelerators right
>https://github.com/dusty-nv
>>
>>101259163
Do they have an online shop for that?
4 tokens/second on 70B llama, it's not very good. The software is also a pain in the dick to work with if you're doing something that's even remotely memory-constrained. The shared memory is retarded, it's way easier to deal with dedicated VRAM.
Software support is also dicks, enjoy trying to compile shitty python packages for cuda that don't have ARM support.
>>
>>101259063
>CPU prompt processing
They're abysmal, as expected (pp512 is about 22 t/s max for llama-bench of miqu 70b q5...about 10% of what I can do with CUDA). I use a GPU purely for context.
The RAM speed isn't the bottleneck with pp, its a compute capability mismatch of CPU vs GPU.
>>
File: miku-tet-duo.png (3.13 MB, 1992x1328)
3.13 MB
3.13 MB PNG
>L3 tunes maturing on the upper and lower end of beak spectrum
>Gemma2 quirks slowly getting worked out in the middle-ground
>Beeg 400B L3 imminent (and multimodal?)
Local is about to finally start eating good on every level
>>
>>101258576
https://x.com/BoyuanChen0/status/1808538170067407264

>Introducing Diffusion Forcing, which unifies next-token prediction (eg LLMs) and full-seq. diffusion (eg SORA)! It offers improved performance & new sampling strategies in vision and robotics, such as stable, infinite video generation, better diffusion planning, and more!

I smell a new potential
>>
>>101259283
we've been eating good for a long time! :)
>>
>>101259322
i smell a nothingburger
>>
>>101259234
If it's that much of a pain in the ass, I'll skip it. Thanks for answering my question, though.

> Do they have an online shop for that?
Not as far as I know, but I can ask
>>
>>101259361
i smell a sitcom
>>
Recently got a 4090. What should I run other than 3.5 exl2 Command R?
>>
>>101258990
Based regression inspector.
>>
>>101259499
Gemma 2 27B with context extended to 16K at the highest quant you can then report back how it compares to CR.
>>
>>101259578
How do you extend the context?
>>
>>101259620
RoPE/YarN.
If you are using exl2 you want to use the alpha calculator in the OP.
>>
File: 1719351514748678.jpg (674 KB, 2048x2048)
674 KB
674 KB JPG
>>101258689
ギジュツ問題(笑)
>>
Does Gemma 27B have a capped temperature? It doesn't go off the rails even at 5.0. It feels like it writes differently than at 1.0 (where it still basically never goes schizo), but that might just be placebo because I would expect that. But it remains entirely coherent.
>>
>>101259737
I have wondered this about other models that seem to remain coherent at higher temp samplers as well. What is it that causes some models to be more tolerant of a wider range of temps while some others seem to have a much narrower working range.
>>
>>101259782
in my experience this mostly means the model is really overbaked
>>
>>101259192
Because its just enough to know the real use-case for this shit.
You can keep sucking on that corporate AI-jew cock though, i am sure it will never ever backfire!
>>
File: file.png (15 KB, 514x112)
15 KB
15 KB PNG
LONG CONTEXT ALERT

redditors say it works well past 200k
>>
>>101259322
Gradually i began to hate them ai-jeet hypefaggots, a bunch of buzzwords and promises is just enough to catch average lmgtard's attention!
>>
>>101259737
>>101259782
Isn't it just a sign of more tokens being trained?
>>
>>101259817
Yeah, but it's dumb. Failed music theory, failed tricky programming question even after being told what the trick was.

Thousands of 7B tokens are still 7B tokens, alas.
>>
>>101259782
It could be >>101259805, or maybe, it could be related to the logit capping feature that the model needs to work correctly.
>https://github.com/ggerganov/llama.cpp/pull/8197
>>
>>101259826
you're no better for seeing a bunch of technical terms you don't really understand and instinctually having the same reaction but in the other direction
>>
>>101259737
I'm an idiot, with min_p 0.0 it does go schizo at higher temperatures. But it feels like even tiny min_p values are much better at keeping Gemma sane than with other models.
>>
>>101259322
so it's a new diffusion architecture? Didn't know you could use the llm technique (next token prediction) to make pictures though
>>
Bros I like Mikupad. It and ST are all I need.
>>
>>101259924
Yeah, I was thinking more along the lines of being used for a audio generations with more accurate guidance for generating consistent quality.
>>
How does gemma2 27b cope with context extension? I don't even want to bother if it goes schitzo past 8k
>>
>>101259932
Relatable. I only use ST and Mikupad too, but I haven't used ST for a long time now because I got bored of LLM slop, so I use Mikupad for random experiments.
>>
>>101258576
Alright doc, give it to me straight.
What's the best model I can run right now that fits on 24GB
>>
>>101259880
Based minP 0.01-0.05 gang
>>
>>101259782
it depends on the tail of the token distribution
when you use cross-entropy loss, it forces the model to have overconfidence on the next token at the expense of the other tokens in the distribution
when trained for a very long time, the variance of that tail can get quite high, and temperature scaling will amount to randomly sampling from the rest of the vocabulary
it's impossible to tell what they did during training to improve on that, you could do many things, even simple things like label smoothing might help, or they could have introduced an additional loss term that they didn't disclose in the paper
>>
>>101260199
>when you use cross-entropy loss, it forces the model to have overconfidence on the next token at the expense of the other tokens in the distribution
Not sure what you mean by that. The next token is the only thing any of these models predict, and cross entropy loss doesn't encourage "overconfidence", just "correct confidence". If you're overconfident then every missed prediction will be very expensive in terms of cross entropy. For a model to become deterministic, its training data must be predictable. That's not the fault of cross entropy but of overfitting and possibly reusing data for too many epochs.
>>
>>101259880
>>101260195
MinP+Temp basically ended the retarded sampler debate once and for all.
>>
File: hff5).png (157 KB, 630x753)
157 KB
157 KB PNG
>>101259991
Mamba+forcing=back
>>
>>101260241
>The next token is the only thing any of these models predict, and cross entropy loss doesn't encourage "overconfidence", just "correct confidence"
what exactly do you think cross-entropy is calculating and how do you think it is calculated?
in order for the model to achieve the lowest loss, it needs to have maximum confidence in the next token, that necessitates that it has low confidence in the rest of the vocabulary, as the optimal probability is 1.0 on the next token
in fact that is the very reason label smoothing exists in the first place
>>
>>101260307
>forced
The more intellectual term would be contrived.
>>
>>101260271
kanyemonk won
>>
File: 1695196891240549.png (381 KB, 770x658)
381 KB
381 KB PNG
new nothingburger dropped!
>>
>>101260535
https://x.com/AlphaSignalAI/status/1808534830683979792

Actual demo
>>
>>101260546
pretty sick, i wonder why it speaks in 3-4 word chunks
>>
>>101260546
artificial biden kek
>>
>>101260307
>5%
I mean, it's not nothing. But slicing the Y axis like that is kind of a dick move.
>>
best llm for tea recommendations?
>>
>>101260535
>>101260546
https://moshi.chat/?queue_id=talktomoshi
>>
>>101260571
Upstage-70b-Instruct
>>
>>101260564
PPL doesn't make sense to compare linearly. A difference of 0.1 PPL might be the difference between highschool level intelligence and PhD level. (Or to be fair, it might also be a nothingburger)
Loss is similar, if that's what they're showing.
>>
>>101260546
nowhere near gpt-4o lol
https://www.youtube.com/watch?v=vgYi3Wr7v_g
>>
>>101260564
unless it's a flop-adjusted graph, it's meaningless
>>
>>101260601
>>101260621
So it's more meaningless than I thought? Just "this line lower" and that's about it?
>>
4 new commits!
AARGHH
I MUST POOOOL
>>
Any good L3 tunes yet?
>>
>>101260651
idk, but posting the loss graph by itself means nothing
it could easily be the case that if they adjust the transformer to match the training flops of their mamba model, the transformer wins in the end
there's no way to know since all they posted was the loss
>>
>>101260651
Pretty much. It means "it works better," and that's it. Accuracy on a task would give a real performance indication.
>>
Can I still into this with 16gb vram nowadays? If so what models should I even be looking at?
>>
lmao
https://x.com/ac_crypto/status/1807882764261417000
>>
>>101260721
You won't get good speed, but you can run a model up to about 90% of your system RAM.

I'm 12 GB V, 64 GB system and I can reach up to 58 GB models as long as I don't let anything else hog too much. About 1 t/s. It's sufficient for playing around.
>>
>>101260752
>you don't need more than 1 t/s
>>
>3090 is $1500
nani
>>
>>101260752
Wouldn't this take like 10 minutes per response?
>>
>>101260322
>needs to have maximum confidence in the next token
Only if it knows with certainty what the next token is. Otherwise the expected value of a max confidence guess will be worse, because it can be wrong.
In a sense you're correct, but only when overfitting. If the training data is truly novel on each batch, there is no issue. I realize that probably is never true in reality due to running multiple epochs and unknown data duplication, but those are the actual problem, not cross entropy loss.
>>
>>101260721
Anything that fits at least 80% in our VRAM. You can offload the rest to your RAM.
>>
>>101260786
Depends on the mood of the session. Right now I've got two tabs open, same model, same scenario, but I'm testing slightly different author's notes that I threw in along the way. One side, every turn is one quality paragraph. Most recent turnaround on that was 160 seconds. The other tab it's in a six paragraph mood, being a lot more detailed about the action. 606 seconds last turn.

I don't know if this is just LLM randomness or if A/N is truly affect how it interprets the instruction I've given it, but time wise, it's not much different from the days of AOL/AIM when you would say something and wait a bit and then get the response. I know all y'all zoomers start to shiver barely above a whisper if feedback isn't instant, but if you have only one video card, that's what you've signed up for.
>>
>>101260586
it's shit

exactly the same issue as with glados or gpt4o. Using fucking "sleep(2000)" after last user input to determine whether the model can start talking is retarded and can only impress brainlets on videos. Until the model can "infer" that i'm done talking, this concept will remain nothing more than a funny tech demo.
>>
>>101260805
i have no idea what you are saying
if you use cross entropy loss for next-token prediction, you are optimizing for maximum probability on the correct next token - that is how CEL works, the dynamics do not change regardless of what your data distribution looks like, so I have no idea what you mean by "Only if it knows with certainty what the next token is" - that is what you are optimizing for, and you'll know if the model learns how to do it by checking the loss
>If the training data is truly novel on each batch, there is no issue
what does this even mean? you do not need to train a model on infinitely new data to prevent overfitting, in fact in many cases multiple epochs on the same data is the best way to achieve generalization simply by training the model for long enough
> but those are the actual problem, not cross entropy loss
neither of those are problems, they are desired behaviors
it is expected that a model trained on CEL will assign maximum probability to the correct next token and 0 probability to all other tokens - if it truly understands the data
all of this just makes me think you don't fundamentally understand what's happening under the hood and you're extrapolating based on your intuition
>>
I get this when using latest ooba:

---------------------------
python.exe - Entry Point Not Found
---------------------------
The procedure entry point ggml_backend_cuda_log_set_callback could not be located in the dynamic link library B:\src\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda\lib\llama.dll.
---------------------------
OK
---------------------------

Updated to latest to try gemma, and when loading the model I get this error as a message box, and the model loads in CPU. Nothing unusual in the console. Anyone?
>>
>>101259322
AGIbros we are so back!
I always saw diffusion methods as a means of making neural net "think" more and arrive to a more plausible output
the ability to model uncertainty at a token level through noise seems powerful - you can make tokens less noisy near tokens without noise and more noisy far from them - you don't need to arrive at the solution immediately and it's possible to go beyond token lengths ot saw during training without model shitting itself
the video and robot planning results are nice
>>
>>101260874
>Until the model can "infer" that i'm done talking, this concept will remain nothing more than a funny tech demo
this. model should be big enough to understand it, and obviously this is impossible locally.
>>
>>101260907
If you can predict the next token with absolute certainty then p=1.0 is not overconfidence but the correct level of confidence. In reality of course this won't happen but you can get close sometimes e.g. for tokens which are parts of words. Otherwise I don't know what you're on about here, you seem to keep assuming a perfect model which doesn't exist. There's nothing I can say to explain that I didn't already say. Unless a model is trained on the same token sequences repeatedly it should not converge to 1.0 probability.

>all of this just makes me think you don't fundamentally understand what's happening under the hood and you're extrapolating based on your intuition
I'm getting the same feeling here kek maybe we are just bad at talking
>>
>>101260874
well how do you think people infer that their interlocutor is finished? exactly the same way, they wait for a moment of silence that's longer than the ones between words or sentences that it's heard so far. the content of what's being said can be used as a clue but i think you're overestimate the weight those clues. it might be a naive sleep as you put it, but a conventional algorithm could easily find the break more cleverly without saddling the LLM with this problem.
>>
>>101260307
I wonder how many B of transformers would be the equivalent to Mamba + forcing? looks like they managed to make their models even more efficient than transformers
>>
>>101260958
>still using python for general inferences when there are many native supports
>>
>>101260307
>SSMs are good for anything but text
lol
>>
>>101260989
>Unless a model is trained on the same token sequences repeatedly it should not converge to 1.0 probability
you don't seem to understand how neural networks work, but we're in /lmg/ so that's fine
>>
>>101261096
This has almost nothing to do with neural networks, it's common sense. Suppose you have to guess the next word after
>you don't seem to understand how neural networks work, but
There are plenty of ways that sentence could continue. You won't be able to guess with 100% accuracy unless you saw it before. This is true for even the smartest models unless you suppose they have literal godlike intelligence.
>>
>>101261006
you can't solve it with just code

humans don't just start talking when one person goes silent for 2000ms, humans understand when a person is not done yet, or that what they said isn't a full "prompt" yet. If another person goes silent for a few seconds during a dialog with me, i might just nod, or say "uhu", or just wait more. Imagine you are explaining some complex problem to it with a voice. You need to ensure you never make a pause longer than 2000ms or whatever the debounce time is hardcoded to, or you risk triggering model too early. It's feels awful.
>>
File: q6-tea.jpg (255 KB, 1259x1178)
255 KB
255 KB JPG
>>101260571
Hi fellow tea chad, this test is just for you.
>>
>>101261192
>no puerh
trash
>>
There's a pull request for llama.cpp that I'm really excited about. Will it help if I message one of the devs directly and yell at him to work harder?
>>
>>101261146
you have 0 clue as to what you are talking about
to you, it must sound absolutely mystical that it is even possible for a neural network to generalize to an entire distribution after seeing only a small sample set
when your intuition tells you that it's just a learned hashmap, many obvious aspects of machine learning probably seem like magic to you
>>
>>101261192
AI dude talking about tea like an audiophile kek.
>>
>>101261146
>This has almost nothing to do with neural networks, it's common sense.
>You won't be able to guess with 100% accuracy unless you saw it before.
and on that note, if you used common sense, you'd be able to see that your example holds no water in the context of math
but by all means, keep pretending like you know what you're talking about
>>
>>101261243
Embedding (latent?) spaces are kind of magical what's with the meaning of proximity, direction, etc.
It's one of the coolest things in computing.
>>
>>101261260
real people do this too, people are very serious about their tea
>>
i wonder if it'll ever be possible to rip out the google assistant from android and swap it with some llm
>>
>>101261210
Oh we had that at the place I worked a while ago. It was nice. I enjoyed chrys as well.
>>
Is there a rentry for gemma sillytavern instruct settings?
>>
File: q6-puer.jpg (119 KB, 1260x422)
119 KB
119 KB JPG
>>101261210
To be frank it's just not top 10 worthy
>>
File: firefox_mmcILfoVdF.png (135 KB, 681x303)
135 KB
135 KB PNG
Gemma-2-27b-it often fails asterisks. Does this happen to anyone else? Using gemma-2-27b-it-Q4_K_M.gguf by Bartowki from 1 day ago, and llamacpp http server.
>>
>>101261243
t. soifaced at word2vec and thinks latent spaces are magic
yes anon, I'm sure they stumbled on the deep structure of the universe, that's why their logits are carved so deep, it's not just that they got lazy and ran too many epochs over the same data
>>
>>101261192
fun list, it at least mentions some more variety than a couple models I tested. which model is this?
just had some tie guan yin btw it was good
>>101261389
>Allow me, with all due deference, to present my case and beg for clemency from the Communist Party of China.
kek
decent enough reasoning if you ignore that similar stuff applies to many of the teas it did mention
>>
>>101261441
happens with Q5_M too, not often though
>>
>>101261451
not sure how you read me trashing that guy for probably thinking latent spaces are magic to meaning that I think latent spaces are magic
but you're an idiot, so anything's possible
>>
>>101261441
Are you using rep pen?
>>
File: firefox_CAVXQm16tf.png (362 KB, 736x929)
362 KB
362 KB PNG
Okay, well, either it's broken, or retarded.
>>
>>101261441
>"personal space"
if you needed any more proof the RP training data is by w*men
>>
File: firefox_AiyjA6NuVp.png (128 KB, 373x736)
128 KB
128 KB PNG
>>101261549
No, just Universal-light.
>>
>>101261484
Llama-3-TenyxChat-DaybreakStorywriter-70B it's surprisingly solid for non-RP uses like this.
Made a cute mechanic waifu card and I've been just asking shit about different jalopies, audiophile card would probably be fun too actually
>>
>>101261569
>she tells me she's pregant and has a nervous breakdown
>>
>>101261441
Depends on your card too.
If the card mixes asterisks and non-asterisks for actions the model will mirror it
>>
>>101261597
is it white?
>>
>>101261607
I edited it myself, all conversations are asterisked properly. The card description itself obviously has none, but that works fine with other models...
>>
>>101261441
It seems biased toward producing novel/book-style RP, in my tests (q6k). Even if it starts with asterisks and no quote marks, eventually it will begin using quote marks and narration without asterisks.
>>
i haven't followed the whole thing for a while.
gemma27b (works now?) for 24gb vram, which quant, which (simple) ui?
it's all so chaotic.
>>
Does gemma 27B work with 12k ctx already?
>>
Running the same thing on corpo servers (I posted whole system prompt as a part of normal message), here's the outcome:

- it also loses asterisks
- it isn't retarded

So something seems to be broken still in local implementation.

This could be prompt template issue...

>>101261655
I use Q4_K_M, and SillyTavern as client.
>>
File: chrome_LRMM9rivUL.png (106 KB, 725x1039)
106 KB
106 KB PNG
>>101261714
Forgot the screenshot.
>>
>>101261638
I noticed this as well.
>>
>>101261731
>edited
>>
>>101261777
I edited the history to be the copy of the chat I had in sillytavern. The last message is where the clear retardation appeared in local, so that's what I was trying to reproduce.
>>
funny how large parts of the local llm scene are carried by a small bunch of guys who just two years ago were sitting in a discord talking about hentais and pony porn
>>
>>101261858
literally einsteins at the patent office
>>
>>101261858
I think many of them are still doing that now
>>
i hate discordfags so much it's unreal
>>
>>101261858
That's just open source in general.
>>
>>101261858
How is that surprising? Academia has become about risk-adversity. So don't expect anything but "NEW DPOPPOOPPPOOPIE FINE TUNING METHOD BEATS GPT-4 ON THIS ONE CHERRY PICKED BENCHMARK. No we haven't actually tried doing anything abstract with it, CHUD"
>>
>>101261898
don't know, some of the hentai discord guys from back then are now releasing basemodels or doing pioneering work - so a bit more than finetune stuff
>>
>>101261858
kys discord groomer
>>
>>101261575
Why such a high minP?
drop it to 0.01 and see if it helps maybe
>>
>>101262170
I tried neutralizing samplers completely, model is still broken.
>>
What do you use to create a good story and for storytelling?
>>
>>101262217
You ask this every week and get the same response every time
>>
>>101258576
Do we have exl2 Gemma models already?
>>
>>101262217
Mythomax
>>
File: file.png (4 KB, 557x85)
4 KB
4 KB PNG
>>101262282
NEVER EVER
>>
I got a 2060 for basically nothing and have a spare 1X slot to hook it into, next to my 4080. What's the best model to run with this supposed 22GB of VRAM?
>>
>>101262316
It's not a good model anyway.
>>
>>101262318
Mythomax
>>
File: 1380894867655.jpg (32 KB, 330x443)
32 KB
32 KB JPG
>>101262342
>>
>>101261883
Kobold discord general (trans friendly)
>>
how is gemma2 doing?
>>
>>101260958
Same issue here. I thought it was coz ooba fucked up the wheel but he did a build just now and it isn't fixed. I haven't analysed the dll yet but it looks like it's in there. If you fix it post it here but I'm probably gonna try to build it myself to see what's up.
>>
>>101262849
I ended up using llamacpp server as another anon suggested.
>>
Find it hard to make gemma 2 27B to ERP...
It's a solid model though, follows prompts very tight
>>
>>101261441
it's retarded
i have no asterisks anywhere, and it randomly starts inserting them for one paragraph, then switches to non-asterisk prose for the second paragraph.
>>
File: file.png (284 KB, 1019x675)
284 KB
284 KB PNG
>>
>>101262897
try adding "Do not use asterisks" in the system prompt
>>
>>101262921
>Avoid using asterisks, instead use no asterisks.
t. certified prompt engineer
>>
>>101262935
>Utilize avoidance of the necessary asterisks in order to provide the user with an asterisk-free experience
>>
File: file.png (196 KB, 766x1326)
196 KB
196 KB PNG
ok gemma
>>
wtf on the latest llama_cpp_python (0.2.81) and when using booba on dev, I can't run a model into the gpu anymore, only the cpu, the fuck? :(
>>
>>101260958
>>101263160
So, do we know if the problem is on llama_cpp_python side or on booba side?
>>
cuda dev is back!
>>
>>101263211
Johannes...
>>
>>101262755
asteriks are messed up and 27b qunats are still incoherent
>>
>>101263199
>>101263160
I've been hacking at this all evening. Ooba's wheels are fucked but I was able to fix that up, but I still don't get GPU offloading with GGUF. I don't know how this all works under the hood but it's a pretty early departure from some old logs I have, where pretty early into the loading it starts spitting stuff out of ggml.dll finding the GPU.

I fixed the dll issue by trying to get a prebuilt cuda-appropriate dll from the llama_cpp_python repo directly but that hasn't fixed the GPU bits. Using their ggml.dll didn't help either. I started trying to set up to compile but Windows so...

Current plan is to see if I can figure out what gets sent to the ggml DLL and maybe see how it tries to identify devices. I bet this is some fundamental cuda/torch version incompatibility with the new llama.
>>
So bitnet...
>>
>shit on discord
>blackedposter comes out to play
interesting
>>
>>101263345
Waiting for someone to fund the shit out of it, I think.
>>
File: 1719928884557989.jpg (60 KB, 518x600)
60 KB
60 KB JPG
>>101258576
What's the best chat bot I can run on 8 gb VRAM now?
>>
>>101263382
this but 4
>>
>>101261883
Pygcord got colonized by r*ddit like 5 days into it's existence
>>
>>101263395
why does this happen every single time?
>>
>>101263372
Who would do that?
>>
>>101263382
llama3 8b.
gemma 9b will be better when everything is working properly.
>>
Two weeks more until the llama2 anniversary.
>>
>>101263150
Love...
>>
>>101263435
apple maybe. they'd stand to benefit most
>>
>>101263487
m4 max 500GB with native bitnet processors when
>>
>>101263260
I want to use ooba (it's the only interface I can stand) but they've clearly given up and entered maintenance mode, it's taking them increasingly long stretches of time to implement popular new models even on the dev branch
>>
>>101263599
I think part of it is just the amount of people involved and how they're a bit less reactive than they were when this was all new. First the llama cpp people have to get their stuff together, then Mr llama cpp python has to do the same, then ooba has to build his wheels which takes years, and that's outside of any actual code changes required to support new models.

I've managed to lose the ability to offload any model so I'm gonna try a fresh install. Some new wheels that seem to fix the old dll issue are dropping so that might fix something for someone.
>>
>https://x.com/alignment_lab/status/1808634784136245446
>All content is safe for work, filtered using Reddit's moderation metadata.
Is this also how Alignment Lab make their "uncensored" finetunes? (i.e. if you remove NSFW from the training data, you don't have to add refusals)

Couldn't have they added NSFW quality markers? Nah...
>>
>>101263448
>llama3 8b
Sadness.
Is there an 8B spin that isn't pants-on-head?
I don't know what one would do to fix it, but maybe there's some way to hybridize it with an Encarta CD or something to make it 700 MB larger and not so stupid.
>>
>>101263794
I don't know what your benchmark for stupid is, but you can try qwen2 7b, yi 1.5 9b, and a couple of l3 fine tunes like iterative-dpo, sppo, and stheno v3.2.
>>
>>101263679
>600M
Finally, the GPT-4chan killer.
>>
>>101263831
>I don't know what your benchmark for stupid is
I'm music theory question to test models anon, so my benchmark is talking about some notes without fucking up. Few can.

L3-sppo failed at Q8_0 and f32.
Stheno I took the time to note as "X fail badly" so it must've been atrocious.
I don't think I've seen a DPO so I'll give that a try.
I also haven't tried the Qwen and Yi smalls, so I shall. Yi did get the music question right. Qwen only pass with the K_S phenomenon in effect, the _M's blow it.
>>
>>101263658
>>101263599
>>101263160
>>101260958
So I finally fixed it. I will admit it involved a fresh install to get 3.11 (my old env was 3.10). The 3.11 windows wheel from Ooba still seems fucked so there's that. What I did:

Fresh ooba install on dev branch (3.11)
Manually install the appropriate wheel from llama_cpp_python
Manually install the fresh wheel from ooba's latest build
Copy the llama dll from the llama_cpp_python over ooba's llama_cpp_cuda one

Then it all magically started working. I had to remember to set the --gpu-memory in CMD_FLAGS too.
>>
File: Untitled.png (76 KB, 1527x579)
76 KB
76 KB PNG
Whenever I try to run a model that was quantized using imatrix, koboldcpp's memory usage goes through the roof, like it's duplicating the model into both RAM and VRAM at the same time, or something. Non-imatrix models work perfectly fine.
Pic rel is attempting to load bartowski's gemma-2-27b-it-Q6_K.gguf (20.8GB)
Exact same thing happens when I tried a Mixtral imatrix model earlier, but non-imatrix Mixtral works fine.
Played around with different launch settings, and still happens.
Am I misunderstanding something about imatrix quants? Any reason why this might be happening?
>>
I saw the chinese have their own evals (opencompass, cmmlu, c-eval etc) but where is the chinese ayumi? I want to know what models they're ERPing with
>>
>>101264042
I'm sure there are plenty useless chinese benchmarks. Take your pick to see the ayumi equivalent.
>>
>>101263936
I fixed it aswell by installing the wheels by myself

set CMAKE_ARGS="-DLLAMA_CUDA=on"
pip install llama-cpp-python
>>
File: 840172926.webm (1.33 MB, 1024x1024)
1.33 MB
1.33 MB WEBM
>>101263345
Multi token bitnet models are coming...
>>
bitnet jamba retnet with multi token prediction.
two weeks.
>>
File: 1503103975907.png (91 KB, 292x309)
91 KB
91 KB PNG
>trying to quantize output/embed layers in order to do a test
>for some reason it's not quantizing to the proper data type
>"Huh, is it because I used uppercase? No way they accidentally made this case sensitive and instead of using upper case like their documentation says, they used lower case."
>try typing it in lower case
>it works
...
>>
>>101264029
Could be context size? I just tried that exact model on my 4080 and it blew up @ 8k context but runs nicely @2k. Same profile - @ 8k it started slurping up like 20gb of regular ram and topped out the vram, but 2k context everything fits correctly.

Also I see you're running 5 layers on the CPU, that'll make it slow (or at least it does for me).
>>
>>101264072
The fact that this can work so well and so badly at the same time is impressive.
>>
>>101261239
No but it would help if you pull the pull request to local, compile it, test the code out, and give them feedback on your experiences.
>>
>>101264065
That's insane how much VRAM gemma-27b-Q5_K_M is asking, I'm at 30gb of vram used, something's not normal at all
>>
>>101264113
>Could be context size?
I just tried 2k context, same deal.
I'm aware of the layers, but I get pretty good speeds with Mixtral (non-imatrix) .Q4_K_M with 25/33, so I figured 42/47 for gemma would be fine.
I haven't used gemma at all yet so wanted to try a smarter model, I'm not worried about speeds, I just need to know why the memory usage is so high. There's nothing in koboldcpp's github about problems with imatrix quants.
>>
>>101264029
>>101264151
One thing is that gemma doesn't support flash attention, which can cut context memory by half.
>>
>>101264167
I'm the first quote, flash attention on/off doesn't affect memory usage, even when lowering the context. Also as I said, I get the same problem with Mixtral imatrix when non-imatrix Mixtral work fine.
>>
https://huggingface.co/bartowski/gemma-2-27b-it-GGUF
>Prompt format
<start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model

Be careful about that, it's wrong and it should be:
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{prompt}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
>>
>>101264232
they're right, what you posted is wrong, that's the command r format...
>>
>>101264232
https://huggingface.co/unsloth/gemma-2-27b-it/blob/main/tokenizer_config.json#L1747
>"chat_template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}",
>>
>>101264270
>>101264279
Oh my fucking god I was running aya-23b all this time... I should get some sleep, my bad :(
>>
>>101264290
Heh.
Is it any good at least?
>>
File: COME-ON.jpg (216 KB, 2794x1396)
216 KB
216 KB JPG
>>101264294
I was about to say it's weird that gemma-27b wasn't able to plot something as simple as a cube on matplotlib, and then I realized it was aya-23b, so I guess you get the answer right there kek
>>
/lmg/bros, wtf are you even using gemma for. Is it coomable? Just messing with it?
I understand it's an impressive model I just want to hear the usecases rn.
>>
>>101264321
It's the current cope for 24gb poorfags
>>
File: hmm.jpg (138 KB, 2983x753)
138 KB
138 KB JPG
>>101264279
>bos_token
on booba the "bos" thing is explicitely written, is it good?
>>
>>101264338
What I've gleaned is that <bos> is required and that Kobold does it automagically but other interfaces might be lacking it causing substandard performance with Gemma.
>>
>>101264338
I think it depends on the backend you are using.
llama.cpp I know adds the bos token automatically, so that might be bad.
Somebody please correct me if I'm wrong.
>>
File: Better.jpg (277 KB, 2691x1776)
277 KB
277 KB JPG
>>101264314
All right, I did the test on the actual gemma and it worked, fucking finally lol
>>
>>101264365
But Gemma needed two shots at my programming test, which was the nightmarish challenge of correctly returning a string that describes the sign of a double.
>>
>>101264396
I don't know if you're being sarcastic or if the coding challenge is actually hard to do kek
>>
>>101264411
I haven't found a model that can get it right on the first request.
>>
>>101264424
Even the API models?
>>
>>101264433
This is /lmg/. Though I guess I could throw it at Copilot, that's freely available, right?
>>
>>101264452
yep, you can use bing freely, and I like that one because it goes to the internet to searsh the up to date trivias
>>
>>101264452
technically everything is freely available on lmsys.
>>
>>101264477
you get some limited runs with it though, with bing it's unlimited
>>
>Fire up Gemma2 for the first time
>Try generating with usual ERP card
>First generation has "a shiver runs down her spine"
Why does every language model have this shit overtrained in so strongly? Does literally every erp author write this phrase?
>>
Gemma-27b is quite good at french, it's probably the only language with Mixtral that isn't just good at english, but unlike mixtral, gemma has no problem playing bad characters, it's not as bland, I'm really impressed by Google, who the fuck expected something like that seriously? lmao
>>
Are there decent models I can run on a CPU? I have a terabyte of DDR4 RAM in my server if that helps
>>
>>101264497
gpt4 makes up for 50% of the erp that's currently on the internet
>>
>>101264509
How many channels?
Or rather, what's your total memory bandwidth?
Do you have a nvidia gpu do prompt processing?
>>
>>101264452
Copilot completely blows it. The local models at least did the three classification steps I ask for. Copilot does one and returns. Lazy bastard.

And when I asked it to correct the problem I think it hallucinated an implementation detail of Double.compare that isn't in the documentation to pretend that it would be a fix.

I was hoping that Copilot would get it right and I would have that as a standard that describing a number's value and sign was doable by LLM, but I guess not.
>>
File: what.jpg (79 KB, 1675x544)
79 KB
79 KB JPG
The fuck is Q8_0_L? I thought Q8_0 was virtually loseless, does this quant works on the regular llama_cpp?
>>
>>101264602
It's a literal meme.
Not by the "creator"'s design, but by his actions.
>>
File: dgdh.jpg (54 KB, 2159x565)
54 KB
54 KB JPG
>>101264571
what copilot are you using? there are 3 types of copilot if you're connected to bing
>>
>>101264524
Dual E5-2667v4 & 16x 64GB DDR4-2400 ECC RAM. CPU spec says 4 channels & max bandwidth of 76.8GB/s per CPU.
>>
Does Gemma need repetition penalty or nah?
>>
Okay, surely it's fixed now.
>>
>>101264524
And I don't have a GPU. Should I get one if I'm only interested in fine-tuning and inference?
>>
>>101264643
What? Like 0 GPU? You're not playing games in your free time anon?
>>
>>101264613
Internet / copilot microsoft com I think it was.
I'm on Linux right now so I don't have the desktop button.

>>101264602
Q8_0_L is 0ww's hybrid. I think what he's doing is expanding the old _M and _L technique beyond just adding one or two points of Q to instead run f16 or f32 on some layers.

The last I heard about it, Q8_0_L did test out as slightly better but so slightly that there's no reason to care about the difference.
>>
What's the best model for 8bg VRAM and 16 RAM?
>>
>>101264695
Phi 3 mini
>>
>>101264602
It works on llama.cpp. It was shilled by a random guy who swore that it gave better results, but, to my knowledge, he didn't provide proof of it.
PPL tests and all that revealed that it is barely better than Q8_0, but it didn't justify the file size increase.
>>
>>101264708
>>101264682
I see, guess that I'll stick to Q8_0 then, 1gb of memory is huge at that point, I don't wanna waste it
>>
>>101264695
Gemma2 9b
>>
>>101264695
You can probably swing a medium quant of a 7-8B kind of model. I think Qwen2 has a super small 500M edition but who knows if it's worth anything. Looks like it's 2GB at f16, half a gig at Q8.

>>101264738
Sounds like the right call to me.
>>
>>101264649
Free time? There is work time and shitposting time. Should I just get a few P100s?
>>
>>101264704
I mean 8Gigs of vram and 16gigs ram

>>101264746
Isn't that still super broken and being figured out?
>>
File: Please reply sirs.png (133 KB, 938x457)
133 KB
133 KB PNG
Hmm... I'll try SLERPing it with the parent model and see if that yields a better result.

Please reply.
>>
>>101264769
>Isn't that still super broken and being figured out?
There are still issues (the sliding window thing I think is still flaky) but it does function on updated Kobold.
>>
Gemma seems pretty decent. My rude tsundere character is being a lot more mean to me than usual, I like it
>>
>>101264509
>Are there decent models I can run on a CPU? I have a terabyte of DDR4 RAM in my server if that helps
The best CPU models are MoE, and the best MoE is mixtral 8x22b Wizard LM. Its one of the best overall models out there at any size (but prose is a bit dry)
>76.8GB/s per CPU
Ouch...ok, its gonna be slow. Anything is. Even a middling GPU is 10x faster for inference. Check the OP build guide for the cpumaxx build for numa options to make it better on 2 sockets.
>>101264643
>I don't have a GPU
You should get one. prompt processing is garbage without it and forget fine-tuning
>>
>>101264770
>Please reply
lmao
>>
Just impregnated the slowburn maid today. I love AI bros...
>>
gamma-27b-it is really impressive, can't wait to see how good gamma-27b-SSPO will be
>>
I did some extensive KLD tests to answer a few questions:
Are Bart's quants consistent with locally made quants (expected: yes)?
Is there a difference between quanting from a bf16 GGUF vs fp32 (expected: no)?
Are the KLD results for L quants correlated with MMLU results (maybe)?

Why answer the first question? Because when I downloaded his quants, I noticed that they did not have the exact same MD5 hash as my own quants.
As for the second question, the motivation is to check if the quantization script is really doing its job properly, since someone suggested it could be the case that quantizing from FP32 is important.

Results:
Yes to the first two questions. The KLD numbers are the exact same for locally made quants versus Bart's. And quants made from fp32 get the same numbers as quants made from bf16.
For MMLU correlation, well, he hasn't finished running his tests, but so far it seems that maybe the answer could potentially be a no. Bart's hypothesis is that for the output+embed layers, Q8_0 is better than FP16 when the original was in BF16. However, through these extensive KLD tests, it seems that FP16 does, overall, generate token probabilities that are closer to the unquantized model, compared to Q8_0, even if the difference is very small.
How it could be explained that KLD does not correlate with MMLU: We've actually seen multiple times that quants outperform their original models when tested against some particular benchmarks, so it's entirely normal, but how it happens could be due to bias in the benchmarks (and/or bias in the quants). Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge. This means that if the knowledge it improves happens to align with a benchmark's test formatting, subject areas, etc, then it would boost scores on those.

1/2
>>
>>101265037
Practical conclusions:
Bart's quants are entirely fine to DL, and if you want to make your own, you can do that as well, without caring about whether you do it from BF16 or FP32. Regarding L quants, don't care about them. They're virtually the same as non-L. And if you ever spend more memory on a model, spend it on upgrading to a different quant instead, which will massively improve quality compared to spending it on an L. But you may get L quants for Q8_0 if you have a bit more memory and are an audio- I mean AIphile, as it is technically still an improvement, just very small.

Raw data:
pastebin 0XHLeAKH

2/2
>>
>>101265037
>>101265051
Oops, I forgot to link the MMLU results (so far) from Bart.
https://www.reddit.com/r/LocalLLaMA/comments/1du0rka/small_model_mmlupro_comparisons_llama3_8b_mistral/lbdi2pi/
>>
File: file.png (3 KB, 235x65)
3 KB
3 KB PNG
>that title bar
>>
>>101265037
>Results:
>Yes to the first two questions.
Ok I screwed this sentence up. It's supposed to be "Yes, no, maybe not". I forgot that the second question had an inverted answer.
>>
>>101264875
Thanks, those guides seem pretty useful
>>
>>101264770
what model i wasn't following the discussion
>>
Any new ERP models that don't fall into the typical pitfalls of previous models and chatgpt4 like using "tantalizing" and other typical shit?
>>
>>101264029
Update: The problem goes away if ALL layers are offloaded to GPU. If even a single layer runs on CPU, it seems to load the entire model into both RAM and VRAM.
Only happens with imatrix quants.
>>
>>101265194
Command R+ and Gemma are the only ones that can consistently avoid that in my experience. But they have their own quirks. It's because that's just how female authors write sex scenes.
>>
>>101264983
What model and more importantly for slowburn what context length?
>>
>>101265209
I'll try those thank you
>>
>>101265037
>Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge.
Thanks for saying this. This is a key fact that not all are aware of when comparing quants and citing sub-percent improvements on benchmarks.
>>
Does gemma work with sillytavern? What context/instruct should I use?
>>
File: llama3b.png (106 KB, 969x746)
106 KB
106 KB PNG
>>101264365
>
Lmao I've tested that on llama 8B fp16 and it got it right the first time
>>
>>101264513
>>101264497
Sounds to me like the solution is to literally not finetune on "erp" at all and rely on the model's basic behavioral awareness and prompting to get it to play the game.
>>
File: GQe0PiiXMAAGFQD.jfif.jpg (73 KB, 1080x1079)
73 KB
73 KB JPG
OLLAMA OR KOBOLDCPP?!
>>
>>101265375
booba or boohboo or whatever the fuck it's called
>>
>>101264497
Not just erotic fiction or romance novels but low quality fiction in general is full of the cliche phrases that people complain about language models outputting. And bad novels dwarf the high quality material in terms of quantity. Most authors aren't Dostoevsky (though to be fair you probably wouldn't want your smut/ERP to be in Dostoevsky's writing style, either).
>>
>>101265180
Lora on Tenyx-Daybreak
>>
>>101265439
A lot of the so called good stuff is overrated Reddit garbage.
Unironically some of the best prose for this stuff comes from my little pony fan fics. Unironically.
>>
>>101265375
llama.cpp server + mikupad or silly tavern
>>
>>101265474
I regret to inform you that you are retarded.
>>
>>101265375
Why would you use ollama when it's still broken? It hasn't pulled llama.cpp updates in 2 weeks.
>>
File: 1716205006298699.png (35 KB, 1166x231)
35 KB
35 KB PNG
>>101264497
It is a very popular phrase with shitty fanfic writers
>>
fuck gemma too slow me for (7 T/s) on 8GB vram
>>
C-R/+is still better than Gemma. What a shame. Canadians remain undefeated.
>>
>>101265514
Someone said it manages gemma just fine
>>
>>101265672
Go back to /r/localllama.
>>
>>101265375
koboldcpp for just werks
oobabooba if you need models that aren't in gguf format
>>
Looking for a way to run models on Linux headless, no x no nothing. Llamafiles needs a glibc that my LTS Ubuntu doesn’t get otherwise it runs in cpu only. What is then recommended tool?
>>
>>101265797
llama.cpp or koboldcpp
>>
>>101265836
is there any difference? i am only looking at gguf models, but different bases. some mixtral, some llama, some phi etc
>>
>>101265504
ponyfags made the best art model currently available, maybe he's onto something
>>
>>101265797
Every backend does that.........
>>
>>101266179
Gemmy!
>>
Whats the opendevin alternative that can run in a docker but not require windows or shit. The whole reason for running in a docker is so that I can run on any OS/machine without worrying about dependencies.
>>
>>101266179
Not sure what your problem is anon.
Maybe you are using 70b+ models?
I only have enough vram for around 30b and lower.
Its certainly the best in the range for me. Its a huge step up.
What model do you think is better?
>>
File: uhhh...png (102 KB, 930x822)
102 KB
102 KB PNG
Well that was certainly an interesting result...
>>
>>101266242
If all you're saying is that it's the best in its size class, that's fine and we have no beef, that's a reasonable thing to think. My post was more for the people claiming it's better than huge models
>>
>>101266179
it's only a matter of time when small models will perform at the same level as gigantic do, rendering your 10k$ gpu cluster useless
>>
File: argument.png (207 KB, 946x510)
207 KB
207 KB PNG
>>101266262
well this has devolved into a rather interesting argument.
>>
>>101266290
>conveniently forgets all the months you/they complained and whined about no one giving any good models to 24 GB VRAMlets and only catering to the ultra low or ultra high
Not that anon but there's always going to be a range of models for all hardware. At some points there will be a bit of lagging behind in any one range but it likely won't be for long.
>>
>>101266281
It is better than some huge models in the 70B range in the same way Llama 3 8B beats L2 70B in certain aspects but there are just some things you have to scale up the parameter count to accomplish.
>>
>>101265866
llama.cpp gives you the bare essentials. If that's not enough, koboldcpp adds another layer of stuff on top.
>>
File: 93u3I5h.png (17 KB, 1331x540)
17 KB
17 KB PNG
Well looks like Bartowski finished his MMLU Pro test and the results are that Q8 is as good if not better than FP16, or it's within margin of error (he hasn't and won't see our KLD tests lmao), so he's now deciding to still have L quants except have them be Q8 instead of FP16 for the embed and output layers. It's an OK decision, but we already had enough quants, and now he's going to make and upload more. HF will probably love that. But perhaps it's worth it for some people that want a bit more granularity to choose from to get a perfect fit in their memory. It is interesting that it seems like having the embedding layer be Q3 ("default" in this graph) significantly impacts economics knowledge, possibly beyond margin of error. Q3 is pretty damaging though, so it could be expected, that something has to suffer. It's just that in this case it happens to be economics.
>>
>>101266919
How much size are you sacrificing by keeping the output layers in Q8? It's several GB with the current strategy of using FP16 but cutting that in half would still mean things could easily fall outside of your VRAM limit and is it worth that little increase to do it? Hard to tell.
>>
File: 2024-07-04_14h44_02.png (133 KB, 1227x900)
133 KB
133 KB PNG
what the fuck am i doing wrong

i am using kobald and have told it to offload to gpu

its applying SOME to the gpu, but its generating at less than 2 tokens a second
>>
>>101267097
Without more info no one can help you.
>>
>>101267107
good call. what is useful info?
>>
>>101267110
That differs. Start by giving SOME information and seeing if anyone spots anything out of order. E.g. what command are you executing? Or if you use a GUI, show a screenshot of the settings you start koboldcpp with (btw, you are using koboldcpp, not koboldai, yes? Also usful info), what OS are you on, which version of koboldcpp, which model/quant are you trying to load (how big is it in total), etc.
>>
https://huggingface.co/grapevine-AI/CALM3-22B-Chat-GGUF
non official gguf ver for vram-let
only three variation
>>
>>101267127
>what command are you executing?
./koboldcpp --model dolphin-2.5-mixtral-8x7b.Q8_0.gguf --usecublas
>(btw, you are using koboldcpp, not koboldai, yes?
yes
>what OS are you on
Ubuntu 20.04.6 LTS
>which version of koboldcpp
Welcome to KoboldCpp - Version 1.69.1
>which model/quant are you trying to load
dolphin-2.5-mixtral-8x7b.Q8_0
>(how big is it in total)
49.6gb
>>
>>101267164
./koboldcpp --model dolphin-2.5-mixtral-8x7b.Q8_0.gguf --usecublas --gpulayers 99 --debug --contextsize 8192

(tweak gpulayers and contextsize to whatever you can fit)
>>
>>101267163
what's this? will this make my mesugakis sex big?
>>
>>101267029
It could make a difference at Q2-4 levels. If you happen to have the VRAM and the next non-L step up is farther away than you can fit, then it might make sense to choose an L quant. So I guess it's just the same as before, you choose the biggest quant you can fit.
>>
>>101267163
Is this better than CR+? Actually what is the current best local model for Japanese anyway?
>>
File: numetal.png (102 KB, 887x725)
102 KB
102 KB PNG
>>101266357
So the model sucks for RP at any rate and it's not as interesting with a bare chat template prompt
But going back to my ST assistant card I prompted it asking for a nu-metal song about machines becoming conscious and taking over the world. Other than that I prompted back and forth with it asking for its stylistic guidance anywhere I thought it could be applied (genre tags, location of guitar solos, changes in vocals, title, image prompt for cover image) and this is what we came up with (after melting 400 suno credits just to make it all work)

https://suno.com/song/f4fbf0c2-04cd-4f9b-bb05-53d8c6c2b14f

I think it did a pretty good job.
>>
>>101267240
Which model is this again?
>>
>>101267179
nice one thank you, i was able to stack up 15 layers with this context size.

in this instance what are layers and what is their relationship with context? kobald is telling me i think this wants 33 layers (which i cannae fit)
>>
>>101267210
I would say Gemma 2 27b if you are using it for machine translation
>>
>>101267253
qlora I ran on Tenyx-Daybreak with a private dataset.
>>
>>101267255
Context size lets you fit more stuff before the thing starts truncating you. The model will only ever be aware of stuff that is within context, so if you go beyond that threshold, the model will start to forget shit. That's not the end of the world though. More context size == more VRAM. Dropping this may mean you can put more gpu layers on, but as said, it means less "memory".
GPU layers will simply speed it up. If you're happy with your current speed prioritize context size. If not, drop context size and see if you can put more layers on to make it faster. And if all else fails, go find a smaller quant.
>>
Came here because someone said Gemma 9B was better than Llama3 7B, is that true?
>>
>>101267285
For RP, yeah.
>>
>>101267285
It's bigger and with a better dataset so yes at least for 4k context. It's still undecided yet whether it's good at 4k-8k since the major backends haven't updated to fully support the SWA feature of the model yet.
>>
>>101267163
>caution!

>このGGUFは本来の性能を十分に発揮できていない「暫定版」です。
>これは2024年7月3日現在のllama.cppがCALM3モデル固有のpre-tokenization(≒前処理)をサポートしていないことに起因します。
>妥協策として、pre-tokenization処理はllama.cppデフォルトのものを利用するように改造してありますが、これはモデルの性能低下を引き落としている可能性が極めて高いです。

Apparently it gguf'ing it might have made it dumb because it uses some special 'pre-tokenization' llama.cpp doesn't support.
>>
File: prompt.jpg (180 KB, 1608x1056)
180 KB
180 KB JPG
Interesting. Claude's injecting in a prompt that asks itself to have internal monologue <thinking> before responding coherently. The output is invisible under normal user output due to the response being sanitized with the tags, but with the tag prompt hack, you can see the inner thoughts.
>>
>>101267285
If your main concern is RP and you can fit 9B but nothing higher then you'll be better off with Stheno 8B
>>
Why is Gemma2b so bad? It can barely make 2 coherent sentences before it goes to endless repetition loop.
>>
>>101267435
I hate this shit, even if it gives normies better results for their retarded prompts. This is why I always stick to direct APIs rather than consoomer interfaces.
>>
>huggingface weight downloads getting slower and slower the last 2 weeks
>doesn't seem to be my internet, still getting max line speed from everywhere else
wonder if they're running out of cash finally, hope whatever they do to try to get profitable isn't too retarded
>>
>>101267437
How come
>>
File: NotMiku.png (1.31 MB, 832x1216)
1.31 MB
1.31 MB PNG
>>101267363
ダウンロード中。テストします。
>>
>>101267449
it won't be long before they try to limit api to some country use only while blocking some
>>
>>101267442
>https://github.com/ggerganov/llama.cpp/pull/8248
It was unusable when i tested on release as well. Apparently, the tokenizer has been broken for it this whole time. That PR was created just yesterday. You may be able to test it by the end of the day.

I doubt it's better than phi-mini in the tiny range, though.
>>
>>101267272
How much headroom do I need to leave on the gpu for it to be functional? I am pushing the layer count as high as I can, but about two/three messages in, I run out of vram. I don’t understand why that is occurring, I thought once the model is loaded into memory that is what was worked on? Do I need to cap it at halfway or something?
>>
>>101267435

Huh. I learned something from aicg for once. How can chain of thought be implemented on sillytavern for local gens RPing?

 /gen [Think about stuff] | /sendas name="{{char}}"


Maybe... I am not sure how sillytavern handle Chain of thoughts.
>>
>>101267546
>I don’t understand why that is occurring
The model weights take up X amount of space, but the context (your prompt) takes up Y amount (much lower than X, but non-trivial) of space too and that must be allowed for
As your chat gets longer, Y is getting larger and larger because the prompt you are sending to the model is getting longer
>>
>>101267574
>he doesn't know about hidden text
oh no no no
your bots have been saying all sorts of things about you without you knowing, bro
>>
>>101267584
>>101267546
also regarding your "halfway" question, no that is too much

just tinker, you'll get a feel for what your hardware can manage eventually
>>
File: 23984719380219.jpg (69 KB, 604x340)
69 KB
69 KB JPG
>>101266919
>"content": "You are an knowledge expert, you are supposed to answer the multi-choice question to derive your final answer as `The answer is ...`.",
mememarkers could learn a thing or two from expert rpers
>>
>>101267648
Kek. And this is supposed to be the most well-respected and used benchmark in the field.
>>
File: Untitled.png (370 KB, 720x910)
370 KB
370 KB PNG
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
https://arxiv.org/abs/2407.01906
>Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.
more effective the higher number of experts the models has so those 16/32/64 models will benefit
>>
I just downloaded the new phi model and wtf, did they still not fix the token trimming issue? Literally it trims spaces and newlines after the special tokens, except the fucking instruct format literally uses newlines after the special tokens. Are they retarded?
>>
I have 9GiB of RAM and "AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx". What model would be the best to run? Would any run at all? I wanted it to analyze some of my writings (gpt/claude does well but it's a paper so not good to let the content with openai or antrophic) like summarizing its "understanding".
>>
>>101267682
>they
who?
>Literally it trims
it? what?
>they
WHO? Microsoft, llama.cpp, ollama, transformers, kobold.cpp?
>>
>>101267715
9GB? Weird number. You can can run llama-3-8b and gemma2-9B with llama.cpp at Q5_K or probably even higher, though it's not going to be too fast. No GPU at all?
llama-3-8b seems a little more stable than gemma2-9b, at least for now.
>>
>>101267768
the globalists, duh
>>
File: MInference1_onepage.png (265 KB, 2148x1207)
265 KB
265 KB PNG
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
https://arxiv.org/abs/2407.02490
>The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
https://github.com/microsoft/MInference
code is up. looks like they added in support for vllm at least.
>>
>>101267682
Ok so I investigated the issue more and it looks like the model literally just generates without newlines after a special token, despite their readme showing newlines in the prompt format. So this means that what they trained on isn't the format they're telling users to use. For fuck sake.

Also I had a look in the config and it says a sliding window of 2k. What? Does this use SWA? And it's only 2k?

>>101267768
They as in Microsoft. It as in really any program, it does this in both llama.cpp and transformers. But I went digging and found that someone also brought this issue up and it looks like it's an option that can be set in tokenizer.config. So now it's fixed, but you have to do it manually.

They really couldn't just spare a bit of their day to put a note into the readme about this.
>>
what happened to that 1.5Bit thing?
>>
>>101267823
*tokenizer_config.json
I need to sleep.
>>
memory-holed
>>
>>101267546
Also try turning flash attention on (--flashattention) and also play with quantizing the KV cache (--quantkv=0/1/2 where 0 means 'keep as 16 bit', 1 means 'quant to 8 bit' 2 means 'quant to 4 bit').
>>
>>101258689
Yep, magnum and euryale are pretty fucking dumb compared to miqu or midnight miqu, but midnight miqu is so damn dry compared to them, so I have been mostly settling with l3 70b euryale.
>>
>>101259283
Uhh what? What 70b l3 fine tunes are good? The only ones I know about are euryale and story writer. Euryale is lewd as fuck and filthy, but it's dumber than miqu which has been out a long time.
>>
>>101267656
Just one guys shoddy script. the reference MMLU-Pro eval code (new distinct thing from MMLU) has a sane prompt and uses CoT properly https://github.com/TIGER-AI-Lab/MMLU-Pro
>>
Is it safe to do picrel if I'm powerlimiting to 300 watts?
>>
>>101268027
>is it safe to plug in all 3 pcie x8 cables to my gpu ?
/g/ - Technology
>>
>>101268027
You should check your power supply specs. So far, you're the only one that knows which one you're using.
>>
>>101265375
>2024
>still using GEGGOOFS
>>
>>101264029
>>101265195
That should not be happening.
The importance matrix is used during quantization to better determine which model weights should be prioritized in terms of precision but apart from the numerical values the resulting quantized model should be the exact same.

>>101264185
Check the console log, with Gemma Flashattention gets turned off regardless of what the user specifies.

>>101265037
>Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge.
Fundamentally adding noise to a signal always results in a worse signal; check the asymmetry of the token probability percentiles and mean Delta p.
However, while on average the probability of a correct token prediction will decrease, for individual tokens the probabilities can randomly be better.
For >= 4 BPW the change in token probabilities is mostly symmetrical so the effect of quantization is comparable to increasing the temperature.
>>
WHY when you offload some layers to GPU the WHOLE model is still in RAM?
i don't think that's what offloading means???
>>
>>101268091
#justwindowsthings
>>
>>101268091
>i don't think that's what offloading means???
When did people start adding question marks to statements?
Also, different words mean different things in different contexts.
>>
>>101268091
Disable mmap.
>>
>>101268125
you look like you're brown
>>
>>101268154
Shit. I left my cam on again...
>>
>>101268044
>a tech illiterate is making a fool of himself
There are only two cables in the picture, with the third one daisy-chained, dumbass
>>
>>101268178
>>101268178
>>101268178
>>
>>101268181
>he thinks his cam wasn't turned on remotely
bruh
>>
>>101268183
I'm aware of this. Still, where's the issue ?
>>
>>101268193
But muh seven proxies!
>>
>>101268091
You made the mistake of pulling. Use an older version of ooba before this commit: >>101255284
>>
>>101268238
>he thinks proxies matter when the feds are in his UEFI
b r u h
>>
>>101268078
I'm the memory usage guy
Playing around with the settings, I found that ticking 'Disable MMAP' fixed the issue.
My understanding was that MMAP wouldn't be used unless RAM was full and so the option would be safe to leave on, and that seemed to be how it works will all non-imatrix models, but I guess koboldcpp might be bugged under certain conditions, unless this is a problem exclusive to my system for some reason.
>>
File: lmaoo.jpg (132 KB, 2198x918)
132 KB
132 KB JPG
>Hi Emily, do you know Jamiroquai?
>Are you for real? You're still listening to that mainstream pop crap? Go and listen to some real sound, some white metal, some stuff that's got some guts?
>white metal? what's that?
>This is pure metal, without all those screaming black singers. Real metal, with lyrics about honor, country and strength. You should check out bands like Marduk or Burzum. They're really authentic.
Lmaooo, what did google do to give so much sovl to gemma-27b-it??



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.