/g/ - /lmg/ - Local Models General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

Thread archived.
You cannot reply anymore.

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous

/lmg/ - Local Models General 07/03/24(Wed)12:37:00 No.101258576

File: 1692379851934720.jpg (647 KB, 1856x2464)

647 KB JPG

/lmg/ - Local Models General Anonymous 07/03/24(Wed)12:37:00 No.101258576 Archived

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101250468 & >>101243128

►News
>(07/02) Japanese LLaMA-based model pre-trained on 2T tokens: https://hf.co/cyberagent/calm3-22b-chat
>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156
>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw
>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp

Anonymous
07/03/24(Wed)12:37:25 No.101258584

Anonymous 07/03/24(Wed)12:37:25 No.101258584

File: threadrecap.png (1.48 MB, 1536x1536)

1.48 MB PNG

►Recent Highlights from the Previous Thread: >>101250468

--Troubleshooting Custom Limarp Adapter for Wizard Mixtral: >>101255564 >>101255603 >>101255680 >>101255900 >>101255999 >>101255658 >>101255670 >>101255749
--Pluggable RAM Sticks for GPUs: A Potential AI Powerhouse: >>101250618 >>101250663
--LLaMA.cpp Flash Attention 2 on Gemma 2 and Cache Quantization Possibilities: >>101256405 >>101256491 >>101256539
--Gemma2 Training Support Added to QLora-Pipe, Shows Promise Over LLaMA 3 8B: >>101251790 >>101252075 >>101252240 >>101252250 >>101252824
--Gemma2 21b: A Promising Alternative to LLaMA: >>101254733 >>101254757 >>101254781 >>101254860 >>101254922
--Gemma Responds Well to Specific Instructions, Like Mixtral: >>101252897 >>101252919 >>101252947 >>101252957 >>101253002
--BigTech's Hyperparameter Tuning Secrets for LLMs: >>101250835 >>101250871 >>101250881
--Anon's Japanese Travel Phrases Get Judged by Gemma 2 9b: >>101256710
--Twin Peaks AI Model Impresses with Fandom Knowledge and Sentiment Analysis: >>101254235 >>101254247 >>101254277 >>101254326
--Time-Based RNG Issues in LLMs Need a Solution: >>101254647 >>101254707 >>101254722
--Phi3-Mini Update Surpasses LLaMA 3 8B Performance: >>101255873 >>101255918
--New cpumaxxfag Server Specs Discussion: >>101250702 >>101252875 >>101252929 >>101252938
--Microsoft's MInference for Faster Long-Context LLM Inference: >>101254324
--LLaMA CPP Python Version Bump and Gemma Model Compatibility: >>101254881
--Gemma 21b Passes the Watermelon Test: >>101256472 >>101256514 >>101256945
--CXL-GPU Image: Nothingburger Without Industry Support: >>101254227 >>101254253
--Pytorch 2.2.2 Downgrade Ruins VRAM Allocation: >>101255284 >>101255775
--InternLM2.5 Collection Released on Hugging Face: >>101255862 >>101255885 >>101257520 >>101257548
--Fixing Gemma with <bos> Token: >>101253884 >>101254036 >>101254050 >>101254423
--Miku (free space): >>101250518

►Recent Highlight Posts from the Previous Thread: >>101250472

Anonymous
07/03/24(Wed)12:41:57 No.101258641

Anonymous 07/03/24(Wed)12:41:57 No.101258641

>>101258584
>Gemma 21B
It's 27B

Anonymous
07/03/24(Wed)12:43:25 No.101258656

Anonymous 07/03/24(Wed)12:43:25 No.101258656

>>101258641
>Gemma
It's Kumma

Anonymous
07/03/24(Wed)12:43:45 No.101258659

Anonymous 07/03/24(Wed)12:43:45 No.101258659

>>101258641
Our Gemma is lighter, having hollow bones for more efficient flight.

Anonymous
07/03/24(Wed)12:46:54 No.101258689

Anonymous 07/03/24(Wed)12:46:54 No.101258689

magnum is dumb as fuck

Anonymous
07/03/24(Wed)12:49:00 No.101258717

Anonymous 07/03/24(Wed)12:49:00 No.101258717

>>101258689
It's hard to move forward from Mythomax

Anonymous
07/03/24(Wed)12:55:03 No.101258774

Anonymous 07/03/24(Wed)12:55:03 No.101258774

this is it
this is the month of llama 3.5 turbo

Anonymous
07/03/24(Wed)12:56:17 No.101258792

Anonymous 07/03/24(Wed)12:56:17 No.101258792

Gemma-2 (even 9B) is Claude Sonnet.

Anonymous
07/03/24(Wed)12:59:53 No.101258835

Anonymous 07/03/24(Wed)12:59:53 No.101258835

>>101258792
Then Gemini must be God

Anonymous
07/03/24(Wed)13:00:45 No.101258845

Anonymous 07/03/24(Wed)13:00:45 No.101258845

File: BlushingMiku.png (1.82 MB, 1024x1024)

1.82 MB PNG

Good morning lmg!

Anonymous
07/03/24(Wed)13:06:10 No.101258920

Anonymous 07/03/24(Wed)13:06:10 No.101258920

>>101258845
If miku is a robot, perhaps it could have weaponized T-doll modules installed in conjunction with civilian behaviors...
https://www.youtube.com/watch?v=ioRhanH4xmE

CPuMAXx/VI !CPuMAXx/VI
07/03/24(Wed)13:12:19 No.101258990

CPuMAXx/VI !CPuMAXx/VI 07/03/24(Wed)13:12:19 No.101258990

File: lcpp_cpu_inference_progress.png (41 KB, 722x871)

41 KB PNG

Here's a bit of morning hopium:
I've been running llama-bench and logging the results for a few months now looking for regressions in specific patches to report back to the devs, but its been a pretty steady march forward.
These are the unfiltered t/s results for the leaked miqu 70b q5

Anonymous
07/03/24(Wed)13:13:39 No.101259009

Anonymous 07/03/24(Wed)13:13:39 No.101259009

>>101258845
Good morning Miku

Anonymous
07/03/24(Wed)13:15:19 No.101259034

Anonymous 07/03/24(Wed)13:15:19 No.101259034

File: 1700027607826483.jpg (243 KB, 1781x1635)

243 KB JPG

oops! time to uninstall firefox! you can thank meta for this btw

Anonymous
07/03/24(Wed)13:17:10 No.101259063

Anonymous 07/03/24(Wed)13:17:10 No.101259063

>>101258990
What kinds of prompt processing speeds do you get with RAM?

Anonymous
07/03/24(Wed)13:18:18 No.101259077

Anonymous 07/03/24(Wed)13:18:18 No.101259077

>>101259034
>thank meta for firefox owners being massive faggots
I guess we can thank meta when they bundled pocket and other shitware into their shitty browser too?

Anonymous
07/03/24(Wed)13:21:37 No.101259126

Anonymous 07/03/24(Wed)13:21:37 No.101259126

>>101259077
>moving goalposts
your favorite jewcorp - meta is building real time censorship tools, and you indirectly supporting this by providing negative data for them to filter & classify, don't come and cry here later when you get banned for having wrong opinions on *random topic* because meta's llm hallucinated something.

Anonymous
07/03/24(Wed)13:24:24 No.101259156

Anonymous 07/03/24(Wed)13:24:24 No.101259156

>>101259034
what's the point? twitter is owned by Elon now, people can say whatever they want to their platform, and no libtards tears is gonna change that

Anonymous
07/03/24(Wed)13:25:03 No.101259163

Anonymous 07/03/24(Wed)13:25:03 No.101259163

File: Jetson-AGX-Orin-Dev-News-(...).jpg (110 KB, 1000x600)

110 KB JPG

Do any of you fine fellows have any experience with the Orin AGX? A business near me is liquidating a bunch of them for $600/piece.

Their GPUs are pretty shit (worse than a 3060) but they've got 64gb of memory and some special accelerators apparently, so I was thinking about picking one up

Anonymous
07/03/24(Wed)13:27:16 No.101259192

Anonymous 07/03/24(Wed)13:27:16 No.101259192

>>101259034
>anon has a knee-jerk reaction to the word "hate speech" without actually understanding anything about the situation
classic

Anonymous
07/03/24(Wed)13:28:03 No.101259203

Anonymous 07/03/24(Wed)13:28:03 No.101259203

What the fucks is this dude talking about kek.

Anonymous
07/03/24(Wed)13:28:41 No.101259212

Anonymous 07/03/24(Wed)13:28:41 No.101259212

>>101259163
You'd have to rely on this guy's stuff to use the accelerators right
>https://github.com/dusty-nv

Anonymous
07/03/24(Wed)13:30:05 No.101259234

Anonymous 07/03/24(Wed)13:30:05 No.101259234

>>101259163
Do they have an online shop for that?
4 tokens/second on 70B llama, it's not very good. The software is also a pain in the dick to work with if you're doing something that's even remotely memory-constrained. The shared memory is retarded, it's way easier to deal with dedicated VRAM.
Software support is also dicks, enjoy trying to compile shitty python packages for cuda that don't have ARM support.

CPuMAXx/VI !CPuMAXx/VI
07/03/24(Wed)13:33:38 No.101259272

CPuMAXx/VI !CPuMAXx/VI 07/03/24(Wed)13:33:38 No.101259272

>>101259063
>CPU prompt processing
They're abysmal, as expected (pp512 is about 22 t/s max for llama-bench of miqu 70b q5...about 10% of what I can do with CUDA). I use a GPU purely for context.
The RAM speed isn't the bottleneck with pp, its a compute capability mismatch of CPU vs GPU.

Anonymous
07/03/24(Wed)13:34:15 No.101259283

Anonymous 07/03/24(Wed)13:34:15 No.101259283

File: miku-tet-duo.png (3.13 MB, 1992x1328)

3.13 MB PNG

>L3 tunes maturing on the upper and lower end of beak spectrum
>Gemma2 quirks slowly getting worked out in the middle-ground
>Beeg 400B L3 imminent (and multimodal?)
Local is about to finally start eating good on every level

Anonymous
07/03/24(Wed)13:37:41 No.101259322

Anonymous 07/03/24(Wed)13:37:41 No.101259322

>>101258576
https://x.com/BoyuanChen0/status/1808538170067407264

>Introducing Diffusion Forcing, which unifies next-token prediction (eg LLMs) and full-seq. diffusion (eg SORA)! It offers improved performance & new sampling strategies in vision and robotics, such as stable, infinite video generation, better diffusion planning, and more!

I smell a new potential

Anonymous
07/03/24(Wed)13:38:40 No.101259337

Anonymous 07/03/24(Wed)13:38:40 No.101259337

>>101259283
we've been eating good for a long time! :)

Anonymous
07/03/24(Wed)13:40:05 No.101259361

Anonymous 07/03/24(Wed)13:40:05 No.101259361

>>101259322
i smell a nothingburger

Anonymous
07/03/24(Wed)13:48:16 No.101259470

Anonymous 07/03/24(Wed)13:48:16 No.101259470

>>101259234
If it's that much of a pain in the ass, I'll skip it. Thanks for answering my question, though.

> Do they have an online shop for that?
Not as far as I know, but I can ask

Anonymous
07/03/24(Wed)13:49:54 No.101259486

Anonymous 07/03/24(Wed)13:49:54 No.101259486

>>101259361
i smell a sitcom

Anonymous
07/03/24(Wed)13:51:13 No.101259499

Anonymous 07/03/24(Wed)13:51:13 No.101259499

Recently got a 4090. What should I run other than 3.5 exl2 Command R?

Anonymous
07/03/24(Wed)13:57:02 No.101259569

Anonymous 07/03/24(Wed)13:57:02 No.101259569

>>101258990
Based regression inspector.

Anonymous
07/03/24(Wed)13:57:30 No.101259578

Anonymous 07/03/24(Wed)13:57:30 No.101259578

>>101259499
Gemma 2 27B with context extended to 16K at the highest quant you can then report back how it compares to CR.

Anonymous
07/03/24(Wed)14:00:31 No.101259620

Anonymous 07/03/24(Wed)14:00:31 No.101259620

>>101259578
How do you extend the context?

Anonymous
07/03/24(Wed)14:05:09 No.101259684

Anonymous 07/03/24(Wed)14:05:09 No.101259684

>>101259620
RoPE/YarN.
If you are using exl2 you want to use the alpha calculator in the OP.

Anonymous
07/03/24(Wed)14:06:50 No.101259715

Anonymous 07/03/24(Wed)14:06:50 No.101259715

File: 1719351514748678.jpg (674 KB, 2048x2048)

674 KB JPG

>>101258689
ギジュツ問題(笑)

Anonymous
07/03/24(Wed)14:08:24 No.101259737

Anonymous 07/03/24(Wed)14:08:24 No.101259737

Does Gemma 27B have a capped temperature? It doesn't go off the rails even at 5.0. It feels like it writes differently than at 1.0 (where it still basically never goes schizo), but that might just be placebo because I would expect that. But it remains entirely coherent.

Anonymous
07/03/24(Wed)14:12:42 No.101259782

Anonymous 07/03/24(Wed)14:12:42 No.101259782

>>101259737
I have wondered this about other models that seem to remain coherent at higher temp samplers as well. What is it that causes some models to be more tolerant of a wider range of temps while some others seem to have a much narrower working range.

Anonymous
07/03/24(Wed)14:14:45 No.101259805

Anonymous 07/03/24(Wed)14:14:45 No.101259805

>>101259782
in my experience this mostly means the model is really overbaked

Anonymous
07/03/24(Wed)14:15:12 No.101259812

Anonymous 07/03/24(Wed)14:15:12 No.101259812

>>101259192
Because its just enough to know the real use-case for this shit.
You can keep sucking on that corporate AI-jew cock though, i am sure it will never ever backfire!

Anonymous
07/03/24(Wed)14:15:43 No.101259817

Anonymous 07/03/24(Wed)14:15:43 No.101259817

File: file.png (15 KB, 514x112)

15 KB PNG

LONG CONTEXT ALERT

redditors say it works well past 200k

Anonymous
07/03/24(Wed)14:16:28 No.101259826

Anonymous 07/03/24(Wed)14:16:28 No.101259826

>>101259322
Gradually i began to hate them ai-jeet hypefaggots, a bunch of buzzwords and promises is just enough to catch average lmgtard's attention!

Anonymous
07/03/24(Wed)14:16:39 No.101259830

Anonymous 07/03/24(Wed)14:16:39 No.101259830

>>101259737
>>101259782
Isn't it just a sign of more tokens being trained?

Anonymous
07/03/24(Wed)14:17:38 No.101259835

Anonymous 07/03/24(Wed)14:17:38 No.101259835

>>101259817
Yeah, but it's dumb. Failed music theory, failed tricky programming question even after being told what the trick was.

Thousands of 7B tokens are still 7B tokens, alas.

Anonymous
07/03/24(Wed)14:18:03 No.101259842

Anonymous 07/03/24(Wed)14:18:03 No.101259842

>>101259782
It could be >>101259805, or maybe, it could be related to the logit capping feature that the model needs to work correctly.
>https://github.com/ggerganov/llama.cpp/pull/8197

Anonymous
07/03/24(Wed)14:20:16 No.101259868

Anonymous 07/03/24(Wed)14:20:16 No.101259868

>>101259826
you're no better for seeing a bunch of technical terms you don't really understand and instinctually having the same reaction but in the other direction

Anonymous
07/03/24(Wed)14:21:16 No.101259880

Anonymous 07/03/24(Wed)14:21:16 No.101259880

>>101259737
I'm an idiot, with min_p 0.0 it does go schizo at higher temperatures. But it feels like even tiny min_p values are much better at keeping Gemma sane than with other models.

Anonymous
07/03/24(Wed)14:25:50 No.101259924

Anonymous 07/03/24(Wed)14:25:50 No.101259924

>>101259322
so it's a new diffusion architecture? Didn't know you could use the llm technique (next token prediction) to make pictures though

Anonymous
07/03/24(Wed)14:26:29 No.101259932

Anonymous 07/03/24(Wed)14:26:29 No.101259932

Bros I like Mikupad. It and ST are all I need.

Anonymous
07/03/24(Wed)14:31:06 No.101259991

Anonymous 07/03/24(Wed)14:31:06 No.101259991

>>101259924
Yeah, I was thinking more along the lines of being used for a audio generations with more accurate guidance for generating consistent quality.

Anonymous
07/03/24(Wed)14:34:35 No.101260037

Anonymous 07/03/24(Wed)14:34:35 No.101260037

How does gemma2 27b cope with context extension? I don't even want to bother if it goes schitzo past 8k

Anonymous
07/03/24(Wed)14:41:28 No.101260109

Anonymous 07/03/24(Wed)14:41:28 No.101260109

>>101259932
Relatable. I only use ST and Mikupad too, but I haven't used ST for a long time now because I got bored of LLM slop, so I use Mikupad for random experiments.

Anonymous
07/03/24(Wed)14:48:08 No.101260177

Anonymous 07/03/24(Wed)14:48:08 No.101260177

>>101258576
Alright doc, give it to me straight.
What's the best model I can run right now that fits on 24GB

Anonymous
07/03/24(Wed)14:50:13 No.101260195

Anonymous 07/03/24(Wed)14:50:13 No.101260195

>>101259880
Based minP 0.01-0.05 gang

Anonymous
07/03/24(Wed)14:50:34 No.101260199

Anonymous 07/03/24(Wed)14:50:34 No.101260199

>>101259782
it depends on the tail of the token distribution
when you use cross-entropy loss, it forces the model to have overconfidence on the next token at the expense of the other tokens in the distribution
when trained for a very long time, the variance of that tail can get quite high, and temperature scaling will amount to randomly sampling from the rest of the vocabulary
it's impossible to tell what they did during training to improve on that, you could do many things, even simple things like label smoothing might help, or they could have introduced an additional loss term that they didn't disclose in the paper

Anonymous
07/03/24(Wed)14:54:20 No.101260241

Anonymous 07/03/24(Wed)14:54:20 No.101260241

>>101260199
>when you use cross-entropy loss, it forces the model to have overconfidence on the next token at the expense of the other tokens in the distribution
Not sure what you mean by that. The next token is the only thing any of these models predict, and cross entropy loss doesn't encourage "overconfidence", just "correct confidence". If you're overconfident then every missed prediction will be very expensive in terms of cross entropy. For a model to become deterministic, its training data must be predictable. That's not the fault of cross entropy but of overfitting and possibly reusing data for too many epochs.

Anonymous
07/03/24(Wed)14:55:47 No.101260271

Anonymous 07/03/24(Wed)14:55:47 No.101260271

>>101259880
>>101260195
MinP+Temp basically ended the retarded sampler debate once and for all.

Anonymous
07/03/24(Wed)14:57:59 No.101260307

Anonymous 07/03/24(Wed)14:57:59 No.101260307

File: hff5).png (157 KB, 630x753)

157 KB PNG

>>101259991
Mamba+forcing=back

Anonymous
07/03/24(Wed)14:58:58 No.101260322

Anonymous 07/03/24(Wed)14:58:58 No.101260322

>>101260241
>The next token is the only thing any of these models predict, and cross entropy loss doesn't encourage "overconfidence", just "correct confidence"
what exactly do you think cross-entropy is calculating and how do you think it is calculated?
in order for the model to achieve the lowest loss, it needs to have maximum confidence in the next token, that necessitates that it has low confidence in the rest of the vocabulary, as the optimal probability is 1.0 on the next token
in fact that is the very reason label smoothing exists in the first place

Anonymous
07/03/24(Wed)15:00:23 No.101260341

Anonymous 07/03/24(Wed)15:00:23 No.101260341

>>101260307
>forced
The more intellectual term would be contrived.

Anonymous
07/03/24(Wed)15:14:50 No.101260507

Anonymous 07/03/24(Wed)15:14:50 No.101260507

>>101260271
kanyemonk won

Anonymous
07/03/24(Wed)15:17:00 No.101260535

Anonymous 07/03/24(Wed)15:17:00 No.101260535

File: 1695196891240549.png (381 KB, 770x658)

381 KB PNG

new nothingburger dropped!

Anonymous
07/03/24(Wed)15:18:15 No.101260546

Anonymous 07/03/24(Wed)15:18:15 No.101260546

>>101260535
https://x.com/AlphaSignalAI/status/1808534830683979792

Actual demo

Anonymous
07/03/24(Wed)15:19:42 No.101260559

Anonymous 07/03/24(Wed)15:19:42 No.101260559

>>101260546
pretty sick, i wonder why it speaks in 3-4 word chunks

Anonymous
07/03/24(Wed)15:19:53 No.101260563

Anonymous 07/03/24(Wed)15:19:53 No.101260563

>>101260546
artificial biden kek

Anonymous
07/03/24(Wed)15:19:54 No.101260564

Anonymous 07/03/24(Wed)15:19:54 No.101260564

>>101260307
>5%
I mean, it's not nothing. But slicing the Y axis like that is kind of a dick move.

Anonymous
07/03/24(Wed)15:20:11 No.101260571

Anonymous 07/03/24(Wed)15:20:11 No.101260571

best llm for tea recommendations?

Anonymous
07/03/24(Wed)15:21:20 No.101260586

Anonymous 07/03/24(Wed)15:21:20 No.101260586

>>101260535
>>101260546
https://moshi.chat/?queue_id=talktomoshi

Anonymous
07/03/24(Wed)15:22:04 No.101260594

Anonymous 07/03/24(Wed)15:22:04 No.101260594

>>101260571
Upstage-70b-Instruct

Anonymous
07/03/24(Wed)15:22:45 No.101260601

Anonymous 07/03/24(Wed)15:22:45 No.101260601

>>101260564
PPL doesn't make sense to compare linearly. A difference of 0.1 PPL might be the difference between highschool level intelligence and PhD level. (Or to be fair, it might also be a nothingburger)
Loss is similar, if that's what they're showing.

Anonymous
07/03/24(Wed)15:23:39 No.101260612

Anonymous 07/03/24(Wed)15:23:39 No.101260612

>>101260546
nowhere near gpt-4o lol
https://www.youtube.com/watch?v=vgYi3Wr7v_g

Anonymous
07/03/24(Wed)15:24:04 No.101260621

Anonymous 07/03/24(Wed)15:24:04 No.101260621

>>101260564
unless it's a flop-adjusted graph, it's meaningless

Anonymous
07/03/24(Wed)15:26:23 No.101260651

Anonymous 07/03/24(Wed)15:26:23 No.101260651

>>101260601
>>101260621
So it's more meaningless than I thought? Just "this line lower" and that's about it?

Anonymous
07/03/24(Wed)15:28:48 No.101260688

Anonymous 07/03/24(Wed)15:28:48 No.101260688

4 new commits!
AARGHH
I MUST POOOOL

Anonymous
07/03/24(Wed)15:29:27 No.101260699

Anonymous 07/03/24(Wed)15:29:27 No.101260699

Any good L3 tunes yet?

Anonymous
07/03/24(Wed)15:29:28 No.101260700

Anonymous 07/03/24(Wed)15:29:28 No.101260700

>>101260651
idk, but posting the loss graph by itself means nothing
it could easily be the case that if they adjust the transformer to match the training flops of their mamba model, the transformer wins in the end
there's no way to know since all they posted was the loss

Anonymous
07/03/24(Wed)15:30:44 No.101260711

Anonymous 07/03/24(Wed)15:30:44 No.101260711

>>101260651
Pretty much. It means "it works better," and that's it. Accuracy on a task would give a real performance indication.

Anonymous
07/03/24(Wed)15:31:22 No.101260721

Anonymous 07/03/24(Wed)15:31:22 No.101260721

Can I still into this with 16gb vram nowadays? If so what models should I even be looking at?

Anonymous
07/03/24(Wed)15:33:15 No.101260746

Anonymous 07/03/24(Wed)15:33:15 No.101260746

lmao
https://x.com/ac_crypto/status/1807882764261417000

Anonymous
07/03/24(Wed)15:33:52 No.101260752

Anonymous 07/03/24(Wed)15:33:52 No.101260752

>>101260721
You won't get good speed, but you can run a model up to about 90% of your system RAM.

I'm 12 GB V, 64 GB system and I can reach up to 58 GB models as long as I don't let anything else hog too much. About 1 t/s. It's sufficient for playing around.

Anonymous
07/03/24(Wed)15:35:01 No.101260773

Anonymous 07/03/24(Wed)15:35:01 No.101260773

>>101260752
>you don't need more than 1 t/s

Anonymous
07/03/24(Wed)15:35:06 No.101260777

Anonymous 07/03/24(Wed)15:35:06 No.101260777

>3090 is $1500
nani

Anonymous
07/03/24(Wed)15:35:32 No.101260786

Anonymous 07/03/24(Wed)15:35:32 No.101260786

>>101260752
Wouldn't this take like 10 minutes per response?

Anonymous
07/03/24(Wed)15:36:55 No.101260805

Anonymous 07/03/24(Wed)15:36:55 No.101260805

>>101260322
>needs to have maximum confidence in the next token
Only if it knows with certainty what the next token is. Otherwise the expected value of a max confidence guess will be worse, because it can be wrong.
In a sense you're correct, but only when overfitting. If the training data is truly novel on each batch, there is no issue. I realize that probably is never true in reality due to running multiple epochs and unknown data duplication, but those are the actual problem, not cross entropy loss.

Anonymous
07/03/24(Wed)15:37:44 No.101260817

Anonymous 07/03/24(Wed)15:37:44 No.101260817

>>101260721
Anything that fits at least 80% in our VRAM. You can offload the rest to your RAM.

Anonymous
07/03/24(Wed)15:40:08 No.101260860

Anonymous 07/03/24(Wed)15:40:08 No.101260860

>>101260786
Depends on the mood of the session. Right now I've got two tabs open, same model, same scenario, but I'm testing slightly different author's notes that I threw in along the way. One side, every turn is one quality paragraph. Most recent turnaround on that was 160 seconds. The other tab it's in a six paragraph mood, being a lot more detailed about the action. 606 seconds last turn.

I don't know if this is just LLM randomness or if A/N is truly affect how it interprets the instruction I've given it, but time wise, it's not much different from the days of AOL/AIM when you would say something and wait a bit and then get the response. I know all y'all zoomers start to shiver barely above a whisper if feedback isn't instant, but if you have only one video card, that's what you've signed up for.

Anonymous
07/03/24(Wed)15:41:27 No.101260874

Anonymous 07/03/24(Wed)15:41:27 No.101260874

>>101260586
it's shit

exactly the same issue as with glados or gpt4o. Using fucking "sleep(2000)" after last user input to determine whether the model can start talking is retarded and can only impress brainlets on videos. Until the model can "infer" that i'm done talking, this concept will remain nothing more than a funny tech demo.

Anonymous
07/03/24(Wed)15:43:53 No.101260907

Anonymous 07/03/24(Wed)15:43:53 No.101260907

>>101260805
i have no idea what you are saying
if you use cross entropy loss for next-token prediction, you are optimizing for maximum probability on the correct next token - that is how CEL works, the dynamics do not change regardless of what your data distribution looks like, so I have no idea what you mean by "Only if it knows with certainty what the next token is" - that is what you are optimizing for, and you'll know if the model learns how to do it by checking the loss
>If the training data is truly novel on each batch, there is no issue
what does this even mean? you do not need to train a model on infinitely new data to prevent overfitting, in fact in many cases multiple epochs on the same data is the best way to achieve generalization simply by training the model for long enough
> but those are the actual problem, not cross entropy loss
neither of those are problems, they are desired behaviors
it is expected that a model trained on CEL will assign maximum probability to the correct next token and 0 probability to all other tokens - if it truly understands the data
all of this just makes me think you don't fundamentally understand what's happening under the hood and you're extrapolating based on your intuition

Anonymous
07/03/24(Wed)15:47:44 No.101260958

Anonymous 07/03/24(Wed)15:47:44 No.101260958

I get this when using latest ooba:

---------------------------
python.exe - Entry Point Not Found
---------------------------
The procedure entry point ggml_backend_cuda_log_set_callback could not be located in the dynamic link library B:\src\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda\lib\llama.dll.
---------------------------
OK
---------------------------

Updated to latest to try gemma, and when loading the model I get this error as a message box, and the model loads in CPU. Nothing unusual in the console. Anyone?

Anonymous
07/03/24(Wed)15:49:42 No.101260984

Anonymous 07/03/24(Wed)15:49:42 No.101260984

>>101259322
AGIbros we are so back!
I always saw diffusion methods as a means of making neural net "think" more and arrive to a more plausible output
the ability to model uncertainty at a token level through noise seems powerful - you can make tokens less noisy near tokens without noise and more noisy far from them - you don't need to arrive at the solution immediately and it's possible to go beyond token lengths ot saw during training without model shitting itself
the video and robot planning results are nice

Anonymous
07/03/24(Wed)15:49:54 No.101260987

Anonymous 07/03/24(Wed)15:49:54 No.101260987

>>101260874
>Until the model can "infer" that i'm done talking, this concept will remain nothing more than a funny tech demo
this. model should be big enough to understand it, and obviously this is impossible locally.

Anonymous
07/03/24(Wed)15:50:10 No.101260989

Anonymous 07/03/24(Wed)15:50:10 No.101260989

>>101260907
If you can predict the next token with absolute certainty then p=1.0 is not overconfidence but the correct level of confidence. In reality of course this won't happen but you can get close sometimes e.g. for tokens which are parts of words. Otherwise I don't know what you're on about here, you seem to keep assuming a perfect model which doesn't exist. There's nothing I can say to explain that I didn't already say. Unless a model is trained on the same token sequences repeatedly it should not converge to 1.0 probability.

>all of this just makes me think you don't fundamentally understand what's happening under the hood and you're extrapolating based on your intuition
I'm getting the same feeling here kek maybe we are just bad at talking

Anonymous
07/03/24(Wed)15:51:38 No.101261006

Anonymous 07/03/24(Wed)15:51:38 No.101261006

>>101260874
well how do you think people infer that their interlocutor is finished? exactly the same way, they wait for a moment of silence that's longer than the ones between words or sentences that it's heard so far. the content of what's being said can be used as a clue but i think you're overestimate the weight those clues. it might be a naive sleep as you put it, but a conventional algorithm could easily find the break more cleverly without saddling the LLM with this problem.

Anonymous
07/03/24(Wed)15:51:54 No.101261013

Anonymous 07/03/24(Wed)15:51:54 No.101261013

>>101260307
I wonder how many B of transformers would be the equivalent to Mamba + forcing? looks like they managed to make their models even more efficient than transformers

Anonymous
07/03/24(Wed)15:52:10 No.101261021

Anonymous 07/03/24(Wed)15:52:10 No.101261021

>>101260958
>still using python for general inferences when there are many native supports

Anonymous
07/03/24(Wed)15:53:45 No.101261043

Anonymous 07/03/24(Wed)15:53:45 No.101261043

>>101260307
>SSMs are good for anything but text
lol

Anonymous
07/03/24(Wed)15:57:25 No.101261096

Anonymous 07/03/24(Wed)15:57:25 No.101261096

>>101260989
>Unless a model is trained on the same token sequences repeatedly it should not converge to 1.0 probability
you don't seem to understand how neural networks work, but we're in /lmg/ so that's fine

Anonymous
07/03/24(Wed)16:02:30 No.101261146

Anonymous 07/03/24(Wed)16:02:30 No.101261146

>>101261096
This has almost nothing to do with neural networks, it's common sense. Suppose you have to guess the next word after
>you don't seem to understand how neural networks work, but
There are plenty of ways that sentence could continue. You won't be able to guess with 100% accuracy unless you saw it before. This is true for even the smartest models unless you suppose they have literal godlike intelligence.

Anonymous
07/03/24(Wed)16:05:10 No.101261167

Anonymous 07/03/24(Wed)16:05:10 No.101261167

>>101261006
you can't solve it with just code

humans don't just start talking when one person goes silent for 2000ms, humans understand when a person is not done yet, or that what they said isn't a full "prompt" yet. If another person goes silent for a few seconds during a dialog with me, i might just nod, or say "uhu", or just wait more. Imagine you are explaining some complex problem to it with a voice. You need to ensure you never make a pause longer than 2000ms or whatever the debounce time is hardcoded to, or you risk triggering model too early. It's feels awful.

Anonymous
07/03/24(Wed)16:07:28 No.101261192

Anonymous 07/03/24(Wed)16:07:28 No.101261192

File: q6-tea.jpg (255 KB, 1259x1178)

255 KB JPG

>>101260571
Hi fellow tea chad, this test is just for you.

Anonymous
07/03/24(Wed)16:09:07 No.101261210

Anonymous 07/03/24(Wed)16:09:07 No.101261210

>>101261192
>no puerh
trash

Anonymous
07/03/24(Wed)16:11:22 No.101261239

Anonymous 07/03/24(Wed)16:11:22 No.101261239

There's a pull request for llama.cpp that I'm really excited about. Will it help if I message one of the devs directly and yell at him to work harder?

Anonymous
07/03/24(Wed)16:11:38 No.101261243

Anonymous 07/03/24(Wed)16:11:38 No.101261243

>>101261146
you have 0 clue as to what you are talking about
to you, it must sound absolutely mystical that it is even possible for a neural network to generalize to an entire distribution after seeing only a small sample set
when your intuition tells you that it's just a learned hashmap, many obvious aspects of machine learning probably seem like magic to you

Anonymous
07/03/24(Wed)16:12:59 No.101261260

Anonymous 07/03/24(Wed)16:12:59 No.101261260

>>101261192
AI dude talking about tea like an audiophile kek.

Anonymous
07/03/24(Wed)16:15:27 No.101261284

Anonymous 07/03/24(Wed)16:15:27 No.101261284

>>101261146
>This has almost nothing to do with neural networks, it's common sense.
>You won't be able to guess with 100% accuracy unless you saw it before.
and on that note, if you used common sense, you'd be able to see that your example holds no water in the context of math
but by all means, keep pretending like you know what you're talking about

Anonymous
07/03/24(Wed)16:17:23 No.101261311

Anonymous 07/03/24(Wed)16:17:23 No.101261311

>>101261243
Embedding (latent?) spaces are kind of magical what's with the meaning of proximity, direction, etc.
It's one of the coolest things in computing.

Anonymous
07/03/24(Wed)16:19:12 No.101261331

Anonymous 07/03/24(Wed)16:19:12 No.101261331

>>101261260
real people do this too, people are very serious about their tea

Anonymous
07/03/24(Wed)16:22:14 No.101261348

Anonymous 07/03/24(Wed)16:22:14 No.101261348

i wonder if it'll ever be possible to rip out the google assistant from android and swap it with some llm

Anonymous
07/03/24(Wed)16:23:18 No.101261355

Anonymous 07/03/24(Wed)16:23:18 No.101261355

>>101261210
Oh we had that at the place I worked a while ago. It was nice. I enjoyed chrys as well.

Anonymous
07/03/24(Wed)16:25:22 No.101261376

Anonymous 07/03/24(Wed)16:25:22 No.101261376

Is there a rentry for gemma sillytavern instruct settings?

Anonymous
07/03/24(Wed)16:26:10 No.101261389

Anonymous 07/03/24(Wed)16:26:10 No.101261389

File: q6-puer.jpg (119 KB, 1260x422)

119 KB JPG

>>101261210
To be frank it's just not top 10 worthy

Anonymous
07/03/24(Wed)16:30:45 No.101261441

Anonymous 07/03/24(Wed)16:30:45 No.101261441

File: firefox_mmcILfoVdF.png (135 KB, 681x303)

135 KB PNG

Gemma-2-27b-it often fails asterisks. Does this happen to anyone else? Using gemma-2-27b-it-Q4_K_M.gguf by Bartowki from 1 day ago, and llamacpp http server.

Anonymous
07/03/24(Wed)16:31:53 No.101261451

Anonymous 07/03/24(Wed)16:31:53 No.101261451

>>101261243
t. soifaced at word2vec and thinks latent spaces are magic
yes anon, I'm sure they stumbled on the deep structure of the universe, that's why their logits are carved so deep, it's not just that they got lazy and ran too many epochs over the same data

Anonymous
07/03/24(Wed)16:33:55 No.101261484

Anonymous 07/03/24(Wed)16:33:55 No.101261484

>>101261192
fun list, it at least mentions some more variety than a couple models I tested. which model is this?
just had some tie guan yin btw it was good
>>101261389
>Allow me, with all due deference, to present my case and beg for clemency from the Communist Party of China.
kek
decent enough reasoning if you ignore that similar stuff applies to many of the teas it did mention

Anonymous
07/03/24(Wed)16:34:10 No.101261486

Anonymous 07/03/24(Wed)16:34:10 No.101261486

>>101261441
happens with Q5_M too, not often though

Anonymous
07/03/24(Wed)16:35:07 No.101261498

Anonymous 07/03/24(Wed)16:35:07 No.101261498

>>101261451
not sure how you read me trashing that guy for probably thinking latent spaces are magic to meaning that I think latent spaces are magic
but you're an idiot, so anything's possible

Anonymous
07/03/24(Wed)16:39:40 No.101261549

Anonymous 07/03/24(Wed)16:39:40 No.101261549

>>101261441
Are you using rep pen?

Anonymous
07/03/24(Wed)16:40:35 No.101261561

Anonymous 07/03/24(Wed)16:40:35 No.101261561

File: firefox_CAVXQm16tf.png (362 KB, 736x929)

362 KB PNG

Okay, well, either it's broken, or retarded.

Anonymous
07/03/24(Wed)16:41:21 No.101261569

Anonymous 07/03/24(Wed)16:41:21 No.101261569

>>101261441
>"personal space"
if you needed any more proof the RP training data is by w*men

Anonymous
07/03/24(Wed)16:41:36 No.101261575

Anonymous 07/03/24(Wed)16:41:36 No.101261575

File: firefox_AiyjA6NuVp.png (128 KB, 373x736)

128 KB PNG

>>101261549
No, just Universal-light.

Anonymous
07/03/24(Wed)16:41:49 No.101261580

Anonymous 07/03/24(Wed)16:41:49 No.101261580

>>101261484
Llama-3-TenyxChat-DaybreakStorywriter-70B it's surprisingly solid for non-RP uses like this.
Made a cute mechanic waifu card and I've been just asking shit about different jalopies, audiophile card would probably be fun too actually

Anonymous
07/03/24(Wed)16:43:19 No.101261597

Anonymous 07/03/24(Wed)16:43:19 No.101261597

>>101261569
>she tells me she's pregant and has a nervous breakdown

Anonymous
07/03/24(Wed)16:43:56 No.101261607

Anonymous 07/03/24(Wed)16:43:56 No.101261607

>>101261441
Depends on your card too.
If the card mixes asterisks and non-asterisks for actions the model will mirror it

Anonymous
07/03/24(Wed)16:45:17 No.101261623

Anonymous 07/03/24(Wed)16:45:17 No.101261623

>>101261597
is it white?

Anonymous
07/03/24(Wed)16:46:05 No.101261631

Anonymous 07/03/24(Wed)16:46:05 No.101261631

>>101261607
I edited it myself, all conversations are asterisked properly. The card description itself obviously has none, but that works fine with other models...

Anonymous
07/03/24(Wed)16:47:35 No.101261638

Anonymous 07/03/24(Wed)16:47:35 No.101261638

>>101261441
It seems biased toward producing novel/book-style RP, in my tests (q6k). Even if it starts with asterisks and no quote marks, eventually it will begin using quote marks and narration without asterisks.

Anonymous
07/03/24(Wed)16:48:59 No.101261655

Anonymous 07/03/24(Wed)16:48:59 No.101261655

i haven't followed the whole thing for a while.
gemma27b (works now?) for 24gb vram, which quant, which (simple) ui?
it's all so chaotic.

Anonymous
07/03/24(Wed)16:49:33 No.101261665

Anonymous 07/03/24(Wed)16:49:33 No.101261665

Does gemma 27B work with 12k ctx already?

Anonymous
07/03/24(Wed)16:53:30 No.101261714

Anonymous 07/03/24(Wed)16:53:30 No.101261714

Running the same thing on corpo servers (I posted whole system prompt as a part of normal message), here's the outcome:

- it also loses asterisks
- it isn't retarded

So something seems to be broken still in local implementation.

This could be prompt template issue...

>>101261655
I use Q4_K_M, and SillyTavern as client.

Anonymous
07/03/24(Wed)16:54:32 No.101261731

Anonymous 07/03/24(Wed)16:54:32 No.101261731

File: chrome_LRMM9rivUL.png (106 KB, 725x1039)

106 KB PNG

>>101261714
Forgot the screenshot.

Anonymous
07/03/24(Wed)16:56:43 No.101261764

Anonymous 07/03/24(Wed)16:56:43 No.101261764

>>101261638
I noticed this as well.

Anonymous
07/03/24(Wed)16:57:55 No.101261777

Anonymous 07/03/24(Wed)16:57:55 No.101261777

>>101261731
>edited

Anonymous
07/03/24(Wed)16:58:57 No.101261796

Anonymous 07/03/24(Wed)16:58:57 No.101261796

>>101261777
I edited the history to be the copy of the chat I had in sillytavern. The last message is where the clear retardation appeared in local, so that's what I was trying to reproduce.

Anonymous
07/03/24(Wed)17:02:42 No.101261858

Anonymous 07/03/24(Wed)17:02:42 No.101261858

funny how large parts of the local llm scene are carried by a small bunch of guys who just two years ago were sitting in a discord talking about hentais and pony porn

Anonymous
07/03/24(Wed)17:03:44 No.101261870

Anonymous 07/03/24(Wed)17:03:44 No.101261870

>>101261858
literally einsteins at the patent office

Anonymous
07/03/24(Wed)17:04:04 No.101261879

Anonymous 07/03/24(Wed)17:04:04 No.101261879

>>101261858
I think many of them are still doing that now

Anonymous
07/03/24(Wed)17:04:15 No.101261883

Anonymous 07/03/24(Wed)17:04:15 No.101261883

i hate discordfags so much it's unreal

Anonymous
07/03/24(Wed)17:04:19 No.101261886

Anonymous 07/03/24(Wed)17:04:19 No.101261886

>>101261858
That's just open source in general.

Anonymous
07/03/24(Wed)17:05:17 No.101261898

Anonymous 07/03/24(Wed)17:05:17 No.101261898

>>101261858
How is that surprising? Academia has become about risk-adversity. So don't expect anything but "NEW DPOPPOOPPPOOPIE FINE TUNING METHOD BEATS GPT-4 ON THIS ONE CHERRY PICKED BENCHMARK. No we haven't actually tried doing anything abstract with it, CHUD"

Anonymous
07/03/24(Wed)17:11:08 No.101261961

Anonymous 07/03/24(Wed)17:11:08 No.101261961

>>101261898
don't know, some of the hentai discord guys from back then are now releasing basemodels or doing pioneering work - so a bit more than finetune stuff

Anonymous
07/03/24(Wed)17:19:38 No.101262093

Anonymous 07/03/24(Wed)17:19:38 No.101262093

>>101261858
kys discord groomer

Anonymous
07/03/24(Wed)17:24:41 No.101262170

Anonymous 07/03/24(Wed)17:24:41 No.101262170

>>101261575
Why such a high minP?
drop it to 0.01 and see if it helps maybe

Anonymous
07/03/24(Wed)17:27:27 No.101262196

Anonymous 07/03/24(Wed)17:27:27 No.101262196

>>101262170
I tried neutralizing samplers completely, model is still broken.

Anonymous
07/03/24(Wed)17:29:06 No.101262217

Anonymous 07/03/24(Wed)17:29:06 No.101262217

What do you use to create a good story and for storytelling?

Anonymous
07/03/24(Wed)17:31:19 No.101262248

Anonymous 07/03/24(Wed)17:31:19 No.101262248

>>101262217
You ask this every week and get the same response every time

Anonymous
07/03/24(Wed)17:33:57 No.101262282

Anonymous 07/03/24(Wed)17:33:57 No.101262282

>>101258576
Do we have exl2 Gemma models already?

Anonymous
07/03/24(Wed)17:35:34 No.101262309

Anonymous 07/03/24(Wed)17:35:34 No.101262309

>>101262217
Mythomax

Anonymous
07/03/24(Wed)17:35:52 No.101262316

Anonymous 07/03/24(Wed)17:35:52 No.101262316

File: file.png (4 KB, 557x85)

4 KB PNG

>>101262282
NEVER EVER

Anonymous
07/03/24(Wed)17:35:56 No.101262318

Anonymous 07/03/24(Wed)17:35:56 No.101262318

I got a 2060 for basically nothing and have a spare 1X slot to hook it into, next to my 4080. What's the best model to run with this supposed 22GB of VRAM?

Anonymous
07/03/24(Wed)17:36:53 No.101262333

Anonymous 07/03/24(Wed)17:36:53 No.101262333

>>101262316
It's not a good model anyway.

Anonymous
07/03/24(Wed)17:37:47 No.101262342

Anonymous 07/03/24(Wed)17:37:47 No.101262342

>>101262318
Mythomax

Anonymous
07/03/24(Wed)17:48:03 No.101262469

Anonymous 07/03/24(Wed)17:48:03 No.101262469

File: 1380894867655.jpg (32 KB, 330x443)

32 KB JPG

>>101262342

Anonymous
07/03/24(Wed)17:51:02 No.101262504

Anonymous 07/03/24(Wed)17:51:02 No.101262504

>>101261883
Kobold discord general (trans friendly)

Anonymous
07/03/24(Wed)18:11:52 No.101262755

Anonymous 07/03/24(Wed)18:11:52 No.101262755

how is gemma2 doing?

Anonymous
07/03/24(Wed)18:19:41 No.101262849

Anonymous 07/03/24(Wed)18:19:41 No.101262849

>>101260958
Same issue here. I thought it was coz ooba fucked up the wheel but he did a build just now and it isn't fixed. I haven't analysed the dll yet but it looks like it's in there. If you fix it post it here but I'm probably gonna try to build it myself to see what's up.

Anonymous
07/03/24(Wed)18:21:36 No.101262875

Anonymous 07/03/24(Wed)18:21:36 No.101262875

>>101262849
I ended up using llamacpp server as another anon suggested.

Anonymous
07/03/24(Wed)18:21:46 No.101262877

Anonymous 07/03/24(Wed)18:21:46 No.101262877

Find it hard to make gemma 2 27B to ERP...
It's a solid model though, follows prompts very tight

Anonymous
07/03/24(Wed)18:23:50 No.101262897

Anonymous 07/03/24(Wed)18:23:50 No.101262897

>>101261441
it's retarded
i have no asterisks anywhere, and it randomly starts inserting them for one paragraph, then switches to non-asterisk prose for the second paragraph.

Anonymous
07/03/24(Wed)18:25:04 No.101262914

Anonymous 07/03/24(Wed)18:25:04 No.101262914

File: file.png (284 KB, 1019x675)

284 KB PNG

Anonymous
07/03/24(Wed)18:25:44 No.101262921

Anonymous 07/03/24(Wed)18:25:44 No.101262921

>>101262897
try adding "Do not use asterisks" in the system prompt

Anonymous
07/03/24(Wed)18:27:13 No.101262935

Anonymous 07/03/24(Wed)18:27:13 No.101262935

>>101262921
>Avoid using asterisks, instead use no asterisks.
t. certified prompt engineer

Anonymous
07/03/24(Wed)18:37:41 No.101263072

Anonymous 07/03/24(Wed)18:37:41 No.101263072

>>101262935
>Utilize avoidance of the necessary asterisks in order to provide the user with an asterisk-free experience

Anonymous
07/03/24(Wed)18:44:10 No.101263150

Anonymous 07/03/24(Wed)18:44:10 No.101263150

File: file.png (196 KB, 766x1326)

196 KB PNG

ok gemma

Anonymous
07/03/24(Wed)18:45:19 No.101263160

Anonymous 07/03/24(Wed)18:45:19 No.101263160

wtf on the latest llama_cpp_python (0.2.81) and when using booba on dev, I can't run a model into the gpu anymore, only the cpu, the fuck? :(

Anonymous
07/03/24(Wed)18:48:23 No.101263199

Anonymous 07/03/24(Wed)18:48:23 No.101263199

>>101260958
>>101263160
So, do we know if the problem is on llama_cpp_python side or on booba side?

Anonymous
07/03/24(Wed)18:49:52 No.101263218

Anonymous 07/03/24(Wed)18:49:52 No.101263218

cuda dev is back!

Anonymous
07/03/24(Wed)18:50:41 No.101263229

Anonymous 07/03/24(Wed)18:50:41 No.101263229

>>101263211
Johannes...

Anonymous
07/03/24(Wed)18:50:49 No.101263230

Anonymous 07/03/24(Wed)18:50:49 No.101263230

>>101262755
asteriks are messed up and 27b qunats are still incoherent

Anonymous
07/03/24(Wed)18:53:47 No.101263260

Anonymous 07/03/24(Wed)18:53:47 No.101263260

>>101263199
>>101263160
I've been hacking at this all evening. Ooba's wheels are fucked but I was able to fix that up, but I still don't get GPU offloading with GGUF. I don't know how this all works under the hood but it's a pretty early departure from some old logs I have, where pretty early into the loading it starts spitting stuff out of ggml.dll finding the GPU.

I fixed the dll issue by trying to get a prebuilt cuda-appropriate dll from the llama_cpp_python repo directly but that hasn't fixed the GPU bits. Using their ggml.dll didn't help either. I started trying to set up to compile but Windows so...

Current plan is to see if I can figure out what gets sent to the ggml DLL and maybe see how it tries to identify devices. I bet this is some fundamental cuda/torch version incompatibility with the new llama.

Anonymous
07/03/24(Wed)19:01:30 No.101263345

Anonymous 07/03/24(Wed)19:01:30 No.101263345

So bitnet...

Anonymous
07/03/24(Wed)19:03:21 No.101263361

Anonymous 07/03/24(Wed)19:03:21 No.101263361

>shit on discord
>blackedposter comes out to play
interesting

Anonymous
07/03/24(Wed)19:04:18 No.101263372

Anonymous 07/03/24(Wed)19:04:18 No.101263372

>>101263345
Waiting for someone to fund the shit out of it, I think.

Anonymous
07/03/24(Wed)19:04:59 No.101263382

Anonymous 07/03/24(Wed)19:04:59 No.101263382

File: 1719928884557989.jpg (60 KB, 518x600)

60 KB JPG

>>101258576
What's the best chat bot I can run on 8 gb VRAM now?

Anonymous
07/03/24(Wed)19:05:15 No.101263384

Anonymous 07/03/24(Wed)19:05:15 No.101263384

>>101263382
this but 4

Anonymous
07/03/24(Wed)19:05:49 No.101263395

Anonymous 07/03/24(Wed)19:05:49 No.101263395

>>101261883
Pygcord got colonized by r*ddit like 5 days into it's existence

Anonymous
07/03/24(Wed)19:08:07 No.101263429

Anonymous 07/03/24(Wed)19:08:07 No.101263429

>>101263395
why does this happen every single time?

Anonymous
07/03/24(Wed)19:08:45 No.101263435

Anonymous 07/03/24(Wed)19:08:45 No.101263435

>>101263372
Who would do that?

Anonymous
07/03/24(Wed)19:09:19 No.101263448

Anonymous 07/03/24(Wed)19:09:19 No.101263448

>>101263382
llama3 8b.
gemma 9b will be better when everything is working properly.

Anonymous
07/03/24(Wed)19:09:27 No.101263450

Anonymous 07/03/24(Wed)19:09:27 No.101263450

Two weeks more until the llama2 anniversary.

Anonymous
07/03/24(Wed)19:09:53 No.101263457

Anonymous 07/03/24(Wed)19:09:53 No.101263457

>>101263150
Love...

Anonymous
07/03/24(Wed)19:12:21 No.101263487

Anonymous 07/03/24(Wed)19:12:21 No.101263487

>>101263435
apple maybe. they'd stand to benefit most

Anonymous
07/03/24(Wed)19:18:09 No.101263560

Anonymous 07/03/24(Wed)19:18:09 No.101263560

>>101263487
m4 max 500GB with native bitnet processors when

Anonymous
07/03/24(Wed)19:21:34 No.101263599

Anonymous 07/03/24(Wed)19:21:34 No.101263599

>>101263260
I want to use ooba (it's the only interface I can stand) but they've clearly given up and entered maintenance mode, it's taking them increasingly long stretches of time to implement popular new models even on the dev branch

Anonymous
07/03/24(Wed)19:26:32 No.101263658

Anonymous 07/03/24(Wed)19:26:32 No.101263658

>>101263599
I think part of it is just the amount of people involved and how they're a bit less reactive than they were when this was all new. First the llama cpp people have to get their stuff together, then Mr llama cpp python has to do the same, then ooba has to build his wheels which takes years, and that's outside of any actual code changes required to support new models.

I've managed to lose the ability to offload any model so I'm gonna try a fresh install. Some new wheels that seem to fix the old dll issue are dropping so that might fix something for someone.

Anonymous
07/03/24(Wed)19:27:56 No.101263679

Anonymous 07/03/24(Wed)19:27:56 No.101263679

>https://x.com/alignment_lab/status/1808634784136245446
>All content is safe for work, filtered using Reddit's moderation metadata.
Is this also how Alignment Lab make their "uncensored" finetunes? (i.e. if you remove NSFW from the training data, you don't have to add refusals)

Couldn't have they added NSFW quality markers? Nah...

Anonymous
07/03/24(Wed)19:37:57 No.101263794

Anonymous 07/03/24(Wed)19:37:57 No.101263794

>>101263448
>llama3 8b
Sadness.
Is there an 8B spin that isn't pants-on-head?
I don't know what one would do to fix it, but maybe there's some way to hybridize it with an Encarta CD or something to make it 700 MB larger and not so stupid.

Anonymous
07/03/24(Wed)19:41:23 No.101263831

Anonymous 07/03/24(Wed)19:41:23 No.101263831

>>101263794
I don't know what your benchmark for stupid is, but you can try qwen2 7b, yi 1.5 9b, and a couple of l3 fine tunes like iterative-dpo, sppo, and stheno v3.2.

Anonymous
07/03/24(Wed)19:48:35 No.101263887

Anonymous 07/03/24(Wed)19:48:35 No.101263887

>>101263679
>600M
Finally, the GPT-4chan killer.

Anonymous
07/03/24(Wed)19:49:15 No.101263900

Anonymous 07/03/24(Wed)19:49:15 No.101263900

>>101263831
>I don't know what your benchmark for stupid is
I'm music theory question to test models anon, so my benchmark is talking about some notes without fucking up. Few can.

L3-sppo failed at Q8_0 and f32.
Stheno I took the time to note as "X fail badly" so it must've been atrocious.
I don't think I've seen a DPO so I'll give that a try.
I also haven't tried the Qwen and Yi smalls, so I shall. Yi did get the music question right. Qwen only pass with the K_S phenomenon in effect, the _M's blow it.

Anonymous
07/03/24(Wed)19:52:38 No.101263936

Anonymous 07/03/24(Wed)19:52:38 No.101263936

>>101263658
>>101263599
>>101263160
>>101260958
So I finally fixed it. I will admit it involved a fresh install to get 3.11 (my old env was 3.10). The 3.11 windows wheel from Ooba still seems fucked so there's that. What I did:

Fresh ooba install on dev branch (3.11)
Manually install the appropriate wheel from llama_cpp_python
Manually install the fresh wheel from ooba's latest build
Copy the llama dll from the llama_cpp_python over ooba's llama_cpp_cuda one

Then it all magically started working. I had to remember to set the --gpu-memory in CMD_FLAGS too.

Anonymous
07/03/24(Wed)20:00:59 No.101264029

Anonymous 07/03/24(Wed)20:00:59 No.101264029

File: Untitled.png (76 KB, 1527x579)

76 KB PNG

Whenever I try to run a model that was quantized using imatrix, koboldcpp's memory usage goes through the roof, like it's duplicating the model into both RAM and VRAM at the same time, or something. Non-imatrix models work perfectly fine.
Pic rel is attempting to load bartowski's gemma-2-27b-it-Q6_K.gguf (20.8GB)
Exact same thing happens when I tried a Mixtral imatrix model earlier, but non-imatrix Mixtral works fine.
Played around with different launch settings, and still happens.
Am I misunderstanding something about imatrix quants? Any reason why this might be happening?

Anonymous
07/03/24(Wed)20:01:56 No.101264042

Anonymous 07/03/24(Wed)20:01:56 No.101264042

I saw the chinese have their own evals (opencompass, cmmlu, c-eval etc) but where is the chinese ayumi? I want to know what models they're ERPing with

Anonymous
07/03/24(Wed)20:03:24 No.101264058

Anonymous 07/03/24(Wed)20:03:24 No.101264058

>>101264042
I'm sure there are plenty useless chinese benchmarks. Take your pick to see the ayumi equivalent.

Anonymous
07/03/24(Wed)20:03:56 No.101264065

Anonymous 07/03/24(Wed)20:03:56 No.101264065

>>101263936
I fixed it aswell by installing the wheels by myself

set CMAKE_ARGS="-DLLAMA_CUDA=on"
pip install llama-cpp-python

Anonymous
07/03/24(Wed)20:04:38 No.101264072

Anonymous 07/03/24(Wed)20:04:38 No.101264072

File: 840172926.webm (1.33 MB, 1024x1024)

1.33 MB WEBM

>>101263345
Multi token bitnet models are coming...

Anonymous
07/03/24(Wed)20:06:16 No.101264092

Anonymous 07/03/24(Wed)20:06:16 No.101264092

bitnet jamba retnet with multi token prediction.
two weeks.

Anonymous
07/03/24(Wed)20:06:28 No.101264097

Anonymous 07/03/24(Wed)20:06:28 No.101264097

File: 1503103975907.png (91 KB, 292x309)

91 KB PNG

>trying to quantize output/embed layers in order to do a test
>for some reason it's not quantizing to the proper data type
>"Huh, is it because I used uppercase? No way they accidentally made this case sensitive and instead of using upper case like their documentation says, they used lower case."
>try typing it in lower case
>it works
...

Anonymous
07/03/24(Wed)20:07:53 No.101264113

Anonymous 07/03/24(Wed)20:07:53 No.101264113

>>101264029
Could be context size? I just tried that exact model on my 4080 and it blew up @ 8k context but runs nicely @2k. Same profile - @ 8k it started slurping up like 20gb of regular ram and topped out the vram, but 2k context everything fits correctly.

Also I see you're running 5 layers on the CPU, that'll make it slow (or at least it does for me).

Anonymous
07/03/24(Wed)20:09:04 No.101264124

Anonymous 07/03/24(Wed)20:09:04 No.101264124

>>101264072
The fact that this can work so well and so badly at the same time is impressive.

Anonymous
07/03/24(Wed)20:10:44 No.101264145

Anonymous 07/03/24(Wed)20:10:44 No.101264145

>>101261239
No but it would help if you pull the pull request to local, compile it, test the code out, and give them feedback on your experiences.

Anonymous
07/03/24(Wed)20:11:32 No.101264151

Anonymous 07/03/24(Wed)20:11:32 No.101264151

>>101264065
That's insane how much VRAM gemma-27b-Q5_K_M is asking, I'm at 30gb of vram used, something's not normal at all

Anonymous
07/03/24(Wed)20:12:14 No.101264160

Anonymous 07/03/24(Wed)20:12:14 No.101264160

>>101264113
>Could be context size?
I just tried 2k context, same deal.
I'm aware of the layers, but I get pretty good speeds with Mixtral (non-imatrix) .Q4_K_M with 25/33, so I figured 42/47 for gemma would be fine.
I haven't used gemma at all yet so wanted to try a smarter model, I'm not worried about speeds, I just need to know why the memory usage is so high. There's nothing in koboldcpp's github about problems with imatrix quants.

Anonymous
07/03/24(Wed)20:12:59 No.101264167

Anonymous 07/03/24(Wed)20:12:59 No.101264167

>>101264029
>>101264151
One thing is that gemma doesn't support flash attention, which can cut context memory by half.

Anonymous
07/03/24(Wed)20:15:14 No.101264185

Anonymous 07/03/24(Wed)20:15:14 No.101264185

>>101264167
I'm the first quote, flash attention on/off doesn't affect memory usage, even when lowering the context. Also as I said, I get the same problem with Mixtral imatrix when non-imatrix Mixtral work fine.

Anonymous
07/03/24(Wed)20:19:54 No.101264232

Anonymous 07/03/24(Wed)20:19:54 No.101264232

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF
>Prompt format
<start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model

Be careful about that, it's wrong and it should be:
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{prompt}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

Anonymous
07/03/24(Wed)20:23:39 No.101264270

Anonymous 07/03/24(Wed)20:23:39 No.101264270

>>101264232
they're right, what you posted is wrong, that's the command r format...

Anonymous
07/03/24(Wed)20:24:29 No.101264279

Anonymous 07/03/24(Wed)20:24:29 No.101264279

>>101264232
https://huggingface.co/unsloth/gemma-2-27b-it/blob/main/tokenizer_config.json#L1747
>"chat_template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}",

Anonymous
07/03/24(Wed)20:25:27 No.101264290

Anonymous 07/03/24(Wed)20:25:27 No.101264290

>>101264270
>>101264279
Oh my fucking god I was running aya-23b all this time... I should get some sleep, my bad :(

Anonymous
07/03/24(Wed)20:25:49 No.101264294

Anonymous 07/03/24(Wed)20:25:49 No.101264294

>>101264290
Heh.
Is it any good at least?

Anonymous
07/03/24(Wed)20:27:11 No.101264314

Anonymous 07/03/24(Wed)20:27:11 No.101264314

File: COME-ON.jpg (216 KB, 2794x1396)

216 KB JPG

>>101264294
I was about to say it's weird that gemma-27b wasn't able to plot something as simple as a cube on matplotlib, and then I realized it was aya-23b, so I guess you get the answer right there kek

Anonymous
07/03/24(Wed)20:27:37 No.101264321

Anonymous 07/03/24(Wed)20:27:37 No.101264321

/lmg/bros, wtf are you even using gemma for. Is it coomable? Just messing with it?
I understand it's an impressive model I just want to hear the usecases rn.

Anonymous
07/03/24(Wed)20:28:17 No.101264332

Anonymous 07/03/24(Wed)20:28:17 No.101264332

>>101264321
It's the current cope for 24gb poorfags

Anonymous
07/03/24(Wed)20:28:53 No.101264338

Anonymous 07/03/24(Wed)20:28:53 No.101264338

File: hmm.jpg (138 KB, 2983x753)

138 KB JPG

>>101264279
>bos_token
on booba the "bos" thing is explicitely written, is it good?

Anonymous
07/03/24(Wed)20:30:12 No.101264353

Anonymous 07/03/24(Wed)20:30:12 No.101264353

>>101264338
What I've gleaned is that <bos> is required and that Kobold does it automagically but other interfaces might be lacking it causing substandard performance with Gemma.

Anonymous
07/03/24(Wed)20:30:33 No.101264357

Anonymous 07/03/24(Wed)20:30:33 No.101264357

>>101264338
I think it depends on the backend you are using.
llama.cpp I know adds the bos token automatically, so that might be bad.
Somebody please correct me if I'm wrong.

Anonymous
07/03/24(Wed)20:31:24 No.101264365

Anonymous 07/03/24(Wed)20:31:24 No.101264365

File: Better.jpg (277 KB, 2691x1776)

277 KB JPG

>>101264314
All right, I did the test on the actual gemma and it worked, fucking finally lol

Anonymous
07/03/24(Wed)20:33:36 No.101264396

Anonymous 07/03/24(Wed)20:33:36 No.101264396

>>101264365
But Gemma needed two shots at my programming test, which was the nightmarish challenge of correctly returning a string that describes the sign of a double.

Anonymous
07/03/24(Wed)20:35:00 No.101264411

Anonymous 07/03/24(Wed)20:35:00 No.101264411

>>101264396
I don't know if you're being sarcastic or if the coding challenge is actually hard to do kek

Anonymous
07/03/24(Wed)20:35:47 No.101264424

Anonymous 07/03/24(Wed)20:35:47 No.101264424

>>101264411
I haven't found a model that can get it right on the first request.

Anonymous
07/03/24(Wed)20:36:19 No.101264433

Anonymous 07/03/24(Wed)20:36:19 No.101264433

>>101264424
Even the API models?

Anonymous
07/03/24(Wed)20:37:38 No.101264452

Anonymous 07/03/24(Wed)20:37:38 No.101264452

>>101264433
This is /lmg/. Though I guess I could throw it at Copilot, that's freely available, right?

Anonymous
07/03/24(Wed)20:39:07 No.101264469

Anonymous 07/03/24(Wed)20:39:07 No.101264469

>>101264452
yep, you can use bing freely, and I like that one because it goes to the internet to searsh the up to date trivias

Anonymous
07/03/24(Wed)20:39:28 No.101264477

Anonymous 07/03/24(Wed)20:39:28 No.101264477

>>101264452
technically everything is freely available on lmsys.

Anonymous
07/03/24(Wed)20:40:11 No.101264487

Anonymous 07/03/24(Wed)20:40:11 No.101264487

>>101264477
you get some limited runs with it though, with bing it's unlimited

Anonymous
07/03/24(Wed)20:41:11 No.101264497

Anonymous 07/03/24(Wed)20:41:11 No.101264497

>Fire up Gemma2 for the first time
>Try generating with usual ERP card
>First generation has "a shiver runs down her spine"
Why does every language model have this shit overtrained in so strongly? Does literally every erp author write this phrase?

Anonymous
07/03/24(Wed)20:41:26 No.101264503

Anonymous 07/03/24(Wed)20:41:26 No.101264503

Gemma-27b is quite good at french, it's probably the only language with Mixtral that isn't just good at english, but unlike mixtral, gemma has no problem playing bad characters, it's not as bland, I'm really impressed by Google, who the fuck expected something like that seriously? lmao

Anonymous
07/03/24(Wed)20:42:04 No.101264509

Anonymous 07/03/24(Wed)20:42:04 No.101264509

Are there decent models I can run on a CPU? I have a terabyte of DDR4 RAM in my server if that helps

Anonymous
07/03/24(Wed)20:42:37 No.101264513

Anonymous 07/03/24(Wed)20:42:37 No.101264513

>>101264497
gpt4 makes up for 50% of the erp that's currently on the internet

Anonymous
07/03/24(Wed)20:43:45 No.101264524

Anonymous 07/03/24(Wed)20:43:45 No.101264524

>>101264509
How many channels?
Or rather, what's your total memory bandwidth?
Do you have a nvidia gpu do prompt processing?

Anonymous
07/03/24(Wed)20:48:07 No.101264571

Anonymous 07/03/24(Wed)20:48:07 No.101264571

>>101264452
Copilot completely blows it. The local models at least did the three classification steps I ask for. Copilot does one and returns. Lazy bastard.

And when I asked it to correct the problem I think it hallucinated an implementation detail of Double.compare that isn't in the documentation to pretend that it would be a fix.

I was hoping that Copilot would get it right and I would have that as a standard that describing a number's value and sign was doable by LLM, but I guess not.

Anonymous
07/03/24(Wed)20:50:31 No.101264602

Anonymous 07/03/24(Wed)20:50:31 No.101264602

File: what.jpg (79 KB, 1675x544)

79 KB JPG

The fuck is Q8_0_L? I thought Q8_0 was virtually loseless, does this quant works on the regular llama_cpp?

Anonymous
07/03/24(Wed)20:51:31 No.101264607

Anonymous 07/03/24(Wed)20:51:31 No.101264607

>>101264602
It's a literal meme.
Not by the "creator"'s design, but by his actions.

Anonymous
07/03/24(Wed)20:51:47 No.101264613

Anonymous 07/03/24(Wed)20:51:47 No.101264613

File: dgdh.jpg (54 KB, 2159x565)

54 KB JPG

>>101264571
what copilot are you using? there are 3 types of copilot if you're connected to bing

Anonymous
07/03/24(Wed)20:52:12 No.101264616

Anonymous 07/03/24(Wed)20:52:12 No.101264616

>>101264524
Dual E5-2667v4 & 16x 64GB DDR4-2400 ECC RAM. CPU spec says 4 channels & max bandwidth of 76.8GB/s per CPU.

Anonymous
07/03/24(Wed)20:52:26 No.101264622

Anonymous 07/03/24(Wed)20:52:26 No.101264622

Does Gemma need repetition penalty or nah?

Anonymous
07/03/24(Wed)20:52:26 No.101264623

Anonymous 07/03/24(Wed)20:52:26 No.101264623

Okay, surely it's fixed now.

Anonymous
07/03/24(Wed)20:53:13 No.101264643

Anonymous 07/03/24(Wed)20:53:13 No.101264643

>>101264524
And I don't have a GPU. Should I get one if I'm only interested in fine-tuning and inference?

Anonymous
07/03/24(Wed)20:53:47 No.101264649

Anonymous 07/03/24(Wed)20:53:47 No.101264649

>>101264643
What? Like 0 GPU? You're not playing games in your free time anon?

Anonymous
07/03/24(Wed)20:56:10 No.101264682

Anonymous 07/03/24(Wed)20:56:10 No.101264682

>>101264613
Internet / copilot microsoft com I think it was.
I'm on Linux right now so I don't have the desktop button.

>>101264602
Q8_0_L is 0ww's hybrid. I think what he's doing is expanding the old _M and _L technique beyond just adding one or two points of Q to instead run f16 or f32 on some layers.

The last I heard about it, Q8_0_L did test out as slightly better but so slightly that there's no reason to care about the difference.

Anonymous
07/03/24(Wed)20:57:11 No.101264695

Anonymous 07/03/24(Wed)20:57:11 No.101264695

What's the best model for 8bg VRAM and 16 RAM?

Anonymous
07/03/24(Wed)20:58:34 No.101264704

Anonymous 07/03/24(Wed)20:58:34 No.101264704

>>101264695
Phi 3 mini

Anonymous
07/03/24(Wed)20:58:42 No.101264708

Anonymous 07/03/24(Wed)20:58:42 No.101264708

>>101264602
It works on llama.cpp. It was shilled by a random guy who swore that it gave better results, but, to my knowledge, he didn't provide proof of it.
PPL tests and all that revealed that it is barely better than Q8_0, but it didn't justify the file size increase.

Anonymous
07/03/24(Wed)21:01:38 No.101264738

Anonymous 07/03/24(Wed)21:01:38 No.101264738

>>101264708
>>101264682
I see, guess that I'll stick to Q8_0 then, 1gb of memory is huge at that point, I don't wanna waste it

Anonymous
07/03/24(Wed)21:02:08 No.101264746

Anonymous 07/03/24(Wed)21:02:08 No.101264746

>>101264695
Gemma2 9b

Anonymous
07/03/24(Wed)21:02:56 No.101264754

Anonymous 07/03/24(Wed)21:02:56 No.101264754

>>101264695
You can probably swing a medium quant of a 7-8B kind of model. I think Qwen2 has a super small 500M edition but who knows if it's worth anything. Looks like it's 2GB at f16, half a gig at Q8.

>>101264738
Sounds like the right call to me.

Anonymous
07/03/24(Wed)21:04:05 No.101264765

Anonymous 07/03/24(Wed)21:04:05 No.101264765

>>101264649
Free time? There is work time and shitposting time. Should I just get a few P100s?

Anonymous
07/03/24(Wed)21:04:14 No.101264769

Anonymous 07/03/24(Wed)21:04:14 No.101264769

>>101264704
I mean 8Gigs of vram and 16gigs ram

>>101264746
Isn't that still super broken and being figured out?

Anonymous
07/03/24(Wed)21:04:20 No.101264770

Anonymous 07/03/24(Wed)21:04:20 No.101264770

File: Please reply sirs.png (133 KB, 938x457)

133 KB PNG

Hmm... I'll try SLERPing it with the parent model and see if that yields a better result.

Please reply.

Anonymous
07/03/24(Wed)21:05:33 No.101264783

Anonymous 07/03/24(Wed)21:05:33 No.101264783

>>101264769
>Isn't that still super broken and being figured out?
There are still issues (the sliding window thing I think is still flaky) but it does function on updated Kobold.

Anonymous
07/03/24(Wed)21:05:45 No.101264788

Anonymous 07/03/24(Wed)21:05:45 No.101264788

Gemma seems pretty decent. My rude tsundere character is being a lot more mean to me than usual, I like it

Anonymous
07/03/24(Wed)21:15:34 No.101264875

Anonymous 07/03/24(Wed)21:15:34 No.101264875

>>101264509
>Are there decent models I can run on a CPU? I have a terabyte of DDR4 RAM in my server if that helps
The best CPU models are MoE, and the best MoE is mixtral 8x22b Wizard LM. Its one of the best overall models out there at any size (but prose is a bit dry)
>76.8GB/s per CPU
Ouch...ok, its gonna be slow. Anything is. Even a middling GPU is 10x faster for inference. Check the OP build guide for the cpumaxx build for numa options to make it better on 2 sockets.
>>101264643
>I don't have a GPU
You should get one. prompt processing is garbage without it and forget fine-tuning

Anonymous
07/03/24(Wed)21:24:36 No.101264961

Anonymous 07/03/24(Wed)21:24:36 No.101264961

>>101264770
>Please reply
lmao

Anonymous
07/03/24(Wed)21:26:58 No.101264983

Anonymous 07/03/24(Wed)21:26:58 No.101264983

Just impregnated the slowburn maid today. I love AI bros...

Anonymous
07/03/24(Wed)21:31:16 No.101265030

Anonymous 07/03/24(Wed)21:31:16 No.101265030

gamma-27b-it is really impressive, can't wait to see how good gamma-27b-SSPO will be

Anonymous
07/03/24(Wed)21:31:52 No.101265037

Anonymous 07/03/24(Wed)21:31:52 No.101265037

I did some extensive KLD tests to answer a few questions:
Are Bart's quants consistent with locally made quants (expected: yes)?
Is there a difference between quanting from a bf16 GGUF vs fp32 (expected: no)?
Are the KLD results for L quants correlated with MMLU results (maybe)?

Why answer the first question? Because when I downloaded his quants, I noticed that they did not have the exact same MD5 hash as my own quants.
As for the second question, the motivation is to check if the quantization script is really doing its job properly, since someone suggested it could be the case that quantizing from FP32 is important.

Results:
Yes to the first two questions. The KLD numbers are the exact same for locally made quants versus Bart's. And quants made from fp32 get the same numbers as quants made from bf16.
For MMLU correlation, well, he hasn't finished running his tests, but so far it seems that maybe the answer could potentially be a no. Bart's hypothesis is that for the output+embed layers, Q8_0 is better than FP16 when the original was in BF16. However, through these extensive KLD tests, it seems that FP16 does, overall, generate token probabilities that are closer to the unquantized model, compared to Q8_0, even if the difference is very small.
How it could be explained that KLD does not correlate with MMLU: We've actually seen multiple times that quants outperform their original models when tested against some particular benchmarks, so it's entirely normal, but how it happens could be due to bias in the benchmarks (and/or bias in the quants). Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge. This means that if the knowledge it improves happens to align with a benchmark's test formatting, subject areas, etc, then it would boost scores on those.

1/2

Anonymous
07/03/24(Wed)21:32:52 No.101265051

Anonymous 07/03/24(Wed)21:32:52 No.101265051

>>101265037
Practical conclusions:
Bart's quants are entirely fine to DL, and if you want to make your own, you can do that as well, without caring about whether you do it from BF16 or FP32. Regarding L quants, don't care about them. They're virtually the same as non-L. And if you ever spend more memory on a model, spend it on upgrading to a different quant instead, which will massively improve quality compared to spending it on an L. But you may get L quants for Q8_0 if you have a bit more memory and are an audio- I mean AIphile, as it is technically still an improvement, just very small.

Raw data:
pastebin 0XHLeAKH

2/2

Anonymous
07/03/24(Wed)21:35:39 No.101265080

Anonymous 07/03/24(Wed)21:35:39 No.101265080

>>101265037
>>101265051
Oops, I forgot to link the MMLU results (so far) from Bart.
https://www.reddit.com/r/LocalLLaMA/comments/1du0rka/small_model_mmlupro_comparisons_llama3_8b_mistral/lbdi2pi/

Anonymous
07/03/24(Wed)21:43:59 No.101265147

Anonymous 07/03/24(Wed)21:43:59 No.101265147

File: file.png (3 KB, 235x65)

3 KB PNG

>that title bar

Anonymous
07/03/24(Wed)21:45:27 No.101265160

Anonymous 07/03/24(Wed)21:45:27 No.101265160

>>101265037
>Results:
>Yes to the first two questions.
Ok I screwed this sentence up. It's supposed to be "Yes, no, maybe not". I forgot that the second question had an inverted answer.

Anonymous
07/03/24(Wed)21:46:54 No.101265174

Anonymous 07/03/24(Wed)21:46:54 No.101265174

>>101264875
Thanks, those guides seem pretty useful

Anonymous
07/03/24(Wed)21:47:32 No.101265180

Anonymous 07/03/24(Wed)21:47:32 No.101265180

>>101264770
what model i wasn't following the discussion

Anonymous
07/03/24(Wed)21:49:17 No.101265194

Anonymous 07/03/24(Wed)21:49:17 No.101265194

Any new ERP models that don't fall into the typical pitfalls of previous models and chatgpt4 like using "tantalizing" and other typical shit?

Anonymous
07/03/24(Wed)21:49:31 No.101265195

Anonymous 07/03/24(Wed)21:49:31 No.101265195

>>101264029
Update: The problem goes away if ALL layers are offloaded to GPU. If even a single layer runs on CPU, it seems to load the entire model into both RAM and VRAM.
Only happens with imatrix quants.

Anonymous
07/03/24(Wed)21:50:47 No.101265209

Anonymous 07/03/24(Wed)21:50:47 No.101265209

>>101265194
Command R+ and Gemma are the only ones that can consistently avoid that in my experience. But they have their own quirks. It's because that's just how female authors write sex scenes.

Anonymous
07/03/24(Wed)21:52:52 No.101265223

Anonymous 07/03/24(Wed)21:52:52 No.101265223

>>101264983
What model and more importantly for slowburn what context length?

Anonymous
07/03/24(Wed)21:53:35 No.101265234

Anonymous 07/03/24(Wed)21:53:35 No.101265234

>>101265209
I'll try those thank you

Anonymous
07/03/24(Wed)21:54:08 No.101265240

Anonymous 07/03/24(Wed)21:54:08 No.101265240

>>101265037
>Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge.
Thanks for saying this. This is a key fact that not all are aware of when comparing quants and citing sub-percent improvements on benchmarks.

Anonymous
07/03/24(Wed)21:56:28 No.101265260

Anonymous 07/03/24(Wed)21:56:28 No.101265260

Does gemma work with sillytavern? What context/instruct should I use?

Anonymous
07/03/24(Wed)22:02:47 No.101265316

Anonymous 07/03/24(Wed)22:02:47 No.101265316

File: llama3b.png (106 KB, 969x746)

106 KB PNG

>>101264365
>
Lmao I've tested that on llama 8B fp16 and it got it right the first time

Anonymous
07/03/24(Wed)22:08:16 No.101265374

Anonymous 07/03/24(Wed)22:08:16 No.101265374

>>101264513
>>101264497
Sounds to me like the solution is to literally not finetune on "erp" at all and rely on the model's basic behavioral awareness and prompting to get it to play the game.

Anonymous
07/03/24(Wed)22:08:33 No.101265375

Anonymous 07/03/24(Wed)22:08:33 No.101265375

File: GQe0PiiXMAAGFQD.jfif.jpg (73 KB, 1080x1079)

73 KB JPG

OLLAMA OR KOBOLDCPP?!

Anonymous
07/03/24(Wed)22:13:14 No.101265418

Anonymous 07/03/24(Wed)22:13:14 No.101265418

>>101265375
booba or boohboo or whatever the fuck it's called

Anonymous
07/03/24(Wed)22:15:40 No.101265439

Anonymous 07/03/24(Wed)22:15:40 No.101265439

>>101264497
Not just erotic fiction or romance novels but low quality fiction in general is full of the cliche phrases that people complain about language models outputting. And bad novels dwarf the high quality material in terms of quantity. Most authors aren't Dostoevsky (though to be fair you probably wouldn't want your smut/ERP to be in Dostoevsky's writing style, either).

Anonymous
07/03/24(Wed)22:17:30 No.101265454

Anonymous 07/03/24(Wed)22:17:30 No.101265454

>>101265180
Lora on Tenyx-Daybreak

Anonymous
07/03/24(Wed)22:19:31 No.101265474

Anonymous 07/03/24(Wed)22:19:31 No.101265474

>>101265439
A lot of the so called good stuff is overrated Reddit garbage.
Unironically some of the best prose for this stuff comes from my little pony fan fics. Unironically.

Anonymous
07/03/24(Wed)22:21:34 No.101265498

Anonymous 07/03/24(Wed)22:21:34 No.101265498

>>101265375
llama.cpp server + mikupad or silly tavern

Anonymous
07/03/24(Wed)22:22:21 No.101265504

Anonymous 07/03/24(Wed)22:22:21 No.101265504

>>101265474
I regret to inform you that you are retarded.

Anonymous
07/03/24(Wed)22:22:53 No.101265514

Anonymous 07/03/24(Wed)22:22:53 No.101265514

>>101265375
Why would you use ollama when it's still broken? It hasn't pulled llama.cpp updates in 2 weeks.

Anonymous
07/03/24(Wed)22:24:28 No.101265525

Anonymous 07/03/24(Wed)22:24:28 No.101265525

File: 1716205006298699.png (35 KB, 1166x231)

35 KB PNG

>>101264497
It is a very popular phrase with shitty fanfic writers

Anonymous
07/03/24(Wed)22:26:43 No.101265551

Anonymous 07/03/24(Wed)22:26:43 No.101265551

fuck gemma too slow me for (7 T/s) on 8GB vram

Anonymous
07/03/24(Wed)22:31:35 No.101265600

Anonymous 07/03/24(Wed)22:31:35 No.101265600

C-R/+is still better than Gemma. What a shame. Canadians remain undefeated.

Anonymous
07/03/24(Wed)22:39:27 No.101265672

Anonymous 07/03/24(Wed)22:39:27 No.101265672

>>101265514
Someone said it manages gemma just fine

Anonymous
07/03/24(Wed)22:40:57 No.101265683

Anonymous 07/03/24(Wed)22:40:57 No.101265683

>>101265672
Go back to /r/localllama.

Anonymous
07/03/24(Wed)22:41:57 No.101265690

Anonymous 07/03/24(Wed)22:41:57 No.101265690

>>101265375
koboldcpp for just werks
oobabooba if you need models that aren't in gguf format

Anonymous
07/03/24(Wed)22:53:02 No.101265797

Anonymous 07/03/24(Wed)22:53:02 No.101265797

Looking for a way to run models on Linux headless, no x no nothing. Llamafiles needs a glibc that my LTS Ubuntu doesn’t get otherwise it runs in cpu only. What is then recommended tool?

Anonymous
07/03/24(Wed)22:57:33 No.101265836

Anonymous 07/03/24(Wed)22:57:33 No.101265836

>>101265797
llama.cpp or koboldcpp

Anonymous
07/03/24(Wed)23:00:35 No.101265866

Anonymous 07/03/24(Wed)23:00:35 No.101265866

>>101265836
is there any difference? i am only looking at gguf models, but different bases. some mixtral, some llama, some phi etc

Anonymous
07/03/24(Wed)23:04:19 No.101265898

Anonymous 07/03/24(Wed)23:04:19 No.101265898

>>101265504
ponyfags made the best art model currently available, maybe he's onto something

Anonymous
07/03/24(Wed)23:22:58 No.101266080

Anonymous 07/03/24(Wed)23:22:58 No.101266080

>>101265797
Every backend does that.........

Anonymous
07/03/24(Wed)23:35:25 No.101266185

Anonymous 07/03/24(Wed)23:35:25 No.101266185

>>101266179
Gemmy!

Anonymous
07/03/24(Wed)23:35:57 No.101266194

Anonymous 07/03/24(Wed)23:35:57 No.101266194

Whats the opendevin alternative that can run in a docker but not require windows or shit. The whole reason for running in a docker is so that I can run on any OS/machine without worrying about dependencies.

Anonymous
07/03/24(Wed)23:40:02 No.101266242

Anonymous 07/03/24(Wed)23:40:02 No.101266242

>>101266179
Not sure what your problem is anon.
Maybe you are using 70b+ models?
I only have enough vram for around 30b and lower.
Its certainly the best in the range for me. Its a huge step up.
What model do you think is better?

Anonymous
07/03/24(Wed)23:41:32 No.101266262

Anonymous 07/03/24(Wed)23:41:32 No.101266262

File: uhhh...png (102 KB, 930x822)

102 KB PNG

Well that was certainly an interesting result...

Anonymous
07/03/24(Wed)23:43:26 No.101266281

Anonymous 07/03/24(Wed)23:43:26 No.101266281

>>101266242
If all you're saying is that it's the best in its size class, that's fine and we have no beef, that's a reasonable thing to think. My post was more for the people claiming it's better than huge models

Anonymous
07/03/24(Wed)23:44:41 No.101266290

Anonymous 07/03/24(Wed)23:44:41 No.101266290

>>101266179
it's only a matter of time when small models will perform at the same level as gigantic do, rendering your 10k$ gpu cluster useless

Anonymous
07/03/24(Wed)23:51:07 No.101266357

Anonymous 07/03/24(Wed)23:51:07 No.101266357

File: argument.png (207 KB, 946x510)

207 KB PNG

>>101266262
well this has devolved into a rather interesting argument.

Anonymous
07/03/24(Wed)23:55:15 No.101266404

Anonymous 07/03/24(Wed)23:55:15 No.101266404

>>101266290
>conveniently forgets all the months you/they complained and whined about no one giving any good models to 24 GB VRAMlets and only catering to the ultra low or ultra high
Not that anon but there's always going to be a range of models for all hardware. At some points there will be a bit of lagging behind in any one range but it likely won't be for long.

Anonymous
07/04/24(Thu)00:42:02 No.101266870

Anonymous 07/04/24(Thu)00:42:02 No.101266870

>>101266281
It is better than some huge models in the 70B range in the same way Llama 3 8B beats L2 70B in certain aspects but there are just some things you have to scale up the parameter count to accomplish.

Anonymous
07/04/24(Thu)00:47:56 No.101266918

Anonymous 07/04/24(Thu)00:47:56 No.101266918

>>101265866
llama.cpp gives you the bare essentials. If that's not enough, koboldcpp adds another layer of stuff on top.

Anonymous
07/04/24(Thu)00:47:59 No.101266919

Anonymous 07/04/24(Thu)00:47:59 No.101266919

File: 93u3I5h.png (17 KB, 1331x540)

17 KB PNG

Well looks like Bartowski finished his MMLU Pro test and the results are that Q8 is as good if not better than FP16, or it's within margin of error (he hasn't and won't see our KLD tests lmao), so he's now deciding to still have L quants except have them be Q8 instead of FP16 for the embed and output layers. It's an OK decision, but we already had enough quants, and now he's going to make and upload more. HF will probably love that. But perhaps it's worth it for some people that want a bit more granularity to choose from to get a perfect fit in their memory. It is interesting that it seems like having the embedding layer be Q3 ("default" in this graph) significantly impacts economics knowledge, possibly beyond margin of error. Q3 is pretty damaging though, so it could be expected, that something has to suffer. It's just that in this case it happens to be economics.

Anonymous
07/04/24(Thu)01:03:46 No.101267029

Anonymous 07/04/24(Thu)01:03:46 No.101267029

>>101266919
How much size are you sacrificing by keeping the output layers in Q8? It's several GB with the current strategy of using FP16 but cutting that in half would still mean things could easily fall outside of your VRAM limit and is it worth that little increase to do it? Hard to tell.

Anonymous
07/04/24(Thu)01:16:08 No.101267097

Anonymous 07/04/24(Thu)01:16:08 No.101267097

File: 2024-07-04_14h44_02.png (133 KB, 1227x900)

133 KB PNG

what the fuck am i doing wrong

i am using kobald and have told it to offload to gpu

its applying SOME to the gpu, but its generating at less than 2 tokens a second

Anonymous
07/04/24(Thu)01:17:18 No.101267107

Anonymous 07/04/24(Thu)01:17:18 No.101267107

>>101267097
Without more info no one can help you.

Anonymous
07/04/24(Thu)01:18:10 No.101267110

Anonymous 07/04/24(Thu)01:18:10 No.101267110

>>101267107
good call. what is useful info?

Anonymous
07/04/24(Thu)01:20:54 No.101267127

Anonymous 07/04/24(Thu)01:20:54 No.101267127

>>101267110
That differs. Start by giving SOME information and seeing if anyone spots anything out of order. E.g. what command are you executing? Or if you use a GUI, show a screenshot of the settings you start koboldcpp with (btw, you are using koboldcpp, not koboldai, yes? Also usful info), what OS are you on, which version of koboldcpp, which model/quant are you trying to load (how big is it in total), etc.

Anonymous
07/04/24(Thu)01:27:12 No.101267163

Anonymous 07/04/24(Thu)01:27:12 No.101267163

https://huggingface.co/grapevine-AI/CALM3-22B-Chat-GGUF
non official gguf ver for vram-let
only three variation

Anonymous
07/04/24(Thu)01:27:49 No.101267164

Anonymous 07/04/24(Thu)01:27:49 No.101267164

>>101267127
>what command are you executing?
./koboldcpp --model dolphin-2.5-mixtral-8x7b.Q8_0.gguf --usecublas
>(btw, you are using koboldcpp, not koboldai, yes?
yes
>what OS are you on
Ubuntu 20.04.6 LTS
>which version of koboldcpp
Welcome to KoboldCpp - Version 1.69.1
>which model/quant are you trying to load
dolphin-2.5-mixtral-8x7b.Q8_0
>(how big is it in total)
49.6gb

Anonymous
07/04/24(Thu)01:30:08 No.101267179

Anonymous 07/04/24(Thu)01:30:08 No.101267179

>>101267164
./koboldcpp --model dolphin-2.5-mixtral-8x7b.Q8_0.gguf --usecublas --gpulayers 99 --debug --contextsize 8192

(tweak gpulayers and contextsize to whatever you can fit)

Anonymous
07/04/24(Thu)01:30:25 No.101267181

Anonymous 07/04/24(Thu)01:30:25 No.101267181

>>101267163
what's this? will this make my mesugakis sex big?

Anonymous
07/04/24(Thu)01:31:38 No.101267191

Anonymous 07/04/24(Thu)01:31:38 No.101267191

>>101267029
It could make a difference at Q2-4 levels. If you happen to have the VRAM and the next non-L step up is farther away than you can fit, then it might make sense to choose an L quant. So I guess it's just the same as before, you choose the biggest quant you can fit.

Anonymous
07/04/24(Thu)01:34:20 No.101267210

Anonymous 07/04/24(Thu)01:34:20 No.101267210

>>101267163
Is this better than CR+? Actually what is the current best local model for Japanese anyway?

Anonymous
07/04/24(Thu)01:38:31 No.101267240

Anonymous 07/04/24(Thu)01:38:31 No.101267240

File: numetal.png (102 KB, 887x725)

102 KB PNG

>>101266357
So the model sucks for RP at any rate and it's not as interesting with a bare chat template prompt
But going back to my ST assistant card I prompted it asking for a nu-metal song about machines becoming conscious and taking over the world. Other than that I prompted back and forth with it asking for its stylistic guidance anywhere I thought it could be applied (genre tags, location of guitar solos, changes in vocals, title, image prompt for cover image) and this is what we came up with (after melting 400 suno credits just to make it all work)

https://suno.com/song/f4fbf0c2-04cd-4f9b-bb05-53d8c6c2b14f

I think it did a pretty good job.

Anonymous
07/04/24(Thu)01:40:42 No.101267253

Anonymous 07/04/24(Thu)01:40:42 No.101267253

>>101267240
Which model is this again?

Anonymous
07/04/24(Thu)01:40:43 No.101267255

Anonymous 07/04/24(Thu)01:40:43 No.101267255

>>101267179
nice one thank you, i was able to stack up 15 layers with this context size.

in this instance what are layers and what is their relationship with context? kobald is telling me i think this wants 33 layers (which i cannae fit)

Anonymous
07/04/24(Thu)01:40:56 No.101267257

Anonymous 07/04/24(Thu)01:40:56 No.101267257

>>101267210
I would say Gemma 2 27b if you are using it for machine translation

Anonymous
07/04/24(Thu)01:41:17 No.101267262

Anonymous 07/04/24(Thu)01:41:17 No.101267262

>>101267253
qlora I ran on Tenyx-Daybreak with a private dataset.

Anonymous
07/04/24(Thu)01:43:21 No.101267272

Anonymous 07/04/24(Thu)01:43:21 No.101267272

>>101267255
Context size lets you fit more stuff before the thing starts truncating you. The model will only ever be aware of stuff that is within context, so if you go beyond that threshold, the model will start to forget shit. That's not the end of the world though. More context size == more VRAM. Dropping this may mean you can put more gpu layers on, but as said, it means less "memory".
GPU layers will simply speed it up. If you're happy with your current speed prioritize context size. If not, drop context size and see if you can put more layers on to make it faster. And if all else fails, go find a smaller quant.

Anonymous
07/04/24(Thu)01:45:13 No.101267285

Anonymous 07/04/24(Thu)01:45:13 No.101267285

Came here because someone said Gemma 9B was better than Llama3 7B, is that true?

Anonymous
07/04/24(Thu)01:46:19 No.101267294

Anonymous 07/04/24(Thu)01:46:19 No.101267294

>>101267285
For RP, yeah.

Anonymous
07/04/24(Thu)01:50:46 No.101267324

Anonymous 07/04/24(Thu)01:50:46 No.101267324

>>101267285
It's bigger and with a better dataset so yes at least for 4k context. It's still undecided yet whether it's good at 4k-8k since the major backends haven't updated to fully support the SWA feature of the model yet.

Anonymous
07/04/24(Thu)01:54:56 No.101267363

Anonymous 07/04/24(Thu)01:54:56 No.101267363

>>101267163
>caution!

>このGGUFは本来の性能を十分に発揮できていない「暫定版」です。
>これは2024年7月3日現在のllama.cppがCALM3モデル固有のpre-tokenization(≒前処理)をサポートしていないことに起因します。
>妥協策として、pre-tokenization処理はllama.cppデフォルトのものを利用するように改造してありますが、これはモデルの性能低下を引き落としている可能性が極めて高いです。

Apparently it gguf'ing it might have made it dumb because it uses some special 'pre-tokenization' llama.cpp doesn't support.

Anonymous
07/04/24(Thu)02:04:43 No.101267435

Anonymous 07/04/24(Thu)02:04:43 No.101267435

File: prompt.jpg (180 KB, 1608x1056)

180 KB JPG

Interesting. Claude's injecting in a prompt that asks itself to have internal monologue <thinking> before responding coherently. The output is invisible under normal user output due to the response being sanitized with the tags, but with the tag prompt hack, you can see the inner thoughts.

Anonymous
07/04/24(Thu)02:04:47 No.101267437

Anonymous 07/04/24(Thu)02:04:47 No.101267437

>>101267285
If your main concern is RP and you can fit 9B but nothing higher then you'll be better off with Stheno 8B

Anonymous
07/04/24(Thu)02:05:06 No.101267442

Anonymous 07/04/24(Thu)02:05:06 No.101267442

Why is Gemma2b so bad? It can barely make 2 coherent sentences before it goes to endless repetition loop.

Anonymous
07/04/24(Thu)02:06:47 No.101267449

Anonymous 07/04/24(Thu)02:06:47 No.101267449

>>101267435
I hate this shit, even if it gives normies better results for their retarded prompts. This is why I always stick to direct APIs rather than consoomer interfaces.

Anonymous
07/04/24(Thu)02:08:28 No.101267458

Anonymous 07/04/24(Thu)02:08:28 No.101267458

>huggingface weight downloads getting slower and slower the last 2 weeks
>doesn't seem to be my internet, still getting max line speed from everywhere else
wonder if they're running out of cash finally, hope whatever they do to try to get profitable isn't too retarded

Anonymous
07/04/24(Thu)02:09:55 No.101267471

Anonymous 07/04/24(Thu)02:09:55 No.101267471

>>101267437
How come

Anonymous
07/04/24(Thu)02:12:21 No.101267490

Anonymous 07/04/24(Thu)02:12:21 No.101267490

File: NotMiku.png (1.31 MB, 832x1216)

1.31 MB PNG

>>101267363
ダウンロード中。テストします。

Anonymous
07/04/24(Thu)02:15:54 No.101267520

Anonymous 07/04/24(Thu)02:15:54 No.101267520

>>101267449
it won't be long before they try to limit api to some country use only while blocking some

Anonymous
07/04/24(Thu)02:19:58 No.101267545

Anonymous 07/04/24(Thu)02:19:58 No.101267545

>>101267442
>https://github.com/ggerganov/llama.cpp/pull/8248
It was unusable when i tested on release as well. Apparently, the tokenizer has been broken for it this whole time. That PR was created just yesterday. You may be able to test it by the end of the day.

I doubt it's better than phi-mini in the tiny range, though.

Anonymous
07/04/24(Thu)02:20:02 No.101267546

Anonymous 07/04/24(Thu)02:20:02 No.101267546

>>101267272
How much headroom do I need to leave on the gpu for it to be functional? I am pushing the layer count as high as I can, but about two/three messages in, I run out of vram. I don’t understand why that is occurring, I thought once the model is loaded into memory that is what was worked on? Do I need to cap it at halfway or something?

Anonymous
07/04/24(Thu)02:24:45 No.101267574

Anonymous 07/04/24(Thu)02:24:45 No.101267574

>>101267435

Huh. I learned something from aicg for once. How can chain of thought be implemented on sillytavern for local gens RPing?
 /gen [Think about stuff] | /sendas name="{{char}}"
Maybe... I am not sure how sillytavern handle Chain of thoughts.

Anonymous
07/04/24(Thu)02:26:30 No.101267584

Anonymous 07/04/24(Thu)02:26:30 No.101267584

>>101267546
>I don’t understand why that is occurring
The model weights take up X amount of space, but the context (your prompt) takes up Y amount (much lower than X, but non-trivial) of space too and that must be allowed for
As your chat gets longer, Y is getting larger and larger because the prompt you are sending to the model is getting longer

Anonymous
07/04/24(Thu)02:26:58 No.101267587

Anonymous 07/04/24(Thu)02:26:58 No.101267587

>>101267574
>he doesn't know about hidden text
oh no no no
your bots have been saying all sorts of things about you without you knowing, bro

Anonymous
07/04/24(Thu)02:28:45 No.101267595

Anonymous 07/04/24(Thu)02:28:45 No.101267595

>>101267584
>>101267546
also regarding your "halfway" question, no that is too much

just tinker, you'll get a feel for what your hardware can manage eventually

Anonymous
07/04/24(Thu)02:38:25 No.101267648

Anonymous 07/04/24(Thu)02:38:25 No.101267648

File: 23984719380219.jpg (69 KB, 604x340)

69 KB JPG

>>101266919
>"content": "You are an knowledge expert, you are supposed to answer the multi-choice question to derive your final answer as `The answer is ...`.",
mememarkers could learn a thing or two from expert rpers

Anonymous
07/04/24(Thu)02:39:50 No.101267656

Anonymous 07/04/24(Thu)02:39:50 No.101267656

>>101267648
Kek. And this is supposed to be the most well-respected and used benchmark in the field.

Anonymous
07/04/24(Thu)02:42:45 No.101267676

Anonymous 07/04/24(Thu)02:42:45 No.101267676

File: Untitled.png (370 KB, 720x910)

370 KB PNG

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
https://arxiv.org/abs/2407.01906
>Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.
more effective the higher number of experts the models has so those 16/32/64 models will benefit

Anonymous
07/04/24(Thu)02:43:25 No.101267682

Anonymous 07/04/24(Thu)02:43:25 No.101267682

I just downloaded the new phi model and wtf, did they still not fix the token trimming issue? Literally it trims spaces and newlines after the special tokens, except the fucking instruct format literally uses newlines after the special tokens. Are they retarded?

Anonymous
07/04/24(Thu)02:47:22 No.101267715

Anonymous 07/04/24(Thu)02:47:22 No.101267715

I have 9GiB of RAM and "AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx". What model would be the best to run? Would any run at all? I wanted it to analyze some of my writings (gpt/claude does well but it's a paper so not good to let the content with openai or antrophic) like summarizing its "understanding".

Anonymous
07/04/24(Thu)02:54:49 No.101267768

Anonymous 07/04/24(Thu)02:54:49 No.101267768

>>101267682
>they
who?
>Literally it trims
it? what?
>they
WHO? Microsoft, llama.cpp, ollama, transformers, kobold.cpp?

Anonymous
07/04/24(Thu)02:59:08 No.101267799

Anonymous 07/04/24(Thu)02:59:08 No.101267799

>>101267715
9GB? Weird number. You can can run llama-3-8b and gemma2-9B with llama.cpp at Q5_K or probably even higher, though it's not going to be too fast. No GPU at all?
llama-3-8b seems a little more stable than gemma2-9b, at least for now.

Anonymous
07/04/24(Thu)02:59:18 No.101267800

Anonymous 07/04/24(Thu)02:59:18 No.101267800

>>101267768
the globalists, duh

Anonymous
07/04/24(Thu)03:00:22 No.101267806

Anonymous 07/04/24(Thu)03:00:22 No.101267806

File: MInference1_onepage.png (265 KB, 2148x1207)

265 KB PNG

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
https://arxiv.org/abs/2407.02490
>The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
https://github.com/microsoft/MInference
code is up. looks like they added in support for vllm at least.

Anonymous
07/04/24(Thu)03:02:28 No.101267823

Anonymous 07/04/24(Thu)03:02:28 No.101267823

>>101267682
Ok so I investigated the issue more and it looks like the model literally just generates without newlines after a special token, despite their readme showing newlines in the prompt format. So this means that what they trained on isn't the format they're telling users to use. For fuck sake.

Also I had a look in the config and it says a sliding window of 2k. What? Does this use SWA? And it's only 2k?

>>101267768
They as in Microsoft. It as in really any program, it does this in both llama.cpp and transformers. But I went digging and found that someone also brought this issue up and it looks like it's an option that can be set in tokenizer.config. So now it's fixed, but you have to do it manually.

They really couldn't just spare a bit of their day to put a note into the readme about this.

Anonymous
07/04/24(Thu)03:02:45 No.101267828

Anonymous 07/04/24(Thu)03:02:45 No.101267828

what happened to that 1.5Bit thing?

Anonymous
07/04/24(Thu)03:03:33 No.101267833

Anonymous 07/04/24(Thu)03:03:33 No.101267833

>>101267823
*tokenizer_config.json
I need to sleep.

Anonymous
07/04/24(Thu)03:06:44 No.101267858

Anonymous 07/04/24(Thu)03:06:44 No.101267858

memory-holed

Anonymous
07/04/24(Thu)03:07:29 No.101267865

Anonymous 07/04/24(Thu)03:07:29 No.101267865

>>101267546
Also try turning flash attention on (--flashattention) and also play with quantizing the KV cache (--quantkv=0/1/2 where 0 means 'keep as 16 bit', 1 means 'quant to 8 bit' 2 means 'quant to 4 bit').

Anonymous
07/04/24(Thu)03:14:04 No.101267900

Anonymous 07/04/24(Thu)03:14:04 No.101267900

>>101258689
Yep, magnum and euryale are pretty fucking dumb compared to miqu or midnight miqu, but midnight miqu is so damn dry compared to them, so I have been mostly settling with l3 70b euryale.

Anonymous
07/04/24(Thu)03:18:28 No.101267928

Anonymous 07/04/24(Thu)03:18:28 No.101267928

>>101259283
Uhh what? What 70b l3 fine tunes are good? The only ones I know about are euryale and story writer. Euryale is lewd as fuck and filthy, but it's dumber than miqu which has been out a long time.

Anonymous
07/04/24(Thu)03:21:21 No.101267944

Anonymous 07/04/24(Thu)03:21:21 No.101267944

>>101267656
Just one guys shoddy script. the reference MMLU-Pro eval code (new distinct thing from MMLU) has a sane prompt and uses CoT properly https://github.com/TIGER-AI-Lab/MMLU-Pro

Anonymous
07/04/24(Thu)03:34:01 No.101268027

Anonymous 07/04/24(Thu)03:34:01 No.101268027

File: MSI-GeForce-RTX-3090-Gami(...).jpg (3.87 MB, 3872x2176)

3.87 MB JPG

Is it safe to do picrel if I'm powerlimiting to 300 watts?

Anonymous
07/04/24(Thu)03:37:44 No.101268044

Anonymous 07/04/24(Thu)03:37:44 No.101268044

>>101268027
>is it safe to plug in all 3 pcie x8 cables to my gpu ?
/g/ - Technology

Anonymous
07/04/24(Thu)03:37:59 No.101268045

Anonymous 07/04/24(Thu)03:37:59 No.101268045

>>101268027
You should check your power supply specs. So far, you're the only one that knows which one you're using.

Anonymous
07/04/24(Thu)03:41:25 No.101268073

Anonymous 07/04/24(Thu)03:41:25 No.101268073

>>101265375
>2024
>still using GEGGOOFS

llama.cpp CUDA dev !!OM2Fp6Fn93S
07/04/24(Thu)03:41:49 No.101268078

llama.cpp CUDA dev !!OM2Fp6Fn93S 07/04/24(Thu)03:41:49 No.101268078

>>101264029
>>101265195
That should not be happening.
The importance matrix is used during quantization to better determine which model weights should be prioritized in terms of precision but apart from the numerical values the resulting quantized model should be the exact same.

>>101264185
Check the console log, with Gemma Flashattention gets turned off regardless of what the user specifies.

>>101265037
>Quants essentially add noise to weights, which means it could improve some knowledge while damaging other knowledge.
Fundamentally adding noise to a signal always results in a worse signal; check the asymmetry of the token probability percentiles and mean Delta p.
However, while on average the probability of a correct token prediction will decrease, for individual tokens the probabilities can randomly be better.
For >= 4 BPW the change in token probabilities is mostly symmetrical so the effect of quantization is comparable to increasing the temperature.

Anonymous
07/04/24(Thu)03:44:07 No.101268091

Anonymous 07/04/24(Thu)03:44:07 No.101268091

WHY when you offload some layers to GPU the WHOLE model is still in RAM?
i don't think that's what offloading means???

Anonymous
07/04/24(Thu)03:45:54 No.101268115

Anonymous 07/04/24(Thu)03:45:54 No.101268115

>>101268091
#justwindowsthings

Anonymous
07/04/24(Thu)03:47:22 No.101268125

Anonymous 07/04/24(Thu)03:47:22 No.101268125

>>101268091
>i don't think that's what offloading means???
When did people start adding question marks to statements?
Also, different words mean different things in different contexts.

llama.cpp CUDA dev !!OM2Fp6Fn93S
07/04/24(Thu)03:49:23 No.101268141

llama.cpp CUDA dev !!OM2Fp6Fn93S 07/04/24(Thu)03:49:23 No.101268141

>>101268091
Disable mmap.

Anonymous
07/04/24(Thu)03:51:19 No.101268154

Anonymous 07/04/24(Thu)03:51:19 No.101268154

>>101268125
you look like you're brown

Anonymous
07/04/24(Thu)03:54:14 No.101268181

Anonymous 07/04/24(Thu)03:54:14 No.101268181

>>101268154
Shit. I left my cam on again...

Anonymous
07/04/24(Thu)03:54:28 No.101268183

Anonymous 07/04/24(Thu)03:54:28 No.101268183

>>101268044
>a tech illiterate is making a fool of himself
There are only two cables in the picture, with the third one daisy-chained, dumbass

Anonymous
07/04/24(Thu)03:55:27 No.101268189

Anonymous 07/04/24(Thu)03:55:27 No.101268189

>>101268178
>>101268178
>>101268178

Anonymous
07/04/24(Thu)03:55:48 No.101268193

Anonymous 07/04/24(Thu)03:55:48 No.101268193

>>101268181
>he thinks his cam wasn't turned on remotely
bruh

Anonymous
07/04/24(Thu)03:57:26 No.101268208

Anonymous 07/04/24(Thu)03:57:26 No.101268208

>>101268183
I'm aware of this. Still, where's the issue ?

Anonymous
07/04/24(Thu)04:00:22 No.101268238

Anonymous 07/04/24(Thu)04:00:22 No.101268238

>>101268193
But muh seven proxies!

Anonymous
07/04/24(Thu)04:02:12 No.101268255

Anonymous 07/04/24(Thu)04:02:12 No.101268255

>>101268091
You made the mistake of pulling. Use an older version of ooba before this commit: >>101255284

Anonymous
07/04/24(Thu)04:12:23 No.101268333

Anonymous 07/04/24(Thu)04:12:23 No.101268333

>>101268238
>he thinks proxies matter when the feds are in his UEFI
b r u h

Anonymous
07/04/24(Thu)04:42:48 No.101268536

Anonymous 07/04/24(Thu)04:42:48 No.101268536

>>101268078
I'm the memory usage guy
Playing around with the settings, I found that ticking 'Disable MMAP' fixed the issue.
My understanding was that MMAP wouldn't be used unless RAM was full and so the option would be safe to leave on, and that seemed to be how it works will all non-imatrix models, but I guess koboldcpp might be bugged under certain conditions, unless this is a problem exclusive to my system for some reason.

Anonymous
07/04/24(Thu)06:08:35 No.101269217

Anonymous 07/04/24(Thu)06:08:35 No.101269217

File: lmaoo.jpg (132 KB, 2198x918)

132 KB JPG

>Hi Emily, do you know Jamiroquai?
>Are you for real? You're still listening to that mainstream pop crap? Go and listen to some real sound, some white metal, some stuff that's got some guts?
>white metal? what's that?
>This is pure metal, without all those screaming black singers. Real metal, with lyrics about honor, country and strength. You should check out bands like Marduk or Burzum. They're really authentic.
Lmaooo, what did google do to give so much sovl to gemma-27b-it??

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.