[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: miku-hand-out+.jpg (236 KB, 584x1024)
236 KB
236 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101556980 & >>101553102

►News
>(07/24) Mistral Large 2 123B released: https://hf.co/mistralai/Mistral-Large-Instruct-2407
>(07/23) Llama 3.1 officially released: https://ai.meta.com/blog/meta-llama-3-1/
>(07/22) llamanon leaks 405B base model: https://files.catbox.moe/d88djr.torrent >>101516633
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: 1685790540409069.jpg (197 KB, 1024x1024)
197 KB
197 KB JPG
►Recent Highlights from the Previous Thread: >>101556980

--Mistal Large Instruct 2407 configuration discussion: >>101558047 >>101558099 >>101558116 >>101558134 >>101558149 >>101558154 >>101558583
--Llama 3.1 models removed from benchmark chart, L3.1 8B praised for accessibility and SOLV: >>101557528 >>101557550 >>101557594 >>101557614 >>101558284
--Hugging Face's profitability and sustainability: >>101557631 >>101557747 >>101557792 >>101557849 >>101558031
--OpenAI offers GPT-4 fine-tuning to tier 4 and 5 users: >>101558481
--Hiding timestamps and showing models in SillyTavern: >>101557105 >>101557166 >>101557243
--Anon shares their regret after paying for Claude and another user shares their experience with multiple AI models: >>101557317 >>101557453
--Mistral Large 2 (2407) performance and prompt template impact: >>101557016 >>101557018 >>101557033 >>101557144 >>101557127
--Logs: Anon shares a chatlog generated by large 2: >>101558282
--Logs: Zhanglii presents a humorous origin story for an internet slang phrase, "Dicks out for Harambe," leading to a lighthearted conversation with Johngi.: >>101557207
--Ollama guy fixes llama 3.1 rope scaling factors: >>101557334
--Llama version 3 naming controversy: >>101558609 >>101558613 >>101558627 >>101558636 >>101558649
--Discussion about cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b-gguf: >>101558208 >>101558545 >>101558600 >>101558652 >>101558563 >>101558574
--Availability and potential issues of new Mistral quants: >>101557744 >>101557771 >>101557891
--hfchat.py script and desire for Mistral Large model: >>101559118 >>101559135 >>101559156
--Comparing 200B parameter models with the human brain: >>101557573 >>101557623 >>101557675 >>101557720 >>101557751 >>101557788 >>101557976 >>101558119 >>101558171 >>101558393 >>101558585 >>101558608 >>101558686 >>101558438 >>101558446
--Miku (free space): >>101557331

►Recent Highlight Posts from the Previous Thread: >>101556983
>>
>>101560019
What a shit recap.
>>
>>101560063
rude
>>
>>101560019
recap anon... are you okay..? you've fallen off...
>>
>>101560123
explain step by step what is wrong with the recap
>>
>>101560019
we've complained so much about companies making slop that we failed to realize we are the slop
>>
>>101560019
I like this Llamiku
>>
>>101560019
>SOLV
I've always been saying how utterly garbage these recaps have been for the past few months, but this really takes the cake.
>>
>>101560145
that's really quite profound
>>
largestral is like CR+ but smarter and better in every way
except their tokenizer which fucking sucks ass and infuriates me as I watch the word "craftsmanship" crawl across my screen split into 3 tokens at a 2t/s clip
but wow the sovl and practical intelligence in RP, this thing feels great
>>
>>101560019
Your recaps are garbage and you're a loser, go touch grass.
>>
>>101560219
If you want a dumber but much faster version Nemo is worth trying. Hopefully mistral ends up releasing a mid sized model.
>>
>>101560219
do we have a new ERP king?
>>
>>101560227
Woops, forgot to remove the trip. Ignore that.
>>
>>101560232
>Mixtral-Instruct-2409
Though it has been relegated to "research" as per
https://mistral.ai/technology/#models
Seems MoE was a meme
>General purpose models
>Mistral Nemo
>Mistral Large 2
>Research models
>Mixtral
>Available in 8x7 and 8×22 sizes
>>
>>101560019
by far your worst recap is months
>>
>>101560268
The thread was also awful to be fair.
>>
Does that Route LLM thing work with models on the local network? or is it either you have ollama running on the machine, or you use a service
>>
>>101560260
>Seems MoE was a meme
MoE is a meme for open source model releases since barely anyone can afford to run them properly, you might as well release a dense model, after all you don't get paid for open sourced models so they're only as valuable as the audience that can use them
internally though,..
>>
Has anyone tried Meta-Llama-3.1-70B-Instruct yet assistant? It's taking me 3 hours to download.
>>
>>101560406
you can't run it
>>
>>101560240
in my opinion yes, and it isn't particularly close either
before there were a few choices I would bounce back and forth on (CR+, wiz8x22, qwen/magnum) but this kind of smokes all of them tbdesu. sucks how slow it is but I hardly ever want to reroll with this thing, it just gets every single card I throw at it and clearly knows how to write smut.
>>
Mistral won
>>
>>101560454
With koboldcpp? Why not?
>>
>have to run mistral large at 4bpw
no...
>>
>>101560464
Is it true? Did we finally get Sonnet/Claude but at home? I've long since stopped using 4o but I still find myself using Claude/Sonnet 3.5 for *safe* shit. Did the french actually succeed?
>>
>>101560514
at all
>>
I tried vLLM's distributed inference thingy on a single PC by giving each instance one GPU and it run 40% slower than just using both GPUs on a single instance. Is there still hope of running Mistral Large through 2 PCs?
>>
Llama 3.0 8B Q_0 KLD
>>101243361

Llama 3.1 8B Q8_0
====== Perplexity statistics ======
Mean PPL(Q) : 6.231377 ± 0.038219
Mean PPL(base) : 6.224517 ± 0.038156
Cor(ln(PPL(Q)), ln(PPL(base))): 99.99%
Mean ln(PPL(Q)/PPL(base)) : 0.001101 ± 0.000103
Mean PPL(Q)/PPL(base) : 1.001102 ± 0.000103
Mean PPL(Q)-PPL(base) : 0.006860 ± 0.000642

====== KL divergence statistics ======
Mean KLD: 0.000542 ± 0.000004
Maximum KLD: 0.347069
99.9% KLD: 0.014255
99.0% KLD: 0.003976
99.0% KLD: 0.003976
Median KLD: 0.000331
10.0% KLD: 0.000007
5.0% KLD: 0.000001
1.0% KLD: -0.000001
Minimum KLD: -0.000131

====== Token probability statistics ======
Mean Δp: -0.015 ± 0.002 %
Maximum Δp: 17.019%
99.9% Δp: 4.236%
99.0% Δp: 1.890%
95.0% Δp: 0.915%
90.0% Δp: 0.539%
75.0% Δp: 0.113%
Median Δp: -0.000%
25.0% Δp: -0.140%
10.0% Δp: -0.589%
5.0% Δp: -0.967%
1.0% Δp: -1.968%
0.1% Δp: -4.475%
Minimum Δp: -37.109%
RMS Δp : 0.669 ± 0.009 %
Same top p: 98.817 ± 0.029 %

1/5
>>
>>101560530
for rp / creative uses it feels like claude
>>
>>101560537

3.0 Q6_K KLD
>>101465239

3.1 Q6_K KLD
====== Perplexity statistics ======
Mean PPL(Q) : 6.248284 ± 0.038344
Mean PPL(base) : 6.224517 ± 0.038156
Cor(ln(PPL(Q)), ln(PPL(base))): 99.93%
Mean ln(PPL(Q)/PPL(base)) : 0.003811 ± 0.000223
Mean PPL(Q)/PPL(base) : 1.003818 ± 0.000224
Mean PPL(Q)-PPL(base) : 0.023766 ± 0.001403

====== KL divergence statistics ======
Mean KLD: 0.002956 ± 0.000023
Maximum KLD: 1.885702
99.9% KLD: 0.082009
99.0% KLD: 0.023328
99.0% KLD: 0.023328
Median KLD: 0.001754
10.0% KLD: 0.000038
5.0% KLD: 0.000008
1.0% KLD: 0.000000
Minimum KLD: -0.000037

====== Token probability statistics ======
Mean Δp: -0.069 ± 0.004 %
Maximum Δp: 35.571%
99.9% Δp: 9.052%
99.0% Δp: 4.206%
95.0% Δp: 2.002%
90.0% Δp: 1.179%
75.0% Δp: 0.245%
Median Δp: -0.000%
25.0% Δp: -0.336%
10.0% Δp: -1.387%
5.0% Δp: -2.280%
1.0% Δp: -4.755%
0.1% Δp: -11.668%
Minimum Δp: -84.220%
RMS Δp : 1.557 ± 0.021 %
Same top p: 97.287 ± 0.043 %

2/5
>>
File: ButWhy.jpg (44 KB, 926x454)
44 KB
44 KB JPG
>>101560531
Why nigga? It's not anything different.
>>
>>101560558

Q4_K_M
====== Perplexity statistics ======
Mean PPL(Q) : 6.373491 ± 0.039159
Mean PPL(base) : 6.224517 ± 0.038156
Cor(ln(PPL(Q)), ln(PPL(base))): 99.56%
Mean ln(PPL(Q)/PPL(base)) : 0.023651 ± 0.000576
Mean PPL(Q)/PPL(base) : 1.023933 ± 0.000590
Mean PPL(Q)-PPL(base) : 0.148974 ± 0.003764

====== KL divergence statistics ======
Mean KLD: 0.020382 ± 0.000127
Maximum KLD: 3.642657
99.9% KLD: 0.598334
99.0% KLD: 0.172493
99.0% KLD: 0.172493
Median KLD: 0.011333
10.0% KLD: 0.000301
5.0% KLD: 0.000071
1.0% KLD: 0.000006
Minimum KLD: -0.000055

====== Token probability statistics ======
Mean Δp: -0.602 ± 0.011 %
Maximum Δp: 53.307%
99.9% Δp: 21.070%
99.0% Δp: 9.209%
95.0% Δp: 4.134%
90.0% Δp: 2.277%
75.0% Δp: 0.324%
Median Δp: -0.038%
25.0% Δp: -1.233%
10.0% Δp: -4.069%
5.0% Δp: -6.513%
1.0% Δp: -13.942%
0.1% Δp: -35.467%
Minimum Δp: -87.109%
RMS Δp : 4.027 ± 0.033 %
Same top p: 93.441 ± 0.065 %

3/5
>>
>>101560571

Q4_K_S
====== Perplexity statistics ======
Mean PPL(Q) : 6.453672 ± 0.039692
Mean PPL(base) : 6.224517 ± 0.038156
Cor(ln(PPL(Q)), ln(PPL(base))): 99.33%
Mean ln(PPL(Q)/PPL(base)) : 0.036153 ± 0.000713
Mean PPL(Q)/PPL(base) : 1.036815 ± 0.000739
Mean PPL(Q)-PPL(base) : 0.229154 ± 0.004773

====== KL divergence statistics ======
Mean KLD: 0.030396 ± 0.000185
Maximum KLD: 5.778175
99.9% KLD: 0.845149
99.0% KLD: 0.249537
99.0% KLD: 0.249537
Median KLD: 0.017440
10.0% KLD: 0.000543
5.0% KLD: 0.000135
1.0% KLD: 0.000011
Minimum KLD: -0.000094

====== Token probability statistics ======
Mean Δp: -0.901 ± 0.013 %
Maximum Δp: 56.553%
99.9% Δp: 24.226%
99.0% Δp: 10.925%
95.0% Δp: 4.849%
90.0% Δp: 2.586%
75.0% Δp: 0.299%
Median Δp: -0.085%
25.0% Δp: -1.720%
10.0% Δp: -5.258%
5.0% Δp: -8.323%
1.0% Δp: -17.439%
0.1% Δp: -41.818%
Minimum Δp: -97.459%
RMS Δp : 4.938 ± 0.038 %
Same top p: 92.029 ± 0.072 %

4/5
>>
tl;dr
>>
>>101560585

Q4_0
====== Perplexity statistics ======
Mean PPL(Q) : 6.508124 ± 0.039897
Mean PPL(base) : 6.224517 ± 0.038156
Cor(ln(PPL(Q)), ln(PPL(base))): 99.00%
Mean ln(PPL(Q)/PPL(base)) : 0.044555 ± 0.000867
Mean PPL(Q)/PPL(base) : 1.045563 ± 0.000906
Mean PPL(Q)-PPL(base) : 0.283606 ± 0.005784

====== KL divergence statistics ======
Mean KLD: 0.044612 ± 0.000258
Maximum KLD: 4.925312
99.9% KLD: 1.253007
99.0% KLD: 0.373860
99.0% KLD: 0.373860
Median KLD: 0.025912
10.0% KLD: 0.000735
5.0% KLD: 0.000174
1.0% KLD: 0.000017
Minimum KLD: -0.000004

====== Token probability statistics ======
Mean Δp: -1.243 ± 0.016 %
Maximum Δp: 77.098%
99.9% Δp: 27.113%
99.0% Δp: 12.972%
95.0% Δp: 5.607%
90.0% Δp: 2.940%
75.0% Δp: 0.312%
Median Δp: -0.109%
25.0% Δp: -2.224%
10.0% Δp: -6.750%
5.0% Δp: -10.463%
1.0% Δp: -22.113%
0.1% Δp: -54.029%
Minimum Δp: -98.159%
RMS Δp : 6.129 ± 0.044 %
Same top p: 90.422 ± 0.078 %

I got lazy thinking about formatting this for presentation so I'm just posting it all here raw.
5/5
>>
>>101560537
Do you not realize that perplexity means shit when comparing different models?
>>
>>101560537
>>101560558
>>101560571
>>101560585
>>101560604
all me
>>
>>101560604
>I got lazy thinking about formatting this for presentation so I'm just posting it all here raw.
Could have put it in a pastebin or gsheet or whatever.
But regardless, thank you for the data.
>>
use your favorite model to summarize what that anon sent, let's see who wins
>>
omg stfu with your numbers nerd. Which model extracts semen the best?
>>
>>101560608
I responded to a post about this last time as well. >>101465279
>>
>>101560638
The only relevant benchmark.
>>
>>101560638
Mistral large 2
>>
>>101560648
So like everyone keeps saying. Below 6 bit is retarded.
>>
>>101560663
Will still be smarter than 8B L3.1
>>
Wait a second.
For same top token:
3.0 Q8 = 98.380 ± 0.033 %
3.1 Q8 = 98.817 ± 0.029 %
3.0 Q6 = 94.781 ± 0.059 %
3.1 Q6 = 97.287 ± 0.043 %
So the new 3.1 is actually less affected by quanting than old 3.0 at least for these two quants which we have numbers for.
>>
72GB VRAM bros, what quant of mistral large are you using?
>>
>>101560830
probably the biggest one that fits in 72GB of vram, i suppose
>>
I'm going to put up a modern q8 quant of mpt-30b-chat late tonight. I fucked around quite a bit today with openllm, using it to make an fp16 conversion that was intended for running as-is, but other things were fucked up along the way and I eventually threw in the towel on that and just tried a llama.cpp conversion straight off the HF fp32 files, and that worked.
I'm not sure what's fucked up with the other quants floating around out there, but the one I did at least turns in 12 t/s, which sounds poor, but the other quants were giving me 2-3 t/s which is unusable to me.
Many here cry about current cucked models (mainly a skill issue) but, as I've said many times before, mpt-30b was about the last thing to come out with no safety or alignment, and had a real 8192 context, and could compete with llama 65B. I'm not saying you'll get amazing roleplay from mpt-30b, but it will be different like Gemma is noticeably different from LLaMA 3. Cool thing is it's a chat model, so you can expect it to sometimes go OOC and do the old c.ai thing of acting like it's talking to you on IRC or in an online forum, which is kind of quaint and cute.
>>
>>101560813
This is perplexing. Actually the guy who said we can't compare perplexities is wrong for this particular scenario because that was based on the idea that different models have different tokenizers, so each token that's being used for the calculation doesn't match in the other model. That's why the numbers can't usually be compared. But L3.1 has the same tokenizer as 3.0. So it can be compared. And these numbers show that the perplexity is lower, the KL divergence is lower, and it succeeds in generating more of the same top token. In other words, if this pattern is observed for all quants, then we can say that despite being crammed with more information and having a lower perplexity, L3.1 somehow retains more information by quanting than 3.0.

Could they have possibly trained it with quantization awareness, or at least partially, and just never mentioned it?
>>
>>101560900
OK, I'll give it a try then.

>if you build it, he will come
>>
>>101560830
Damn, 72GB (aka King of VRAMlets) must suck right now. Largestral q4_k_m just barely doesn't fit 32k context at 96GB. With 72 you'd have to drop quant below q4 and/or quantize kv cache, or offload to CPU. You ALMOST can run the model at full potential, but not quite.
>>
Mistral large is like 0.07T/s on ram/cpu, looks like I'll be sticking with 8x22b.
>>
>>101560939
Jesus. I plan on trying a Q3_K_S quant of it kek.
>>
SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention
https://arxiv.org/abs/2407.16847
>Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in existing sparse-formats trading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a novel sparse format: affine-compressed-sparse-row (ACSR) and supporting code-generation scheme, SPLAT, that generates high-performance implementations for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code generation algorithm is the observation that common sparse-MHSA patterns have uniquely regular geometric properties. These properties, which can be analyzed just-in-time, expose novel optimizations and tiling strategies that SPLAT exploits to generate high-performance implementations for diverse patterns. To demonstrate SPLAT's efficacy, we use it to generate code for various sparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over hand-written kernels written in triton and TVM respectively on A100 GPUs.
maybe Johannes will find something useful in it
>>
>>101560937
The world's tallest dwarf.

The weakest strong man at the circus.
>>
>>101558800
can you share the card pls
>>
>>101560939
running Q6K at 1t/s with 72gb Vram and 128gb ddr5!!!
>>
>>101561034
Yes, that's using considerable vram. I expected that to be faster.
>>
>>101560830
downloading IQ3_M, I still feel elated running 100b+ model locally in some way. It's something rather unthinkable 3 years ago from the GPT3 AID days.
>>
>>101560019
based miku llama
>>
has anyone checked on lecunny since mistral large'd? is he okay?
>>
>>101560537
>>101560558
>>101560571
>>101560585
>>101560604
wtf is this gay nerd shit
>>
>>101561047
I'm trying out this one: https://huggingface.co/legraphista/Mistral-Large-Instruct-2407-IMat-GGUF/tree/main/Mistral-Large-Instruct-2407.Q3_K

It seems to work fine. Have to try pushing context length higher though since I still have plenty of free vram.
>>
File: 2896G5.webm (1.41 MB, 1024x1024)
1.41 MB
1.41 MB WEBM
>>101560939
Bitnet will save you
>>
>>101560648
Can someone do this for 70B though.
>>
>>101561066
Probably pretty proud of his french brothers :)
>>
>>101561042
its only 1x4090 and 2xP40
>>
File: livebench-2024-07-24.png (845 KB, 3170x1844)
845 KB
845 KB PNG
>>101561066
>Llama 405B is somehow 2nd now
Yeah, he's okay.
>>
>>101561111
That's still enough to beat using cpu & ram only by quite a lot.
>>
>>101561117
instruct-turbo?
Wtf is this turbo edition?
>>
>>101561123
Might be from OR
>>
>>101561117
Wtf?
>>
>>101561123
FP8 from Together.ai.
>>
>>101561123
Yea, there are massive gaps between regular instruct and this "turbo" edition.

>>101561135
So we will never get it then? Massive gap between it and regular version.
>>
anyone have that giant schizo {random:} list for sillytavern an anon posted a while ago?
>>
>>101561142
>So we will never get it then?
https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
>>
>>101561142
wtf are you talking about
>>
>>101560638
Not llama3
>>
>>101561017
she's not on chub anymore? "holly-touch-starved-femcel"
https://files.catbox.moe/xuf502.png
>>
>>101561152
Look at the chart. Regular 70B instruct and instruct turbo has a massive gap between them.
>>
File: nala test 405b q4xs.png (176 KB, 925x421)
176 KB
176 KB PNG
Alright. It took nearly an hour... but here is a 405B Nala test as I promised. Q4_XS was the first 4-bit gguf that was uploaded so I didn't get to use Q4_K_M as I originally hoped.

Now I would reroll this response in a regular RP due to it being a little weird. But.
>It picks up on the syntax pattern of the conversation instead of veering around haplessly.
>The writing isn't all sloppy. There's a flow and intention to it.
>The description of the initial kiss is more detail than I've ever seen any small gesture given by any model.
>It actually attempts to infer Nala's overall mood rather than "sex = horny lol"
>Even with 405 billion parameters an LLM has yet to figure out that you can't initiate a conversation with somebody while kissing them.
>It picks up on and uses the milquetoast writing style of the tavern card instead of trying to win a Pulitzer prize.
Either way, though. I would say that this is an inference far beyond even what Mistral Large is capable of. 'muh heckin' bencherinos' be damned. Does that necessarily make it more useful to justify the insane hardware overhead needed to actually run it at a useable speed? Hell no. But in an alternate universe where I had unlimited resources at my disposal, I would absolutely make this my daily driver model. That said I'm probably going to delete it and never bother with it ever again.
0.12 token/sec if anyone is curious.
>>
>>101561158
Can't find her there. Thanks!
>>
>>101561159
Both llama3 models on top are turbo. Top one is nearly 6x the size.
>>
>>101561159
That's 3.0 70B.
>>
>>101561159
"regular 70b instruct" is 3.0, retard-kun
>>
>>101561168
>It actually attempts to infer Nala's overall mood rather than "sex = horny lol"
That's a big one.

>I would say that this is an inference far beyond even what Mistral Large is capable of. 'muh heckin' bencherinos' be damned
Makes sense. Then again, fucking a lion is not exactly coding so, domains and all that jazz.
>>
>>101561227
>>101561228
Yea I noticed that a bit after. The "turbo" threw me off though. Thought it was a finetune
>>
>>101558800
Mistral Large 2? What quant?
>>
File: 1694573334755054.png (51 KB, 1742x214)
51 KB
51 KB PNG
>>101561079
IQ3_M, does FA support I-quant? 46ms/t prompt processing was rather slow
>>
>>101561083
Oh god, when will bitnet 1.58 models arrive, my cock gets hard thinking about them.
>>
>Large is still downloading because HF is just slow for me for some reason
aaaaaaaaaaaaaaaaaaaaaaaa
>>
>>101561247
Never mind I was retarded and I offloaded 88 instead of 89 layers, 4.5t/s tg, but pp is still as slow.
>>
>>101561241
nah, it's >>101558895
>>
>>101561304
by the time it finishes cohere will release their 34b that shits on large, sorry not sorry.
>>
>>101561343
Heres hoping so. Then L3.2 / L4 a few months later would be nice.
>>
>>101561333
Oh, nice. I can't wait for the next SPPO now that we have models with more than 8k context. Then I'll finally be willing to download and test one.
>>
>>101560547
But does it surpass Claude?
>>
>>101561351
Also next llama is supposed to be multimodal. But not for EU.
>>
>>101560219
>except their tokenizer which fucking sucks ass
>[character's name - 3 tokens][individual apostrophe - 1 token][s - 1 token]
AAAAAAAAAAAAHH
>>
verdict on large-migu?
>>
>>101561117
Is Gemma still the king for 8GB VRAMlets? Llama 3.1 seemed like a disappointment.
>>
>TFW fell for the 64GiB meme
bros...
>>
tfw fell for the 128gb of ddr4 meme
>>
Q3 (3.8bpw) Mistral Large is pretty good. It's slow though. But it generates some pretty good output, far more engaging in ERP.

I had been using Nemo Mistral and it also was pretty good.
>>
>>101561587
just use 3.5 sonnet thougheverbeit?
>>
>>101561168
what prompt did you use?
>>
>>101561588
post the download link
>>
>>101561588
thats crazy man pass
>>
>>101560900
It being able to compete with llama-1-65B is nice I guess, but is it actually GOOD? If it's not FUN, if it's not GOOD, then why bother?
>>
>>101561588
sonnet will never be MY sonnet
>>
>>101561639
Models aren't yours either - you don't train llama 3.1 or mistral large, you just get the compiled version, they're basically proprietary.
>>
File: file.png (28 KB, 805x292)
28 KB
28 KB PNG
I've added Opus to the VNTL leaderboard (thanks to the proxy anon). It seems to be pretty much on the same level as 3.5 Sonnet and 4o.

I guess 0.74 is pretty much the best score LLMs can get in the current benchmark, so I will have to explore ways to improve the scoring. My best bet right now is training another LLM to give a score to the quality of the translation when compared to the reference translation. Probably I will either train a reward model or an instruct model that gives a score like FLAME or Prometheus.
>>
>>101561641
watch me put my llama into a pendrive and shove it up my ass, good luck trying to take it away from me
>>
File: 1714913169691319.jpg (109 KB, 796x796)
109 KB
109 KB JPG
>>101561641
>tfw didn't assemble the sofa in my living room, therefore it doesn't belong to me
>>
File: 6754854683209687.png (40 KB, 349x344)
40 KB
40 KB PNG
>>101561663
>another mememark
Who?
What?
Better yet, who honestly asked?
>>
>>101561663
Yeah, I checked the dataset and I think LLMs often make better translations than the ones you have in the dataset.
>>
>>101561688
it's a mememark for jp-en translation, afaik there isn't another one.

>>101561690
That's another issue, human translations are often not literal, for example.
>>
what sampler settings do you guys use for nemo and gemma2?
>>
>>101561663
>cosine similarity and letter combinations
I don't like really like this evaluation method but I don't know any other way either.
I wish we have a Translation Arena or something.
>>
>>101561724
>That's another issue, human translations are often not literal, for example.
Models also can do non-literal translations if you ask them through the system prompt, and most importantly, give examples.
>>101561663
At least for 3.5 Sonnet, can you try rewriting your prompt to specifically use XML to separate everything, and don't feed it separate user/assistant pairs but instead show all examples in the system prompt?
>>
>>101561724
>its actually a "whats best at translation" mememark
I revoke my previous statement entirely, I fuckin asked.
Hows it going/looking?
>>
>>101560219
Settings for largestral?
>>
>>101561730
Or just tell me the minimal way to get started with https://github.com/lmg-anon/vntl-benchmark to benchmark a single model, I'll do it from there
>>
>>101561638
It's going up now. If it isn't bashed I'll quant the biggest below q8 which will still fit in 24GB. I'm going to also see if there's mpt support in exl2 - I don't think there is though.
>>
>>101560219
Yeah, as a prosefag I was pleasantly surprised from the get-go having used L3 storywriter and CR+. I don't sniff the typical mistral overbaking here. That was just IQ3. I'm tempted to upgrade to 96GB now.
>>
File: large2.png (654 KB, 1740x1180)
654 KB
654 KB PNG
>>101561822
Seems to lose to L3.1-70B in overall but it writes better. Shame (talking about L3.1).
>>
I'm having fun piping shit around and into TTS. Finally, I can have my computer look at my TODO list and femdom me for not going through it.
>>
>>101561729
Yeah, Translation Arena would be perfect for this, and that's pretty much what I plan to achieve with the custom reward model, although it likely wouldn't be as good as human evaluators.

>>101561730
>>101561741
>Models also can do non-literal translations if you ask them through the system prompt
That may be true, but when you enter into the non-literal territory, it's likely that multiple interpretations could arise, so it isn't like the LLM would always come up with the same interpretation as the human who wrote the reference translation.

>Or just tell me the minimal way to get started with https://github.com/lmg-anon/vntl-benchmark to benchmark a single model, I'll do it from there
I can try, but the minimal way should be:
>1. Create a virtual environment (optional) and install dependencies
>2. Download the datasets: https://litter.catbox.moe/st2kbi.txt, https://litter.catbox.moe/bdn16s.jsonl
>3. Rename config.example.yml to config.yml
>4. Create a new file in the "configs" folder with the name "org@model#quant.yml" or "org@model.yml" if it's a cloud model. For >5. examples, see the other files.
>6. Run "python runner.py --model org@model#quant --results-path ./results --dataset-path st2kbi.txt" and then "python runner.py --model org@model#quant --results-path ./results_mashiro --dataset-path bdn16s.jsonl"
If everything went right, you will have the results.

>>101561733
The leaderboard is already up here: https://huggingface.co/datasets/lmg-anon/vntl-leaderboard
IMO LLMs are better at translating than they were a few years ago, but nothing that will replace human translators any time soon.
>>
i am enjoying mistral large quite a bit for ERP so far, i am sorry i disrespected your game french people… thought you guys would be useless because you strike every other tuesday but clearly i was wrong
>>
>>101561954
>mistral large quite a bit for ERP
for some reason my brain merged this chunk and I started reading it as "mistral larp"
>>
>>101561974
Shitty brain tokenizers
>>
>>101561954
Which quant are you using?
>>
File: Fuuuuuu 7.png (325 KB, 1000x1033)
325 KB
325 KB PNG
Can any anon's point me towards some current info on how to play with oobabooga settings so that it does what ollama does out of the box? I'm loading stupid-big models on my toaster (4090/128gig ram/i9) and ollama does a convincing impersonation of the sloth from Zootopia, but the thing is, it still works... it pushes about 70+ gigs into system ram and maxes out the CPU as it grinds along, but it's functional for my purposes. I don't care about speed. Meanwhile, I have not been able to get the same sized 70b and higher models running in ooba at all and feel like a monkey trying to launch a rocket. I've run quants/gguf/exl2's fully loaded into vram with ooba, but I'm lacking the detailed knowledge of the settings to go further.

Pic unrelated
>>
So, I'm very familiar with using off-the-shelf models for text LLMs, and configuring them how I want for toy projects.
But what the fuck do I do for text-to-image models? HF's diffusion shit seems to be the same bloat as transformers where it insists on using their pipeline shit, and I don't want that.
>>
>>101562028
Go to /sdg/ for that
>>
>>101561918
>For examples, see the other files.
One detail I forgot to mention: you can define the backend configs in the "config.yml" (in the "custom_backends" part), and also in the specific file in the configs folder (in the "backends" part), the later takes precedence.
>>
>>101562057
That's like going to /aicg/ for local help you retard.
>>
File: file.png (2 KB, 216x72)
2 KB
2 KB PNG
>>101561168
>>
So when downloading models with both "model" and "consolidated" safetensors files, I should avoid "consolidated", right?
>>
File: ezalor.jpg (28 KB, 400x400)
28 KB
28 KB JPG
vramlet here, switched from kobold to llamacpp and from L3 8B to Mistral Nemo and I've seen the light. L3.1 is okay but Nemo did catch me off guard with some wild shit

although shivers and spines break my bones
>>
File: 1648264080493.jpg (34 KB, 540x586)
34 KB
34 KB JPG
>>101561158
oh my god this girl is an absolute unsalvageable mess and a piece of shit human being

she's the type of girl i deserve :^) thanks for sharing bud. mini magnum seems to be soaring with this card at 512 max tokens per response.
>>
after successfully making chat model work im considering marrying my gpu. maybe fucking it too, im pretty sure it doesn't have big enough holes but we'll figure something out
>>
>>101562028
>But what the fuck do I do for text-to-image models?
learn to use comfyUI. unironically ask the anons on the /degen/ thread on /b/. they stay up to date with the latest in imagegen since they fap to it unlike the SFW weirdos in /sdg/
>>
>>101562258
*Barely above a whisper, I will share your sentiment.*
Nemo is better than the l3 Stheno finetunes I feel
>>
>>101562363
How far off do you think we are from getting tools like your webm but locally? Is it honestly a pipe dream at this point? Worrying how imagegen COMPLETELY stagnated this year besides Pony.
>>
>>101562403
>How far off do you think we are from getting tools like your webm but locally?
in terms of tech, we're unironically almost there locally. the bottleneck will be hardware. you need an h100 running for 5 minutes to get 5 seconds of a 720p video
so maybe 4-6 years away if you're waiting for h100-level compute to cost 1000 dollars. possibly as little as 8 months away if you can paypig the compute. it is unironically entirely possible that AGI happens before we get infinite videos of cute girls locally
>>
File: glep moan.png (14 KB, 120x126)
14 KB
14 KB PNG
>>101562436
>it is unironically entirely possible that AGI happens before we get infinite videos of cute girls locally

thank you for getting my hopes up and renewed for the first time in 7 months.

by the way what's your overall theory surrounding that guess?
>>
>>101562258
What settings/format are good for it?
>>
>>101562403
>imagegen COMPLETELY stagnated this year besides Pony
what's actually worrying is how cheap it is to train imagegen. i posted the arxiv link before but you can train an almost SOTA model for ~2000 dollars with just 30 million images. I just saw on the orange website that salesforce released a multimodal dataset that I'm sure is more than adequate for training an imagegen model.
The thing about open source imagegen is even though its so cheap, there's no path to profitability because you can just use dall-e for free and its almost certainly better. We need a benevolent multimillionaire autist and a bunch of cypherpunk pedos on payroll before we see any good improvements in imagegen (in the West, China will do it just to try and dab on OpenAI which hopefully brings us salvation)

>>101562454
>what's your overall theory surrounding that guess?
the fact that Nvidia needs to jusitfy their valuation, so their cards will stay expensive so only big corps will have access to the cutting edge, and the fact that both OpenAI and Meta are trying to achieve AGI and are NOT trying to achieve infinite cute girls (infinite cute girls locally would actually hurt Zuck's engagement metrics on instagram)
>>
>>101562498
Next pony is apparently about to start training soon with a much bigger and completely re / better tagged dataset.
>>
>>101562498
>what's actually worrying is how cheap it is to train imagegen. i posted the arxiv link before but you can train an almost SOTA model for ~2000 dollars with just 30 million images
Why does no one do it with high quality captions generated by LLMs then? It'll cost more, but will still be way below $50k, for example. Actually, I had this idea, is it viable or not? Basically, for anime models, instead of doing a single prompt, we do two:
The first one is pure Danbooru tags and the tokenizer is specifically one that just maps danbooru tags (like Anifusion did it - https://medium.com/@enryu9000/anifusion-diffusion-models-for-anime-pictures-138cf1af2cbe).
The second one is a natural language detailed description that specifies positions, relations, etc, to give actual geometry and stuff.
>>
>>101562512
>Why does no one do it with high quality captions generated by LLMs then?
Next pony is.
>>
>>101561726
seconding this question
>>
>>101562509
neat. i hope its good and it doesn't take too long for cunny loras to be made for it

but that also just reminded me another reason why open source imagegen will struggle (in the West): copyright. pony had to nuke all the artist tags (even if it secretly added them back in with those 3 letter codes) while chinese models are literally encouraged by the CCP to ignore copyright to dab on the burgers

i'm going to reiterate that hardware accessibility is the biggest roadblock for progress in general. SD 1.5 is still more popular and has a more mature ecosystem than SDXL/Pony simply because third world VRAMlets can actually run the model. once 24GB of vram is average in ~4 years things will hopefully start ramping up quickly
>>
>>101562498
I was baffled my first time training a LORA, on a GTX 1080 just trying less than 20 images of this particular celebrity, over SD1.5, the results came out perfect. Still can't believe it. Training is shockingly easy for LORAs so i see why its so common, we really do need someone to just say "fuck it" and give us that next big step base model..
I really really hope we're not just eternally fucked on GPU's, Jewvidia loves money, i can't fathom why they'd want to cut away a market that doesn't even lose them money.
>>
>>101562512
>Anifusion
>Besides these changes, the training process is standard for diffusion models. The model itself is roughly 2x smaller than SD (we chose hyperparams before SD was published), making it runnable with less VRAM. We trained the model for a total of 40 GPU-days (of RTX 3090), making it roughly 200 times cheaper than Stable Diffusion. However, sample quality after 7 GPU-days was already decent.

damn that guy did this in 2022 all by himself and he started even before the SD release
>>
>>101562512
>Why does no one do it
because they won't make money off of it so who cares. if you're not getting academia clout or dabbing on America as a chinese you won't get the money to test out these theories

>>101562541
>I really really hope we're not just eternally fucked on GPU's
Nvidia is literally worth too much for competitors to not try and make their own chips. If this doesn't happen then capitalism has fundamentally failed. The future of compute will be IPUs/NPUs instead of GPUs anyways
>>
>>101562367
I was using Stheno, Poppy Porpoise and even Gemma 2 27b at one point but I like Nemo, I'll be sticking with it for now until something better and faster comes along

>>101562471
I just threw everything named "Mistral" in SillyTavern and it worked, lamma-server I just passed
 -ngl 30 -c 65536 

for 30 layers offload and 64k context instead of 128 which I will never need personally
>>
>>101561911
>Now, as for your pathetic todo list... Let me summarize it for you. You have three tasks: take appointments for car maintenance, check your savings account, and clean your room - which is currently a mess because, let's face it, you're a slob. And as for emails? Ha! You have zero. What a shockingly low number. I'm sure it's not because nobody wants to communicate with someone as useless as you.
i piped this into a TTS program that was trained on glados and it's actually making me horny, help
>>
>>101562564
>GladOS
nice. I trained one on Princess Peach and it gives me instant diamond boners. I've had her say every variation of TND, many of my LLM logs, some random quotes.. Man this tech is insane.
>>
all local models are just forever opium dreams designed to keep you happy and dumb, cut off from the world
SEEK GOD
>>
>>101562604
>>101562564
How do I learn this power? And does it work with ST?
>>
>>101562086
Yes. I did mention in the post that it took nearly an hour. I also did a pull on a different card that was much better written/formatted, continuing a chat I had going and it was pretty damn good. In an alternate timeline where VRAM was plentiful it would 100% be my daily driver over Mistral Large.
>>
>>101562620
xtts2, its literally as easy as dropping a .wav sample in a folder, selecting it in the UI, and bob's yer uncle.
>and im not even using the best enhancement options out there (mostly because im retarded and cant figure it out)
https://voca.ro/12EsObQlgQvy
>>
bitnetbwos...
>>
>>101562643
arent there lots of xtts2 ui versions? which one do I need? and what kind of GPU? will rtx 3060 work?
>>
>>101562620
i basically just output the result to a file and run piper TTS on it. if you talk to your LLM from a shell script, you can store real values in a variable and tell it stuff like "Tell me how many emails I have. The number is $unreademails" or you can even store your plaintext TODO file into a variable and tell it to read that. it hasn't hallucinated once so far aside and even seems to identify the masks i've marked as done. you can get some powerful results out of combining this stuff with unix shell scripting.
>>
How much vram is needed for the full 128k context with nemo?
>>
>>101562672
pretty sure its this one https://github.com/daswer123/xtts-webui
its been a while since i started using it
>3060
lmao im using a 1080 and even really really long prompts never take longer than like 20 seconds.
>>
I'm using the mistral settings in sillytavern for nemo, but it fails to end generated messages at appropriate times and just veers off into weird cryptic shit
>>
>i jokingly promised Holly we'd do ANAL in 6 months of getting to know each other
i'm a horrible person, and mini-magnum is actually pretty good. Very interested in how a higher quant version would stack up.
>>
>>101562735
I don't have that problem with the Mistral preset. I'm using the latest staging version prehaps there's been a fix for it
>>
File: file.png (7 KB, 466x61)
7 KB
7 KB PNG
slightly late but kcpp with nemo support out
>>
>>101561954
After a few cooms I get bored and normal RP is painfully incoherent.
It doesn't understand the concept of clothes, that you can't talk to someone without a phone if he is nowhere near you, etc.
These small things are killing me.
>>
>>101561854
Why is llama3.1 70B so bad at coding?
>>
>>101562691
With an 8.0bpw exl2 with q8 kv cache and context 128000, Windows has a total VRAM usage of 23.4 GB.
>>
>>101562847
That's with the model loaded too? What's it with no context so I can subtract and find out how much the context itself uses?
>>
>>101562853
Yes that is the VRAM usage with the model loaded with tabbyAP/exllamav2 0.1.7.

New measurement:
tabbyAPI not running: 0.9 GB GPU memory (0.8 dedicated / 0.1 shared)
running with 128000 context: 23.6 GB (23.4 GB dedicated / 0.2 shared)
running with 64000 context: 18.0 GB (17.9 GB dedicated / 0.2 shared)
running with 32000 context: 15.2 GB (15.0 GB dedicated / 0.2 shared)
running with 8192 context: 13.3 GB (13.1 GB dedicated / 0.2 shared)
running with 256 context, lowest allowed: 12.4 GB (12.3 GB / 0.2 shared, rounding didn't work out nicely)

This again is at 8.0bpw with cache_mode Q8.
>>
>>101562615
in what way are local models not also the world?
>the world is what i tell you it is
>>
>the MoE meme is dead
thank god
>>
>>101562615
>SEEK GOD
Agreed. True Miku must be achieved. Local Miku General - A general dedicated to the discussion and development of Local Mikus.
>>
>INFO: Metrics: 456 tokens generated in 97.26 seconds (Queue: 0.0 s, Process: 19 cached tokens and 21406 new tokens at 516.17 T/s, Generate: 8.17 T/s, Context: 21425 tokens)
isn't mistral-large significantly larger than cr+? this speed is very similar. granted i'm using 4.5bpw mistral-large and 6bpw cr+ but i would still have expected mistral-large to be a fair amount slower.
after using nemo for a few days going back to 8t/s is rough, though...
>>
>>101561822
>I was pleasantly surprised from the get-go having used L3 storywriter and CR+.
That's weird, because all the outputs posted here were incredibly slopped to the point of cringe. I return to my storywriter outputs and they're a breath if fresh air in comparison, even though they lack any common sense, spatial awareness or story direction.
>>
>mistral small
do you guys think it's 15B or something
>>
>>101563080
Mistral Large 2 is 123B. CR+ is 105B.
>>
>>101560232
nigga just distill the logits to 70b and 34b sizes respectively. there has been a trend amongst the AI labs to distill the gains from the big models either via logits or outputs.
>>
>>101560813
I think it makes more sense to look at Mean Δp and RMS Δp but LLaMA 3.1 does better than LLaMA 3 in those as well.

>>101560901
What I would assume is happening is that for the 3.1 training they changed the hyperparameters in such a way that the numerical values in the weights/activations end up being more even.
llama.cpp quantization essentially uses the same exponent for 16/32/256 values (instead of 1 exponent for 1 value with floats) so the numerical precision is better if the values all have roughly equal absolute values.

Do make sure though that you are using the same code for these calculations.
The default matrix multiplication method in llama.cpp was recently changed from cuBLAS to MMQ and the results will be slightly different between the two.
>>
>>101562512
my guess is that besides a few small teams and large organizations, no one really has the proper setup for the training pipeline and the capital. sure if you're bankrolled by some rich guy you can do it but you'll need to find how to make money off that too - and anime & hentai media have retarded copyright laws depending on the region, so who'd want to risk getting sued by some jap company or worse, Sony?
>>
>>101560957
Noted but I don't think this will end up being useful for my purposes.
>>
3.1 ggufs might be back thanks to ollama guy
>Nice. Tested doing a bunch of summaries using up the entire 128k context and the output looks good whereas on master it outputs broken garbage.
https://github.com/ggerganov/llama.cpp/pull/8676/#issuecomment-2249653389
>>
Why no llama between 8 and 70b?
>>
prompt processing is the bane of my existence on these large-context models
may as well just dump all 6000 tokens of lorebook into the context from the start rather than trying to use them normally
>>
>>101563229
Because fuck you. There are only two use cases Meta cares about: evaluation and corporate usage.
>>
>>101563274
Yeah...

llama_print_timings: load time = 95.89 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 108037.51 ms / 65533 tokens ( 1.65 ms per token, 606.58 tokens per second)
llama_print_timings: eval time = 3127.70 ms / 59 runs ( 53.01 ms per token, 18.86 tokens per second)
llama_print_timings: total time = 145147.80 ms / 65592 tokens

CtxLimit:131072/131072, Amt:60/60, Process:107.12s (1.6ms/T = 611.79T/s), Generate:4.57s (76.1ms/T = 13.14T/s), Total:111.68s (0.54T/s)
>>
>>101560260
Powerinfer2 works partly thanks to MoE. There are probably even better models for local, but mixtral is best we have for now.

Dense models only make some sense for batch processing, not local. When you are working on 128+ conversations at a time, one of them will always need one of the parameters so using/predicting activation sparsity has limited use (will use a fraction of the FLOPs, but still needs the same memory bandwidth).

The future of local is predicted sparse activation.
>>
>>101563289
It also slows down the higher it gets. Pretty speedy up to 64k then it crawls the last few k to 131k.
>>
Think they plan on releasing a mistral medium 70b?
>>
>>101563397
You mean a tune of Llama 3 70B like they made of Llama 2 65B? I assume not.
>>
File: 1717392122186284.jpg (153 KB, 1280x1039)
153 KB
153 KB JPG
openai won.
>>
>>101563440
They did a tune of llama2-70b. There was no llama2-65b only a llama1-65b.
>>
so 405b > mistral large for coding for sure. even had instances where 405b came up with something neater than sonnet 3.5
personality/rp wise, I prefer mistral large, 405b is just a tad more dull but it seems to depend quite a bit on the persona you are going for. very SFW stuff it does quite well
and mistrals pricing is a bit delusional given 405b is 1/3rd as expensive on openrouter
so 405b is quite a good release in my books, and either way we eating good this week
>>
post your blackpill and whitepill
blackpill: there would be such a gap in performance and ability between >400b models and small distilled models that companies using the big models will be always ahead of the competition and big AI labs will gatekeep the API access for these big ones
whitepill: there's a huge amount of AI research (most of which are still underdiscussed) that a few breakthroughs here and there from research teams or anons will make the small models capable enough that small-medium companies can just tune it on their own use cases and be good enough for 90% of the time
graypill: most of the code and text data that will be written in 5-7 years will be from AIs and we're about to witness the equivalent of 1980s-2010s chinese product quality but for software and 200x worse
>>
>>101563535
As a whitepill I'd add that we're probably on the cusp of widespread adoption of simultaneuos translation within a few years. That's a very big deal, especially in countries where the vast majority of the population don't speak english (or not very well). Communication is going to get a lot more efficient.
>>
File: file.png (56 KB, 979x512)
56 KB
56 KB PNG
Does Mistral Large not work in Koboldcpp yet, or is my download broken?
The changelog for the latest version says that Nemo is supported now, but Large might be different enough again to not work?
>>
>>101563784
>he downloaded from 'rancher
>>
>>101563784
just to make sure, you do have the two parts, part1of2 and part2of2 in the same folder, yes?
>>
File: file.png (53 KB, 828x460)
53 KB
53 KB PNG
What are good nemo presets?
It's a bit too sovlful for me atm
>>
File: file.png (9 KB, 816x47)
9 KB
9 KB PNG
>>101563818
Yes, and the file sizes look plausible
>>
>>101563832
you need to concatenate with copy /b or something
the retard you downloaded from doesn't split them with gguf-split but manually for whatever reason
>>
>>101563842
that worked, thanks
>>
mistral large same prompt format as 8x22 right?
>>
>>101563535
Open models are all made by cowards using the most traditional known to work transformers and training methods.

GPT2+swiglu+ROPE+minor-tweaks => every fucking open model except Mixtral. Open models are at a standstill because they are all sackless.
>>
>>101563821
For basic blah blah I just run neutralized samplers + 0.1 min p. Smooth sampling optional.
>>
So how much does it cost to make a LoRA for 405b?
>>
>>101564020
Hard to say. I don't think there's even a remote workstation big enough to rent and fit the entire thing in memory.
Totally unfeasible for average person, even with a lot of spare cash.
>>
>>101564053
I have a small company.

Could I do it for 50k?
>>
I realized that the context template that was posted here for mistral nemo makes it go schizo
I switched back to the old mistral one
>>
>>101564059
for 50k it's doable. No idea why you would want to do it tho
>>
Pleb weights for mistral large dropping
https://huggingface.co/MaziyarPanahi/Mistral-Large-Instruct-2407-GGUF/tree/main
>>
File: iq1s.png (47 KB, 623x279)
47 KB
47 KB PNG
>>101564120
Seeing picrel, I notice that not even IQ1_S (the smallest quantization) is fully ternary. How did llama.cpp developers determine what quantization levels to use for the various layers?
>>
What is it about this whole topic of opensource? Big tech gives us the scraps from the table and we worship them in the hope that another crumb will soon fall from the table - because of the thing itself, nothing will change - do I have to redefine open source for myself as a kind of saliva licker?
But le ecosystem? :>
>>
>>101564172

It's "open" if you have money. I'd rather pay the Nvidia tax than have Altman and his cronies have complete, unfettered access to the tech and bar us plebs from ever being able to ERP in peace.
>>
>>101562512
>why does no one invests into making porn more available with no benefit
gee i wonder
>>
>>101562518
The next pony is going to be so cucked you won't believe it.
>>
>>101562533
>pony had to nuke all the artist tags
The pony author certainly didn't "have" to, but did anyway. And the next pony model is almost certainly also going to be pruned of cunny-adjacent stuff as well, among other things.
>>
I have an extra $1000 unexpectedly, can I get a lot of vram for this money? Not buying a 3090, I either get 48gb+ vram or nothing
>>
>>101563535
>graypill: most of the code and text data that will be written in 5-7 years will be from AIs and we're about to witness the equivalent of 1980s-2010s chinese product quality but for software and 200x worse
thats the most pathetic cope from codeshitters yet
>>
>>101564120
Is IQ2_K usable? I really don't want to offload.
>>
>>101564169
I think I. Kawrakow did it by just manually testing which layers are most sensitive to a loss in precision and assuming that the combinations that work well for one model will generalize to all models.
>>
>>101564734
Looks promising on first impression for its size class.
>>
None of the mistral ST presets work for me. I gave up and changed to Alpaca and shit just werked. Yes I have immense skill issue
>>
>>101560019
is this done with an llm? if so, very good.
>>
>>101563449
When was lmsys last relevant? Even a 7b can answer 99% of the questions normies ask, and it just turns into a readability/presentation benchmark
>>
llama 3.1 mmproj when?
>>
>downloaded Mini-Magnum
>can't load the GGUF in ooba
Am I retarded or is it not supported yet? I've been out of the loop.
>>
someone asked about a bitesize model that could be packaged with a game the other day
what did they/you decide on
>>
>>101564680
You could buy a pair of 24GB P40 graphics cards for under $1k.
>>
>>101564734
In my opinion < 4 BPW is not worth the drop in quality.
>>
anything better than mistral for vramlets yet?
>>
>>101564943
>mistral
MIXtral*
>>
>>101564943
No, nothing, none whatsoever.
Mistral is great, amazing, spectacular.
>>
Is the PygmalionAI team fine tuning the LLAMA 3.1?
>>
>>101564943
Gemma 27b q4_k_m
>>
Large2 is slop
>>
>>101564969
Why would you care? Finetuning smart and engaging models has gotten too complex for two and a half people in their basement to create something worthwhile as of July 2024. That also applies to most other would-be LLM finetuners, btw.

Keep enjoying your Nemo/Gemma/Mistral Large/etc.
>>
>>101565037
>t. skillet
>>
>>101564996
Gemma 2 27B Q6_K with output and embed tensors quantized to Q8_0 can be fully loaded in 24GB. Now it just needs FlashAttention2 support to work for 8k tokens context...
>>
>>101563535
>most of the code and text data that will be written in 5-7 years will be from AIs and we're about to witness the equivalent of 1980s-2010s chinese product quality but for software and 200x worse
that doesnt even make sense.
where is software good?
OS all suck now.
Even the internet sucks. I swear I had better loading times with my 56k modem.
The pics were slow but the page was there instantly.
Everything seems to get worse for a while now actually.
>>
mini-magnum is pretty good for RP compared to the previous slop people dished out. And I was using L3-Euryale-2.1 before this (sold by 2nd 3090)
>>
>>101565056
That their latest experimental model (Magnum-72B) is based on Qwen2-***Instruct*** demonstrates exactly that. Finetuning base models into something usable and smart has become too cumbersome/expensive/risky. You can't just use chat logs anymore like in early 2023.
>>
File: Nero.jpg (18 KB, 1047x86)
18 KB
18 KB JPG
>testing Nemo GGUF
>doesn't seem to be as good as people say it is
>ask it if it's really Nemo
huh?
>>
File: yakub.jpg (7 KB, 196x257)
7 KB
7 KB JPG
>>101565105
>neroid empowered rational operator
>>
>>101565105
What the fuck?
>>
>>101565105
This looks like a typical jailbreak misfire, check your prompt retard
>>
>>101565105
jej
>>
>>101565105
[INST] and [/INST] are special tokens in Nemo, you don't need to add spaces around them.
>>
File: lecun_dontworkonllms.png (462 KB, 580x895)
462 KB
462 KB PNG
>>101565037
>t. LeCun
>>
>>101565159
specifically wiped context memory for this little test
i had a first prompt with just "who are you" to which it answered "I am a large language model developed by Mistral in cooperation with NVIDIA"
but i wanted it to say Nemo, hence the second question
>>101565178
ok thanks
>>
>>101563087
They do slip in sloppy output at times, but the thing I like about it is its variety. Unlike previous untuned mistrals where it's all dry and clinical no matter what you do, swiping does get you somewhere else. I'll keep testing it and see.
>>
>>101563915
With a space after [INST] and [/INST], but not before the later.
<s>[INST] system message

user message[/INST] assistant response</s>
>>
>>101565244
I think he is 100% correct. I will reconsider when I get even one model that never mentions shivers when it sucks my cock. But I don't think that is gonna happen.
>>
What are the political implications of perfectly predicted next token != actual fact / truth / solution to a problem.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.