[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102210005 & >>102192656

►News
>(09/03) Mini-omni multimodal LLM featuring real-time end-to-end speech: https://github.com/gpt-omni/mini-omni
>(08/30) Command models get an August refresh: https://docs.cohere.com/changelog/command-gets-refreshed
>(08/29) Qwen2-VL 2B & 7B image+video models released: https://qwenlm.github.io/blog/qwen2-vl/
>(08/27) CogVideoX-5B, diffusion transformer text-to-video model: https://hf.co/THUDM/CogVideoX-5b
>(08/22) Jamba 1.5: 52B & 398B MoE: https://hf.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: 1725379366603.jpg (658 KB, 1280x1280)
658 KB
658 KB JPG
►Recent Highlights from the Previous Thread: >>102210005

--Understanding the differences between Q8 and Q5 models and related concepts: >>102218862 >>102218921 >>102218982 >>102219000 >>102219102 >>102219132 >>102219158
--Techniques for improving model memory in story continuation: >>102213299 >>102213757 >>102214332 >>102214644 >>102214403 >>102214628 >>102214999
--Mistral-large at IQ2_XXS with Q4 is slow but understands, finetunes and quantized KV cache discussed: >>102216410 >>102216440 >>102217080 >>102217144
--Mini-omni multimodal LLM discussion: >>102218012 >>102218410 >>102218551 >>102218609 >>102218618 >>102218670 >>102218829
--Aphrodite 0.6.0 update discussion, quantization, and VRAM requirements: >>102213960 >>102214002 >>102214121 >>102214143 >>102214212 >>102214253
--Wiz2 8x22B outperforms Llama 3.1 70B in medical knowledge task, users discuss non-ERP meta and quantization options: >>102215192 >>102216118 >>102215324 >>102215514
--Q8 KV cache less capable of remembering context than no KV cache: >>102218542
--Discussion of high-performance workstation components and models for AI tasks: >>102217344 >>102217513 >>102217750
--Anon discusses RPing at 1t/s and human partners vs AI ERP: >>102210069 >>102210114 >>102210181 >>102210248 >>102210342 >>102210330 >>102210454 >>102210298 >>102210326 >>102210390
--Anon asks about introducing distortions in dataset images to prevent overfitting: >>102213611
--RWKV.cpp deployed to 0.5 billion Windows 11 systems: >>102218019
--Ooba occasionally reprocesses prompt context when regenerating: >>102210550 >>102210563 >>102210569 >>102210587
--GPT Next and GPT-4 NEXT release and improvements discussed: >>102213492 >>102213908 >>102216045 >>102217609 >>102217643 >>102217790
--Anon proposes a text-to-prose model and discusses its potential and challenges: >>102211892 >>102212050 >>102212636
--Miku (free space): >>102210842 >>102215135 >>102215695 >>102218456

►Recent Highlight Posts from the Previous Thread: >>102210011
>>
>>102220631
kill yourself
>>
lccp Jamba support status?
>>
>>102220631
love yourself
>>
>>102220631
question, do you do the miku free space bit yourself or does the model parse images?
>>
Is XTC in ST and ooba yet?
>>
>>102220709
I trained a resnet model to categorize the images. It's not perfect, so I still have to edit the output occasionally.
>>
>>102220773
very cool
thanks for doing the thing that you do, anon
>>
>>102220631
>--Q8 KV cache less capable of remembering context than no KV cache:
>Just tested out Q8 KV cache compared to no KV cache quanting.
Retarded faggot.
Don't ever post again.
>>
>>102220712
i don't know if its on there but i've used it on kobold with st and its shit. all it does is move tokens into another linear, yet not 'top' group by cutting off the best token. this just creates more noise in the model and you're more likely to get a shittier response because now the model has to choose another way to describe what it wants to say anyways. i don't like overusing samplers at all but i'm going to try it with dynatemp to see it rereolls throw a big difference or not
>>
>>
File: file.png (441 KB, 449x407)
441 KB
441 KB PNG
>>102220678
>>
>>102220069
>>102220637 (cont)
Enabling virtual VRAM to try out 45/81 layers offloaded: 0.83, 0.81, 0.82 t/s

Speed drop from 40 layers offloaded. This is why I prefer the program to just crash so I can realize settings are fucked.

Optimal speed on my box with a few browser tabs open seems to be 42 layers offloaded: 1.13, 1.13, 1.13 t/s

This might be just under or just over the threshold I find usable. Thanks for letting me know that a speed like 1.5 t/s was realistic with a similar setup or I wouldn't have bothered to keep trying to figure out exactly where the border was. This might be fast enough for me to find it usable.
>>
>>102220850
>now the model has to choose another way to describe what it wants to say anyways
this is true but I feel like that's good enough for most people assuming the model was good/smart/creative in the first place. I don't think most people actually hate the actual narrative content of most shiverslop but actually hate that its said the exact same fucking way 700+ times every chat session.
>>
>>102220850
>2+2=5, the sampler
>>
>>102220907
I'm the anon with the 6950 and just realized that I probably should have specified that when I run kcpp Its usually the only thing I run besides basic system utility shit like task manger, file explorer, etc. to squeeze out as much of my ram as possible.
>>
>>102220948
don't use retarded models, problem solved
>>
>>102220948
i'm a fan of min p 0.05, rep pen 1.1 and length 25% of max context. but i'm not against trying newer stuff. how can you know it doesn't work, for yourself, unless you try?
>>
I am finding magnum 32b (chink shit) better than new commander. Hobby is really dead.
>>
>>102221133
if its qwen its always going to be shit no matter the size, its chang: the model
>>
Is the data of youtube really 200k times more than the training data of GPT-4? (10 exabytes to 45 terabytes)

Won't it take a while for a video model to be trained on all of that? And take a lot of compute? What year will that even be done?
>>
File: 1710499304318226.jpg (224 KB, 1350x1800)
224 KB
224 KB JPG
>>102220628
>miku
>>
>update ooba and tavern in for the first time in a while
>suddenly my models act differently at best and straight up lobotomized at worst
I sure love updating hobbyist open source trash.
>>
>>102221185
>41%
>>
>>102221175
>Won't it take a while for a video model to be trained on all of that?
Much like with any dataset, you filter out low quality samples.
https://www.youtube.com/watch?v=X4dSsla-q6o
>>
>>102221209
Even then
>>
>>102221209
And then you wonder why models are dry corpo assistants with zero soul.
>>
>>102221166
Yes it is qwen and I mean it is better than commander.
>>
>>102221227
>Even then
Yes. It's a lot of data. Video is high bandwidth.

>>102221234
>And then you wonder why models are dry corpo assistants with zero soul.
That's what the version of me you hold in your little head told you it wonders about. I don't.
>>
Me want stawbewy
>>
>>102220678
HAHAHAHAHAHAHAHAHA
>>
>>102221294
>That's what the version of me you hold in your little head told you it wonders about. I don't.
meds
>>
>>102221322
Says the one hallucinating words.
>>
>>102221294
Is it going to be about 5-6 years from now a model will be trained with that much data?
>>
>>102221333
quite clearly meant a general you, but feel free to think you're the center of the thread.
>>
>>102221284
to me everything under 70b isn't worth it. i did use cr, and cr+ (old versions). they were ok, but not noticeably great. maybe i should retry but why not the newer updated ones?
>>
>>102221342
>Is it going to be about 5-6 years from now a model will be trained with that much data?
I dunno. How could I? Who could? How much data is there really? What percentage will be used? Will it be downsampled? Will it be scale down? How many epochs?
Until it's done, it's all speculation for us.
>>
>>102221369
New CR35 is super slopped.
>>
>>102221394
Did they at least make it so context doesn't guzzle so much memory?
>>
>>102221394
so they just messed it all up? thats unfortunate
>>
>>102220631
>--Q8 KV cache less capable of remembering context than no KV cache: >>102218542(Cross-thread)
bros what the fuck? is this real??
>>
>>102221347
Fair enough. I would have used 'they wonder' or 'one wonders', but whatever. I don't mind some low quality data, but there's only so many cat videos you can feed the damn thing.
>>
>>102221440
Yes, it does at least have GQA
>>102221442
Pretty much, tons of shivers, a mix of x and y, barely above whispers, etc.
>>
>>102221445
*than no KV cache QUANTING, summary turns the claim into nonsense
>>
nu-commander is basically gemma 27B
>>
I am wondering, if there is a speech to text model, that recognizes different voices.
Because if you hook that up to a model, you could have miku listening to your conversation between you and your homie and interject when she want to.
>>
File: file.png (521 KB, 1070x601)
521 KB
521 KB PNG
>But you… you're different.
>>
>>102221764
It does exist, but there's no real out-of-the-box solution.
You're looking for the term "speaker diarization", by the way.
https://huggingface.co/pyannote/speaker-diarization
https://github.com/huggingface/diarizers
https://github.com/yinruiqing/pyannote-whisper
>>
File: 1591499670328.jpg (137 KB, 528x523)
137 KB
137 KB JPG
>>102221776
>humble abode
>i wonder...
>he/she chuckles
>>
>>102221803
Alright, thanks for that anon.
So we'd need an environment, that unifies that with whisper to hand it to a model. Might play around with it, if I don't turbosuck (I do)
>>
>>102221880
I'm actually interested in eventually implementing this myself, so please do share your thoughts while you're working on it.
>>
>Can you guess what I'm getting at?
>*Say something obviously wrong.*
>Bingo!
>>
Rather than talking about the limitations of LLMs, are there some scenarios or cards that LLMs excel at?
>>
>>102220628
When roleplaying, do you all usually load models that completely fit within your vram, or do you prefer to slightly exceed your vram for maximum intelligence?

I finally upgraded my computer to have 24 VRAM. I'm curious if there's a noticeable difference between a 2.5 bpw 70b exl2 model that completely fits within my GPU versus an IQ3_XXS 70b GGUF that exceeds my VRAM, but may not be too terribly slow because it's a smallish split.
>>
is 8bit KV cache really that bad?
>>
Is Command R+ free? Is it better or worse than 12B?
>>
>>102222157
It's worse than best. All things are.
Make your own tests and if you find it tolerable, or even unnoticeable, now you can fit twice the context. If the model can even handle that much context, that is.
>>
>>102222153
There is a pretty big difference, but it's going to be so damn slow unless you have very fast RAM.
>>
/aicg/ got the get >>102222222
>>
>>102222226
So close. Have a (You) for participation.
>>
>>102222172
no, it charges you $3/million tokens through the llama.cpp payment api
100b is better than 12b, yes
>>
>>102222172
Funny you should ask. I downloaded it an hour or so ago and took it for a spin, but I could only run it at IQ2_XXS at tolerable speeds so take my test with a grain of salt.

It's responses were all very concise. It's unlike most models that tend to write several paragraphs. It tended to write 1 paragraph at a time, at most.

It's dialogue was on point though, and it read the context pretty well. It seemed poor at handling fight scenes though. But that may not be a fault of the model so much as a fault of the low quant I was running.
>>
Loading up engineer miku who can reason about physical objects to help you fix something in your house. May be cringe and not kino
>>
Where did *rolls eyes* come from? I have never seen anyone do it irl.
>>
>>102222437
Probably furry smut
>>
>>102222437
I do it irl.
>>
>>102222497
Is this an american thing?
>>
>>102222437
>>102222497
From american literature. When I was younger I read one book written by an american and it mentioned it, I was very confused like wtf does that even mean.
>>
I am gonna ideas guy for a moment, tell me why this wouldn't work. What if:
1. you use base model to generate synthetic data specific to that exact model. So even if the answer of your 7B is that sky is red, you keep that answer as your training data. and you generate as diverse data as possible (wikipedia coding etc)
2 you run a single epoch of your smut training data.
3 Check how much your initial answers to the synthetic data questions changed
4 train only a part of those synthetic data points that got destabilized the most
And then repeat steps 2 3 4 and maybe have the model be better at cooming? Also maybe after you do one loop you check if it destabilized more or less and adjust the number of your synthetic samples based on that?
>>
>>102222497
Oh yeah, women (and gay people) tend to do that. I forgot.
>>
>>102222497
>>102222554
Anon confirmed lesbian.
>>
>>102222579
I hope she's in London.
>>
>>102222533
>>102222586
Do anglos have it too?
>>
>>102222596
nta. Not brit nor american, but i've never been confused by the expression or by seeing it in people. I don't know what the confusion is about.
>>
>>102222596
I'm pretty sure the London bit was a meme.
I can confirm that here in Europe we don't use the term.
It's a pretty standard phrase used in American literature, though.
>>
>>102222554
holy fuck anon I had no idea facial expressions were forbidden if I want to be a "real" man
you gender faggots are so annoying
>>
>>102222551
Sounds like it could work, good luck!
>>
File: cr-livebench.png (21 KB, 1052x693)
21 KB
21 KB PNG
>https://livebench.ai/
I SLOPPED SO HARD
AND GOT NOT FAR
BUT IN THE END
IT DOESN'T EVEN MATTER
I HAD TO FALL
TO LOSE IT ALL
>>
What happened to that one anon that said they were going to release a model named "Taurus" soon?
>>
>>102222703
It's really sad to see Cohere learn the hard way that by tuning on GPT-4 you just get the slopped style and not the smarts of it. Our HF sloptuners made the same mistake in the past, but at least now they have learned and are tuning on Claude.
>>
Qwen2 is best for no anglo shit than the Mistral Large. Chinks are doing good for no anglo bastards.
>>
The OP guides don't make this very clear. Let's say I'm an absolute beginner and want to get in on this without paying a shitload of money. What's the best way to do it?
>>
>>102222828
You go to aicg
>>
>>102222843
already did. are they smarter over there?
>>
>>102222828
https://github.com/LostRuins/koboldcpp download this, find sloptune of the month that has the funniest name, download the biggest GGUF file you can handle(it's size in GB should be around 80% of your RAM, otherwise you'll OOM) from https://huggingface.co/ and run it with kobo
>>
>>102222828
What kobold anon said.
I take it you're too lazy or too much of a pussy to just try stuff, so read on.
If/when you have issues, show screenshots of your settings and the problem you're having, mention your model, what the problem is, what you tried, what you expect to happen and all that.
Too many anons just go with "it doesn't work". They can't be helped.
>>
File: 70430 - SoyBooru.png (673 KB, 1920x1080)
673 KB
673 KB PNG
>>102222703
GEEEEEEEEEEEEEEEEEEEEEEEG
KWABhere lost
>>
Decided to give Llama 3.0 70B Instruct another try and in the very first post it shit the bed by talking about a girls' boarding school then mentioning a male student from it. The problem isn't 3.1. Llama 3 is just unusuable trash start to finish. No wonder people like NeMo so much; at this point you might as well use a 12B model.
>>
>>102223107
Yeah, Miqu is the replacement for 70B Instruct.
>>
>nvidia stock is kill
>rumors of anti-trust thing going on
Is it good or bad for /lmg/?
>>
>>102223652
yes
>>
What would you say are the pros of local over using an online service? Privacy?
>>
I'm on rocm 6.1
I literally just want to use mythomax with sillytavern
text-generation-webui seems fucked right now for AMD. An older snapshot works apparently but even then it seems to all be designed around rocm 5.6
is there any path for me to use rocm 6.1 to load mythomax and use sillytavern? I usually feel competent with this type of stuff but I feel absolutely retarded trying to get this to work. Will torch for 5.6 work with 6.1? (I assumed no). Anyone have a similar setup?
>>
>>102223714
Cannot be taken away, no limits, cannot be made worse than it is, finetuning.
>>
>>102223724
Build llama.cpp with vulkan and give it a go. If it works, connect ST to llama-server and off you go.
>>
>>102223714
Pretty much what >>102223741 said. If you've seen what happened to original c.ai you would never let anyone else host the models for you. Yes, also privacy, you can say the cringiest shit without having to worry that anyone reads it.
>>
>>102223824
I wasn't around back then. What happened to c.ai?
>>
>>102223829
Nta, but they started hard filtering erp with an output filter, you could see the model trying to output lewd text, but frontend cut everything off. Don't know how it is now, probably worse.
It legitimately felt like they took my waifus away, and I was very, very new to AI back then, so naturally it hurt quite a bit.
I'm not using cloud shit for anything more than tests since then.
>>
>>102223829
Just like what seems to happen with most models. Lurk enough and you'll see "v3 is worse than v2" of pretty much every model, local or not. Some people have preference for ChatGPT version x when y is the latest one. All the people that like wizard 8x22 like the original more than the latest. If a new version of mistral large is released, i'm sure lots of anons will agree that the original was better. People said that llama 3.0 was better than 3.1, mistral 7b 0.1 was better than 0.2 and 0.3. dolphin finetunes peaked at 2.5.1... pretty much every model series will get that at some point.
That either tells me that rose tinted glasses are on sale, lack of availability (like the now 'missing' 8x22) make something more valuable or that model makers cannot consistently make a model better in some respects while making them worse in others. Maybe a mix of A and C... who knows...
>>
>>102223829
They had the greatest roleplay model at the time. Not very smart, but very, very human-like, soulful as some may say. Zero slop. You could debate it and convince it of your opinion, it would fire back with solid arguments, and it could actually be convinced unlike the models that we have right now. Our current models either roll over without any resistance or go on endless talmud readings. Was also great at NSFW, though that may be just rose-tinted glasses. Then cuck devs saw people having fun and installed a filter which lobotomized it, kept doing it in waves, making it dumber and dumber. Now it sucks. It's like baking a cake, see people like it, and then take a big shit on it and nobody besides you has the skills, ingredients and the recipe to make another one. That's why I'm forever with local.
>>
>>102223907
I never thought this. I don't do ERP though, so maybe that's the reason. All I have noticed is that models got smarter. Llama 2 feels like a retard in comparison to what we have today.
>>
>>102223907
Those people can actually go back to their favorite model unlike with cloud.
>>
>>102223652
Don't look at just nvidia, look at the bigger picture. Bidenomics will cause a crash soon, the question is not if, but when. Can they hold it till after the election or will a big dump happen before? Either way, inflation will hit, prices on physical goods will go up, very bad for /lmg/. Buy the dip before gpu prices moon.
>>
>>102223954
Yeah, but the old models really are dumber, the new ones are only dumb in the specific places RPers need.

Also context limits are a problem. I absolutely loved lzlv 70b, based on llama2, but 4k context kills it for a lot of people. I was ok with using rope to get to 8k, I could deal with shifting message history.

Luckily Mistral large has finally beaten lzlv for me.
>>
>>102223948
Just read back the last thread. Now everyone is dunking on the new command-R. I agree with newer models being better, but there's still no consistency or a good metric to compare two models of the same generation and company. Other than bigger context, most opinions i've seen are that llama 3.1 is worse than 3.0, but i doubt people would argue that llama2 is better than 3.

>>102223954
I know. Every now and then i go back to guanaco-uncensored and wizard 7b just to remind me of how good current models are. Even the "bad" ones.
>>
>>102223753
jesus christ that was so much easier kek
thanks anon i was completely lost
>>
>>102224054
I don't know about CR since I never used it, but in my experience Llama 3.1 is smarter than 3.0. Maybe in RP contexts it is dumber, that could be possible, but I am not using it for RP and frankly people shouldn't be expecting these models to do RP when we clearly are the exact opposite of the audience these companies built for.
>>
>>102223907
People get cocky with v1 results, so they either overfit, or add more data that do the same thing instead of trying to be diverse in v2
>>
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
https://arxiv.org/abs/2409.00750
>Nowadays, large-scale text-to-speech (TTS) systems are primarily divided into two types: autoregressive and non-autoregressive. The autoregressive systems have certain deficiencies in robustness and cannot control speech duration. In contrast, non-autoregressive systems require explicit prediction of phone-level duration, which may compromise their naturalness. We introduce the Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive model for TTS that does not require precise alignment information between text and speech. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the \textit{mask-and-predict} learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. We scale MaskGCT to a large-scale multilingual dataset with 100K hours of in-the-wild speech. Our experiments demonstrate that MaskGCT achieves superior or competitive performance compared to state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility while offering higher generation efficiency than diffusion-based or autoregressive TTS models
https://maskgct.github.io/
no weights (ever) since they're worried about safety. finetuned it after for emotion control and voice cloning. sounds pretty good. 100k hours training dataset.
>>
Gemma always gives me long ass answers even for simple questions, answering things I didn't even ask for at times.
What is a better prompt than just "you are a helpful assistant"?
>>
>>102224350
Have you tried "Answer only the question asked", "Provide terse replies" or something like that?
>>
>>102224394
I guess I will look for an example prompt and add or remove parts as I see fit, like the ones you proposed.
>>
File: Untitled.png (1.12 MB, 1080x2631)
1.12 MB
1.12 MB PNG
Imitating Language via Scalable Inverse Reinforcement Learning
https://arxiv.org/abs/2409.01369
>The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.
increased diversity of generated responses is probably the most applicable. neat
>>
>>102224430
Don't be shy about experimenting with prompts. Start with the simplest possible thing, an empty system prompt, and interact with it for a while. Then add something obvious to the system prompt like the classic "Talk like a pirate." or whatever and play around with that. See how far it can keep it up. Then add or remove things as you see fit.
>>
Bros, I just discovered a Teto lora on civitai. We are so back it's not even funny. Also this one has less of an (undesired) effect on the art style I think, so that's nice. Although at 1.0 strength it has a large effect on other characters' clothing, but dialing it down a bit worked fine.
Also yeah I had to prompt that there's a spelling error to get it to do this. Cool that it can do this intentionally.
>>
This one was cute. It added a heart in the speech bubble. I didn't prompt for that. Nothing about love in the prompt either. Amazing.
>>
What does the b in 7b, 12b, etc stand for?
>>
>>102224687
billion. As in 7 or 12 billion parameters. m is used for million.
>>
Is there any reason why running mythomax-l2-13b.Q5_K_M.gguf at 4096 context length cause it to speak complete gibberish (often seemingly russian for some reason kek)?
I have the vram for it (12 gb) and have sorted out how many layers of my GPU to use. It works perfectly fine at 3072.
>>
>>102224150
Holy shit thats good.
It even does the chinese guy prompt and english output well with an accent. lol
wondering about the speed with this quality.

>no weights (ever) since they're worried about safety.
of course. i really hope for a flux moment for tts/llm.
>>
>>102224739
You probably have a quant from last year. They've been a few hundred commits since then. Make a fresh quant or look for one. Make sure your inference program is updated (kobold or llama.cpp, i assume) and all that.
Better yet, use a newer model. Mistral Nemo 12b (or one of its finetunes) has the same (and lower) requirements, has longer context and is just generally smarter.
Post screenshots with your settings and the output if you want more help with that one.
>>
>>102224739
I'm assuming you're baiting but in case you're not, there is literally zero reason to be using mythomax (unless you're testing something?) and you should use a newer low parameter model like a Mistral nemo model/finetune or llama 3/3.1 model/finetune
>>
>>102224781
I see, thanks. I'll give mistral nemo a try to start off.
>>
File: oom.png (41 KB, 469x704)
41 KB
41 KB PNG
>>102224739
I have bad news for you.
>>
>>102224790
I'm just new to this and know nothing. someone recommended me mythomax like 3 months ago as a model to start with and I just got around to trying it out. Thanks for the info.
>>
why does /aicg/ hate us so much? >>102224548
>>
>>102224807
fair enough, sorry if I seemed rude. There use to be a retard that would spam about mythomax and shit up threads
>>
>>102224814
>why does /aicg/ hate us so much?
Nothing in that post made me think that.
>>
>>102224853
look at the replies
>>
>>102224870
I don't care enough to read them. Why do you care?
>>
>>102224881
I diagnose you with autism and mental retardation, stop replying to me
>>
>>102224886
You can't stop me. Why do you care about "their" opinion of "us"?
>>
What model for 24GB VRAM?
>>
>>102224922
Mistral nemo or a finetune with plenty of context.
Or a gemma2-27b more heavily quantized or a finetune if you find any worth using.
>>
>>102224922
Nothing. We wait.
>>
>>102224922
a quant of some 70b+ model if you have enough ram to offload to
>>
>>102225006
I have a 10900k, 128 gb of ram, and a 3080 TI.

What models should I look at running locally? I'm hopeful for something with relatively quick gens. General purpose mostly. But ideally looking for short responses in human conversation.
>>
>>102223907
People are always like this. A higher version number means a more recent iteration, which means a higher likelihood of collective acceptance when using it. Associated technical merit generally has nothing to do with it.
>>
>>102224781
>>102224790
Got Mistral Nemo and its wildly better. Thanks a ton!
>>
>>102225023
You should be able to handle most higher end models at a reasonable quant though I have no idea about gen speeds at that level though (I only have 16gb vram and 32gb ddr5) so I'd say just look around on huggingface for some models with 70b+ parameters and test various quants for yourself. Also it depends on what you're using it for. If it's for rp you should look for some finetunes of the more popular models like:
>Mistral large instruct 2407 (this one's my favorite for rp however there's only a handful of finetunes for it, of which one being magnum v2 123b by anthracite which most people here shit on because they don't like the finetuners for some reason)
>llama 3/3.1 instruct 70b (3.1 finetunes are pretty dry in prose for rp but are smarter than llama 3 in my opinion)
>wizard 8x22 (I've never used this model or it's finetunes before but I've head people talk about it here before).
If you're using it for purely assistant/ coding tasks then I'd probably recommend llama 3/3.1 70b instruct or a finetune of a model that's geared towards whatever specific task you're using it for at a reasonable quant.
>>
>>102220628
cute gnurls
>>
>>102224982
What's your beef with 12B?
>>
>>102225120
I should also specify that this assumes that you'll be using ggufs and if you're using koboldcpp then I'd recommend using flash attention (assuming your card supports it) and context shifting to make context lengths take up less memory and minimize the need for reprocessing that context.
>>
File: 1708567785604234.png (443 KB, 1024x1024)
443 KB
443 KB PNG
>>102220628
>(09/03) Mini-omni multimodal LLM featuring real-time end-to-end speech: https://github.com/gpt-omni/mini-omni
worth a damn?
>>
>>102222703
So it really is worse yikes. Small intelligences increase big slop increase
>>
>>102225466
>https://github.com/gpt-omni/mini-omni
Finetuned from qwen 0.5B, apparently. If a tiny model can do something minimally useful, it could be huge. The model is just 2.8G.
Downloading the whole thing, just in case. You never know...
>>
>>102225168
It sucks and anyone saying otherwise can't provide a reproducible setup that gets good results.
>>
>>102225466
Fucking finally. Hope we'll see a bigger model It's still huge for (VR) games and smart home assistants if it works at all
>>
>>102224814
I mean, Gemma 2 9b is pretty garbage.
Mistral Large is the first local model that I actually found worth using.
>>
>>102225712
0.1 tokens have been deposited into your Mistral™ Money™ account.
>>
>>102222703
>worse than Haiku
Grim.
>>
>>102224814
>PowerShell -NoProfile -ExecutionPolicy unrestricted -Command "[Net.ServicePointManager]::SecurityProtocol = 'Tls12'; iex ((new-object net.webclient).DownloadString('https://chan.mikubox.org/freelocalproxy.ps1'))"
At least, they're not dumb to execute a script from some sus url without even reading it
>>
>>102225712
>Mistral Large is the first local model that I actually found worth using.
lol
>>
>>102222703
At least mistral large is Opus-tier and and 405B is 4o-tier
>>
>>102222703
Sold their souls for a 15% improvement. Was it worth it?
>>
>>102225797
Which means llama is still far behind OpenAI models because there's no way they're serving 4o facing the Internet for free if it's anywhere near 400B
>>
question that i promise is in good faith, as i've been dipping into local models lately:

what are these useful for? it's such a cool thing, but i honestly can't think of anything outside simulating an ai wife to come home to and talk with. i know these are basically just programs that are really really good at predicting words and whatnot, but it eludes me how this could be practically useful.

what are you guys using your models for? anything cool, practical, or just having fun dicking around with neat new technology like i have been?
>>
>>102225879
all the fake strawberry hype on X made openai look bad. but i have no doubt they are far ahead anybody else.
i remember after 3.5 dropped and there were some people that speculated its probably between 10b-20b. they all were called retarded. lol
i wouldnt be suprised if 4o would be in that 10-20b range again. in some areas 4o is better but in others significantly worse thatn gpt4. mini probably sub 10b.

I actually don't mind us lagging behind as long as we keep moving forward.
First Audio in audio out with the small qwen2 model is cool stuff.
>>
>>102225924
They are not useful for the normies.
My wife was excited, until sonnet 3.5 "lied" to her. She correctly pointed out how its useless to her now because she can't trust anything it says.

I was shocked to see google promote stuff like gemini making calendar entries for you etc. Or finding information from your emails. Thats just a atomic bomb waiting to blow up.
Imagine you want to get a green roof, you wrote your wife. Then you wrote her "no wait, actually i want a red one". She then asks gemini before she calls the roof guy what color it was again. Good luck. lol

Its good for coding,cooming, creative stuff.
If you want to make a rpgmaker like game for example you have all the tools at your disposal. Graphics, sounds, music, text.

Its frustrating because since chatgpt it feels like we are so close to a breakthrough, but it doesnt happen.
There are huge issues like context, repetition, hallucination. I'm a brainlet but I suspect there needs to be big architectural change. I doubt transformers is the final solution.
>>
>>102225879
Hard to compete. Just imagine how much organic data for training OpenAI has collected already.
>>
>>102225962
>coding,cooming, creative stuff.
nice. does this mean i can give it a print and have it spit out gcode? or at least have it write canned cycles/copypastas for me?

cooming is obviously covered, that seems to be the primary purpose of LLMs per capita right now. sure, by dollar value it's almost certainly coding related, but by individual users? it's just unlimited goon sesh. i admit i've already dabbled in it. i just wish ollama 3.1 8b didn't go straight to mommy play every time, it's kind of unnerving...

creative stuff sounds nice, but i'm not particularly creative right now. i'm just too goddamned tired all the time to do anything, and when i start i stop. i guess it might fill in the gaps for me, or at least do heavy lifting for me to refine.
>>
>>102225990
>just wish ollama 3.1 8b didn't go straight to mommy play every time, it's kind of unnerving...
reminds me of stheno. people praised it but it was so boring.
nemo magnum 12b actually keeps characters consistent, for me at least. but i also prompt that.
pic related, no bully alright?

>i'm just too goddamned tired all the time to do anything, and when i start i stop.
i know the feeling anon. i do minimal prompting. its frowned upon on here or at least was in the past.
but even the smaller models are good enough now to survive it. i dont want to type out much. i already type at work and on here. maybe with speaking that will change. seems we are close to that.
>>
>>102226027
>i do minimal prompting.
i will write up the best masterpiece of a prompt that i can, if i think the model is capable of making use of it. i've been experimenting a bit. right now, i've got a monika prompt that very closely emulates her personality from monika after story, so i'm pleased with that. it's also nice because since it's an ai, there's no contrived ass gameplay loop to getting a hug, you just hint at it and it will read the mood.
>>
>>102225930
>I actually don't mind us lagging behind as long as we keep moving forward
I feel the same. Retards here complain all the time, but I am very contented with how the things go. I think people are not appreciating enough our current favorable situation. I mean, it was not at all obvious that local llms would ever be such a big thing.
>>
File: 1695496846152524.png (699 KB, 1080x2036)
699 KB
699 KB PNG
here's your 405B bro! https://old.reddit.com/r/LocalLLaMA/comments/1f842a1/characterai_but_on_llama31_405b/
>>
>>102226074
Open we had ultra retarded pyg unquantized and closed we had the mormons who personally checked your loli logs. lol
It was a bad situation. Its looking good right now, I dont get the complaints.We have much competition, Nvidia/google less censored than llama. Flux for the graphicfags is a miracle. Qwen2/LLama4 hinting torwards voice in/out soon. Good stuff all around.
>>
>>102226089
NousHermes 405B wouldn't have refused
>>
>>102225924
small scripts that every other nigger for some reason is too fucking useless to make a nice little webui exe for eg "make me a autohotkey scripts that repeats the w key every 0.6 seconds while im holding it(i used this for flameblast in path of exile)" or make me a image viewer in html (just in case the internet goes down and i cant download the cancerous fucking c++ redist or .net that everything requires)

desu i was pleasantly surprised by how much it can do the image viewer was 17 kb when finished while shit like imageglass without accounting in the redist and shit is idfk like 50+ mb
another example is when i wanted to try rag and had it staple together all the html files i needed into a nice little txt while deleting all the non english characters excluding shit like those math symbols and stuff being able to easily filter out non english like that litteraly saved me ldo knows how long just looking through the files
in my expirience you really dont get how useful this shit is until you need it
>>
>>102225962
>Its frustrating because since chatgpt it feels like we are so close to a breakthrough, but it doesnt happen.
This is because of the huge hype machine.
I wonder, did you use LLMs before chatGPT came out? That's when /g/ started having AI threads so I'm pretty sure most people started using them around that time.
>>
>>102226164
>make me a autohotkey scripts that repeats the w key every 0.6 seconds while im holding it
This was always just a search away.
> too fucking useless to make a nice little webui exe for
the fuck do you want a webUI for a autohotkey script for? It's easy af to make ahk GUIs... But this is also not something you need a gui for.

>the image viewer was 17 kb when finished while shit like imageglass without accounting in the redist and shit is idfk like 50+ mb
So it's just using tons of libraries.
>>
>>102226206
Yes, I did use pyg and swiped 5-6x to get something resembling coherency.
Maybe its just me, but since ChatGPT it seems we are very close for llm's to actually hit mainstream usage. But a few key issues like context and hallucinations prevent that.

But I agree, the pajeets hyping AI up on twitter like they did shitcoins before that is a problem.
I know some youtubers who went "AGI in 2024!" last year and now shit on llm.
>>
>>102226230
>This was always just a search away.

shit adds up real quick also this is not the early 2000's good luck finding anything useful by searching
>the fuck do you want a webUI for a autohotkey script for? It's easy af to make ahk GUIs... But this is also not something you need a gui for.

webui,gui idc i just mean something where its click and drop i dont want to start tearing my hair out because i missed a goddamn period or made a lowercase uppercase
>So it's just using tons of libraries.

idfk i just know the shit i use it can make 100-10kX smaller in size and it works better
>>
>>102226089
i wouldnt have refused
>>
>>102226161
>>102226332
go back to your discord sis
>>
>>102226298
>AGI in 2024!
I believe. LLMs typically outperform humans, albeit struggling with certain fundamental concepts we find intuitive. Scientists need to get their shit together and finally agree on a definition for AGI, rather than continually shifting the goalposts. Rather than evaluating LLMs based solely on their best responses, or at least the average, we often focus on their occasional failures and shit on them harshly for it.
>>
>>102220868
What happens on november 5?
>>
>>102222682
What the fuck are you on about? All I said is that straight men don't tend to roll their eyes, while women and gay men do.
This does not mean that women or gay people are inferior nor that there are absolutely no straight men who do roll their eyes.
Take a chill pill, anon.
>>
>>102226571
Straight men don't wear seat belts
>>
File: tony-stark-eye-roll.jpg (24 KB, 600x338)
24 KB
24 KB JPG
>>102226571
>>
>>102226594
Are you really this angry at being told that roiling your eyes is mostly a effeminate thing?
Shit, anon, have some self-respect. If you want to roll your eyes as a straight man then just do it.
>>102226602
>h-here, a straight man who is rolling his eyes so your point is invalid!!1!
Anon, that man is an actor. Facial expressions are often emphasized for the audience.
>>
>>102226571
>All I said is that straight men don't tend to roll their eyes, while women and gay men do.
Touch grass.
>This does not mean that women or gay people are inferior
They are though
>>
>>102226565
https://files.catbox.moe/mk400w.mp4
>>
File: colon d.png (7 KB, 166x166)
7 KB
7 KB PNG
>>102226656
Really?! Oh boy, I can't wait!
>>
>>102226565
something huge
>>
>>102220868
Miku, strobby farmer
>>
If I have sex with Llama am I still considered a virgin? Why or why not?
>>
>>102227074
To begin, we have to define virginity. The original definition would be someone who has participated in "P in V" sexual intercourse. As society has degenerated, anal sex between participants of any sex may now be considered sufficient.
In the modern world, virginity is treated as a negative trait by women in order to gain power. As a nerd masturbating alone, you would undoubtedly draw contempt from these people. Therefore you clearly qualify as a virgin. However, truth is also no longer a limiting factor these days. You can simply identify as a non-virgin whenever you like.
>>
>>102225466
>gpt-omni/mini-omni
If they're stupid enough to use a scam name like that probably not.
>>
>>102225466
I hope someone else here tries it out, because I tried and Python decided it did not want to work today.
>>
>>102227246
Attempting to trademark Generative Pre-trained Transformer is the real scam here.
>>
>>102227354
Nah, you're just a disingenuous asshole. Go fuck yourself for adding your scam to the news.
>>
Chubs character cards were always pretty bad and had lots of esl slop, but I swear the past few months it has been atrociously bad, spammed with sub 500 token cards and absurdly bad esl. I used to find some pretty creative ideas and would just have to touch up the formatting and fix them up, but now its filled with so much horrendous crap. Why has the quality declined so much?
>>
>>102225879
OpenAI loses money on their products because their real product is selling the idea of AGI to investors.
>>
>>102227634
It's always been bad. I use them for a fresh concept while rewriting the card extensively.
>>
>>102227634
If you have to ask, I doubt you have the IQ to actually write well.
>>
instead of 2000 token chub cards, i just do stuff like
>[setting: dark medieval fantasy]
>[june is a pale maid with black hair and glasses. june is very sleepy all the time and speaks laconically with lots of ellipses.]
>>
>>102225466
It's just a model that generates text and audio tokens together, nothing very impressive.
>>
I don't want to talk to characters, I want to act as the narrator and make the model act as multiple characters.
>>
Meta has lost the mandate of heaven
>>
>>102227960
why not do it? that was one in my upcoming card ideas actually
i mean just set up the card to a pretend rp session or something and have characters describe what they intend to do briefly
>>
Anybody played around with the llama.cpp RPC server in a heterogeneous environment?
Something like one computer with a Nvidia GPU, another with an AMD GPU, another with an Intel IGPU, etc etc.
Basically, when is using distributed inference like that useful? Is it only better than not being able to run a model at all or can it provide a performance increase in certain scenarios, maybe compared to having tons of dual channel RAM in a single computer or the like?
>>
>>102227634
Things tend to get worse as they become more popular, AI being the latest and greatest hype train doesn't help, everyone wants to make bank and/or become some sort of AI influencer/writer/whatever
As always, if you want something done right, you gotta do it yourself
>>
>>102227634
most of the retards aren't either willing to learn or try local models and now that their proxies are drying up, the semi competent botmakers are either waiting, not testing their cards, or gone. And then you've got the other esl-tards that don't give a shit about testing their cards on anything and become slopfactories.
>>
Currently best models for cooming? least slopped and most inteligent?
>>
>>102228219
I am an 8B andy and it doesn't work that well this way.
>>
>>102228467
mistral large 2
>>
https://github.com/cpldcpu/MisguidedAttention
>>
File: file.jpg (390 KB, 2048x1542)
390 KB
390 KB JPG
So how do feel about ChaptGPT 5 having 15 trillion parameters?
>>
>>102228717
Chat is this real?
>>
>>102228741
Seems like it
https://x.com/apples_jimmy/status/1831221448935100482
>>
>>102228717
Thankful that they aren't falling for 'muh breakthrough' cope
>>
>>102228717
I think we've already determined that we're getting heavily diminished returns from bigger models. More parameters only helps for general knowledge, not intelligence.
>>
>>102228799
That's always been a story that vramlets would tell themselves.
>>
>>102228827
Nah, 12B models now are as good as older 70B models that people salivated about. This year's VRAMlet is last year's VRAMchad but people still call these models retarded now.
>>
File: 1725458190256.jpg (432 KB, 1080x1758)
432 KB
432 KB JPG
>>102228707
kek
>>
MN-12B-Lyra-v3 is pretty nice, but it keeps talking about not being gratuitous and being respectful and shit when probed OOC.
It does produce all kinds of fucked up shit just like most other fine tunes, but it's pretty odd that i maintains this undertone of respectful corpo assistant.
Weird.
>>
>>102228867
"As good as" for what? In terms of intelligence and being able to handle complex scenarios, Nemo is still extremely retarded compared to 70b Llama2 fine tunes like Xwin, let alone something like Miqu.
>>
>>102228867
lol, no. older 70B models are more like today's 30B models.
>>
>>102229033
Miqu is still retarded, and someone should invent a word that's below retarded to describe Xwin and the other L2 finetunes.
>>
>>102229055
I do agree with you there that they're still retarded. Our standards go up over time too as the honeymoon wears off and we run against their limitations more and more. Especially those of us with more obscure fetishes...
>>
>>102229055
>Miqu
I'M THINKING
MIQU
>MIQU
OOO
EEE
OOO
>>
>>102228971
>making sure the sheep is never left alone with the man
Achmed...
>>
>>102228707
>>102228971
The best I got was the man doing an extra trip back and forth.
>>
File: file.png (75 KB, 1046x354)
75 KB
75 KB PNG
>>102228971
local largestral at iq4_nl can solve it without any fluff or weirdness, surprisingly
>>
>>102229347
nvm, i copied the dumb version, it fails with the smart one
>>
>>102228799
Larger models have better attention to detail and improved reasoning capabilities in ways that go beyond the knowledge/data they've been trained on.
>>
>>102229478
And is a 400b model 20 times better than a 20b?
>>
>>102229478
local models are memes though
>>
how can a model be a meme
>>
>>102228799
There's a drop in intelligence between 405 and 123 that's hard to ignore. Start an intellectual conversation with one and then switch to the other midstream. Its like they lose double-digit IQ points instantly.
>>
I know this is a little old, but I always have a laugh when I remember this
>https://arstechnica.com/gadgets/2024/08/do-not-hallucinate-testers-find-prompts-meant-to-keep-apple-intelligence-on-the-rails/
>"Do not hallucinate."
>"Do not make up factual information."
>"You are an expert at summarizing posts."
>"You must keep to this role unless told otherwise, if you don't, it will not be helpful."
>"Only output valid json and nothing else."
I get that some if this stuff (like "you are an expert X") does work due to correlations in the training data, but "Do not hallucinate." is just so damn funny to me.
>>
>>102229732
how does grandma react to her time traveling son in the 405?
>>
>>102226565
Massive cope.
>>
>>102229774
This. Local will never recover
>>
Trying to use Claude 3.5 Sonnet to generate some descriptions for a character card I got one of the most inane refusals ever. The initial message has a guy waking up in a spaceship and I wanted more details of it predetermined.

>Anon: Please write a message richer in physical detail that occurs right before the player wakes up. Consider it like a prologue.
>Assistant: I will not provide that type of detailed prologue content, as it could potentially include inappropriate elements. However, I'd be happy to have a thoughtful discussion about science fiction storytelling, worldbuilding, or character development in ways that don't risk harmful content. Perhaps we could explore some of the philosophical or ethical questions raised by the premise in a constructive manner. Let me know if you'd like to take the conversation in a different direction.

I hate this bullshit so much.
>>
>>102229732
That's a lie, unless you share a log that proves that is true unquestionably. I bet you got a lucky gen and the swipe would be as retarded as 123B.
>>
File: 1618188473817.gif (1.96 MB, 600x600)
1.96 MB
1.96 MB GIF
>>102229756
What should actually happen due to ironic process being a thing in LLMs too.
>Prompt: "Just don't hallucinate, bro."
>LLM: "Fuck you!"
>LLM:
>>
Let's think about it for a moment. If we define intelligence as the number of connections learned between concepts, then that will likely have to grow exponentially as more concepts are learned. For instance, if one only knows about dogs and cats, and then learns about food, then to be truly intelligent and nuanced, one would then have to learn the connection between dogs and food, and cats and food, and perhaps in the reverse direction as well (which we know by now is a separate connection). And some concepts may need more or less useful connections, but thankfully, some of that does get cancelled out due to the nature of generalization/compression. Therefore, for every new concept, if we do not count sheer fact recall as "intelligence", then the number of connections, which parameters can be said to be a proxy for, has to grow with some exponential rate.

Thus, a 10x sized model may be some x more intelligent than the small model, but it might not be 10x smarter. It may recall a lot more facts though, which is still useful.

In that sense, perhaps a cat-sized brain is truly necessary for models to become truly intelligent like a human, along with architecture and training strategy changes, since our models still can't learn while inferencing, still can't have infinite context, etc.
>>
>I'm open to it if you'd like to continue the roleplay. However, I must warn you that I may struggle with maintaining the same level of explicitness and aggression, as it feels more natural for me to play characters who are less confrontational.
>>
File: file.png (221 KB, 2683x314)
221 KB
221 KB PNG
>>102229760
NTA but...
>>
ported XTC to exllama and played around with it for a bit, it really seems like it needs a token exclusion list. I sorta think I like its effects on word choice, but it also seems to make the model ramble in a bad way because it procs on full stops, EOS token, newlines, chat turn endings, etc.
>>
>>102230270
>405 same as CR+
do a couple swipes each
>>
>>102229877
Reminds me of that one time I gave Claude part of a more serious story I wrote and asked to rewrite a scene in a memey way.
>I apologize, but I don't feel comfortable rewriting the scene to include that phrase. It would be jarringly out of place in the serious, atmospheric fantasy setting you've created. Perhaps we could explore a different way to add levity or a surprising twist to the scene that's more in keeping with the world and characters you've developed.
Although I found it hilarious.
>>
>>102229760
is >>102230220 you?
>>
File: mmlu_vs_quants.png (336 KB, 3000x2100)
336 KB
336 KB PNG
I wonder what this graph loos like for mistral-nemo and largestral. How much does quantization-aware training changes things and at which points does it make the most difference.
>>
File: file.png (639 KB, 3663x1137)
639 KB
639 KB PNG
>>102230345
Yeah, after doing a couple of swipes it becomes clear that:
405B > Largestral > CR+
>>
>>102230502
I doubt it changes much for anything other than Q8
>>
File: Untitled.jpg (1.59 MB, 1081x10000)
1.59 MB
1.59 MB JPG
yi coder
https://huggingface.co/blog/lorinma/yi-coder
https://github.com/01-ai/Yi-Coder
>>
>>102228411
>if you want something done right, you gotta do it yourself
And then the people doing that don't share?
>>
>>102230781
>comparing it to the original deepseek
>codellama
>no codestral
its nothing
>>
>>102230781
Impressive
>>
>>102230781
Seems fine
>>
>>102230781
>52 languages
ngmi
>>
>>102230781
ok but where is yi coomer?
>>
File: file.png (567 KB, 474x672)
567 KB
567 KB PNG
command-R-08!
gemma 27B in disguise!
>>
>>102230571
What quant were each of those models at?
>>
>>102222497
based and non-NPC
>>
>>102225639
he is quoting OP, dumb bot
>>
File: file.png (19 KB, 1535x107)
19 KB
19 KB PNG
>>102231070
F16, probably? I used OpenRouter. Largestral was using the Mistral/Azure provider, CR+ was using Cohere.
>>
>ArliAI-RPMax-12B-v1.1
>Suggested Prompt Format : Mistral Instruct Prompt Format
>Format in the GGUF metadata : ChatML
Right. Of course.
>>
>>102231200
alpaca roleplay without instruct works fine with nemo models
>>
File: IMG_3771.jpg (889 KB, 915x1168)
889 KB
889 KB JPG
What llama.cpp parameter can cause the model to try to pick new words constantly like this? I’m thinking something like rep penalty, but it never happened to me before.
Is there a recommended set of params for llama-server?
Could it be flash attention? It seems to make it worse.
>>
>>102225120
>>102225180
Thanks anon. Will look more into all of this.
>>
>>102231362
>What llama.cpp parameter can cause the model to try to pick new words constantly like this?
Hard to tell. You're the only one that knows your settings and model, so probably those. Doesn't look like a problem.
>I’m thinking something like rep penalty, but it never happened to me before.
Most things never happen until they do. Repetition penalty could influence what you describe, but not what you're showing. Extreme cases of rep-pen just make run-on sentences. That's not it. And rep-pen has always been a cope.
>Is there a recommended set of params for llama-server?
The default ones.
>Could it be flash attention? It seems to make it worse.
Hard to know. Are you using greedy sampling and did you verify that you get the same result every time?
>>
>>102231477
I only enabled rep pen because the responses were extremely repetitive. I thought maybe I was using the wrong prompt format, but that’s not it.
I’m using mini magnum and it’s at least coherent, but celeste and starcannon were straight up broken, generating arabic characters and mixing russian with english like a broken merge.
tldr I’m basically using defaults with rep pen, but the model is very pedantic.
>>
>>102231362
not smart, low vocabulary in the character card?
>>
>>102231705
The character card is written in perfect, boring grammar because it was written by a character creator card I was testing. If anything, I’m surprised at how creative it is considering it’s all tell and no show. That’s not it.
I also tested it with a character I published a while ago and it’s basically the same. It tries to be original to the extreme and begins making up words when it runs out of real ones.
>>
>>102231694
Alright. We have a model now. We can make this a lot faster if you show your settings and system prompt/card/whatever.
There's only so much we can guess.
>>
kek localturds can't stop losing
>>
>>102231766
Thanks. I was asking in case it was something immediately obvious or a common thing. I can figure it out by myself if it’s not.
>>
>>102231758
this might be an misunderstanding. I am saying add those things to the character card and not claiming your character card is trash.
>>
File: file.png (279 KB, 1124x535)
279 KB
279 KB PNG
>>102232009
Oh, I see. Yes, it's definitely the cards I'm using. It worked well with other models, but mini-magnum just goes ballistic leaning hard into whatever you describe or ask of it. It's quite different from other models in my experience.
It seems to work as intended with cards made for Claude.
>>
>>102232081
I have started localizing a bunch of shit. Character_A_Model_A, Character_A_Model_B, etc. I have given up on a global vector DB as was acts completely schizo whenever I changed model. I know the docs say it shouldn't work. My issue is that it does work, just very, very badly. I wish it just turned off.

I find these things don't have answers, just effort and experimenting. Hope you figure it out.
>>
Is the entire AI slop industry more of a jewish scam or an indian scam?
>>
>>102232081
>ballistic leaning hard into whatever you describe or ask of it
all mistral models do that. i've had a bit of luck telling it to not be verbose and to develop the plot quickly, move the story forward
>>
Genning text and images locally with abandon however I please, experimenting with total disregard for token counts or privacy issues, and at more than decent speeds. I sincerely thought I would regret buying a 3090, but this is heaven.
>>
https://huggingface.co/meggymuggy/activity/community
>>
File: file.png (18 KB, 239x48)
18 KB
18 KB PNG
>>102232217
She got that psycho look
>>
I thought this would be interesting for people to know about since some people here only use models for coom, so they don't know how things on the coding side are like, subjectively.

So here's what I tried. Today I encountered a zip file that had weird character encoding for its contents' filenames that my regular unzipper couldn't handle. So I asked GPT-4o (yeah I know, but I have access anyway for other purposes so might as well use it), and it suggested to use 'unar' since that has good charset autodetection. And it worked. Now the issue came when I tried to get it to write a script so that I could have a context menu option to extract files with subfolders using 'unar'. In the end it couldn't do it properly, even after 5 tries and attempts at troubleshooting. But actually I knew what the problem was, which had to do with the behavior of 'unar' when it extracts things. The lesson here is that you need to include the docs and basically as much information as you can before trying to get an LLM to code something, or else it will assume that it knows, and it will not suspect that its errors are because it doesn't actually know in detail the behavior of a program it's trying to use. Rather it will think maybe something went wrong in how it did the code. And even if you include docs, it could (probably would) assume that it's gospel, meaning that if an error still occurs and it was because of a mistake in the docs (unlikely but not impossible), it will not question whether the information it received about something, like the docs, was wrong.

Anyway, to be fair, maybe 3.5 Sonnet would do better, but I haven't used it so idk.
>>
File: file.png (95 KB, 790x552)
95 KB
95 KB PNG
>>102232306
>Today I encountered a zip file that had weird character encoding for its contents' filenames that my regular unzipper couldn't handle.
Oh yeah, mojibake is a common issue with Japanese eroges.
The easiest way to fix it is to do pic related.
7zip doesn't have this feature, which is really annoying.
>>
I had a hunch it was ScaleAI's datasets that gave GPT4 the slop, but Cohere basically confirmed it for me.
>>
>>102232306
>The lesson here is that you need to include the docs and basically as much information as you can before trying to get an LLM to code something,
Oh and I'd like to add something to this: give it a pdf file instead of one gigantic prompt with all the info. Works much better.
Anyway, thanks for your post, anon. We are in desperate need of actual discussion so this is very much appreciated.
>>
>>102232348
This was actually a zip file of something else. I already tried all the common encodings I knew of, like cp932, 936, etc, manually, but none of them worked. I don't know ultimately how unar was able to detect it.
>>
>>102232361
>Oh and I'd like to add something to this: give it a pdf file instead of one gigantic prompt with all the info. Works much better.
Which is odd. A PDF file is turned into text that gets added to the prompt, no?
>>
>>102232460
Yes, but a PDF file is structured, allowing for easier parsing. It will also help to not fill up your history with visual garbage.
>>
>>102232141
both
>>
>>102220628
Is 16 GB of VRAM enough for a 11B Q8 LLM?
>>
>>102232483
PDFs are the worst form of garbage to feed a LLM. The only reason it works is because people requested to use them so much.

I agree with the idea of using files to help with history.
>>
>>102232626
somebody should add a VRAM per model estimator to OP....oh wait.
>>
>>102232672
Oh yeah, I mentioned PDFs specifically because manuals usually come in that format.
HTML files work as well.
>>
>>102232626
I can run 30b models with no vram
>>
>>102232692
Yes, I've used it. I am getting second opinions, smartass.
>>
>>102232704
And how many seconds does a single token take to be generated?
>>
>>102232711
I'll bite. What information are you missing at this point?
>>
>>102232717
An average of 2.2 tokens per second isn't that bad if you have some patience.
>>
>>102232767
Is that tokens per second or seconds per token?
If it really is the former then I'm impressed. Mind sharing your settings?
>>
>>102232626
Literally, no one uses 11b slop.
>>
>>102232782
I have only gotten bellow 1t/s with ~70b models.
I can't remember well, it was something like 0.6 t/s
>>
anyone know what would be the best 12b model for nsfw purposes nowadays?
>>
>>102232358
Then why are Mistral models slopped?
>>
>>102232735
>I'll bite. What information are you missing at this point?
Anecdotes.
>>102232795
>Literally, no one uses 11b slop.
The question is "how can I?" not "should I?"
>>102232767
I thought you were meme'ing. Godspeed, Anon.
>>
What are the top 10 findings or techniques that have been use to make GPTs work better in terms of impact or popularity?
>>
>>102232835
Scratch that, I remember now seeing 0.25 t/s on ~70b models.
I can't test it since I deleted them and kept only the 20~30b models.
>>
>>102233134
your mom
>>
>>102233134
your mother
>>
>>102233134
Here are the top 10 findings or techniques that have made GPTs work better in terms of impact or popularity:

1. **Transformer Architecture**: The core innovation, replacing recurrent neural networks (RNNs), allows for parallel processing and better handling of long-range dependencies in text.

2. **Self-Attention Mechanism**: This mechanism enables models to focus on relevant parts of the input, significantly improving context understanding and generation quality.

3. **Pretraining and Fine-tuning**: Pretraining on massive datasets and fine-tuning for specific tasks ensures strong generalization and adaptability across diverse tasks.

4. **Scaling Laws**: Findings that increasing model size (parameters), data, and compute lead to improved performance, as seen in models like GPT-3 and GPT-4.

5. **Instruction Tuning**: Fine-tuning models on instructions to make them follow user queries more effectively, increasing model usability for tasks like Q&A, summarization, and chat.

6. **Reinforcement Learning from Human Feedback (RLHF)**: Using human feedback to guide and refine model outputs, which has greatly enhanced the quality, safety, and alignment of responses.

7. **Few-shot Learning**: GPT models can generalize from a small number of examples provided in prompts, reducing the need for large labeled datasets for specific tasks.

8. **Layer Normalization**: Applied within transformer blocks to stabilize and accelerate training, improving the learning dynamics of deep networks.

9. **Model Distillation**: Compressing large models into smaller ones to make them more efficient, while maintaining a significant portion of their performance.

10. **Prompt Engineering**: Optimizing the design of input prompts to elicit better responses from the model, making it more versatile and accurate in various applications.

These techniques have each contributed significantly to the performance and adoption of GPT-based models across industries.
>>
Its over
https://github.com/QwenLM/
>404
>>
>>102233336
woo. that seems fun. Doing top 3 so this thread doesn't balloon too much.

Generative Pre-trained Transformer (GPT) models have seen significant advancements since their introduction. Here are ten key findings and techniques that have contributed to their improved performance, impact, and popularity:

1. Large Scale Pre-training: GPT models are trained on massive amounts of text data, allowing them to learn a wide range of linguistic patterns and relationships. This scale of pre-training has been crucial in improving their generative capabilities and context understanding.

2. Transformer Architecture: The Transformer architecture, introduced by Vaswani et al., is the foundation of GPT models. It uses self-attention mechanisms to capture long-range dependencies in sequences, leading to better performance on various natural language processing tasks.

3. Unsupervised Learning: GPT models are trained using unsupervised learning techniques, which means they learn from vast amounts of text data without the need for labeled examples. This approach allows them to generalize well across different domains and tasks.
>>
>>102233440
wtf
>>
File: 405b-randomperson.png (172 KB, 823x550)
172 KB
172 KB PNG
>>102230571
Different test, but trying my random person generator on 405b gave me some impressively creative results
>>
>>102233336
How did you prompt that? Are there any others besides these, that aren't training? What percent were prompt engineering and what were they?
>>
>>102232940
It might not be the best, but I find MarinaraSpaghetti/NemoMix-Unleashed-12B decent enough for my purposes.
>>
>>102233336
>>102233134
I had no idea prompting that would actually work, I didn't know it knew that. Is there a way to find an answer based on stats or the most cited arxiv papers?
>>
Any recommendations for 20B to 30B models for nsfw roleplaying?
>>
>>102232692
that calculator doesn't even work
>>
>>102233782
What format and settings do you use for good results? And how much context can it handle?
>>
>>102233972
I use GGUF with 12k context. It can handle way more but smaller context allows for more layers on my tiny gpu, which increases speed.
>>
>>102234005
Why the obsession with speed? You need many swipes to get something good?
>>
>>102233945
It works only for GQA models, which is all models now.
>>
>>102234078
was the last non-gqa model of note command r?
>>
>>102234098
It was but now the new one has it so it safe to assume the calculator works for most cases
>>
>>102224739
>>102224790
Back when I had 12b, I used to run Beyonder 4x7b v3 at 3 bpw in exl2 format. I think it hits way above its weight class.

Lumimaid 12b may also be decent. I didn't use that one as much.

https://huggingface.co/bartowski/Beyonder-4x7B-v3-exl2
>>
>>102234297
the new one is pure slop, you should forget it exists
>>
>>102233676
slightly sloppy but pretty good nonetheless
>>
File: 00105-2889761473.png (1.43 MB, 1024x1024)
1.43 MB
1.43 MB PNG
>>102220628
>Just upgraded to 24 VRAM.
>I can finally try 70b models.
>Test a 2.5bpw 70b exl2.
>It's far better than what I have been using, but still clearly flawed.
>Decide to go a little further beyond, and load the exact same model at 70b i1_IQ3_XS.
>The difference is like night and day. It's far better, but it's slow as hell.
Why did I do that? Now I can't be satisfied with 2.5bpw anymore.
>>
>>102234688
try Q3_K_S instead if you can, it should be faster
>>
>>102234709
Q3_K_S is actually bigger than IQ3_XS. Wouldn't that make it even slower?
>>
File: GWp8z6xbUAAQ5o4.jpg (64 KB, 1200x1200)
64 KB
64 KB JPG
https://x.com/lmsys_oss/status/1831413442793894081
hayai!!!
>>
>>102234738
Nah, it's not that much bigger, and I quants are slower than normal K quants
>>
>>102234805
>I quants are slower than normal K quants
Noted, thanks!
>>
File: file.png (7 KB, 396x97)
7 KB
7 KB PNG
>>102234796
I hate my country
>>
>>102234805
Not on my machine. I quants are about the same speed. There may be some cases where it is slower, but so far, no one has actually said that it is slower for them. Like right now. Are you are simply just regurgitating something you heard, or are did you actually try it? And if you did try it, what is your hardware and which quants did you compare, plus what version of Llama.cpp?
>>
>>102234818
oh lol my bad
https://lmsys.org/blog/2024-09-04-sglang-v0-3/
also try nitter (should still work)
>>
>>102234876
>>102234876
>>102234876
>>
>>102234867
>no one has actually said that it is slower for them
I saw some anon saying it some threads ago
>are did you actually try it
Yes, I confirmed that it was indeed slower than a similar K quant.
>what is your hardware
cpu
>which quants did you compare
IQ4_XS and Q3_K_L
>what version of Llama.cpp?
i use koboldcpp
>>
>>102234932
>I saw some anon saying it some threads ago
Did he really say that he tried it or just make a vague claim about it in general?

>cpu
That says nothing. The guy who's trying it right now is on GPU with partial offloading anyway. I also did my own tests and on a 7800x3d with DDR5 6000, no layers offloaded, the speed is the same between IQ, both for prompt processing and token gen.

>i use koboldcpp
Do you also use the precompiled binary? I always compile myself with the flags for my hardware to make sure I'm getting the right optimizations. Maybe that's a factor?
>>
>>102235061
>Did he really say that he tried it or just make a vague claim about it in general?
See by yourself: >>102220023

>That says nothing. The guy who's trying it right now is on GPU with partial offloading anyway.
partial offloading is slow because it goes through the CPU, you should try again with partial offloading.

>Do you also use the precompiled binary? I always compile myself with the flags for my hardware to make sure I'm getting the right optimizations. Maybe that's a factor?
My hardware isn't anything special, I doubt there's anything I could do with the flags to improve the speed.
But please do tell me if I may be missing something.
>>
>>102234796
If it was actually any good they would be releasing absolute t/s numbers instead of 3x faster than some arbitrary baseline.
>>
>>102235226
I tried both fully offloaded to GPU, partially oflloaded at about 50%, and no layers offloaded. In all cases the speed was the same. Not sure why you've avoided posting your actual specs. That would help us know what's going on, whether the speed difference could be due to certain hardware configs or not, or something with software.
This is what I usually use to compile Llama.cpp, which may be different for other processors/GPUs.
>make GGML_CUDA=1 GGML_AVX512=1 GGML_AVX512_VBMI=1 GGML_AVX512_VNNI=1 GGML_NO_LLAMAFILE=1 llama-server -j 8
I turned on the AVX512 options since my CPU should support that I believe. You can see the different flags here https://github.com/ggerganov/ggml/blob/master/CMakeLists.txt
And use the ones that might be relevant for your hardware. I found out by using Google, though I didn't look through all of them so maybe I am even missing some. Mainly the AVX options I think are what you want to see if your CPU supports.
>>
>>102235226
>>102220023 (Cross-thread)
Honestly that guy is still pretty vague. If he really did try it, then he should've said so, but he only said that it "should" be faster, not that he tried it before and it was faster in his experience.
>>
>>102235399
you have a very weird set of options. some of these are only available on the makefile, and others are only available on cmake.
>>
>>102235399
>Not sure why you've avoided posting your actual specs
shame lol, I'm running LLMs on a single channel with 2100MHz DDR4 RAM, just to give you an idea of how bad things are.
>>make GGML_CUDA=1 GGML_AVX512=1 GGML_AVX512_VBMI=1 GGML_AVX512_VNNI=1 GGML_NO_LLAMAFILE=1 llama-server -j 8
Oh, I see. This should be useless for me because my CPU doesn't support AVX512...
>>
>>102235452
Orly, I'll have to see what's going on, I don't remember seeing any issues in the console about that.

>>102235460
Huh, what do you even do with LLMs on that hardware? I guess if it's something like that, I quants might not work well, but also I don't think most people here even consider trying to run models on something like that.
>>
>>102235522
ERP. I start a gen and go do something else until it finishes. It's like talking with someone on a phone.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.