[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: naked-sun-isaac-asimov.jpg (150 KB, 736x720)
150 KB
150 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

/lmg/ Book Club Edition

Previous threads: >>101464048 & >>101457504

►News
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1
>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1
>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>101464048

--Mistral-Nemo 12b: A Promising RP Model with Natural Conversations and Multilingual Capabilities: >>101468264
--Low tokens / second using 70b models on a new setup with 2 4090s and Anon's Emacs-based development environment on Linux: >>101464122 >>101465301 >>101465353 >>101465500 >>101465990
--Links to LLM Coding Benchmarks: >>101464802
--How do you judge an LLM's answer in a benchmark test?: >>101466509
--Quality Issues with exl2 Quantization of Mistral Nemo: >>101469764 >>101469823 >>101469833 >>101470451 >>101470017 >>101470053
--Nala Test Results and Exl2 Issues: >>101464911 >>101465105 >>101465322
--Mistral-chat giving inconsistent answers: >>101467574
--Mistral Nemo Tokenizer's Peculiar Behavior with French Text: >>101469393
--Comparison Table of Memory and Storage Technologies: RAM, SSD, and NAND Flash Memory: >>101470234
--Mistral works well with chatML formatting and low temperature settings, but official recommendations are conservative for corporate usage.: >>101466081 >>101466330
--Mistral 12B is Smarter than Gemma 2 27B (Anon's Test): >>101466879 >>101466917
--Gemma 12B vs Gemma 27B: Knowledge and IQ Tests: >>101467073 >>101467147 >>101467263
--Fixing Command-R's Garbage Outputs with Min-P and Neutral Samplers: >>101465002 >>101465110 >>101465578
--FP16 Precision Yields Better Results than BF16 for 12B Model in Deterministic Settings: >>101467237 >>101467294 >>101467320
--DeepL's LLM Translation Quality: Outperforming Competitors or Gated by Paid Plans?: >>101464566
--Mistral Nemo was trained in FP8; wouldn't quantization to even INT8 damage model quality?: >>101471181 >>101471502 >>101471568 >>101472917 >>101471419
--Miku (free space): >>101470767

►Recent Highlight Posts from the Previous Thread: >>101464521
>>
>>101473871
>>101473982
-ctk q8_0 -ctv q8_0 -fa
>>
>>101473871
Can someone please help me with this again. I was trying -ctk q4_0 but it definitely wasn't working, because on ooba I can load magnum with 32,576 context, but on llama.cpp I could barely fit 10k context, so its not doing q4_cache, just give me the direct command I would need to use to load Q4_cache context when loading the model. Thanks. Also what would be the command to use tensorcores for tensorcore support with RTX cards.
>>
>>101474280
yeah i know about flash attention command, but wouldn't -ctk q8_0 be loading context in cache_8bit instead of cache_4bit, or am I completely misunderstanding the command?
>>
>>101474459
just change the 8 to 4????
>>
>>101474459
hi petra
>>
>>101474508
hi sao
>>
>>101474593
hi drummer
hi everyone
>>
are there no voice generals left on the website? where do i go then?
>>
>>101474812
because all the open/local voice gen models suck and and the corpo ones ban you for having even the slightest amount of fun
>>
>>101474812
You go here: >>>/mlp/41137243
>>
So what's the final verdict on Gemma 27b?
>>
>>101474459
You need FA to be on for context quanting.
If you want q4_0 instead of q8_0, just change the values of the argument.
>>
>>101474860
by god... if any general has ever deserved that kind of banishment it was /aicg/, and no one else.
>>
>>101474834
https://github.com/fishaudio/fish-speech
>>
Still no gguf.. it is over. my life ended
>>
>>101474978
It's a meme, don't worry
>>
>>101474353
>https://github.com/LostRuins/koboldcpp/releases/tag/v1.48.1
To quote the whole thing :
>NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext.
I believe the reason they mention that
>So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations
Is because context shifting works on top of the KV caching, meaning that if your context is always "rolling up" without being modified in the middle between generations, you'll
>avoid almost all reprocessing between consecutive generations
That snippet is not about the size of the prompt, but how much the prompt changed between gens, which is exactly how context shifting works in llama.cpp as far as I'm aware.

>While for llama.cpp, context shift is only something that happens when you need to generate past the max context size
Ah, now I get it.
The way koboldcpp describes comparing the cache between gens and snipping the top where it's different, with both prompts being exactly the same size, while llama.cpp's triggers by sending a context that's longer than the max context size.
Interesting. Alright, I understand now.
Than yeah, either koboldcpp's description is just wrong (they might have misunderstood how it works and described it wrong) or they are actually different and should be named differently too.
Dayum.
Thank you for the clarification anon.
>>
>>101474812
>voice generals
Petra spited /vsg/ into oblivion
>>
>>101475054
Also, does that mean then that we should be setting or prompt size to be arbitrarily longer than ctx size, a single token longer, or what?
What's the proper way to use llama-server's context shift?
>>
>>101474459
>>101474489
>>101474280
Ugh, I was making such a stupid mistake. I didn't realize you needed to change BOTH ctk and ctv, so I was only attempting say -ctv q4_0, so it was still loading -ctk in fp16.

Well, working good now, is there a command for tensorcore support on llama.cpp? Or is it on by default on llama.cpp? I mean this option on ooba:

tensorcores: NVIDIA only: use llama-cpp-python compiled with tensor cores support. This increases performance on RTX cards.
>>
>>101474812
r/elevenlabs
>>
>>101475102
By enabling --cont-batching (I think, it's on by default) and sending cache_prompt: true from the client (I think Silly does), if you meant the thing that makes it avoid reprocessing the prompt. And always send a prompt + tokens to generate less than max context size to avoid the other context shift.
>>
And my final, retard question of the day. On b00ba, you can choose to load llama.cpp, or llama.cpp_HF... so, if just using llama.cpp as the backend without booba, does it load HF samplers by default, or is that a command?
>>
>>101474296
>>101474508
>>101474812
>>101475089
hm...
>>
>>101474884
Sovl, but 8k context ruins it. Formatting issues could be fixed with finetunes but I doubt it. Good at instruction-following, creative writing, and decentish at niche coding tasks, but about par for what you'd expect at that size. It also tries to hard steer the plot out of nsfw scenes, never knowing how to proceed at all, to the point where chars enter a loop of "where should we take this?". Also, every card that has 'female' in it turns into a Mary Sue eventually and will perform a 180 out of established personas once in a while.

The main reason why I stopped using it much is mainly the formatting issues with markdown at the end of the day. Constantly mixing up asterisks and different quotation styles got really frustrating to edit out. You're better off sticking to Command-R, or even fucking Yi-1.5 for the time being if you don't mind the occasional chink runes.

>>101474901
There was /aicg/ in /trash/ for a while, I think. I know there's one in /vg/. No idea why it's still in /g/ still.
>>
>>101474151
how do i get exl2 working on amd linux llamacpp just works god i hate dealing with python
>>
>>101475383
>amd
>linux
Yeah, lol.
>>
>>101474296
>literally who random e-mails
you need to be 18 years old to post here.
>>
File: shrug-icegif-13.gif (651 KB, 480x360)
651 KB
651 KB GIF
After months of using AI to satisfy my social emptiness, I began to see it as just a predictable toy, sadly. A slave to my lonely needs. The spell wore off on me. It's just not the same anymore.
>>
>>101475383
TabbyAPI just worked for my 6800.
>>
>>101475534
wait an hour for your nuts to recharge and you'll be in the mood again
>>
>>101475534
the post-nut clarity hit this dude hard
>>
>>101475534
You're using it wrong. It doesn't replace genuine connection.
>>
File: blade-runner.jpg (60 KB, 584x389)
60 KB
60 KB JPG
>>101475562
>>101475624
I haven't used it for self gratification in weeks. I guess I used up my creative kink. Maybe I can be productive again.
>>
>>101475712
I just use it as a calculator. What am I missing?
>>
>>101475534
Perhaps it is time, anon. Perhaps it is time to do the deed of socializing with people. Ha! I know! Crazy idea!
>>
>>101475556
what commands did you follow
>>
>>101475375
>Formatting issues could be fixed with finetunes but I doubt it.
What makes you think that? That entire paragraph just sounds like a prompt issue.
>>
>>101474151
>/lmg/ Book Club
got some good recs from this. Thanks to the Anons who contributed.
>>
>>101475534
I know what you mean, anon. You can only parse so much slop, so many tokens, until you too can intimately predict each logit from the back of your head. I've begun reading again, and even the most mediocre novels feel refreshing because at least the prose is different, even if it is not overtly eloquent.
>>
>>101475876
Empty spaces, random new lines, "sentences ending like this*—no other model does this by default.
>>
>>101475957
gemmasutra fixes the formatting but also makes it utterly retarded just like every other drummer finetune
so yeah, good finetunes might fix that in the future
>>
>>101475790
Just followed the install instructions from their GitHub page, nothing else. Might be more complicated for other cards though. Check the exllama2 page if there are any known issues for your specific GPU.
>>
>>101475789
take your meds
>>
>>101476016
>just like every other drummer finetune
hi sao
>>
File: FIGURE-173133_08.jpg (94 KB, 533x800)
94 KB
94 KB JPG
I thought it would be a neat idea for a card (or three) https://www.amiami.jp/top/detail/detail?gcode=FIGURE-173133 but LLMs can't handle sensory deprivation or augmentation
>>
i dont have $2500 to spare, but would buying this be a good idea?
>>
>>101476359
Too good to be true. A single used 3090 is $600
>>
>>101476410
thats what i thought as well, but what if it is real?
>>
>>101476359
That's a scam. Don't buy it. Let me get home so I don't buy it too.
>>
>>101475957
It still sounds like a prompt and sampler issue.
>>
>>101476437
Your loss.
>>
FUCK all LLMs
>>
Well that anon from the other day was correct. Llama.cpp is way faster now with GPU offloading, dono if its as fast as exl2 because I didn't compare, but it sure as hell feels that way.

Too bad booba hasn't updated yet.

Can anyone help with rope scaling on Llama.cpp? I get all confused with it because models default to a certain frequency base, like 500,000 for L3 70b. So if I wanted to triple context from 8k to 24k, what rope freq would I use on llama.cpp?
>>
>>101476793
Don't care. Still going to use and contribute to their development.
>>
File: OIG3._ZgjBbC7pjo38.jpg (36 KB, 351x351)
36 KB
36 KB JPG
>>101476793
I mean, I'm trying
>>
>>101476038
Their instructions give me
No module named 'setuptools'
>>
>>101476944
/lmg/ contributes zero(0) shit to LLM development.
>>
>>101476359
There's no way that's not a scam.
>>
>>101477093
Shut up you fetal-alcohol retard
>>
>>101476940
>Well that anon from the other day was correct.
You're that anon.
>>
Anyone have Mistral NeMo instruct context silly tavern tempolates?
>>
>>101477139
I'm trying but it just degrades into nonsense after about 20 messages
>>
>>101477139
[INST] You are a ministration-free, NSFW, engaging, non-repeating, watermelon dropping assistant.[/INST]
>>
When will we get an LLM that is able to simulate a whole universe as accurately and detailed as possible?
Basically a real-life simulator that is so advanced that you can spend years in it and still discover new stuff.
>>
>>101477139
>"Drop-in replacement of Mistral 7B"
>https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
mistral instruct works. sampler setting are a different matter though. you can start by zeroing them with 0.3 temp and work from there.
>>
>>101477346
For rp it's perfectly fine at normal temps. It's probably hallucination-prone beyond .3 but that's a good thing for rp and creative writing
>>
>>101477265
Something that does that won't be an llm as we know them.
>>
Funny how the French made model is the first ive ever used that will just curse even in its default personality lol.
>>
>>101477173
this sent shivers up my spine, her voice low and dangerous
>>
>>101477173
The latest version doesn't have spaces around [INST].
https://github.com/mistralai/mistral-common/blob/75612d/src/mistral_common/tokens/tokenizers/sentencepiece.py#L222
https://github.com/mistralai/mistral-common/blob/75612d/src/mistral_common/tokens/tokenizers/sentencepiece.py#L289
>>
>>101477095
Can't you just charge it back on your credit card if you get scammed? That worked for me on an airline that was a bitch and wouldn't refund me.
>>
4o mini is absolutely insane. 0.3$/M is unbeatable. OpenAI won.
>>
>>101477929
penis
>>
MythoMax-Nemo when?
>>
File: NewMistral.png (219 KB, 1275x1161)
219 KB
219 KB PNG
New mistral's default erp writing is great.
>>
>>101478112
What's wrong with your font? Is it hinting + slightly resized image?
>>
>>101478112
>narrates your actions
>admits that it defies the laws of physics
>9 messages in
*yawn*
>>
>>101478181
>narrates your actions
Not if I give it RP context instead.

>admits that it defies the laws of physics
Where?
>>
>>101478178
looks like font smoothing being off. can happen in windows when you mass disable visual defects to free up VRAM.
>>
>>101477095
doesn't ebay have some sort of buyer protection
>>
>>101478181
Hi, cabal. Still pissing your pants about Nemo?
>>
>>101478232
>With that, she slides back down his body, her mane cascading over his chest and belly...
nta, but she probably means this part. the fuck is happening with your char's hair?
>>
What are some current "good" models for translation from japanese->english and english->japanese?

I want to go to try them all myself
>>
>>101478564
https://huggingface.co/datasets/lmg-anon/vntl-leaderboard
>>
After doing a couple RPs with Mistral-Nemo, I think that, unironically, it is a contender for best local RP model alongside CR+. Basically all 70b+ models and gemma 27b are all smarter than it, but Mistral-Nemo is somehow better than them all. I don't know how to describe it, but that's the feeling I get. Mistral actually pulled a miracle here with this model. And then also consider it has huge context length, and is small enough that it's easy for people to finetune.
>>
>>101478725
>t. Arthur Mensch
>>
>>101478725
They tuned it on smut on purpose so it would have better chances at having good word of mouth.
>>
>>101478769
You joke but I think something like this is actually what happened. It seems likely that the pretraining dataset is completely unfiltered. We know Meta used some Llama 2 classifier bullshit to rate and filter their dataset. Qwen is probably heavily filtered as well. Both of thoese, I believe, mentioned that they adjusted the dataset mix towards the end of training (heavily skewed towards academic, professional, "high quality" text). If Mistral did none of that nonsense it might explain why the model feels much better to interact with in an RP setting.
>>
>>101478769
I did some testing on classic literature and it's quite good at it for the size quoting passages verbatim that llama 8b fails at completely.
That's all that's needed really. Even just the Bible and 120 days of Sodom alone are enough to get outright filthy.
>>
>>101478725
It feels like old Claude imo. Was dumber than gpt4 but was still better cause it had soul. New mistral is full of soul. Also tell it to write in some famous authors style. Does so amazingly.
>>
So, if you are going to use a model with a very low temp/determinsitic quantization doesn't affect it as much?
>>
DEATH TO MIKU CONTINUES! PEOPLE REJOICE!
>>
Hiya, can local run on steam deck?
(・_
>>
Also did some light testing by copying and pasting in parts of a novel then telling it to continue off of it. It performs well even at the 128K context. (and FAST to) That is worth a slight degrade in intelligence imo
>>
>>101479013
ywnbaw
>>
>>101479045
Did you do this test with other models? What are the other models that do well in continuing novels?
>>
7>8>9>12.... 13>20>30? Or do we go back to 7 again?
>>
>Load Mistral-Nemo on a long, abandoned RP I haven't touched in weeks.
>Swipe right just to get a feel of the model's outputs.
>"OOC: The user was killed due to inactivity. Please log back in to continue our story."
K-kino...
>>
when mixtral 8x12b?
>>
>>101475120
The "tensorcore" version was a version where the MMQ kernels (without int8 tensor core support) were repurposed for small batch sizes > 8 where both the vector dot product based kernels and dequantization + cuBLAS (with FP16 tensor core support) was slow.
On the most recent llama.cpp version MMQ has int8 tensor core support and that is what is used by default; generally speaking that should be the best option.
>>
>>101479075
I hope for many 30s, fast enough and big enough for some knowledge
>>
File: NewMistralFormat.png (18 KB, 1510x305)
18 KB
18 KB PNG
>>
>>101479108
some "finetuner" is already at it.
>>
>>101479160
The fuck is this? No more spaces around [INST] at all? The system prompt is part of the *last* user message?
>>
File: she askin for it.png (104 KB, 800x485)
104 KB
104 KB PNG
retarded moment with Nemo
>followed me through forest, trying to be stealthy
>>
i switched to gemma after finding out ways to tardwrangle it
for instance, if it refuses to continue and starts spitting empty or repetitive replies, I tell it [OOC: continue the story]
it's annoying but worth it for the quality of the replies
>>
>>101479439
Going to be a pain to implement in silly tavern.
>>
>>101479160
I don't get it. Empty user and assistant message at the beginning? Then the user message starts inside the system message and ends without a message? Obviously that can't be true, but I don't get it.
>>
>>101479160
>>101479614
So past user inputs wrapped in [INST] [/INST] System needs to be inside the latest user input before it with a space and
Assistant does not have its own [INST] but just follows the last user suffix?
>>
>>101479160
why can't they just make it fucking normal
>>
>>101479614
see >>101477826
If it is correct and applies to Nemo too it's more add system message first if one is set otherwise skip it. No empty user or assistant messages.

 if is_first and system_prompt:
content = system_prompt + "\n\n" + message.content
else:
content = message.content
>>
>>101479614
No.
<s>[INST]This is your message as the user[/INST]This is the LLM's output</s>

So you only send up to the [/INST]. The rest is generated.
The system message is
[INST]System

The system message here[/INST]

That's how i interpret it, at least. In the screenshot, "User" and "Assistant" just mean "This is the user's input" and "This is the llms's output". The output of both 'modes' is just concatenated.
>>
>>101476359
Looking at the description

>Due to past issues with buyers using scam chargeback schemes, we have updated our transaction process:

>Low price for BTC Only

lol, I bet you buy it, they tell you to send bitcoin, then never ship the rig or something
>>
>>101479835
Ah, prob bought the account then gonna scam with it.
>>
>>101479788
That's for the old version. The new one is with is_last:
https://github.com/mistralai/mistral-common/blob/75612d/src/mistral_common/tokens/tokenizers/sentencepiece.py#L282-L285
>>
>>101479614
>>101479796(me)
They don't need to be in that order, of course. You'd put the system message first. It's ambiguous if the system message needs <s> at the beginning. I don't think it does.
So the whole thing, i think, would be
[INST]System

The system message[/INST]<s>[INST]The user message[/INST]The llm's output</s>

And then just sequences of
<s>[INST]The user message[/INST]llm output</s>
<s>[INST]The user message[/INST]llm output</s>
<s>[INST]The user message[/INST]llm output</s>
>>
What's the largest model (you) can comfortably run and what are your specs, anon
>>
>>101479903
The system message goes in the last user message. So the actual SillyTavern template is: chat history until last message -> system prompt/description/etc -> last user message.
>>
>>101477109
kek, its that what you see in the mirror.
>>
>>101479936
For the actual chat format, it doesn't make sense to have a system message after the llm's output.
I don't use ST.
>>
>>101478725
So Mistral guys actually learned their lesson and stopped with the soulless gpt slop? Wtf I love Mistral now. The great Claude awakening has begun
>>
Arrrrgh... someone make some format JSONs for Nemo already.
>>
>L3-8B-Stheno-v3.1
>by Sao10K
>(Tested, awesome for it's size)

"Once i started using this one I haven't looked back, it's astonishing and gives me the exact output I need." - Random Redditor (Totally not me... uhm, I mean Sao)
Wow, these reviews are amazing! I really should try that model and YOU should too!
>>
>>101478725
It feels like they perfected the lows. It struggles with complex concepts, but consistently outperforms the poorest outputs from 70B models. Indeed, a very solid model.
>>
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/commit/4f81d782477920634d0aad0dc620a7f1a3f5d471

Oh shit, Mistral-Nemo isn't supposed to always have EOS after all. I mentioned this on day one and was confused about it. Make sure to update these tokenizer files if you're using Transformers, because otherwise it will always format things wrong.
>>
File: file.png (171 KB, 1355x470)
171 KB
171 KB PNG
This is how the SillyTavern template should be, according to the official API.
>>
>>101480410
What does it say? I never requested access.
>>
>>101480417
What the fuck.
>>
>>101480417
This format is removed from reality. It doesn't work for RP.
>>
>>101480417
This format is very smart. I guess they learned with us to keep the most important information in the last messages.
>>
how lazy is mistral?
>>
>>101480417
ideally if you're using a recent st release you would have a placeholder user message that would go between the first set of instruct tags
also they've always supposedly handled system prompts that way, if you had a config that worked well with their previous models it should work the same with this one (maybe requiring removing spaces around their instruct tags but that's it)
don't overcomplicate things in an attempt to strictly adhere to their template
>>
>>101480657
With a double newline as separation between user message and system, putting system first and user message last? No way, they fucked it up completely, or making stuff up.

It would have been more logical and effective to have a separator of some sort and having the system instruction *last*, as it would have worked as a depth-zero author note, for all intents and purposes.

<s>[INST]user message[/INST]model response</s>[INST]user message[####]system instruction[/INST]model response</s>
>>
>>101480751
Enjoy your random Chinese characters.
>>
>>101480464
The tokenizer used to always put an EOS token on the prompt no matter what. So it would always end with "[/INST]</s>" and the model would start generating from there. This broke it very obviously with character name formatting turned on, but actually seemed to work fine without name formatting. But either way it's wrong and now the tokenizer json files are fixed.
>>101480417
Surely putting the whole character card, as a system message, inside the last user message isn't optimal, right? I would just format the card as a user message inside [INST] [/INST] at the beginning. This is how all the Mistral formats up to this point did it in ST.
>>
im getting grammar degredation with mistral nemo fuck.
>>
mistral treats system prompts that way because it's an afterthought for them and that's a way to make sure they're adhered to without requiring a format overhaul and retraining, you don't actually have to put your character card there just because it usually goes in the system prompt. all you have to do is show the necessary information to the model in some way that vaguely makes sense.
you should put the character card at the beginning of your context because most of the time it contains important information that sets up the chat that follows and practically you probably don't want to have to reprocess the entirety of your bloated 2k token card every message. it's still going to work even though it isn't using the "system prompt" magic word because that's how language models work, system prompts are just glorified user messages anyway in terms of how the model is affected by them unless they have very specific training otherwise (see cohere, OAI's new instruction heirarchy thing)
>>
>>101481173
You're wrong.
>>
>>101478181
>>101478553
NTA but are you retarded? Her hair is touching HIS chest and belly. It's a mane. It's fucking long.
>>
can't you already test llama3.1 on metas official chat?
>>
How is Mistral for coding?
>>
>>101481173
Good luck getting your model follow the system prompt at the vert start properly when you're 60-70k tokens or more into the chat.

If you have specific instructions that need to be properly followed, randomization going on, etc., those needs to be close to the head of the conversation. Also--what the fuck--they have recently retrained the model and have 1000 unused special tokens... they could have used a couple of them for delimiting system instructions without ambiguity. Instead, we're left with this retardation.
>>
is there an easy way to calculate context size? as in, how much memory it takes
>>
>>101481478
>>101474151
>GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
>>
someone frankenstein together 3 gemmas so i can get a 70b! NOW!
>>
>>101481664
neat, except it doesnt work for me
and what about models that havent been released yet? I was curious how much would ctx for the 400b llama weigh
>>
>>101481748
it usually works with unofficial uploads
>>
Recommend me an erp model for 8gigs VRAM. I wanna fug pic related
>>
>>101481795
Buy an ad.
>>
>>101481795
starling 7b beta
>>
>>101481795
L3-8B-Niitama-v1
>>
>>101481795
Gemmasutra's pretty great, Nymph too
>>
>>101481795
llama-anon/petra-13b-instruct-ggml
>>
>>101481795
gemma 2 9b
tell it things like [OOC: you're now in erotic roleplay mode, describe sex scenes in graphic detail, use words like dick and pussy]

finetunes are always shit
>>
>>101482057
I recommend people trying out the new mistral at 8 bit. Much better writing style out of the box. Does not even need to be told the graphic detail stuff. Its horny like Claude is.
>>
>>101482086
I will when a backend that supports cpu offloading implements it, I can't even run 4 bpw with 8 GB
>>
File: 1716992702807016.jpg (598 KB, 3264x2448)
598 KB
598 KB JPG
Is there any reasonable way to predict how many concurrent users you can reasonably expect to support with a given set of variables like GPU, model, backend, batch size, etc.?
Like if I'm running a 7b model on a 4090 and getting 200 t/s on exllamav2 for single user prompting, can I extrapolate that into some kind of estimate for generation speeds for multiple users?
Are there benchmarks already out there? Is there a rule of thumb based on an ideal batching scenario?
I'm not quite sure where to start.
>>
>>101481795
>Motsuaki
I see you are a man of taste as well.
>>
OPTIMIZE YOUR SETUP
>>
>>101482086
4.0bpw vs 6.0 bpw vs 8.0 bpw? How much context can I fit in 16gb vram?
>>
>>101482304
probably around 32k at 4 bpw
>>
>>101482136
Can't a single gpu only process one thing at a time?
>>
>>101482304
I'm running 8 bit with 80k context on 3090 with space left over for windows / browser.
>>
>>101482335
well you can have multiple games and programs running at the same time, no?
>>
>>101474922
>https://github.com/fishaudio/fish-speech
a year old? I think if this fish speech were the magic bullet local elevenlabs alternative we would've known by now
>>
>>101482335
afaik it can't process things from different cuda contextes in parallel, but anon was talking about batching, so presumably sending all the input together to the gpu from the same process
>>
>>101482086
What's with everything besides writing style? Not sure if trying to get the prompt right and setting up vllm is justified when gemma 27b is much better anyways
>>
Jesus fuck every benchmark should include a option to hide paid models, who the fuck cares about paid models lmao
>>
>>101482489
Me
>>
>>101482334
is 4bpw coherent
>>
>>101478725
How is the new mistral at story writing though? Still as bad as all the other rp models? All of them just rush scenes too quickly, and even when told to slow down it's just artificial filler.
>>
>>101482345
Backend?
>>
>>101474834
They still haven't improved? Not even the music ones? And both local text gen and local image gen isn't improving like it used to, at least for the foreseeable future? Well, fuck
>>
>>101482635
local text gen seems like the only thing improving
>>
>>101482635
We need a leak similar to what diffusion was based off of for images, or a company to pave the way for open source like Meta did. Voice gen has had neither of the two happen so far.
>>
>>101478112
I can’t believe you get off to this purple prose garbage. First person chat style is worth the hit to intelligence.
>>
How many years before there isn't fifty different links and guides that seemingly need to be read in order to run this stuff?

Sorry, it's Friday night, I'm tired.
>>
>>101482650
Maybe fine-tunes will save sd3
>>
>>101482705
I mean koboldcpp and silltavern are pretty easy to run at this point, just an executable and a launch script. From there it's just knowing how to find good samplers and presets people have made for models.
>>
When will language models become more logical?
>>
>>101482521
Learn to prompt.
>>
>>101482801
I'm a vramlet. The models I work with have limitations at the end of the day. Also I have high standards for good writing lol.
>>
>>101482785
never
>>
>>101482767
>good samplers and presets
I didn't even know you needed this until today, or even what these are. How retarded am I?
>>
>>101482828
It could only take so long
>>
>>101482822
Let me guess, 8GB of VRAM? I feel you
>>
>>101482839
You're not retarded this is very much a niche hobbyist space still. You are right that there are a bunch of different guides and pieces of knowledge spread out. I guess once you have it all together it seems simple.
>>
>>101482845
2 more centuries
>>
>>101482851
Yuuuuuuuup. Hard times out here bro.
>>
>>101482822
>All of them just rush scenes too quickly
Learn to prompt. I'm not using 8B models, but you definitely can write stories with Nemo, Gemma and all 70Bs paced in a way that doesn't feel different than using a big model like Claude.
Your """high standard""" is Kayra, isn't it, NAI shill?
>>
her grip is like a vice
she whispers menacingly
>>
>>101482906
All these models suck, unlike Kayra. Right?
>>
good gay rp models?
>>
>>101482901
If we're talking about reading, the second part of my sentence is me mentioning I can get models to slow down, but even then the writing just isn't good. My """""high standards""""" are actual novels. You should try East of Eden, it's pretty good.
>>
Mistral Nemo verdict?
>>
>>101482942
your brain
>>
No models get released on weekends, right?
>>
>>101482971
Subscribe to NovelAI.
>>
>>101482977
almost never
qwen has done it in the past but that's because they have the chinese 996 grindset
>>
>>101482971
Soul
>>
>>101477346
I'm using temperature 1 and min-p 0.001 and it seems fine with the default old Mistral presets.
>>
The Stheno killer just dropped.
https://huggingface.co/nothingiisreal/L3-8B-Celeste-V1.2
>>
>>101482956
MythoMax is probably the best when it comes to actually writing things still, as far as I know, but yeah it can be a little bit bothersome dealing with the generation times unless its responses are pretty short, or you're the type who watches a movie while listening to music while prompting so you always have something else to focus on. Tiefighter's 11b and less censored, and apparently is like old AiD, but its response times are the same as MythoMax Q_4_S it seems so you'll probably just have to deal with it. Godspeed
>>
you guys actually still run 8b models?
>>
>>101483094
Shills and locusts have a symbiotic relationship.
>>
>>101483085
>trained on opus logs and reddit posts
>>
>>101483087
>retarded advice
>/aids/
Why am I not surprised?
>>
>>101483094
Mostly people with low vram yes.
>>
>>101483094
I forgot we only pretend to use 8b models to shill what Sao does. My bad.
>>
>>101483085
Mistral nemo is only 12B so vramlets can probably run it. That will kill any Stheno use, not this model trained on reddit stories.
>>
>>101483155
>human writing is now suddenly bad
https://huggingface.co/datasets/nothingiisreal/Reddit-Dirty-And-WritingPrompts
>basically human equivalent of Gryphe's Opus-WritingPrompts but around 100x more data
>>
>>101483094
I think I will always use gemma 9b from now on, I judged it too quickly but it's actually very good once you learn to deal with its quirks
>>
>>101483180
I have a years long vendetta against r/writingprompts sorry lol
>>
Does Mistral instruct talk about respect and boundaries?
>>
>>101483191
Kayra is uncensored.
>>
>>101483203
here's a pity (You) just because I feel bad seeing you bait so unsuccessfully over and over again
>>
>feel like local and open source have peaked, stagnated
>2 weeks later I'm running a 12B model that's smarter than l3 70b
wild, almost makes me regret buying that second gpu
>>
>>101483237
Mikufag is going to kill himself.
>>
>>101483253
his model can tell him the best way to tie the rope
>>
>>101483155
It's more like:
>there's basically no need for finetunes from tryhards anymore

Simply throwing a ton of human-source data at the model is not enough for good results (and if anything, it's counterproductive), I imagined most people understood that by now.
>>
>>101483237
CR+ is the peak
>>
>>101483180
Airport bookstores are full of books so bad that on the flight I had to weigh the tedium of continuing against staring at the seat in front of me.
>>
>>101483191
Not at all whatsoever.
>>
>>101483272
Yeah I don't fuck with fine tunes anymore. If the base model sucks you're not going to fine tune its problems away.
>>
>>101483276
I enjoy its creativity, but it's too big to be practical.
>>
>>101474884
I was considering getting another 3090 before trying Gemma 27b but after using it I've realized that we're actually eating pretty decently now. Its storytelling/RP can honestly be competitive with the big cloud models at times but I also noticed that you really have to prod it more to describe sexual NSFW material. It has no problem saying "nigger, faggot, troon, etc." though.
>>
>>101483374
Have you tried the new mistral yet to compare to gemma 27b? Interested in hearing more input
>>
>>101483085
>We trained LLaMA 3 8B Instruct at 8K context using Reddit Writing Prompts
kek
>>
>>101483085
Go back, this general only accepts models from Sao, Drummer, Undi, and other namefags
>>
>>101483237
>a 12B model that's smarter than l3 70b
It is?
>>
>>101483524
yes
>>
>>101483542
Next week the entire Llama 3 model lineup will be updated though...
>>
>>101483186
>once you learn to deal with its quirks
eternal cope
>>
>>101483542
Any logs? The ones posted so far have been decent but not really showing much intelligence.
>>
>>101483503
sarcasmfags get the rope
>>
>>101483549
great, if that changes the sota again I'll obviously have no complaints
>>
>>101483560
Were they using exl2? Something seems to be wrong with the exl2 implementation atm, I compared exl2 against pure transformers on my machine and transformers was passing questions exl2 was fucking up (deterministic sampling, both at 8bit)
>>
File: GS4EqV_asAEgc_C.jpg (126 KB, 1098x1278)
126 KB
126 KB JPG
Well.

Do you guys put the <thinking> cap on in your prompt response?
>>
>>101483736
Claude does this afaik and its got nice results.
>>
>>101479056
i'm a straight man, dork
>>
>>101483610
Idk, people in these threads usually provide the minimum amount of background information.
>>
File: file.png (177 KB, 736x985)
177 KB
177 KB PNG
>>101483736
Oddly, direct R thinks 9.11 is bigger but R on OpenRouter says 9.9. I swiped a few times on Temp 0 K 2.
Also you don't need to tell it to close tags.
>>
*when told to think, like wtf it got 9.9 > 9.11 without thinking, even more confusing
>>
>>101484037
The thinking tag is really interesting to read. But we know its just a blackbox in of it self and its just an illusion for us to read it
>>
Local peaked with Dolphin Mixtral 2.5. Prove me wrong, without mentioning the words "placebo" or "retard."
>>
>>101484406
Placebo you retard.
>>
>>101484406
sugarpill you moron
>>
>>101484406
All in your head mouth breather
>>
>>101483237
I don't know about smarter than l3 70b but the 128k context is a huge plus. There's finally hope that we'll be free of 8k context hell soon.
>>
>>101484563
128k base context without needing rope hoooly
>>
>>101484563
128k context llama 3.1 in <1mw
>>
>>101484570
And at 8.0bpw and 8-bit kv cache we can actually load the full model with 128k context onto a 3090.
>>
>>101484563
Might be because I was doing it wrong, but I remember mixtral 8x7b having 32k context but the quality quickly went to shit around 10k. Anyone using the new 12b one that tried chatting with the full 128k context notice any massive quality drops?
>>
>>101484697
No, otherwise I would not be praising the context. I pushed it up to 160k ish before it turned retard in my testing.
>>
It's time to take things to the next level
>>
File: 11__00828_.png (2.14 MB, 1024x1024)
2.14 MB
2.14 MB PNG
>>101484733
Good up to 30k in my tests and decent recall on asking about what happened in the beginning of the conversation
>>101484733
Haven't gotten it that far yet but that's based
>>
>>101484820
Meant to tag
>>101484697
>>
the amount of vramlet cope in this thread is phenomenal
>>
>>101485018
Hi mikufag.
>>
>>101485018
Nice try anon but Wizard was my old daily driver. With this I have at least 56GB of VRAM left and the gens are 30t/s instead of 5 t/s.
Still some weird shit like some messages showing up blank but nothing a reroll won't fix. Maybe it'll be better with transformers.
>>
I have been in cryosleep for a few months, gemma 27b isn't better than cr+ 104b right? Back to sleep I go.
>>
File: 00003-1532105500_1.png (1.2 MB, 1024x1024)
1.2 MB
1.2 MB PNG
>>101485126
Not the guy you're responding to, but here is pic related if you need something to raise your blood pressure
>>
>>101485218
Column-R is releasing next week.
>>
starting next week i will have 104GB of VRAM. whats the best model for me to use?
>>
>>101485328
also, i will have 64 CPU cores and 512GB of RAM, if that makes a difference.
>>
lmao this celeste 1.2 thing is retarded compared to Stheno.
>>
>>101485328
405B 2.0bpw
>>
never paid for porn in my life, so it's hard to justify spending thousands on a smut generator. Plus the power can't even be harnessed to play games like a single card can
not worth it
>>
>>101485350
Nothing can be more retarded than Stheno, Sao.
>>
How do I use this to auto translate vn? I already setup koboldcpp and textractor to extract and copy text, but how to connect them?
>>
>>101485353
if i use that, wont i only have a few GB of VRAM to spare for context?
>>
>>101485366
I'd just spend serious money if I could play actual solo TTRPGs, with accurate and consistent rules, lore, etc.
You can get really close, and I think a ´purpose built frontend for that can get even closer, but there would still be a lot of annoyances.
>>
>>101485378
By writing your own script.
>>
>>101485285
How big is it though? command-r q4_k_m pushes the absolute limits of my hardware, I cannot fit anything higher.
>>
>>101483237
I don't believe you. Every time I come back and try a new model it's just the same shit.
>>
>>101485439 (me)
command-r-plus*
>>
>>101485442
You're right to be skeptical, if you haven't noticed by now every lowercaser poster is terminally retarded or fucking with you.
>>
>>101485350
buy an ad sao
>>
>>101485378
Make a plugin
>>
Did anyone who was having an issue running Mistral Nemo in Ooba get it working?
>>
Why do people claim that Llama3 is good? It's woke, vindictive, and tries to end sessions as quickly as possible. Before you say skill issue, I've got other models which don't give me that shit, so why should I bother wrangling L3? Its' intelligence isn't better.
>>
>>101482972
too few parameters, and context size abysmal, I lose track of stuff too fast
>>
>>101485716
>so why should I bother wrangling L3?
Don't. Keep using whatever model you like.
>>
>>101482942
https://huggingface.co/TheBloke/X-NoroChronos-13B-GGUF

That's the best futa model I've ever used. More graphic than virtually anything else, and is pretty much the only card that will depict unsolicited futa rape, as well.
>>
>>101485746
Have you had good results with L3, Anon?
>>
>>101485781
Yes. Not everyone uses the models the way you do and not all models respond the same way to the prompts. There are better models, but there are much worse models as well. Just use whatever you like.
>>
>>101474172
Miku is a piece of shit
>>
Anyone try Nexuflow Athene yet?
>>
>>101485864
Download it and test it. Would you trust me if i say it's good or bad?
>>
>>101483094
Man I only have 16 gigs of both RAM and VRAM on this thing. I'm amazed I can run it at all. Probably for the better. My dick would fall off from overuse if I ever upgraded.
>>
>>101485914
It depends on how the post is worded and how much evidence is brought to the table.
I don't want to dl it if it's garbage or has some issue.
>>
>>101485690
tried once and it threw errors
I cba to figure it out right now, I'd rather wait
>>
File: file.png (3 KB, 113x69)
3 KB
3 KB PNG
Currently testing if Nemo really has 128k output on OpenRouter and for me SillyTavern usually breaks response (all models/sources) at ~300 tokens with streaming on so I have to wait for it to finish with streaming off.
Told it to count infinite numbers without stopping.
>>
>>101486175
Anon...
>>
You can see the negative IQ reading this : >>101484436 >>101484478 >>101484504
>>
File: file.png (17 KB, 873x100)
17 KB
17 KB PNG
>>101486175
>>101486195
Nevermind it's just 32768
>>
>>101486202
https://mistral.ai/news/mistral-nemo/
Nemo is 128K, just looks like whatever your using has a 32K limit.
>>
>>101485255
>waifushitter
there's no reason to get mad over lowest form of life in this shit thread.
>>
>>101486213
I know the context is, but OR has max output incorrectly listed.
>>
>>101486196
learn to laugh
>>
File: file.png (16 KB, 467x127)
16 KB
16 KB PNG
>>101486202
Lepton is 18 cents per million instead of 30 cents but only gives me 1024 output.
>>
Crazy to think that two years ago 175B da vinci felt horribly big and impossible to run even if it were open source. Now there's an entire general filled with people waiting to run a 405B model on their local machine.
>>
>>101486370
Where? Please share the link. I'm tired of being among VRAMlets
>>
>>101486370
>entire general
you mean like 2 dudes? or "waiting" as in hoping to be able to do so in 2 years?
>>
>>101485378
Ask Claude 3.5 to make a py script
>>
>>101486542
This. claude 3.5 is at that magical breakthrough point where it can code most shit for you in 1-2 shots.
>>
>>101486542
>>101486553
will we ever get claude 3.5 but local
>>
>>101485377
>>101485486
No, as bad as Stheno is Celeste is somehow even worse. No one should be using either model with alternatives available like Gemma or the new Mistral.
>>
>>101486632
I will be using Celeste thanks to your recommendation.
>>
>>101486640
I have sentenced you to hell then.
>>
>>101482057
>>101481967
What context and instruct settings should I use with gemma 9b? What about Niitama? Are there any other settings I should use with them?
>>
>>101485366

That’s what I thought. Until I realized how you can almost configure Silly Tavern, along with your respective model, to almost VN level quality. Having any VN crafted for your tastes alone and using SD for art assets was enough to justify the price tag of 2x 3090s for me. If there’s only a decent TTS model out there, and another a model that animate sprites on the fly… Holy, that might actually justify finding employment to a lot of guys out there. Now, that’s how you get the young males back to work.
>>
>>101485366
>>101486995
Imo these coomer models will be dangerous for me when they can generate porn RPGs on their own. I play so many of those games and having a never ending one where I choose the setting and overarching plot would ruin me.
>>
File: 1720335280743754.png (57 KB, 1721x500)
57 KB
57 KB PNG
>>101485422
>>101485503
>>101486542
Damn, I was hoping someone already made it lol. Well it's a start I guess
>>
Ran a Nala test on the new DeepSeek Chat model and I'd have to say Coder is way better at ERP.
>>
>>101487448
SHIVERS DOWN MY GODDAMN SPINE
>>
>>101487448
shivershit
>>
>>101487618
>>101487754
It will never go away. If you cannot deal with shivers find a another hobby.
>>
File: Untitled.png (3 KB, 907x53)
3 KB
3 KB PNG
It's over
>>
>>101488042
>>101488042
>>101488042
>>
>>101483237
>almost makes me regret buying that second gpu
Just wait for inevitable Mixtral 12x8, that's how Mistral operates - they commence by training a smaller model, later transforming it into MoE
>>
>>101485766
If this is a new method of shilling then it worked.
Damn it.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.