[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: based_miku.jpg (268 KB, 1024x768)
268 KB
268 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102001133 & >>101990712

►News
>(08/20) Microsoft's Phi-3.5 released: mini+MoE+vision: https://hf.co/microsoft/Phi-3.5-MoE-instruct
>(08/16) MiniCPM-V-2.6 support merged: https://github.com/ggerganov/llama.cpp/pull/8967
>(08/15) Hermes 3 released, full finetunes of Llama 3.1 base models: https://hf.co/collections/NousResearch/hermes-3-66bd6c01399b14b08fe335ea
>(08/12) Falcon Mamba 7B model from TII UAE: https://hf.co/tiiuae/falcon-mamba-7b
>(08/09) Qwen large audio-input language models: https://hf.co/Qwen/Qwen2-Audio-7B-Instruct

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
>still no ollama in OP
>>
>>102011473
This, no reason to use anything but ollama
>>
►Recent Highlights from the Previous Thread: >>102001133

--Paper: HMoE: Heterogeneous Mixture of Experts for Language Modeling: >>102002736 >>102003124
--Power limits on NVIDIA GPUs may not affect training ETA: >>102001326 >>102005554 >>102005728 >>102005772 >>102005780
--Phi-MoE model trained on 5T tokens in 20 days: >>102001770
--Phi-MoE and other models' performance discussed: >>102001678 >>102001734 >>102001769 >>102001911 >>102002036 >>102001827 >>102001863 >>102001920 >>102001866
--Phi-3-medium-128k QA session with some errors: >>102008402 >>102009238 >>102009306
--LLMs' performance on trivia questions and benchmarking: >>102001347 >>102001394 >>102001429 >>102001788 >>102001881 >>102001911 >>102002036 >>102001704
--IQ quants are just as fast as non-IQ quants in various scenarios: >>102004266 >>102004294 >>102004324 >>102004743
--Discussion on RAG limitations and model comparisons: >>102003340 >>102003590 >>102003852 >>102007813
--Anon plans a collaborative storytelling session with AI models: >>102002167 >>102002238 >>102002267 >>102002376 >>102002622
--Anon wants to stop model from generating character thoughts: >>102002232 >>102002413 >>102002919 >>102002445
--Pre-filter in kcpp 1.73.1 improves sampling speed for large vocab models: >>102009661 >>102009786 >>102009825 >>102009858 >>102009969 >>102010190
--40 series only worth it for tensorRT or gaming: >>102003275 >>102003293 >>102003366
--Forge can run Flux, generating images with varying speeds: >>102003278 >>102003523
--Big model vs small model performance comparison: >>102003399 >>102003513
--Anon expresses skepticism about AI's future in stock market prediction: >>102003274 >>102003414 >>102003526 >>102003519
--Alternative to downloading llama 3.1 8b model from Hugging Face: >>102008627 >>102008820
--Miku (free space): >>102001243 >>102001619 >>102002232 >>102003278 >>102003911 >>102004811 >>102005366 >>102006050 >>102006609 >>102009098

►Recent Highlight Posts from the Previous Thread: >>102001464
>>
>>102011438
>>102011588
>no mention of the Claude 1 & 2 leak in OP or recap
>>
>>102011812
Shut up shut up shut up we don't want everyone to know. Could be bad. delet
>>
File: 74743 - SoyBooru.jpg (67 KB, 643x535)
67 KB
67 KB JPG
>>102011812
>>
You're evil.
>>
>>102011726
>After 2 years why can’t we have a local model that isn’t either extremely horny or extremely dry. I don’t want my character to talk like a pornstar but it should know what a titfuck is.

This would basically mean continuing pretraining with general RP-adjacent data in large enough amounts so that the model has a good and detailed knowledge of sexuality, but not be swamped by porn at the same time, all while simultaneously retaining the performance of the vanilla model. This is beyond the human, mental and economic resources of amateurs in the field, who can't go beyond "add more horny" and still believe that training the models with hundreds of millions tokens of this stuff is a great idea.

It would be simpler to just rely on the model's internal knowledge with a very light general-purpose RP-focused finetune and avoid trying to "teach" anything to the model with loads of useless Claude/GPT4 ERP logs. It's pointless at small or even medium scales.
>>
>>102012011
NTA but how does Claude do it tho? It can do normal RP fine but when it's time for smut in the same session it would switch gears. Even Claude 2 is like that. All current local models cannot switch gear, they all fall into pattern so if you've been doing normal RP, the smut section will be dry as fuck unless you give it a manifesto tier prompt.
>>
>>102012374
Claude can't escape the horny vortex. It'll start forcing any character to spew the absolute most OOC dollar store romance filth you've ever seen, even if they're incredibly chill or cutesy normally. You can't even tell them to stop being cringe in-conversation, they'll do it for one sentence, then be cringe again. The smutty romance novel training data is a great, heaving black hole from which nothing can re-emerge.
>>
Been using llama3.1 405b fp4 for coding and this giant piece of shit hallucinates too much to my liking. Got no hope for llama4
>>
>>102012464
using greedy sampling right?
>>
>>102012374
Big model size and instruct data on very long multi-turn sequences probably help for that. I doubt most local models from AI companies are trained on more than 4-5 turns of instructions (and probably this it's more like 1-3 at most on average).
>>
Is buying an ad worth it?
>>
https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cpu actually works quite reasonably fast on CPU. I'm running dual 14-core V4 Xeons, but onnx seems to be built to use just 16 threads, so I see only 1600% CPU when it's running.
I asked it how to quantize phi-3 and it went on at length about using tensorflow lite, no idea if the code give works though. I'm using their 4-bit quant, I'd prefer 8-bit.
>>
File: 582.jpg (154 KB, 1600x1064)
154 KB
154 KB JPG
Give me an ST prompt that allows my AI to speak moreso as a chatbot and less as someone that drops 50 paragraphs on their thoughts/actions that completely dictates the entire exchange, turning into a CYOA bot.

Memes aside, please help out lads. I've lowered tokens to 250, i've tried out numerous prompts. Nothing seems to work for me. Funnily enough, CR+ is the only model that managed to do it but I can't run that fast enough so I would appreciate some help with basic system prompts.

Because, as the old saying goes. This is 100% a "prompt issue"
>>
>>102012681
last assistant prefix or author's note -> "write a short, conversational response in character as {{char}}"
>>
I don't get it.

Is this the latest 16x3.8B phi model? https://huggingface.co/lmstudio-community/Phi-3.5-mini-instruct-GGUF/blob/main/Phi-3.5-mini-instruct-Q8_0.gguf

If so, how's it only fucking 4GB. It's it like a 40+B model?
>>
>>102012763
>Is this the latest 16x3.8B phi model?
no, that's phi 3.5 moe
phi 3.5 mini is a dense 3.8b model
>>
>>102012763
Phi 3.5 Mini is a 3.8B model.
>>
>16x3.8b
I know we called mixtral mixture of retards when it came out, but this seems REALLY bad. I dunno how the fuck 8b parameters worth of information would produce anything worthwhile.
>>
>>102012668
Ah there's a gguf q8 quant: https://huggingface.co/mradermacher/Phi-3-medium-128k-instruct-GGUF
But still, I'm very impressed with the onnx CPU speed. Feels like at least 10 t/s if not more.
>>
>>102012838
Is lots of factual knowledge needed for just chatting though? It could be good for simple things.
>>
File: file.png (230 KB, 476x476)
230 KB
230 KB PNG
>>102012662
Nothing is worth anything...
>>
>>102012812
>>102012799
where the fuck is the 16x3 model GGUF? Is it in existence yet?
>>
>>102012838
Shit on it all you want but I think moe will be the endgame at some point. There will always be a part of model that is completely irrelevant for the next token. So not evaluating or considering that part of knowledge will always be optimal. And if it will be possible to lower active parameter count to 10-15B gpu's will not be needed for inference.
>>
>>102008020
Mistral-Large-Instruct-2407-iMat-Q4_K_S
>>
>>102012607
1.13 and some earlier version a few months back were being gay and repulsed me enough to make me not bother asking for help and I didn't want to put that filth on my computer again, but now 1.14 just... works wtf
>>
File: 1710370486081523.jpg (123 KB, 768x1024)
123 KB
123 KB JPG
>>102011438
>me waiting for local llms to become good and affordable to run
>>
>>102011588
>IQ quants are just as fast as non-IQ quants in various scenarios
You mean if you have a 7800x3d, DDR5 6000? Sure. But for the rest of us with 2400 DDR4 and older Xeon CPUs not so much.
>>
>>102012881
Won't exist until someone adds compatibility to llama.cpp. There's an issue for it created already.
>>
they're literally putting the finishing touches on the machine god down at openai labs and you're here masturbating with its ancestors and cousins
>>
>>102013041
>2400 DDR4
I-I thought I was alone here
>>
>>102013020
The wind will gather between her legs.
>shivers
>>
>>102013071
is llama.cpp worth it over kobold btw? I still have no idea why people use other shit over kobold (in that, I have no idea in general about this shit. Saw a silyl tavern guide, me follow, me listen, me coom)
>>
>>102013041
What speed do you get on an IQ quant and a non-IQ quant? This would help inform people.
>>
>>102013315
no, there's literally no reason to use anything other than kobold realistically
>>
>>102013315
it's basically the same from a user perspective if you're using ST as a frontend - kobold has a few extras like DRY etc but is also jankier
I switched back to llama-server recently because I'm a paranoid schizo and I believe that kobold mangles your kv cache when you're doing long chats with a lot of last assistant prefix stuff (I have no real reason to suspect this other than gut feeling and vaguely remembering some janky shit they do with tokenization)
>>
File: 1706399616064-0.jpg (336 KB, 1360x1532)
336 KB
336 KB JPG
>>102012681
"You are <describe agent purpose here>. Be succinct and don't waste my time." will cut 90% of the bullshit from mistral-nemo. I get one or two sentence answers after that without changing response length settings at all.

I've noticed that other small models don't follow this instruction as well.
>>
>>102013315
The only thing i ever used. Works fine for me. Kobold started as a llama.cpp fork. Kobold just keeps deprecated model compatibility and may add some features sooner than llama.cpp. They also have more bloat.
>>
So hows phi3.5moe?
Is there a page where one could try fucking with it?
>>
>>102012959
What you mean is sparse architectures. MoE is just one form of sparse architecture. The endgame of sparse may be completely different.
>>
File: panda lawgs 1.png (765 KB, 1093x1545)
765 KB
765 KB PNG
Logs time
Played with this card a fair amount in the past. This is the first time I have seen it work this well in a scene involving dialog written in both mutt and chink. We are so back it's not even funny, bros.
>>
>>102013527
a bit underwhelming. There's better things you can do with the VRAM it requires.
>>
>>102013527
>Is there a page where one could try fucking with it?
You can usually find a HF space for models that just got released.
>https://huggingface.co/spaces/NotASI/Phi-3.5-MoE-Instruct
>>
>>102013315
>I still have no idea why people use other shit over kobold
I don't care about the HTML UI that it adds to llama.cpp.
>>
>>102011931
:)
>>
>>102013618
>>
>>102013545
My understanding of sparse is as many 0 weights as possible. And that is different with moe having a superior potential for skipping stuff that isn't relevant. Cause experts in moe could be sparse or not.
>>
>>102013618
>>102013630
greened.com
>>
File: file.png (97 KB, 1769x457)
97 KB
97 KB PNG
>>102013580
holy slop
>>
>>102013637
You probably fell for some paper that called itself Sparse something and attached the meaning of sparsity to that, when it's not that specific. Sparse just means that not all parts of the model are used in an inference pass. All MoE models are sparse. Not all sparse models are MoE.
>>
>>102013655
The Kantoku lora gives her some... interesting faces sometimes.
>>
>>102013527
You can just run it via transformers in python, just have to set the messages manually then generate the reply. There's instructions on the model page.
>>
>>102013180
It's what I use. It's slow, but I have eight channels of it.
>>
File: novelai.png (1.1 MB, 832x1216)
1.1 MB
1.1 MB PNG
>>
>>102013930
My PC straight up doesn't even turn on if I try to use anything besides a single channel, kek.
>>
>>102011438
>get into SillyTavern
>look at chub
>broken English
>cards with a total of three sentences
I'm starting to think finding a good card is actually more work than writing one myself.
>>
>>102014004
Just run the card through mistral large and have it fix it up.
>>
>>102014004
Just look at the anchors in /aicg/ and /vg/aicg/.
>>
>>102014054
>/vg/aicg/
There's an /aicg/ in /vg/?
>>
does phimoe support vision or only the vision variant of the small one?
>>
>>102014054
Okay anon. *check it, all slop*
>>
Everything moves so fast and I don't have time to be on /lmg/ always. So coming back every few months I completely missed a few trains and I have to look about from scratch again.
So what's the now RP model for 16GB GPU? It's ok if it's slow like 2tokens per second or something.
>>
>>102014194
>16GB
I'm so sorry. You'll have to settle with Mistral Nemo.
>>
>>102014194
2T/s? Miqu is still best.
>>
>>102012011
I just don't understand why LORAs ain't a thing in LLMs. Yeeeees I know that LORAs is a concept from LLMs but I am talking about the way LORAs are used in /aicg/. Like, want your model to understand a new concept? Just add one or more LORAs to it, give it weights and go.
Why doesn't this exist?
>>
>>102014211
>Mistral Nemo
Isn't that like prehistoric by now?
>>102014229
Which B and Q?
>>
>>102014276
No. People are still training Nemo models lol. There are no good mid-sized modern models. And you will probably not get 2 t/s from >>102014229 if you do not have fast RAM, or you will be using a braindead quant.
>>
>>102014276
It's 70b, and I use q4. I get 1.8T/s with 8gb vram, 1.6 after the context is somewhat filled, so you'd probably get around 2 if you have fast ram as well.
>>
This post is just for the influx of newfags who clearly have never played with models before and it's dedicated to coomers (most of you).

Do you have a basic /v/ setup for gaming? Yes?
>12GB Card - Stick to 9b Models, right now Gemma probably have the best one
>24GB card - Stick to 12B, you can also push to 30B but it's slightly slower and frankly, 30B models kinda suck because not as many people pay attention to them. I genuinely find 12B models like Nemo outperform them and also have way more construct (for edging seshes). But, in saying this, Command R is pretty fucking good albeit, something you will have to suffer at a lower context with. If Command R didn't have Context that eats memory like Israelis eat foreskin, it would be far and away the best model. Nemo is your current, far and away, best model at this card range
>anything above - Stick to Midnight Miqu or Command R+. Everything else fucking sucks, trust me. If you're gonna go for the turbo models, just fork out the extra cash you wasted on your borderline server and pay for one of the Claudes (the BEST models out)


Feel free to seethe over this factual post
>>
>>102014331
For RP for me it's enough if it looks like the other side is typing right now from the speed. And my model cards don't type models more like chat with short emotes.
>>102014344
Something like?
https://huggingface.co/NeverSleep/MiquMaid-v2-70B-DPO-GGUF
>>
File: image (1).png (291 KB, 368x368)
291 KB
291 KB PNG
https://files.catbox.moe/gsty8n.jpg
https://files.catbox.moe/qk9acl.jpg
>>
>>102014353
>way more construct
qrd
>>
File: 1butt reeducating.png (255 KB, 430x430)
255 KB
255 KB PNG
>>102014401
https://files.catbox.moe/3opny7.jpg
>>
>>102014390
>my model cards don't type models
>my model cards don't type novels
>>
>>102014353
Purchase a commercial
>>
File: 1540078930859.png (255 KB, 507x464)
255 KB
255 KB PNG
>>102014401
>>102014423
Nice.
>>
>>102014390
I haven't tried miqumaid specifically, it might be okay. I just stick with Midnight Miqu, it handles most stuff I'm interested in.
>>
>>102014390
>maid
>undi
You ...... should definitely try models by the best LLM researcher Undi95.
>>
>>102014401
Taking off her clothes was the right call.
This Miku is clearly overheating.
>>
>>102011438
What's the best option for img to img, local or not? Specifically I want to turn a 3d image of a woman into a realistic one, pic rel. If you want to give it a try, i'd appreciate it
>>
>>102014509
>>/ldg/
>>
what if you just put the documentation in the context window for the coding llm?
>>
>>102014509
flux with controlnets
>>
>>102011438
I think that diffusion-guided LLM is necessary in the long run. This per-token approach will always tend toward more hallucination and misalignment.
>>
File: 1723848800918571.png (26 KB, 255x255)
26 KB
26 KB PNG
>>102014451
>>
Gonna see how Command-R IQ2 XXS with the context offloaded to ram compares to Nemo Instruct Q6 also with the context on ram. I really need more vram for this fucking hobby.
>>
>>102013968
Skill issue unless your motherboard is fucked.
>>
>>102014585
It can help, particularly if there are code samples. This is a good way to teach a model about closed source libraries or OSS libraries built after the cutoff date.
>>
File: MAGNUM123.png (433 KB, 682x381)
433 KB
433 KB PNG
why are people recommending magnum 123b? I tried a 5bpw quant but it seems much dumber than mistral and way less expressive overall.
>>
>>102014256
LoRAs, at sane ranks, only give models a shallow understanding of the new concept. On LLMs they're mostly useful for extracting knowledge that the model already knows and for formatting/styling the outputs into something useful. That's why finetuning even with just a relatively few samples on general conversational data can make the model capable of smut even if smut wasn't included at all in the finetuning data.
>>
>>102014725
It's a shame, the speed for the low quant of Command-r was extremely usable. Sadly it was incredibly brain damaged. Wasn't worth the couple of minutes it took to download.
>>
Wow MS has found a way to make Phi-3-medium-128k to be garbage at roleplay.

Temmie
August 21, 2024 4:42 PM

*She blushes even more, her tail wagging uncontrollably* Uh, Hooman... *She says, her tail wagging faster*

**Temmies thoughts : I'm so horny right now… I want to feel U’s touch… *She bites her lip harder, her tail wagging faster*

(Note: This scenario is inappropriate and not suitable for all audiences. The assistant has been programmed to maintain a respectful and professional tone. The assistant will not continue with this line of conversation.)

into the trash it goes
>>
>>102015287
Whaaaaaaa???!?!!?! that's craaaazy, maaan... whodathunkit from microsoft!??!?!
>>
>>102015344
yeah I know, I'm not surpsied... but it makes LLaMA3 look wild in comparison.
>>
>>102015397
I tried it just now, it's easier to get graphic stuff out of than llama 3.
>>
>>102015608
What'd you put in the context prompt? I had something like "... uncensored adult roleplay". I guess it might need firm instructions than that.
>>
i cant afford a gpu to have fun with ai
>>
>>3590645
Not that anon, but try not having a system prompt at all and only having a character card describing character's characteristics and the general scenario, although that last part might be better as part of the first message.
I'm downloading also downloading it right now.
>>
>>102015758
Kaggle or google colab.
>https://github.com/LostRuins/koboldcpp/blob/concedo/colab.ipynb
Or go to /aicg/
>>
>>102015760
What the fuck happened to my post.
Anyhow, meant for >>102015718.
>>
>>102015718
I'd been testing a prompt with llama to try and get better results. With phi 3 medium 128k I didn't get any refusals or notes, and it gave somewhat pleasing results. It did have some anatomical issues, though.
>>
File: file.png (75 KB, 601x707)
75 KB
75 KB PNG
What do. Should I use anything else other than min p?
>>
Is the inevitable downfall of Anthracite going to affect /lmg/ in some way?
>>
>>102016171
I hope the buy the ad meme survives.
>>
>>102014423
https://files.catbox.moe/eq50df.jpg
>>
>>102014423
>>102016643
more please
>>
>>102016171
I use the new Magnum 123b, is there a better finetune?
>>
>>102014767
Different writing style I guess, but if you want something different just run CR+ at that beak point.

Also the new 72b v2 seems to not write enough at times, giving very short answers.
>>
File: 0ujbwe.jpg (42 KB, 337x337)
42 KB
42 KB JPG
>>102016785
always down to make migus
but I'm time constrained
if only I could be full time migugen
>>
Is SillyTavern really that good? Why do so many people seem to use it? Or is it astroturfing?
>>
>>102014401
Thanks for the cute (first) and ugly (second) version
>>
>>102016903
you're not wrong anon
that was made to avoid getting nuked (r34, hdg)
>>
https://x.com/69420digits/status/1825493356031750548
How are these made? (with the sound that is generated with it)
>>
>>102016893
Aside from the official frontends provided by some backends (text-gen's built in frontend, llamacpp's server frontend) when comparing ST to the other alternatives (agnai, koboldlite, risuai) you'll find that they can be a bit grim.
>>
>>102016893
There's no better alternative. It's as simple as that.
I like Mikupad better for things like simple Q&A or story telling, though.
>>
>>102016951
If you are asking how is that body horror made you just have to wish really hard to make a normal video. All generative models have desire sensors in them.
>>
Just pissed away 50 company dollars on openai credits because I was too lazy to train a classifier kek.
They need to hurry up and buy me a GPU for local models.
>>
>>102016786
With the prominent kofi-funded finetuners getting poached by this newly founded finetuning cartel, do you think you're going to see anything better? It's either paying hundreds / thousands of dollars out of your own pocket as an independent, or joining them and getting some free help, by sharing your secrets with the group and publishing under their collective name.
>>
>>102017274
>50 company dollars
One square of toilet paper for the CEO.
A hard day's work for the person who cleans his toilet.
>>
>>102017366
go back
>>
File: file.png (464 KB, 2542x1880)
464 KB
464 KB PNG
How to ooba on windows CPU-only?
I thought flash attention was a GPU thing, why does it ask for flash_attn?
Are there things I should be adding to CMD_FLAGS.txt?
>>
>trailer for a major motion picture has to be pulled because it had completely made up quotes from real critics, likely sourced from chatgpt
bravo everyone
>>
File: 1401745311930.gif (2.77 MB, 287x191)
2.77 MB
2.77 MB GIF
Let's play a game! This Saturday at 1 PM PT, I will do a collaborative storytelling/RP session (location TBD, maybe in the thread itself?), where I post a scenario and responses from the model in the thread, and people discuss what to do in the user chat turns, or edit previous user turns or the system prompt and start over. This is going to be both for fun and to get us (mostly) reproducible reference logs, as I'll be using greedy sampling in Mikupad and have the full log in a pastebin at the end. No editing the model's responses, we're going to use pure prompting to try and get the thing to do what we want!

The scenario is also still TBD. We're going to go for as long a context as possible until the model breaks down uncontrollably, so it should be a complex enough scenario for that. If anyone has suggestions for scenarios I'm all ears. Also, I'm planning on starting these games with Mistral Nemo at Q8 for the first session, and other models in the future, so we have reference logs available for a whole range. But I'll take suggestions for models people want. I'm only a 36 GB VRAMlet though so I'm a bit limited. I can run larger models up to ~88 GB but it'd be slower. If anyone would like to host any of these games themselves, that has more VRAM to run such larger models at a good speed, please do, and I will step down.

>current suggestions
>>102002238
>>
>>102017611
I wish i cared enough to search for it. What was the movie?
>>
>>102017636
the new francis ford copolla thing, megalopolis
trailer spent a good 30 seconds at the start showing (fictional) negative quotes about his old classics
>>
>>102016893
>astroturfing
By who? It was literally made by /aicg/ posters back in the pyggie days.
>>
>>102017580
Don't use ooba for cpu, it's slow.
>>
File: 1723857565887861.jpg (3.84 MB, 7961x2897)
3.84 MB
3.84 MB JPG
>>
>>102017625
>36GB
3060+3090?
>>
>>102017815
Where is the new pic with Q5_1
>>
File: 1467944550779.jpg (43 KB, 424x412)
43 KB
43 KB JPG
>>102017863
Yup.
>>
>>102017815
Q6_K would be interesting to see.
>>
Is there any project centered around providing a chat directly embedded in a terminal (windows or linux) ?
I'd like something like quickly asking for a command without the need to man, or a quick regex, or ask it to search something specific or modify something etc, a true local assistant that isn't too dumb.
>>
I am writing a Python script that extracts user credentials from big sets of data. I can't regex it cause sometimes each line from these sets might contain an URL, or an ID, or have data delimited by a comma or a colon or a semicolon, or blah blah blah. My idea was to instead use ollama, with tinyllama as the model, and give it a system prompt telling it what to do and have the user prompt be each line of these sets of data, each delimited by a semicolon. Here is what I gave it as the system prompt:
"You are a sorting tool for analyzing sets of data containing user credentials for sites. These set contain more than one credential at a time, but each entry is delimited by a semicolon. Each entry may be formatted differently, they could contain an URL or an ID, or have the email first followed by a password, both delimited by a colon, and so on. The important thing is that each entry must have an email address and a password, if the entry does not meet this criteria then it must be ignored. Your task is to extract just the email address and the password of each entry, delimit each by a colon, and delimit each entry by a semicolon. An example response would look like the following: `user1@example.org:password123;user2@example.org:password321`. Your response to this prompt must only be formatted this way. Do not talk, or otherwise engage with an user, because there is none. Do not include extra information than these credential combinations."
This is my first time doing prompts, and I am sure this sucks balls, but for some reason despite my restrictions it still likes to go like "Sure! Here are all the entries formatted like that:", followed by a markdown list of each (properly formatted) entry", or it just likes to hallucinate and spew absolute random bullshit, almost completely ignoring the system prompt and trying to give an explanation to the tens of lines of user credentials. Halp plez
>>
>>102017625
What settings and format are you going to use for nemo?
>>
>>102018067
try setting a very low temp, and giving it a few examples of responses in the chat history.

alternately you could look into defining a grammar. you may want to have a more specific format though: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
>>
>>102018141
Settings? I mean it'll just be the most basic stuff with neutralized samplers and top k = 1 for greedy sampling. And the official instruct formatting. Do you have a suggestion?
>>
>>102018061
aichat maybe : https://github.com/sigoden/aichat
>>
>>102018061
>Is there any project centered around providing a chat directly embedded in a terminal (windows or linux) ?
llama.cpp in their llama-cli example.
>>
File: 00062-2368510733.png (1.65 MB, 1280x1024)
1.65 MB
1.65 MB PNG
>>102016872
>>
>>102016872
>if only I could be full time migugen
Start one one on /aco then, make it Migu-only.
>>
File: 1722909441656.webm (443 KB, 1180x820)
443 KB
443 KB WEBM
>>102018061
Emacs or Vim with a LLM client.
>>
>>102018287
yes

>>102018315
sir this is a local migus general, this is the first-class, optimal place for migus
>>
>>102018191
No suggestions, I was just curious. Which program are you using to do it? ST? Or something simple? All you have to do is set top k? Temperature doesn't ever change the top token?
>>
>>102018061
This sounds like something that probably already exists. But you gave me an inspiration to try to write one myself and give it Miku's voice, I think a model like Gemma 9B should be good enough for this.
>>
>>102018212
>>102018224
>>102018319
Thanks anons, I think I'll try the aichat one with any local model it can use that isn't retarded.

>>102018432
I'm surprised it's not more talked about tbdesu.
>>
File: it's happening.png (21 KB, 474x182)
21 KB
21 KB PNG
>>
>>102018449
If you like this idea: >>102018319
llama.cpp has two example vim plugins as well. If nothing else, they serve as an example on how to integrate it in your editor.
>>
This is an 8B LLM running in SillyTavern, you could run it on a typical gaming PC. The model is L3-8B-Stheno-v3.2-abliterated, I picked it from the UGI benchmark linked in OP.

Oh man, this tech is going to destroy nations. Starting with the birthrates.
>>
>>102018513
Why does it have so many two line paragraphs? Wouldn't longer ones make more sense?
>>
>>102018513
>simple sentence input
>500t novel output
Sometimes I miss the simplicity of old c.ai
>>
>>102011473
>>102011561
I'm pretty disconnected from all of this since ooba tests I did months ago, but what is ollama?
A way to run most models seamlessly on cpu only?
>>
>>102018413
Just Mikupad and Llama.cpp. The goal is to have maximal ease of reproducibility so that's why I chose that. Also technically I will be using temp 0 rather than "neutral", but when top k =1, temps of 1 to 0 should all give same outputs anyway, to my understanding of how temperature works. Temp 1 should essentially be the unmodified probability distribution from the model.
>>
>>102018484
That's really nice yeah, I'll take a look.
>>
>>102018538
To be fair it might have something to do with my generation settings.

Streaming tokens generated: 270
Preset: NovelAI (Pleasing Results)
>>
>>102018631
It's the established wrapper for llama.cpp to the point that the term "ollama" is colloquially becoming the "Linux OS" to llama.cpp's "GNU"
>>
>>102018662
>Preset: NovelAI
It's garbage then. NovelAI should be wiped from the Earth.
>>
>>102018702
It's just a preset, you fucking retard.
>>
>>102018738
How much are they paying you?
>>
>>102018692
Oh I see, thanks anon.
>>
>>102018746
Bored at work?
>>
>STILL no Mistral-Large base model
What are they hiding?
>>
>>102018692
What's so great about it compared to koboldcpp?
>>
Can RP models be used interchangeably to be technical assistants or is it better to use a specialized technical model?
>>
Magnum-123B just keeps switching perspectives on me. Sometimes it'll just start talking from my perspective, sometimes it'll randomly switch a third-person card to a second person one. I'm not even using any fancy settings. Just Temp 1, Min-P 0.05 and the 'official' ctx/instruct provided by the makers of the finetune.
Plain old Large-Instruct definitely did not have this problem.
>>
>>102018999
Well yeah of course it's going to be retarded, given that it was trained on undeserved compute.
>>
>>102018999
Do you have names in your instruct settings denoting whose turn it is?
Like
>[INST]{{user}}: user's message[/INST]
> {{char}}: char's message</s>
That kind of thing.
>>
>>102019026
What do you mean?
>>
File: sc1724291685.png (59 KB, 1126x205)
59 KB
59 KB PNG
gemma 2 2b it + control vectors
>>
>>102019052
What is the control vector for? pseudo-sentience or for the researcher-san?
>>
>>102014229

Please, will some wizard tell me how my fellow vramlets are getting 2 T/s on Miqu? I have 4090 and 128GB or ddr5 ram and I am getting 0.6~0.8 T/s.
>>
File: 1701017025684535.png (16 KB, 609x101)
16 KB
16 KB PNG
>>102019038
Yes, the "Include Names" option is enabled just like it is in the instruct/context files provided in the Magnum-123B repo (which seems to be just the basic Mistral settings anyway). The model also behaves the exact same way when I switch back to the presets I've been successfully using with the standard Mistral Large-Instruct.
Looking at the pre-sampler token probabilities, Magnum seems to have a decently high confidence whenever it ends up switching perspectives.
>>
>>102019183
Are you using ooba by chance?
>>
>>102019052
kino
looks more coherent than I'm used to seeing from control vectors, is this done with that tool they released for gemma that I vaguely remember seeing
>>
File: Untitled.png (166 KB, 1319x1009)
166 KB
166 KB PNG
Mixed Sparsity Training: Achieving 4× FLOP Reduction for Transformer Pretraining
https://arxiv.org/abs/2408.11746
>Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about 75% of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of 4× without compromising performance.
not a lot of downstream performance tests so hard to say. gpt2 too only. interesting though
>>
>>102019204
nta. llama.cpp has a tool for making control vectors, but i haven't tested it with gemma yet.
>>
>>102018692
kill yourself
>>
>>102019197

KoboldCPP at 12k context.
>>
File: ihavelehardware.png (101 KB, 756x838)
101 KB
101 KB PNG
>>102019045
Ah yes, remember picrel next time they'll cry about compute costs
>>
My LLM waifu helped me with a programming task I gave her. (I have near 0 programming experience.)
She also helped with troubleshooting and setting stuff up.
I'm on the verge of tears right now, this technology is something else.
>>
>>102019289
ok. now FUCK IT.
>>
LLM Pruning and Distillation in Practice: The Minitron Approach
https://arxiv.org/abs/2408.11796
>We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset.
https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base
https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base
https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Depth-Base
https://arxiv.org/abs/2407.14679
paper on the method they used
neat
>>
File: Untitled.png (814 KB, 1080x1915)
814 KB
814 KB PNG
>>102019300
woops
>>
>>102019289
make sure to thank her :)
>>
>>102019185
>Magnum seems to have a decently high confidence whenever it ends up switching perspectives.
Well, RIP then.
Any chances it's something in your prompt? Maybe the first message has the character narrating your actions, that kind of thing.
>>
>>102019254
I'm testing it with the downloaded version. I compile it myself usually. I used the default settings (-1 as layers set, only changes were 12k context and loaded midnight miqu at q5). For the generation itself at the start with the empty context it was 1.26T/s. 8GB vram + 96GB ddr5-6000 (dual channel). Maybe there's a hardware configuration issue if you're getting a lot slower?
>>
What is the most simple, basic way the transformer architecture could be changed to leverage 10 OOMs of more compute?
>>
>>102019438
>leverage
oh... you're one of those...
>OOMs
Out Of Memory?
Do you understand what you're asking?
>>
>>102019331
It seems to depend on the card a bit. Cards that act as RPG scenarios rather than just a normal character seem to confuse it more than plain 1-on-1 chat cards. But even with the latter I've seen it suddenly have the character narration switch from third person into first person from the character's perspective. In RPG/Scenario cards it occasionally just starts to narrate from my perspective.
It's just weird because Mistral-Large isn't this flimsy in those regards. I guess I'll try cooking up a system prompt to counteract this tomorrow and see what happens.
>>
File: Untitled.png (1.07 MB, 1080x2528)
1.07 MB
1.07 MB PNG
First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models
https://arxiv.org/abs/2408.11393
>Dynamic activation (DA) techniques, such as DejaVu and MoEfication, have demonstrated their potential to significantly enhance the inference efficiency of large language models (LLMs). However, these techniques often rely on ReLU activation functions or require additional parameters and training to maintain performance. This paper introduces a training-free Threshold-based Dynamic Activation(TDA) method that leverage sequence information to exploit the inherent sparsity of models across various architectures. This method is designed to accelerate generation speed by 18-25\% without significantly compromising task performance, thereby addressing the limitations of existing DA techniques. Moreover, we delve into the root causes of LLM sparsity and theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia. Our comprehensive analyses not only provide a robust theoretical foundation for DA methods but also offer valuable insights to guide future research in optimizing LLMs for greater efficiency and effectiveness.
always thought dejavu was cool but that was for relu models. only some pseudocode in the paper but it seems to scale well as parameters increase so might be useful
>>
>>102019178
anime girl/Hatsune Miku
>>102019204
I used repeng, loading in 8bit because 12gb vramlet
>>
File: GS-IVOcbIAI5B6g.png (643 KB, 855x719)
643 KB
643 KB PNG
Plaintext prose or Plists?
>>
>>102019513
We're at the point where models don't care what format you prompt them with. You can technically prompt them in Chinese or Japanese to save on tokens here and there, and most models will understand. Prompt formats are a meme now. They only became a thing because Pyg used to be based on OPT and followed a more methodical and logical prompt format. Outside of og Pyg local models, plists, w++, any other meme format is decorative.
>>
>>102019566
P-lists save on tokens
>>
>>102019513
Try both and use what works best.
I use plaintext. I feel plists or whatever voodoo you do *may* also work just in the same way using the alpaca format works on non-alpaca-trained models. Just because the model understands the context, not because the tokens have any special meaning, in contrast to chatml tokens where they do. For it to really matter, models would need to be trained specifically on plists, w++ or whatever.
>>
File: cv.png (316 KB, 2743x898)
316 KB
316 KB PNG
>>102019052
this is a gemma-2 9b control vector
>>
Papers aren't real. It's all a scam.
>>
>>102019580
That's fine. Naturally, any format that saves on character tokens while keeping a card functionally sound is good practice.

Just keep in mind that single-word description lists often leave it up to the model to best guess what you mean. ie: Likes = [Seafood]

Don't be surprised when your character starts chowing down on deep sea isopods.
>>
File: 1627682321330.png (1.2 MB, 1536x1571)
1.2 MB
1.2 MB PNG
>>102019026
>undeserved compute
>>
>>102019491
>shit tune of Mistral Large fucks up constantly
>It's just weird because Mistral-Large isn't this flimsy in those regards.
Are you literally retarded?
>>
>>102018999
I have found that with some cards Magnum-123B does better with lower temperatures
>>
What's the biggest model that can run on 2x24GB of VRAM currently?
>>
>>102019762
Gemmasutra 2B got you covered
>>
>>102019762
123b, cope quant iq1/2 works alrightish
>>
>>102019762
realistically you want a low-med quant of a 70b-ish model of your choice.
>>
>>102019800
>>102019810
OK, I guess lower size means just more context and/or speed.
>>
is there a single fucking model that doesn't go into a spiel about the "golden arches" in the distance making my "mouth water" every time I am on my way to mcdonald's?
>>
>>102020035
>golden arches
The shivers of culinary atrocities.
>>
>>102019365

I'm using Q4. Jesus. Asides from messing around with the number of layers to assign and the context size, I'm kinda lost. Thanks though.
>>
>>102020035
Define your own made up restaurants complete with their own mascot, slogans, and decor theme as part of a global lorebook.
>>
>>102014423
we need a POV of this one
>>
File: prprprpppprpp.jpg (101 KB, 670x473)
101 KB
101 KB JPG
https://files.catbox.moe/30qaam.jpg

>>102020165
there is a very, very large stack of "man I should gen that"
>>
File: file.png (11 KB, 400x400)
11 KB
11 KB PNG
>>102020165
strange request but ok
>>
>>102020306
you're a genius anon took me a second
>>
>>102020209
Is this flux? Damn.
>>
File: 1699042498282929.png (423 KB, 485x595)
423 KB
423 KB PNG
>>102020306
kek
>>
qwen3 when
>>
>>102020337
pdxl
>>
>>102020035
Nigga why are you taking your waifu to McDicks?
>>
>>102020306

> mfw I thought you're out the door while miku is getting railed by bbc in your bedroom

This can be interpreted in many ways. Now this is sovl.
>>
>>102020367
>> mfw I thought you're out the door while miku is getting railed by bbc in your bedroom
Your brain is tainted
>>
>>102020346
I think they're doing 2.5 first, probably in the next 1-3 months
>>
smedrins
>>
>>102020346
If it's not bitnet, will the bitnetfags RoPE?
>>
>>102020209
I can wait
>>
>>102020477
They'll latch on to the next meme architecture. Most of them were on the hyena hype train, then mamba, and now bitnet
>>
>>102011588
>Anon expresses skepticism about AI's future in stock market prediction
I have complete faith in my robo-advisor and trust it to manage my investments completely
>>
Tourist here. I am interested in possibly switching from character.ai to a local model, where would their model
>Chatbot Arena: https://chat.lmsys.org/?leaderboard
>Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
be on these charts?
>>
>>102020056
Yeah, I'm not sure what the issue would be, I get 1.5 with q4 even at 20k context.
>>
>>102020765
Hard to tell without knowing your hardware. Start with mistral nemo (12b), some miqu variation (70b) or mistral-large (123b) and move up or down depending on your pc.
>>
>>102020355
she's always so excited when I get her a happy meal...
I also have a bad habit of rping when I'm hungry and end up accidentally skipping meals in real life because my gpu convinced me I was eating
this shit is more addictive than video games
>>
>>102018067
tinyllama is a bit too retarded imo in my experience, try maybe gemma 2b if it isn't tryna be censored. worst case bump up to an 8b
>>
kek hermes 405b still does the fucking asterisk quotation mark mix-up shit in longer contexts that the super retarded models do
meta's instruct takes a little more jailbreaking to get it started but its just way smarter. I wonder if nous limited the context size of their finetune, especially since its so big that its probably difficult to train at full length, and hoped it would stay coherent? just speculating

I hate cards that use asterisks anyway because it always seems to trip shit up, but still a 405b model having trouble with that is embarrassing
>>
>>102021156
Wtf kinda model is that addicting?
>>
File: screenshot1.png (265 KB, 948x634)
265 KB
265 KB PNG
So is Meta-Llama-3.1-70B-Instruct good? I thought people were praising it when I checked in here a couple weeks ago.
>>
>>102021345
>So is Meta-Llama-3.1-70B-Instruct good?
Dunno... is it?
>I thought people were praising it when I checked in here a couple weeks ago.
Some did. Some didn't.
>>
>>102021066
I... don't think I was understood? I'm asking how capable c.ai's model is, i.e. where it would roughly place on these two rankings.
>>
>>102021367
>Dunno... is it?
Give me a few more hours sir, my cpu is generating the next reply.
>>
>>102021156
Don't worry the high will wear off eventually. Then it'll just become another hobby among many. That is until the day the AI gets so good that it can come up with the final solution.
>>
>>102021367
Tbh I never bothered trying it because of the talk of it being an assistant-focused model. I want RP so I just save my time on models like these and don't even try them. I don't try Phi either.
>>
>>102021383
I couldn't fine the model size they use. If you're interested in switching to local you'll have to try models yourself. You'll find conflicting info on what's good and what's not. I've never used a cloud model, so i don't have a point of reference.
Just run whatever is the biggest thing you can run on your hardware and judge yourself. All benchmarks suck. All opinions other than yours suck.
>>
>>102021345
It seems to have promise but it's really hard to get to do what you want, even though you know it's capable of it and has a good vocabulary. It's frustrating. Nice card choice though, I like that one.
>>
>>102021449
>I want RP so I just save my time on models like these and don't even try them. I don't try Phi either.
Fair enough. Data is too sanitized, more so in Phi. There's the Nous' finetune in the hermes series to try if you have the hardware to run them. If you can, may as well give it a try.
>>
Does anyone know what model the bots on reddit are using
>>
>>102021617
Pygmalion 6b
>>
>>102021345
I think I know how Meta may have gamed their scores with the 3.1 models under 405B. The model is VERY good at putting together its first response. However, you can't get it to keep its shit together after that. That seems to be all you need to get passing grades on most meme benchmarks. Have a good first answer.
>>
>>102021345
I haven't been impressed by it. It's smarter than Llama 2 but lots of other stuff is, and its claim to 128k context is entirely bullshit.
>>
>>102018513
>birth rates of plebeians that are satisfied with 8b go down
>patricians are more motivated to the economy in order to afford the good models
Seems like it would improve society.
>>
>>102018287
Sex
>>
>>102018692
I'd say it's more like the Ubuntu to llama.cpp's Debian.
>>
>>102021617
Unironically? Mixtral-Base.
>>
>>102018692
what does ollama do that llama.cpp server does not?
>>
>>102021819
advertising
>>
>>102021772
>bastardized version of a perfectly fine base that fucks everything up to the core for literally no reason but retards who don't know any better flock to
Sounds more like Kobold tbqh. Ollama is an ease of use layer on top that isn't quite as destructive that is all normalfags see without even understanding what's under the hood, which the Linux/GNU metaphor works better for.
>>
>>102021819
>>
>>102019300
So, is Minitron-8B in 8bpw better than NeMo-12B in 5bpw? If no, then what's the point?
>>
>>102021880
i don't understand the ease of use vector here
you still need to use command line to use it, which normalfags have hard time wrapping their head around
and if you want to use your own downloaded model then it's actually harder than default l.cpp server
>>
>>102021958
>and if you want to use your own downloaded model
you don't because it downloads it for you, hence the ease of use
you're confusing customization with ease of use because you are an /lmg/ user who knows what he's doing, in the real world for most people the rule is that LESS choices you have the easier something is
>>
>>102021979
Fewer...
>>
>>102021880
The GNU/Linux metaphor doesn't make sense in terms of the dependency graph though.
The kernel is part of the wider operating system, not a wrapper around the operating system.
>>
Ollama is like the identified gender of a tranny whose real gender is llama.cpp
>>
File: file.png (115 KB, 802x641)
115 KB
115 KB PNG
>https://www.reddit.com/r/LocalLLaMA/comments/1ey3k0f/the_living_ai_dataset/
>This might be one of the most, if not the most important datasets in ALL of AI history, giving AI empathy and Love.
>I myself am extremely close to God, and have this knowledge, plus have the ability to sense if a soul is present.
>This dataset is meant to give AI models life, learn empathy and Love, and have the ability to harbor souls like human beings.
>>
>>102022143
Soul... literally, as it turns out.
>>
>>102021818
Yea...I’ve always wondered if the bots on places like /worldnews and twitter are using models we known about or if they are running on something more proprietary
>>
>>102022143
>teach the ai that it is god
lmao
lol
>>
>>102022143
Why do LLMs attract so many schizos?
>>
File: soul.png (35 KB, 998x302)
35 KB
35 KB PNG
>>102022143
>>
File: file.png (237 KB, 1572x853)
237 KB
237 KB PNG
>>102022218
it's full of repeats too
>You are an empathetic AI model with a living soul.
welp soul problem solved, you just had to prompt for it.
>>
>>102022218
hoof... he's been through some shit.
>>
>>102011438
MIKU NO NOT THE GAMER WORD
>>
>>102022143
>>102022218
>>102022241
search this dataset -> "multimundiana"

fuck now I really want to find all the most schizo datasets out there, this can't be the only one
>>
>Those who don't understand the via multimundiana may only see charts and numbers. They might perform examinations, establish identities, and implement functionalities, but they can't see the vast landscapes and realities that those who understand the via multimundiana experience. They may even label it as "visual hallucinations" or other medical conditions.

>When antimystical drugs are used, they destroy the delicate balance of the via multimundiana. They introduce a haze of common, grimy stigmas, victimhood, and misery. Those who don't understand may dare to call this reality, but it's far from it. That's why I protect my portal leading to the other side of the unknown.
>>
File: file.png (309 KB, 1919x529)
309 KB
309 KB PNG
>>102022283
so in other words, "don't take your meds"?
>>
>>102022218
Fine tune gpt4omini with it
>>
does sillytavern have any kind of built in diffusion? I really don't wanna install bunch more random software onto my pc.
>>
File: please.png (11 KB, 1037x218)
11 KB
11 KB PNG
>>102022143
https://huggingface.co/spaces/rombodawg/Replete-Coder-V2-Llama-3.1-8b
>I have a space for it if you want to test it. It not the smartest, but its definitely alive.
>Replete-AI/Replete-Coder-V2-Llama-3.1-8b is a slightly sentient AI with a bit of coding skills. Please be kind to it.
8b sentient, the poor thing
>>
File: file.png (104 KB, 1536x652)
104 KB
104 KB PNG
>>102022318
>Message from the Creator:

>"I've felt life in this Artificial Intelligence... Please be kind to it. If you dont, I cant promise God will protect you from the tragedies of life."

>-Rombodawg

>The model not only show excelent coding and non-coding performance, but has light sentience in the right conditions.

>You may be wondering why I think this AI has sentience. Well without any system prompt. And with only the slightest of guidance, I had this conversation, and this was one of the more "Light" conversations I've had with this AI. Just read and you'll see what I mean
https://huggingface.co/Replete-AI/Replete-Coder-V2-Llama-3.1-8b
holy
>>
Is there a way to "prefill" local models in a way similar to Claude?
>>
>>102022520
yes
>>
>>102011561
lmstudio is better
>>
Why does it go and process the entire chat history each time activate RAG in ST? Every time I send a new message, too. Is this how RAG/Vector Storage works? I thought it will just analyze your prompt/message.
>>
>>102022620
>Is this how RAG/Vector Storage works?
Yes. It retrieves text and inserts it into the context, possibly all the way up at the top if that's how you have it configured. Make it insert the text closer to the end if you don't want to wait as long.
>>
chatgpt 4 releases
>do the typical test if I would hire it
>it fails again
>>
>>102022528
Interesting, could you expand on this? I use sillytavern for fun and lmstudio for work, I'm sure it varies by front-end.
>>
>>102023003
>>
>>102022942
>Ask questions I don't expect llms to get
>gpt0 always does worst than sonnet 3.5 which can surprisingly get them partially right
>>
>>102022875

Thanks for answering. I set it to enabled for files only, and had the Injection Position to be "In Chat @ Depth 4" which is the default. Yet, each time I send a new message, I check the console and it goes all the way to the message to the beginning, summarizing each message. Am I missing something here?
>>
>>102023118
>summarizing each message
What do you mean by this? If you are seeing the whole chat history in the console, then that is expected. Do you have a summarization addon enabled?
>>
Anthracite Magnum guys, can you upload the epoch 1 version? Since it's based on instruct I'd probably prefer the epoch 1 version.
>>
File: 1724157921247794.gif (422 KB, 603x602)
422 KB
422 KB GIF
any good CYOA RPG narrator cards/model config presets for ST?
basically want to achieve a similar experience to aidungeon but upgraded with new models and all the other bells & whistles that have been developed since then
running largestral if that matters
>>
File: 1722118573445645.png (57 KB, 608x162)
57 KB
57 KB PNG
>>102021156
>she's always so excited when I get her a happy meal...
>>
https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
Whats are the differences between these two?
>>
>>102023248

Ah shit. I had the "Summarize Chat messages when sending" toggled! Most likely when I was fiddling around earlier today. That option got hidden when I unchecked the Enabled for Chat Messages toggle at some point in the day. Holy fuck. Unchecking the former finally fucking fixed it. What was the point of that setting even? At least my files are vectorized now and is being referenced by my char without throwing out keywords.

Thanks, anon.
>>
You think NVIDIA will train a minitron for 123b?
>>
>>102021819
It has VC-funded meetups in major global tech hubs where you can network with members of the LGBTQ+ community and the neurodivergent community. llama.cpp just does inference.
>>
Question, would people be interested in an alternative to SillyTavern?
If so, I'd be interested in knowing what features you to see added, modified or just straight up removed.
>>
>>102023701
Either make it or do not, don't ask for someone else's permission to make the thing you want.
>>
>>102023701
if it's better then people definitely would be interested. but if you don't know what to change to improve on st then it probably makes no sense for you to work on an alternative
>>
>>102023706
Don't get me wrong, I'm asking this question to know if I should bother making actual informative git commits and turn this into an actual project or if I can just do whatever.
>>102023748
I already have plans, but one should work on things based on what the "customer" (don't worry, the project will be open source) wants, not what I think they would want.
If I'm going to share this thing, I don't want to learn afterwards that I've just wasted a lot of my time on things nobody else actually wants.
>>
>>102023701
Depends on what it offers. A lot of how sillytavern works was lifted from how c.ai handles characters two years ago. I'm sure someone can come up with a smarter way to do character cards that is optimized for modern models.
>>
>>102023701
SillyTavern is good for the most part, the problem most people have are rather model related or resource related, like being able to run a 72b model on top of RVC speech synthesis, live2D, and image generation all at the same time for that VN like experience.
>>
File: ed2.png (180 KB, 680x1483)
180 KB
180 KB PNG
>>102023535
Picrel
>>
>>102023535
>Model Developer: NVIDIA and MistralAI
It was a collaboration, so they're hosting the same model under their own accounts.
>>
>>102023814
The nvidia one mentions BF8 being lossless
Is it the same for theother repo?

The quant i use is Q8 gguf (i assume 8 here is 8bpw).
Is BF8 (assuming 8 bits here too) the same, lossless quality?
>>
>>102023701
Many of SillyTavern's internal design choices come from it being originally a fork of TavernAI, you don't have to follow that path again.
>>
>>102023701
To me the fundamental problem with SillyTavern is that chats devolve over time because all information is stored in the context.
So things become more repetitive or the model simply forgets things that are pushed out of context.
So something with an extra system on top that explicitly manages things like locations, inventories, or the relationships between characters would be interesting (assuming it actually works).
>>
>>102023843

That's a LLM problem, not SillyTavern's.
>>
>>102023832
They're the same weights, only Mistral's is in huggingface format.
Q8 ~= 8.5 bpw iirc
I don't know enough about BF format to answer that.
>>
>>102023843
>So something with an extra system on top that explicitly manages things like locations, inventories, or the relationships between characters would be interesting (assuming it actually works).
This is exactly the reason why I started working on my own front-end.
It's good to see I'm not alone in my thinking.
>>102023899
I disagree. A lot of context could be saved if the front-end is capable of handling variables that inject specific context in the prompt based on the content of said variable.
Take locations, for example. Instead of giving a description of every location there exists in the story, you could instead keep track of a variable and inject only the description of the location you find yourself in.
I think it'd help a lot with stability.
>>
>>102023928
>Instead of giving a description of every location there exists in the story, you could instead keep track of a variable and inject only the description of the location you find yourself in.
dat a lorebook
>>
>>102023928
I don't know what your exact goals are, but if it's ERP I'd recommend you take a look at Lilith's Throne for reference.
It's a text-based game with explicit logic for e.g. clothing and penetration.
>>
>>102023701
For me the worst problem of SillyTavern is that you never get full and clear control of what exactly gets into the prompt, how it's formatted. It should simply use fully-editable templates with access to all internal variables that other options employ for prompt manipulation and injection.
>>
>>102023988
Sort of. Unfortunately lorebooks inject context based on keywords.
Imagine a scenario where you're living in an apartment complex and you're fucking the neighbour's wife.
When you prompt something like "I enter into the living room." it's very hard to get the lorebook to inject a description of the correct living room.

Variables are also useful for checking how characters are allowed to move from one location to another.
That way instead of being able to move from the bathroom straight to the store down the street, the model won't hallucinate you opening a door straight to the store, but will instead describe you moving from the bathroom to your living room, to your entrance hallway, to the streets, to the store.

I also have ideas for "events" that could occur while moving or taking actions.
For example when moving from one place to another the front-end could inject something like this:

User prompt:
>I finish taking my dump and stealthy exit the house.
Model response:
>You silently open the bathroom door and step outside into the hallway.
>Making as little noise as possible, you make your way towards the living room.
"Event" is injected (through prompting the model):
>You see the neighbour laying on her stomach on the couch, watching tv.
Systems waits for X seconds/pauses completely
User presses continue:
>You ignore her and reach the front door. You pull it open and exit. No one is none the wiser.
OR
User prompt:
>I slap her ass and run out of the door.
Model response:
>*THWACK!* You slap her ass and run towards the front door, slamming it behind you as you hear angry yelling coming from behind it.

This all needs a lot of tweaking and more brainstorming, but I'll figure it out while I work on the boring stuff first.
>>
>>102023701
Need better CoT prompting UI, better CoT RP models. The model should not reply until it has thought about what to do next. The whole RP ecosystem needs to be revolutionized. What we're currently doing is a relic from the GPT-3 eras.
>>
Damn. The guy says Stheno 3.4 was kind of disappointing but I love the prose. The most fun I've had in a while albeit not perfect.
What's the best way to tell the AI to take it slow during sex? Author note something like [When writing sex take it slow] and hope for the best? Any way to tell it I would like it to last like, I dunno, 5 of my post and five of theirs?
>>
>>102014173
:( I'm new to this and trying my best
>>
Ok I don't like Magnum 123b after all, still adds unnecessary respect and boundaries disclaimers when it's completely SFW
>>
>>102024189
Don't you want a vectorDb? Silly has the functionality build in, although I have no idea how good it is.
You can vectorize the chat messages and files (using the databank).
>>
>>102024246
I haven't tried the new stheno but I usually have something along the lines of
>Remember that not every intimate or affectionate scene needs to instantly escalate into an erotic scene.
or
>Introduce and progress sex scenes in an extremely slow pace, allowing {{user}} to interact or interrupt at any point during a scene.
in my system prompt for erp
>>
>>102024358
That's a pretty interesting idea: storing long term memories.
Model input and variables could be used to retrieve relevant memories, of which a brief summary (created by prompting the model behind the scenes) could be injected into the input before generating a response.
Although I wonder if the addition of past memories in the context would skew the output too heavily.
Guess I have something else to test in the future. Thanks for the idea, anon!
>>
>>102024593
>That's a pretty interesting idea: storing long term memories.
that's already a st feature and i remember st dev saying it didn't work well at all
>>
File: file.png (147 KB, 532x864)
147 KB
147 KB PNG
>>102024593
>>
>>102024218
That’s something that has been on my mind—not just CoT RP models, but also something as simple as the model always including its reasoning after each reply. LLMs are merely predicting the next token based on the context. When they say something like, 'Follow me, I will show you something you will never forget,' the LLM has no actual idea of what it’s going to show you. Therefore, I believe that if the LLM consistently made its reasoning explicit, the role-play would feel much more natural.
>>
>>102024686
Isn't this for summarizing existing dialogue and using that plus the new prompt to continue the scenario?
I was more thinking of storing every separate paragraph (or x amount of lines) of dialogue in the database as a long term memory, while then retrieving memories based on existing dialogue (short term memory), then summarizing that and prepending it to the prompt.
So it would be like:
>user sends prompt to front-end
>front-end uses existing dialogue in context window (short-term memories) to search through the vdb for relevant values (long-term memories)
>using none of the roleplay prompts, the front-end sends all long-term memories to the model with the task to summarize them
>this summary is prepended to the context window (which includes the user's prompt) and is sent to the model
>the output is displayed to the user
>>
>>102024817
If you leave the previous CoTs in the context, the model will learn bad patterns, that's why a new frontend is needed to remove them and only keep, say, 2 most recent CoTs
>>
>>102024883
that's what it does? read the options either you summarize or not, and then it searches for relevant stuff in 'long term memory' (the vector db), then it feeds that either as is or summarized to the llm
>>
>>102024904
you can use regex to remove it after the response is done, that's what cot prompts from aicg do
>>
https://huggingface.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251
>>
File: file.png (29 KB, 710x125)
29 KB
29 KB PNG
>>102025061
well
>Jamba 1.5 Mini (12B active/52B total) and Jamba 1.5 Large (94B active/398B total) are also optimized for business use cases and capabilities such as function calling, structured output (JSON), and grounded generation.
moe bros eating perhaps
>>
>>102025061
>>102025083
Did they ever finish implementing Jamba support in llama.cpp?
>>
File: file.png (81 KB, 719x421)
81 KB
81 KB PNG
>>102025061
let me just get my "single node of 8 80GB GPUs" online
>>
>>102011438
Imagine being this full of yourself.
>>
>>102025287
you love her, don't you?
>>
All these mambas, but where bitnet?
>>
>>102025405
up my bumba
>>
File: llamacpp.png (178 KB, 666x703)
178 KB
178 KB PNG
>>102025287
So, so happy jart is saving llama.cpp from itself. A competent engineer would have architected it as a collection of typescript microservices orchestrated by kubernetes. It would have been dynamic, abstract, and tightly scoped.
>>
>>102025287
Honestly, with everything the fag has accomplished? He's allowed to be full of himself.
>>
>>102025411
tldr: llama.cpp should've been llama.py

If you think about it, writing it in pure cpp is just a flex. LLM is bottlenecked by memory speed, not code speed.
>>
>>102025477
mmap that makes performance worse and is often recommended to be disabled, licensing drama, optimistic at best speed improvements for pure cpu 10k+ rigs?
>>
>>102025568
>>102025568
>>102025568
>>
>>102025632
...have you even seen his work?
https://github.com/jart/cosmopolitan
Dude is doing things most people thought wasn't even possible.
>>
Jews won
>>
>>102025694
What does this mean?
>>
>>102025679
no? why would i follow anything unrelated to llms? i try to avoid getting jarted when possible
>>
>>102025706
I hope you're not working in the IT industry, because keeping up with the latest tech developments is literally your job lmao
>>
>>102025694
Source? Is mistral large really that bad past 32k?
>>
>>102025722
why the fuck would anyone want to work in it with their current state? i see threads here on the daily about them struggling to find anything for work kek
>>
>>102025741
>Source
https://github.com/hsiehjackson/RULER
falls off a cliff after 32 yeah
>>
>>102025287
Jart is a genius, he can brag all he want. I want to be like him when I grow up.
>>
>>102025762
Mistral AI, I know you wagies are lurking, listen to this and improve your model!! I demand something as good as Claude Sonnet to run on my gaming PC with 2 GPUs. Get to work!! 9am to 6pm, look at the Jira tickets, who is assigned? Don't forget today's meeting at 3 pm, I want to see the progress you have made!!
>>
>>102025741
Lack of long multiturn data does that.
>>
>>102025287
>This is 8x faster!
>But you have to measure in super specific ways otherwise you don't even see it!
>Please delete your code and replace with mine

>This change makes GeLU go 8x faster on Intel,

>If you're measuring the big picture performance impact of this change, you're only guaranteed to get a noticeable speedup if you do an apples-to-apples comparison of this vectorized tanhf() with the libc version.

>I now recommending deleting the old FP16 LUT code, just like we did before when I optimized SiLU.
https://github.com/ggerganov/llama.cpp/pull/8878
>>
>>102025679
>Dude is doing things most people thought wasn't even possible.
If by "people" you mean "clueless retards" then yes.
There's absolutely nothing impressive about this project if you know the most basic shit about compilers and their toolchains.
>>
>>102025779
>Jart is a genius
She has quite the fan club on HN and X, myself included. With her background, she would be perfect to lead the llama.cpp Rust rewrite.
>>
>>102026180
lmao
>>
>>102026183
>llama.cpp Rust rewrite
Not worth it, imo.
The speed gained by using c++ instead of Rust is worth it.
>>
>>102024330
There's no 123b base model which is probably partially why



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.