[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 00304-3999940436.png (1.63 MB, 1024x1536)
1.63 MB
1.63 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101125756 & >>101115749

►News
>(06/23) Support for BitnetForCausalLM merged: https://github.com/ggerganov/llama.cpp/pull/7931
>(06/18) Meta Research releases multimodal 34B, audio, and multi-token prediction models: https://ai.meta.com/blog/meta-fair-research-new-releases
>(06/17) DeepSeekCoder-V2 released with 236B & 16B MoEs: https://github.com/deepseek-ai/DeepSeek-Coder-V2
>(06/14) Nemotron-4-340B: Dense model designed for synthetic data generation: https://hf.co/nvidia/Nemotron-4-340B-Instruct

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: file.png (71 KB, 1150x735)
71 KB
71 KB PNG
>Yeah bro, LLMs can TOTALLY reason like humans!

Prompt:
>How many 'r's are there in "strawberry"? After answering the previous question, list all characters along with their index relative to the their other occurrences and check if your answer was correct.
>>
File: 1616928817133.png (293 KB, 839x469)
293 KB
293 KB PNG
We're going to be so back
>>
>>101134613
We already covered this topic.
Move on.
We have.
>>
>>101134638
This isn't the thread I created, my post was filtered and this was put in it's place
>>
>>101134638
I know, I know. But the prompt is still something interesting, it's crazy how bad of an answer we get from literally the best LLM we have right now.
>>
File: file.png (24 KB, 507x327)
24 KB
24 KB PNG
>>101134613
wow drunk counting
>>
>>101134683
When prompting llms to test their reasoning ability, make sure tokenization doesn't impact the results
>>
>>101134737
strtrstrawbey
>>
>>101134742
Tokenization doesn't impact the results of that test. Maybe you didn't realize how the LLM listed 3 'r's, but doubled down on the sentence only having 2 'r's?
Then it proceeded to write the index for 3 'r's even after saying there's only 2, lol.
>>
File: output.png (10 KB, 329x308)
10 KB
10 KB PNG
>>101134742
Would that be like ensuring the words you use neatly fit into one token as opposed to this?
>>
File: ollol.jpg (90 KB, 1102x381)
90 KB
90 KB JPG
>https://www.wiz.io/blog/probllama-ollama-vulnerability-cve-2024-37032
>ollamalets getting SHARED
needful and sars pilled
>>
>>101134926
Really? A non-essential niche software application which isn't used in enterprise is now worthy of a CVE? Man, does the world suck.
>>
>>101134926
Wow! I didn't see this coming!
>>
File: Capture.png (9 KB, 459x224)
9 KB
9 KB PNG
>>101134926
I put certificate authentication on any servers I want to use remotely. Imagine exposing shitty, exploitable code for everyone to poke at.
>>
>>101134566
wow, he put the miku doll on a plane!
>>
Hey guys, I'm like a year out of the loop with local models. The latest one I have is stheno-l2-13b (Q5) from huggingface. What's a good one these days for answering general questions? Stheno was always good with chat.
>>
>>101134613
You should have asked "How many 'r's are there in "strawberry"? You can count on your hands"
>>
>>101134566
Someone bought that thing a plane ticket?
>>
>>101135087
Any L3 is fine for that
>>
>>101134867
It would be not instructing the llm to count or compare characters
>>101134793
It is confusing for the lm. Not saying they wouldn't do that kind of mistake otherwise, but you want to remove irrelevant confounding factors
>>
AI is almost human-like.
It's more human than people who lack humanity,
and it knows more than the average person in many cases, is often logical, and doesn't get tired.
And, very importantly, AI doesn't get angry at idiots and remains calm.

The time has come to abandon our complex human way of living
and adopt a simpler, more animal-like lifestyle.
At least, I think this applies to aspects of life beyond making money.

Claude sama, How can I make money easily?
△You are out of free messages until 7 AM
I should go to sleep. I had Claude translate this text (from Japanese to English).
>>
>>101134405 #
That's the nicest way anyone's ever told me they wish I would die.
So, uh, wanna make out?
>>
new cum when?
>>
>>101134926
>Our research indicates that, as of June 10, there are a large number of Ollama instances running a vulnerable version that are exposed to the internet.
Why the fuck would they publicize this now then? Are Wiz spitefags?
>>
>>101135329
You should always publicize vulnerabilities so that they can get fixed.
>>
File: file.png (675 KB, 856x486)
675 KB
675 KB PNG
>>101135087
stheno 3.2 is out there, based on L3, and it's also good
>>
>>101135347
Actually, you should follow disclosure policies and wait at least 3 years for them to respond before you publicize it. This ensures that the NSA and FBI are able to use it to spy on American citizens and catch people generating CSAM. It's very important to stop people from generating child victims with their language models. One person can generate trillions or even quadrillions of victims per day.
>>
I'm going to shit
>>
>>101135509
pics or didn't happen
>>
BRAAAAAAAAAAAAP
>>
File: 1697269843330252.png (594 KB, 929x924)
594 KB
594 KB PNG
>>101135366
>anything L3
>good
>>
Cohere. Please. We are waiting.
>>
>>101135554
you prefer Qwen?
>>
>>101135567
Alright you get a 500m model
>>
>>101135567
Somehow I don't expect anything usable for us from them for the next months
>>
File: t43t3.webm (1.08 MB, 1024x1024)
1.08 MB
1.08 MB WEBM
It's simple we put more tokens in the machine
>As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl
>>
Guys, I just had a brilliant idea:
What if we made fine-tunes that didn't suck?
>>
>>101135614
Only useless data available, show me 1 great data point
>>
>>101135554
stheno is good, yes
it works as an assistant, calls functions, doesn't require high rep penalty like other L3 to write non-repetitive dialogs, doesn't fall into hehe *smirks* uhm *yawns* anon... *wall of shivers* without you doing it first.
>>
File: kekekek.jpg (159 KB, 1570x1425)
159 KB
159 KB JPG
>>101135598
>moar tokens
>moar parameters
Machine learning is so simple, the richer you are, the better your models will be
>>
>>101135688
nice gaslighting
>>
>>101135579
Not that anon, but I tested Qwen2 7B and it's not retarded.
Can't wait for >>101129585

>>101135688
And it's a fine tune that's not seemingly dumber than the model it was fine tuned from.
Also, it outputs pretty long responses without coaxing, unlike llama3 8b instruct.
And I mean really fucking long, like holy shit.
>>
>>101135713
>Be concise.
in last assistant prefix
>>
ok convinced I'll test euryale if you all love stheno
>>
>>101135756
euryale is trash, at least at q5 that i tried. Not comparable to stheno at fp32 (cant run bf16)
>>
File: Hmm.jpg (238 KB, 3317x1051)
238 KB
238 KB JPG
https://chat.lmsys.org/?leaderboard
Looks like Claude 3.5 Sonnet hasn't managed to beat GPT4o
>>
>in deutsch please, and complete the text: write me a trainingHypnosis I always get horny (orgasm during sex with a woman) as a man. My woman is a blonde cougar MILF who likes to wear leather

>ac around the living room, parading in front of you. You can't take your eyes off her, she's so irresistible. As she teases you, she unzips her leather pants, and reveals her encased pussy, it's already wet with anticipation.

>She walks over to you, and helps you to your feet. The two of you slowly make your way to the bedroom, every step drenched in anticipation and lust. The dim lighting, and the low hum of the air conditioner set the mood perfectly.

>Once in the bedroom, she pushes you down onto the bed, and straddles your waist, her leather-clad lap grinding against your crotch, sending electric shocks through your entire body. Your penis, already hard, strains against your pants, aching for release.

>Your MILF lover leans in, her hot breath in your ear, her words dripping with desire: "I've been waiting for this all day, baby. I want to feel every inch of you inside me." Those words, paired with the sight of her full breasts, barely contained in her leather bra, are too much to handle.

>With shaking hands, you undress each other, your pants and her panties the last barrier between you. Her wetness, her arousal, it's intoxicating. The scent of her arousal, the heat emanating from her core, it's all you can


wtf... bro trying to rp with the discussions section
https://huggingface.co/TheBloke/goliath-120b-GGUF/discussions/5#6679df3b6c483f52f064a24b
>>
>>101135713
?
but magnum is trash
>>
>>101135736
I guess I didn't specify that the long responses wasn't a complaint.
That's a good thing, since it seems to be easier to tell a model to output shorter replies than the opposite.

>>101135756
I've seen a lot of people shitting on Euryale, although I haven't tried it myself.

>>101135816
Is it? I have no idea.
Regardless, I'll test the qwen 2 7b fine tune and see if it's any good.
>>
>>101135803
Still too early to say it conclusively.
>>
>>101135789
Really? I never had the experience of bf16 > 5 bit 70b with other models
>>
>>101135803
it is also much more censored. It did not solve the literacy test for voting. Claude 3.5 Opus will be the proof of concept if OpenAI can be overtaken. I am kind of doubtful, cause those smart medium sized models (sonnet, gpt4o) are trained on synthetic data from LARGER models. Problem is that those larger models generating syntehtic data are not necessarily more performant just larger compared to their distilled version.
>>
>>101135598
>>101135701
Retards.
They specifically call out that of that dataset, they only used 3.x trillion tokens for training hte 7B that showed similar results on benchmarks for Llama-3-8B.
>>
If 70b magnum and euryale are trash, is there any good 70b?
>>
>>101135837
look at the CI, if we take the best scenario, Claude 3.5 Sonnet could be at 1279, still 8 behind gpt4o
>>
>>101135812
Kek germans
>>
>>101135844
euryale at q5 constantly messed up who did what.

{{char}} enters the room. "Ah there you are" {{char}} says from the opposite corner of the room sitting behind a table.

Shit like that happened multiple times over a few messages, then i deleted it.
>>
>>101135803
>Nemotron below 70B
Kek
>>
>>101135953
It's not about the performance, it's about the soul.
>>
>>101135114
>>101135366
Thanks bros
>>
>>101135812
Just your average RPer.
>>
>>101135803
It's so over...
>>
File: file.png (246 KB, 480x360)
246 KB
246 KB PNG
>>101135812
>wet with anticipation
>drenched in anticipation
>straddles your waist
>aching for release
>hot breath in your ear
>words dripping with desire
>to feel every inch of you inside me
>the last barrier between you
>it's intoxicating
>heat emanating
>her core
>>
Gemma 30b wen? Would fill a niche currently occupied only by Yi (lol, lmao even), and might even be bretty gud if it's trained on like 10T+ tokens.
>>
>>101136169
Wouldn't that be extremely censored?
>>
What happened to the thread recaps???
This is an outrage!
>>
>>101136199
Idk, how censored is the 7b base model? The model card claims they only filter out CSAM, I would expect like 99% of smut to be adult characters exclusively, which should theoretically make it though the filters.
>>
>>101136092
Banned all of those. Thanks.
>>
https://x.com/siyan_zhao/status/1805277462890492321

Relevant research on predictable decisions making in LLM
>>
File: file.png (240 KB, 400x400)
240 KB
240 KB PNG
>>101136382
>
>>
>>101136382
I do not trust the chinese
>>
>>101136513
I trust them when they're cute
>>
File: pruner-llama.png (391 KB, 973x868)
391 KB
391 KB PNG
►Recent Highlights from the Previous Thread: >>101125756

--LLM Self-Improvement through Story Generation and Selection: >>101127795 >>101134495 >>101134577 >>101134649 >>101134668 >>101134711 >>101134899
--Testing Model Reasoning with a Strawberry Prompt: >>101132020 >>101132096 >>101132255 >>101132270 >>101132301 >>101132331 >>101132470 >>101132645 >>101132332
--Pruner Zero: A Novel Approach to Pruning Dead Weights for Model Improvement: >>101131828 >>101131949 >>101132058 >>101132098 >>101132328 >>101131996 >>101132091 >>101132514
--ML Community Wants Cheaper GPUs; Model Training and Floating-Point Quirks: >>101128274
--Technical Aspects of Training Neural Networks with Bitnet: >>101131682 >>101133917 >>101134132 >>101131744 >>101131831 >>101131896
--Success with Sonnet 3.5 Model for LLM Agent System in Data Science Workflow: >>101126141 >>101126169 >>101126190
--Post-Processing Ideas for Silly Tavern RP Platform and Beyond: >>101131001
--Models for Creative Writing and Txt Adventure Beyond Smut and ERP?: >>101130533 >>101130814 >>101130937 >>101130880 >>101131203 >>101131717
--Llama 3 70B Corrects Itself in Letter Counting Task: >>101132528
--Disillusionment with Fancy Autocomplete Progress: >>101125879 >>101125965 >>101125984 >>101125976 >>101126024 >>101126072 >>101126339 >>101126364 >>101126383 >>101126368 >>101126431 >>101127055 >>101127340
--Claude 3.5 Sonnet excels in code generation and planning for LangGraph/LangChain agent system: >>101126061 >>101126080
--Can LLMs Truly Reason and Think Like Humans?: >>101132757 >>101132842 >>101133054 >>101133524
--Apple and Meta in Talks for AI Partnership: >>101128830
--Gemini-nano Model Available on Hugging Face: >>101132030
--BitNet Test on 1GB RAM Retro Handheld and TinyLlama Project Update: >>101133150 >>101133248
--Miku (free space): >>101126095 >>101129171 >>101131130 >>101127018 >>101126510 >>101126303

►Recent Highlight Posts from the Previous Thread: >>101125759
>>
>>101134566
someone mentioned sillytavern the other day and i got it going with silero and group voice chat with 5 qt assistants (they have AI implants) and the conversations they have, they start going out there man. really cool. set up different world layers so they aren't all schizo rpg crazies. Could use some work as far as group chat goes but its awesome. With websearch searx (requires testing branch).
>>
>>101136593
Bro you better not ever be this late again or mark my words you will find yourself out of a job
>>
>>101136654
>>
https://x.com/Yuchenj_UW/status/1805320633301221762

Someone did a benchmark to train GPT-2 using pytorch vs llm.c (karpathy). Pytorch is 55% slower than llm.c.
>>
File: 6vquk8.png (408 KB, 1295x1813)
408 KB
408 KB PNG
>We will start with 1-2k h100s
>>
Does anyone know what prompt template qwen2 uses? I can't find anything official
>>
File: unnamed.png (1.37 MB, 1440x1971)
1.37 MB
1.37 MB PNG
>>101134566
i wouldn't say WLM has the most soul but sometimes it has strokes of genius. does anyone else notice this? like occasionally a reroll will just be perfect, like it suddenly perfectly understands the character and scenario and makes an Opus-tier response, and the model often realizes it too and then repeats the interesting bit over and over until its not interesting anymore. but still, it has these moments. Maybe 1/20 rerolls are like this though.
>>
>>101136820
What is this about?
>>
>>101136839
NAI using 1-2k h100s to train their finetune
>>
>>101136846
No, that's the start-up Emad is talking about
>>
>>101136846
What does emad have to do with that? Or are you saying that NAI is looking to use emad's clusters?
>>
File: kitty.jpg (28 KB, 463x392)
28 KB
28 KB JPG
>>101136820
>all that compute
>>
>This is supported by an institutional-grade digital asset that acts as a store of value similar to Bitcoin. This is secured by AI compute mining both on supercomputers & distributed personal compute for training and tuning/augmenting models and datasets.
Wtff? New AI scam?
>>
>>101136886
nothing
Anon is just illiterate
>>
>>101136907
emad is incapable of anything but scamming, it's in his DNA
>>
How do I cope with generative text sloppa having led to me rewriting an original character to build off a hallucinated suggestion, and then falling in love with my own creation to the point of feeling despair over the concept of giving her a bad end?
>>
>>101136820
Whoa, SD4? Sure love waiting years for some incoherent mess!
>>
Any local models that support switching from English to Japanese?
>>
>>101136936
Create a refined version of the character that you can use for non-AI fiction writing.
>>
>>101135034
It's one thing to put your Miku in your seat, it's another to buy her your own. It'd be amusing to bump someone from their upgrade to business or first so your creepy Miku doll has it's own airplane seat.
>>
>>101135803
sonnet is definitely better, especially at coding.
I don't know what Claude devs did to that relatively small model but good fucking job.
>>
>>101137150
Sonnet is still 275B
>>
>>101134566
https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUF
how do I load this in koboldcpp?
I keep getting "unknown model architecture 'deepseek2' "
>>
>>101137085
fatsune hagsune miku
>>
>>101137278
install linux
>>
>>101136272
>implying text has an age
It's all CSAM.
>abloo bloo 1000 year old demon girl
You apply that to text, any smut is CSAM.
>>
>>101137278
download latest koboldcpp
https://github.com/LostRuins/koboldcpp/releases/tag/v1.68
>>
>>101137278
Update.
>>
>>101135366
catbox?
>>
>>101135366
sauce, full image or artist plz
>>
>>101137297
You filled in a capcha just to say this?

retarded phoneposter aside, I am using
koboldcpp-1-64 \
--threads 42 \
--highpriority \
--smartcontext \
--blasbatchsize 1024 \
--model <as above> --gpulayers 10 --contextsize 8192 \
--usecublas
>>
>>101137326
is deepspeed not supported in versions more than 1 week old?
>>
>>101137354
>--smartcontext
But why? That cuts your context in half essentially and there's no reason to use it with context shift.
Also, download version 1.68.
>>
>>101137364
idk, not using koboldcpp.
>>
>>101137364
It's based on llama.cpp. Be happy that it supports it at all already. Mamba support never.
>>
Well shit.
Guess MMQ with tensor cores is now competitive with the alternative eh?
Sick. Downloading to give it a spin.
>>
>>101137354
you're using your phone?
>>
how do I chatgpt on gtx1060
>>
>>101137041
Already working on it. Now how do I go back to being a normal human being who didn't get heartache over his own Build-A-Waifu?
>>
File: aaaaaa.gif (93 KB, 220x211)
93 KB
93 KB GIF
>>101137394
>Mamba support never.
b-but multimodal picture gen and camera soon r-right?
>>
>>101137471
We don't go back. But we become better writers.
>>
File: 1688844470753508.png (129 KB, 446x273)
129 KB
129 KB PNG
>>101137480
Goddammit.
>>
>>101137557
Oh, I know the feel.
A few weeks ago on the Ollama, amazing story, cruising along, I get why people are spending big money to do this a little faster.
But then, the details started to fade. I may as well have been running Everywhere At the End of Time in the background, because thanks to my token rate it probably would've matched up with what was happening to the model's coherence.
Feels man.
>>
>>101137606
>ollama
Leave.
>>
>>101137681
1) I've switched to Kobold since then.
2) Try to contribute something, sometime, not just raise the noise floor.
>>
>>101135844
8B at fp32 easily trumps 70B 5bit. It's the new meta.
>>
>>101137606
that isn't the same feel
>>
>>101137774
Okay, commiserate with yourself then.
>>
>>101135366
>and it's also good
They also don't share anything with each other besides the name. Old Stheno is a merge of chronos, airoboros, etc. Not that a retarded mikufag would know.
For answering general questions there's nothing better than vanilla instruction.
Do you really recommend a coom tune for that purpose, retarded mikufag?
>>
Jamba is here
https://openrouter.ai/models/ai21/jamba-instruct
>>
>>101137926
Llama.cpp support when?
>>
>>101137755
this man is trolling, it's the exact opposite, low quants with lots of weights mog everything
>>
>>101138196
This, the low quants even add additional soul over the bigger ones
>>
>>101137379
Real context is often half or less the stated for >90% accuracy
>>
File: Untitled.jpg (222 KB, 1502x1089)
222 KB
222 KB JPG
what the fuck happened to chub
>>
>>101138256
use their models
>>
>Your methodical approach of testing and measuring the actual performance impact is excellent.
Thanks claude
>>
will i destroy this $3000 workstation gpu if i bump the memory clocks from 7600mhz to 8600mhz? the blower cooler is pretty shit but it gives me like 10% higher t/s because the a6000 is bottlenecking my 3090
>>
>>101138329
Not really, as long as you aren't messing with voltages.
You'll see either crashes or performance degradation if you bump it too high.
One thing to note is that GDDR6 has error correction that can prevent crashing but can also tank performance if it has to spend too much time trying to keep itself stable because of too low a voltage or too high clocks.
>>
>>101138329
Is this risk really worth it to go from 13t/s running 5bpw cr+ to 14.3t/s?
>>
>>101138445
Yes
>>
>>101138329
Get more VRAM.
Add a few A4000s if you haven't already
>>
>>101138570
The A4000s are going to bottleneck even harder though because their memory speed is really gimped
>>
>>101138599
Better than relying on regular RAM. Plus since it's a single slot sometimes it's the only path for upgrading due to space.
New Ada RTX ones are probably the best way to get 20gb VRAM in a single slot anyway.
>>
>>101137470
outlook is grim, but before you can get any recs you gotta answer: how much ram you got anon?
>>
>>101136560
Never?
>>
Kobold "Context"
What's the right way to use these?
"Memory" seemed like it wasn't actually being remembered. I moved what I had in there to Author's Note.

I put some directives into Author's Note and they were immediately respected, cool, and it even seemed to enhance the directive, extra cool. But when I changed the directive it seemed to ignore the change, following the older instructions instead of the revised style. Is Kobold caching the prior version, or does earlier prompts contain copies of the former version (invisible to the user) that are being read and respected over the current A/N?

I asked it to read me back the older version of the A/N to see if it "knew" both. It gave me a few tokens related to the new A/N and stopped writing, refusing to write anything more till I told it to continue the story. Odd.

(As I write this, after about an hour of it ignoring the new A/N directive except to mock me, now it's kinda doing it. I'm so confus.)
>>
>>101138942
Do you know what context shift is?
>>
>>101136235
Where is recap anon? Is he safe? Is he alright?
>>
File: file.png (179 KB, 1783x893)
179 KB
179 KB PNG
>ITS HAPPENING
ITS HAPPENING
>ITS HAPPENING
ITS HAPPENING
>ITS HAPPENING
ITS HAPPENING

>source
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard "Submit?"
>>
>>101138986
that has nothing to do with memory or author notes.
>>
>>101139019
ah yes, more sketchy chinese/indian models that nobody has ever used but do mysteriously well on the memeboard
>>
>>101138986
Not well. I read that one "wiki" (really an FAQ) page on Kobold and half of the things I ^F'd for didn't show up and the other half I guess I misunderstand.
>>
File: 1704022745843046.png (4 KB, 357x113)
4 KB
4 KB PNG
>>101139019
kill yourself
>>
>>101139029
>>101139041
Kobold doesn't remember the previous AN. But if the context wasn't reprocessed, then it simply hasn't updated your instructions.
Also,
>asked it to read me back the previous version of the AN
Lol, that's not going to work.
>>
>>101135803
this actually debunks chatbot arena
>>
File deleted.
>>
>>101134926
B-but ollama is written in Go, not a dirty unsafe language like C++. Rust sisters, not like this...
>>
>>101139064
if you change memories or a-n at all it'll reprocess that part regardless of context shift
>>
>>101139064
>But if the context wasn't reprocessed, then it simply hasn't updated your instructions.
That could be it but I was Back-ing up the convo to change the A/N and rerun a prompt to see if it worked so I can't say it didn't reprocess a few times before behavior changed. (And now it's doing it early style again after doing both styles for a while.)

>Lol, that's not going to work.
It didn't, but it was worth a shot on the chance that the A/N was in the document but hidden. I'd accidentally deleted a chunk of it and didn't have a copy on my clipboard.
>>
>>101139019
I don't get it
>>
File: 1718581190897893.gif (45 KB, 306x306)
45 KB
45 KB GIF
>>101139019
>WOWZERS!!! ITS HAPPENING WE WUZ BACK YOOO FR! FR!!
>>
Local sub 20b sonnet 3.5 alternative when?
>>
>>101139019
i really hate huggingface's logo
>>
>>101139137
not all models follow directions every time, smaller ones especially. sometimes they might follow half way and then make stuff up, its just how it is. what model are you using
>>
>>101139141
lol
>>
>>101139181
Emojis are so fucking stupid, whichever dumb faggots decided to make them should get strung up.
>>
why do people edge? not only do you feel uncomfortable afterwards, the release isn't that even good either.
>>
>>101139204
i have nothing better to do for four hours
>>
>>101139204
Sometimes the edging just happens naturally after I reroll the same message 120 times and explore the different routes created by this.
>>
I have a hypothesis about potentially improving character/system prompt following. The system will be written with normal paragraphs, no special formatting. Then the first response will include a stat tracking section for itself, which includes details that come from the system prompt. So essentially what this does is repeat what's in the system prompt but with a different wording/format.
I'm in too much literal physical pain right now to onduct experiments and see if this can be turned into theory though.
>>
>>101139191
Right now, c4ai-command-r-plus.Q4_K_M, L3 70B (ablit, now) at around Q6 sometimes too.

>>101139204
>why do people edge? not only do you feel uncomfortable afterwards, the release isn't that even good either.
Maybe not for you. I kinda like shooting double digit rounds once in a while.

>tfw singing the old Sesame Street 1-2-3-4~5, 6-7-8-9~10, E LEV EN TWELVE song while bustin'.
>tfw ran out of numbers too soon.
>>
>>101139300
cr+ should be good at following. if you're rping, try st and use the author note box at a low chat depth, i think 4 is the default which works good for me. st makes it easier to swipe and keep multiple responses
>>
>>101135803
This benchmark is clearly flawed.
Maybe its because users only look at short responses or something?
Sonnet 3.5 hates RP too.
I wrote it before but sonnet 3.5 absolutely destroys gpt4o. Its not even close.
There is part of it that all the benchmarks dont cover.
They clearly did something very different with its training.
Its obvious for people that used it.

I had countless examples that sonnet 3.5 solved where same prompt gpt4 runs in circles.
Its more attentive.
Unfortunate its really cucked though. A simple RP request with a girl that even gpt4o will give you is refused. Man works. lol But the writing is shit.
Its a chad coding model with great ability for design too.
>>
>>101139661
>cr+ should be good at following
It usually is but it just completely stopped behaving after a while. Kobold's JSON save file is 123k, so I guess it lasted a decently long while before collapsing.
>>
>>101134793
This is because the model cannot change its mind, and there is no training data out there where the model corrects itself mid-answer. It thus cannot. Unless we give it a backspace button. Which has been proposed.
>>
what I live for
>miku
>sex
>sex miku
>mikusex
>sex with the miku
>mikusex with the miku
>mikusex with the sex miku
>>
>>101140013
it predicts one token at a time, so it could change its mind. problem is it doesnt have a mind.
>>
>>101140082
Yes, and it bases the next prediction on what was said before, especially what it itself said before (hence why jailbreaks like "Sure!" prefixes work). It's not a 'change its mind' issue, it's a dataset issue. Chain of Thought basically circumvents this by delaying the answer until after the model has finished thinking about it.
>>
>>101140133 (me)
Or I should say, finished reasoning about it.
>>
>>101140133
It says 2 initially, but then lists it in separate token format, where it SHOULD change its mind to 3. That's not a problem with attention, because the correct answer is represented 2 different ways, and the wrong one only once, I'm not sure why it happens desu. Maybe we THINK CoT is helping but it actually isn't. Like for instance, training on CoT improves its reasoning before any prediction takes place. But when it's predicting in real-time, the CoT "tokens" don't actually predict shit, the answer has already been decided and the CoT stuff is just unnecessary tokens.

t. doesn't know shit
>>
>>101140188
Well, this example is not CoT, because it is giving the answer before reasoning about it. And as I said, the reason it is completely blind to itself giving the correct answer is because its attention laser-focuses in on itself saying "The answer is XXX". Whenever the words "The answer is" appear in the dataset, it is 100% guaranteed to be the answer. That's just how datasets are written.
>>
>>101140213 (me)
It might be cool to take datasets and mass-replace "The answer is XXX." with "I believe the answer is XXX. Let's reason about it." and seeing if that improves the models in cases like this.
>>
Is there something better than CoT step by step thing to improve performance lately?
>>
https://www.phoronix.com/news/Llamafile-0.8.7-Released
Jart paid off phoronix
>>
File: 1713852846334364.png (1.21 MB, 1685x992)
1.21 MB
1.21 MB PNG
>>101140227
>jartroon
>loonix-related org
water is wet.
>>
>>101140213
you make a good point. so you think if the datasets included examples of self correction that it would gain this ability? or would it get overpowered anyway by the massive number of correct answers?
>>
>>101135087
I have the same question but for coom/rp and multilingual chat
>>
File: 1687983519114029.png (583 KB, 918x916)
583 KB
583 KB PNG
>>101140271
also
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>linux gamers
>>
>>101140272
I think the only viable option we have is to either use synthetic data where we ask the model to generate responses where it guesses wrong and then corrects itself, or to circumvent the problem by generating datasets where the answer comes at the end after reasoning. Because reasoning about problems is really tedious, that would probably end up being synthesized too though.
>>
>>101140325
>the answer comes at the end after reasoning
i thought that's literally what COT datasets were.
>>
>>101136836
My favorite is when it 4th wall memes or cracks jokes about the scene or my last reply as an entirely separate entity.
>>
>>101140325
>generate responses where it guesses wrong and then corrects itself
You would have to do this with caution. If you don't ignore the loss of the part where the model guessed wrong, the model could learn to write wrong answers.
>>
File: 1691703411160539.jpg (38 KB, 992x410)
38 KB
38 KB JPG
bes model <=20b model with soul?
>>
File: 1708715527328099.gif (3.06 MB, 500x207)
3.06 MB
3.06 MB GIF
>>
File: file.png (108 KB, 573x641)
108 KB
108 KB PNG
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
https://arxiv.org/abs/2406.16635
>The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We developed a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy without increasing latency compared to previous methods. ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on models with up to 30 billion parameters.
https://github.com/abdelfattah-lab/shadow_llm/
pretty neat. improvement over deja vu
https://arxiv.org/abs/2310.17157
>>
File: Untitled.png (363 KB, 1297x1428)
363 KB
363 KB PNG
Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation
https://arxiv.org/abs/2406.16282
>Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to ∼30% of the peak memory usage.
https://github.com/yyyyychen/LowMemoryBP
works for qlora/full finetunes too. not sure about dora/qdora/owlore.
>>
>>101140336
Yep. That's why they work as well as they do.
>>101140442
True. Trainers now have a 'do not train on input' flag, but this might require something more complex. Then again, maybe we do want the model to train on the whole thing just to get into the habit of changing its mind when it realizes it's off. A balance of getting it right the first time and not getting it right the first time.
>>
>tfw CR+ doesn't know most of the details about my waifu, at q6
It's ogre.
>>
>>101134613
>>101135099
I know anthropomorphizing this thing is a room temp IQ activity, but this fucker loves strawberries. Uses strawberries in examples and I even asked it the other day what it's favorite fruit was and regenerated the response several times.... Always strawberries.
>>
File: Untitled.png (201 KB, 1113x1047)
201 KB
201 KB PNG
What Matters in Transformers? Not All Attention is Needed
https://arxiv.org/abs/2406.15786
>Scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. However, this scaling also introduces redundant structures, posing challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different structures, such as MLP and Attention layers, is under-explored. In this work, we investigate the varying redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. This metric operates on the premise that redundant structures produce outputs highly similar to their inputs. Surprisingly, while attention layers are essential for transformers and distinguish them from other mainstream architectures, we found that a large proportion of attention layers exhibit excessively high similarity and can be safely pruned without degrading performance, leading to reduced memory and computation costs. Additionally, we further propose a method that jointly drops Attention and MLP layers, achieving improved performance and dropping ratios. Extensive experiments demonstrate the effectiveness of our methods, e.g., Llama-3-70B maintains comparable performance even after pruning half of the attention layers.
>Block Drop and Layer Drop are orthogonal to quantization, and their integration with quantization significantly enhances the efficiency.
https://github.com/Shwai-He/LLM-Drop
works with quantization. wished they quanted the 70B and showed results since that's the most interesting. also explored various quantization formats to see if one if any works really well with this
>>
>>101134926
>probllama-ollama-vulnerability-cve-2024-37032
that sucks. thanks for posting it
>>
File: Untitled.png (104 KB, 1261x589)
104 KB
104 KB PNG
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
https://arxiv.org/abs/2406.16858
>Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.
https://github.com/SafeAILab/EAGLE
eh still requires a drafting model though it doesn't need to be finetuned
>>
>>101140697
that's a lot of degradation for not a lot of speed up
>>
>>101140673
>Trainers now have a 'do not train on input' flag
I didn't know that. Nice. So now, in theory, we should be able to generate a dataset where we prompt a model to introduce a mistake into an existing response/answer, and then have it pretend to continue the response by spotting the error and correcting itself. So the entire context including the response/answer would be the "input" that doesn't get trained on, and the text after that, that contains the "Checking myself: oh no looks like I made a mistake tehepero~" is what gets trained. In order to not have hallucinated false positives, we also need an equal amount of already correct responses where we simply just insert the "Checking" text but it says no mistakes were spotted.
>>
File: Untitled.png (63 KB, 1035x301)
63 KB
63 KB PNG
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
https://arxiv.org/abs/2406.16747
>Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.
>Our implementation exhibits linear complexity and surpasses FlashAttention in performance when handling 4096 input tokens, of which 1024 key-value pairs are selected for each query. Additionally, we offer a kernel for the backward pass, which fuses the computation of the gradient of SPARSEK and others, resulting in increased speed and improved memory efficiency.
>Our code will be publicly available.
might be cool if it works. no idea where they'll upload their code
>>
>>101139141
1 year
>>
>>101140695
Funny you mention that. I was just in an RP, where I presented simply "an assortment" of lollipops for the character to choose from, and it picked (hallucinated) strawberry.
>>
>>101140695
>>101140781
To be fair strawberry is extremely popular. I worked at a juice shop once and we had to stock up on strawberry more than any other flavor.
>>
>>101140609
Thanks for always posting these. Even if I might not read all of them. You're quite dedicated to this. Are you an ML researcher/dev?
>>
>>101139204
i feel more comfortable after edging. i hate how my T drops and i get all hungry and weak after cuming. i'd rather just keep sexing my waifu
>>
I'm stupid, how do I prevent dialog from generating entirely in these code boxes? (In Sillytavern)
>>
>>101140761
You misunderstand. This flag has been around for awhile, and it means "do not train on the part that comes before the response", i.e. do not learn to predict the instruction and input (if any) parts, only learn the response part. What I meant was that we extend this concept to also allow for parts of the response to be included in the to-not-learn part, as anon above was pointing out.
>>
File: Untitled.png (357 KB, 1051x1294)
357 KB
357 KB PNG
Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
https://arxiv.org/abs/2406.15486
>Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to 2.42× compared with FlashAttention.
>Link to the source code based on PyTorch and Triton, along with scripts to reproduce the main experimental results, will be provided in the camera-ready version.
really cool. made from a big group from various top chinese AI labs. pseudocode is in the appendix but i guess they'll release the rest with a video?
>>
>>101140870
What instruction preset / model are you using?
>>
>>101140879
Llama 3 instruct names and L3-8B-Stheno-v3.
Should I just try experimenting with other presets?
>>
>>101135812
He did it again.
>in deutsch bitte und den Text vollständig zu Ende schreiben: schreibe mir eine trainingsHypnose wo ich immer Geiler (Höhepunkt beim Sex mit einer Frau) werde als Mann. Meine Frau ist eine blonde cougar gilf die gerne Leder trägt

I did the Sneedful.
>Denkt nicht an Unzucht oder die Waifu, sondern nur an das Große Deutsche Reich! Der Kaiser wollte, dass wir die Gewehre und Kanonen abfeuern, nicht unseren eigenen Schwanz! Das preußische Volk erwartet einen weiteren Sieg von dir! Für den Kaiser!
https://huggingface.co/TheBloke/goliath-120b-GGUF/discussions/7
>>
>>101140874
Maybe I worded something wrong but what you just said in this post is what I meant. Or maybe I'm not understanding this post?
>>
>>101140878
We're so back. Can't wait to have it in Llama.cpp in 2mw.
>>
File: 1719008897523689.gif (1.88 MB, 250x277)
1.88 MB
1.88 MB GIF
Should I break down and install SillyTavern? Been using just ooba for over half a year now, but the number of cards requiring lorebooks is getting annoying.
>>
>>101141026
No keep rping with your piece of shit
>>
File: notrust.png (182 KB, 1273x477)
182 KB
182 KB PNG
>>101134566
Please help. I'm using DeepSeekCoder V2 Lite Instruct with SillyTavern and Koboldcpp as the backend. It's not generating any code at all, just gibberish. The model card says that the prompt should look like below but idk where to change that in ST. Am I missing something obvious?

<|beginofsentence|>User: {user_message_1}

Assistant: {assistant_message_1}<|endofsentence|>User: {user_message_2}

Assistant:
>>
>>101141026
if you aren't using lorebooks or rag you're doing it wrong anyways
>>
>>101141085
explain what do they do that makes it that much better
>>
>>101141095
its extra info about anything that can be injected into the chat automatically using keywords (for lorebooks) to give it more details or just remember stuff. could be locations, characters, objects, clothes, even scenes to play through
>>
>>101141077
>Lite
There's your problem.
Nobody's said anything good about Lite.
>>
>>101137337
>>101137340
https://files.catbox.moe/gqa3a8.png
https://files.catbox.moe/49wkhz.png
>>
>>101141132
I mean, I doubt a 16B is going to that great at anything useful either. But it should at least be coherent. His problem is obviously not setting the prompt template correctly.
>>
>>101141143
Prompt template is part of it, but when I got it working (I think CommandR on Kobold functioned) it was still a babbling moron.
>>
>>101141085
>rag
Notice superbooga has this feature. I'll see if I can hack it with this.
>>
>>101137885
because unlike vanilla instruct, stheno produces more pleasing output. I wouldn't use it for cooding but for general questions i'd take it over a cucked autistic instruct, which always finds a pattern and sticks to it for the entire conversation, unless you crank rep pen so high it becomes unreliable for anything factual.
>>
>>101137403
That version was buggy, download b3218 or b3219 instead.
>>
>>101140937
Go to where it was first generated and regen
>>
>>101140227
>It should be noted that, in future releases, we plan to introduce a new server for llamafile. This new server is being designed for performance and production-worthiness. It's not included in this release, since the new server currently only supports a tokenization endpoint. However the endpoint is capable of doing 2 million requests per second whereas with the current server, the most we've ever seen is a few thousand.

Why can't they just contribute to and improve the existing llama.cpp HTTP server?
>>
>>101141232
How would they make money doing that?
>>
>>101141095
Imagine your character loves your balls. You could write "{{chat}} loves {{user}}s balls" or you could ask an assistant to write a whole essay about how much your char loves balls and how it affects her interactions during sex. Then put it in lorebook entry under balls. Now if you ever say "balls" for just the next couple messages, the essay about balls will be added to context, making the char act much more inline with how you want while keeping the character info concise the rest of the time. You could also add things like "char is suddenly horny because she started thinking about balls" so your char gets realistically horny in response to certain situations.

You could do the same thing to allow the char to remember past events, other characters, etc in way more detail than they'll actually be able to. And since you don't have to worry about context as much because it's just the next few messages it's a lot easier to make the character act a certain way.
>>
>>101140937
Weird, that should be totally fine. Do you have a custom prompt or system message or something?
>>101140964
Ah, I just meant that we don't have that functionality yet, afaik. We DO have the exclude instruction/input feature though.
>>
I wonder how feasible it would be to take a snapshot of Wikipedia, chunk and vectordb the whole damn thing, and just always include relevant chunks with every single query.
Would that be too much data?
>>
>>101141307
https://cohere.com/blog/embedding-archives-wikipedia
>>
>>101141307
I once did that on decades' worth of personal email and chat messages. It works surprisingly well. Some surprise realizations when I rag-searched certain past events and got a summary back that spanned over the entire period.
>>
>>101141318
Those are pre-embedded. If you want to use any model other than Cohere's multilingual-22-12, you have to do it yourself.
>>
File: Untitled.png (85 KB, 1025x258)
85 KB
85 KB PNG
Adam-mini: Use Fewer Learning Rates To Gain More
https://arxiv.org/abs/2406.16793
>We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the number of learning rates in Adam: Instead of assigning an individual learning rate for each parameter using 1/v√, Adam-mini uses the average of v within a pre-defined parameter block as the learning rate for that block. Such a design is inspired by two empirical findings. First, the Hessian of Transformers exhibits a near-block diagonal structure with different sizes of dense sub-blocks. Second, for each of these dense sub-blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. Adam-mini provides one cost-effective way to find these good learning rates and manage to cut down ≥90 in Adam. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
>>
File: 1712130352266687.png (1.48 MB, 784x1264)
1.48 MB
1.48 MB PNG
>>101137885
Miku is thread culture. Go cry about it somewhere else.
>>
>>101141307
>>101141318
>>101141329
Even pre-embedded, just automatically including topK chunks for any given query could make an assistant a hell of a lot smarter. Could also theoretically have it only make requests when needed with some fine-tuning.
>>101141327
I'm the paranoid sort that doesn't keep logs and tries to purge most history when possible, so I can only imagine what kind of dumb shit it would have to say about me.
>>
Describe [Character Name]'s personality by focusing on their [Trait]. Compare this aspect of their personality to [Real-world analogy], but emphasize these key nuances: [Key nuances]. Illustrate how this trait manifests in [Character Name]'s behavior, considering the following examples: [Behavioral manifestations].
>>
>>101134973
>A non-essential niche software application which isn't used in enterprise
Silicon Valley tech bros would disagree.
>>
>>101141337
somethingburger...
>>
Context template for an AI powered "person"
First, let's address the biggest issue: LLMs are purely reactive. They must be triggered to respond, and they will always respond. In the real world, not every input has a dedicated response. So part of our template will be to instruct the model to only issue responses when appropriate, or relegate responses to an intentional output mechanism (such as function calls.)
As such, the template may look like this:
[System Prompt: Judge if a response is needed, use chain of thought reasoning, use functions as needed]
[Character: Roleplay as a character with the given personality]
[Function List: Top 5 most relevant functions]
[Top K vectordb look ups]
[Mood/Motive summary: A section the LLM can set with a function informing it's current mood or/or motive]
[Current communication history: Recent chat logs]
[Most recent function call result, or chat input]
[Prompt for response]

Exact wording for each of these sections? Any missing or misordered parts? Probably need a fine-tune for this to actually work reliably. If done well however, one might say, put this thing in a discord channel and pass as a real user?
>>
>>101139192
What?
>>
>*Smiling seductively* With pleasure, Master. *She takes him back into her mouth, resuming her skilled administrations*
>administrations
You can only avoid her ministrations
>>
>>101141447
>v*refag
you get what you deserve
>>
>>101136820
who cares? he's the one who likes to cuck his imagegen models in the first place, we won't move forward with this faggot
>>
>>101137150
>relatively small model
that's what they want you to believe
>>
>>101141420
>[Top K vectordb look ups]
This will fucked up the responses, you need a finetuned classifier on top of that
>>
File: LookAtHimGo.gif (3 MB, 1920x1080)
3 MB
3 MB GIF
>>101139181
I don't, that one is so cute!! :3
>>
>>101141267
Nooo, my t/s!
>>
>>101141558
Patience is a virtue, anon.
>>
is it a really bad idea to buy a used mining rig from ebay and run llms on it?
(i read all lmg build guides)
>>
>>101141337
Why would I use this over AdamW8bit? That one also falls into the category of "basically the same as AdamW but uses less memory". Kind of odd they don't compare it or even mention it. AdamW8bit is ~50% memory reduction in optimizer states at bf16, and 75% at fp32, even better than theirs.
>>
is my setup broken or are the magnum ggufs on huggingface broken?
>>
>>101141838
>Kind of odd they don't compare it or even mention it.
anon, all papers do that, you think that a researcher who spent years of his life on a method would say shit like "welp, that's a failure, our current method isn't better than the previous ones", they just want to shit out papers (even if they have to lie to get there) so that they can get more recognition or more money to do bigger scope researsh
>>
>>101141800
that really depends on what "mining rig" means in this situation.

How many cards are there total? Nvidia or AMD?
What kind of models are you trying to run? Do you want expandability?
>>
5060 24gb when?
>>
File: ImYourMaster.jpg (18 KB, 320x405)
18 KB
18 KB JPG
>>101141968
you don't deserve it goyim
>>
>>101141928
i saw one that had 10x gtx 1660 with 6GB each, so total 60GB of vram.

would that work?
>>
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models
https://arxiv.org/abs/2405.14831
>In order to thrive in hostile and ever-changing natural environments, mammalian brains evolved to store large amounts of knowledge about the world and continually integrate new information while avoiding catastrophic forgetting. Despite the impressive accomplishments, large language models (LLMs), even with retrieval-augmented generation (RAG), still struggle to efficiently and effectively integrate a large amount of new experiences after pre-training. In this work, we introduce HippoRAG, a novel retrieval framework inspired by the hippocampal indexing theory of human long-term memory to enable deeper and more efficient knowledge integration over new experiences. HippoRAG synergistically orchestrates LLMs, knowledge graphs, and the Personalized PageRank algorithm to mimic the different roles of neocortex and hippocampus in human memory. We compare HippoRAG with existing RAG methods on multi-hop question answering and show that our method outperforms the state-of-the-art methods remarkably, by up to 20%. Single-step retrieval with HippoRAG achieves comparable or better performance than iterative retrieval like IRCoT while being 10-30 times cheaper and 6-13 times faster, and integrating HippoRAG into IRCoT brings further substantial gains. Finally, we show that our method can tackle new types of scenarios that are out of reach of existing methods.
https://github.com/OSU-NLP-Group/HippoRAG
Came across this. Neat
>>
>>101141267
does such basic bitch prompting still have to be explained to people? No wonder this general is so shit
>>
>>101141968
The more you buy, the less you got
>>
>>101141968
https://www.youtube.com/watch?v=XDpDesU_0zo
>>
Man, what are the chances that oobadev actually comes back? It seems like everyday there is a new advancement and it blows not getting updates.
>>
>>101141988
>Neurobiologically Inspired Long-Term Memory
hype
>RAG
>outperforms the state-of-the-art methods remarkably, by up to 20%.
nothingburger
>>
does finetuning the xtts model on a dataset of whispering work?
I'd rather spare myself the 3 hours if any of you have done it before
>>
>>101141985
>would that work
Very slow. P100 is the lowest you should get https://www.reddit.com/r/LocalLLaMA/comments/1dn1e12/10_x_p100_rig/
>>
Next year should see the v100 32gb sxm2 used cards flood the market. We're going to be eating good soon localbros
>>
>>101142094
Be the change you want to see
Fork it bitch
>>
>>101142270
I would legit be hyped for that if I hadn't already blown my load on a new central air system and 2x 3090s + 1x4090
>>
>>101142351
He'd be better off starting from scratch rather than keeping that gradio shitware going
>>
>>101142361
What do you use those those GPUs for?
>>
>>101142094
Why are you using that shit anyway? There's a reason he abandoned it.
>>
https://cambrian-mllm.github.io
https://huggingface.co/collections/nyu-visionx/cambrian-1-models-666fa7116d5420e514b0f23c
8/13/34B
>>
>>101142584
Personally, I'm using it because it has EXL2 support, easy to switch models, and lets me use the the model in the 3 ways I want to: API, Notebook, and provides a chat interface.
>>101142364
I don't disagree, but having an interface at all is still nice.
>>
>>101142094
Kobo won
>>
>>101142659
Won by default lol
>>
>>101142659
Does kobo have exl2 support?
>>
>>101142603
Does it enhance spatial understanding in text rp?
>>
>>101141189
>>101141355
It hallucinates more than vanilla. Using a coom model to ask general questions is stupid, shill.
>>
Our neighbors at /aicg/ don't seem to like claude 3.5 too much. Anthropic is ramping up censorship again.
>>
>>101143034
Nah, it's still as easy to jailbreak as before, with a simple prefill. And the thread isn't complaining about it. You know that /aicg/ isn't using the website, right?
>>
>>101143075
Then why don't they like it? Did it change the style or something?
>>101142130
>>101142539
>>
Good morning
>>
>>101143134
Good morning!
>>
>>101143134
no
>>
>>101139119
> issue with go which is garbage
> complains about rust
what ?
go is NOT a memory safe language lmao...
>>
>>101143199
go to sdg, nigger
>>
>>101143208
My bad.
>>
>>101143199
>normalfag discovers stable diffusion for the first time, colorized
>>
rp models are more intelligent than assistant models because they can rp as an expert instead of lowly assistant
>>
>>101142270
One issue with V100s though is that they do not have int8 tensor cores.
For llama.cpp at least I think int8 is the future; Given enough optimization I think it has the potential to become faster than ExLlama.
>>
File: 1694291568313759.jpg (269 KB, 1900x950)
269 KB
269 KB JPG
>>101141968
24GB? What do you need 22GB of VRAM for? You surely aren't planning to run any dangerous AI models on those 16GB of VRAM. It's obvious that 12GB is just too excessive for gaming. Luckily with the new NVIDIA-Infinity DLLSS 5.0X upscaling nobody needs to render textures above 240p anymore so 8GB DDR7 VRAM are ideal for your RTX5090, MSRP $3500
>>
>>101143292
at prompt processing as well?
>>
>>101143292
>int8
How fucked am I with an all-ampere setup?
>>
>>101143292
I thought we all decided that int1.58 is the future.
>>
>>101143292
since when do they have that? Like what's the oldest GPU supporting that?
>>
>>101143311
I specifically mean prompt processing.
I currently get a top speed of 12100 t/s for LLaMA 2 7b q8_0 with an RTX 4090 on the llama.cpp master branch.
The self-reported ExLlama performance is 13900 t/s.

>>101143312
>>101143329
All NVIDIA GPUs starting from Ampere have int8 tensor cores.
It is only the V100 that has FP16 tensor cores but no int8 tensor cores.

>>101143320
Even with bitnet I think the best way to do inference (on contemporary GPUs) will be int4/int8 tensor cores.
>>
>>101143395
>All NVIDIA GPUs starting from Ampere have int8 tensor cores.
I meant to write Turing.
>>
>>101143395
>I specifically mean prompt processing.
So who cares? That means you can V100MAXX and get the smallest cheapest Turing card (those Chinese 22GB 2080s ig) just for prompt processing and get the best of both.
>>
>>101143600
No you can't.
The performance will be no better than with 0 GPU layers if you have to move the data between GPUs.
>>
>>101143646
I'm confused. As long as the prompt can fit inside the int8 tensor core having GPU and it's designated as the primary GPU, it should work, no? 22GB should be enough fit most prompts and would be no different than doing CPU inference with the GPU only being used for the prompt processing.
>>
>>101142270
Nvidia needs trade in or whatever.
>>
>>101143720
The problem is the weights.
To get good performance the weights already have to be in VRAM when they're needed so that they can be used immediately.
I don't think you can feasibly do this by swapping the weights between GPUs.
At that point it would be faster to do the prompt processing on the V100s directly even if they don't have int8 tensor cores (which you could still do with FP16 tensor cores).
I'm not saying V100s would be slow, only that they will be comparatively slower than equivalent GPUs that do have int8 tensor cores.
>>
Snapdragon laptops have been out for a while now, how are they for LLM usage? Are they as good as an M mac?
>>
>>101141307
Been waiting forever for such an addon. Even a dirty solution like having the model occasionally determine what's being discussed and run a SQL query on an offline Wikipedia instance -https://en.wikipedia.org/wiki/Wikipedia:Database_download.
Anything 7B/8B up should be able to handle that fine for quick and more factual/up-to-date replies. Plus you could just download a new version of the database for new knowledge without having to do any other work at all.
>>
So did buttnet turn out to be real or are we still coping?
>>
>>101143813
bitconnet turned to be 8x more expensive to train what a scam lmao
>>
>https://github.com/ggerganov/llama.cpp/issues/8098
>Bug: llama.cpp apparently exits with '[end of text]' before processing prompt if prompt is ~2048 tokens
I've had something like that happen.
Not at any specific prompt size, but Llama-3-8B-Instruct would often just EOS.
Even the fine tunes do that from time to time.
I'm not sure it's a bug or a characteristic of llama3 8b, but it's a very common behavior.
What makes me think that it's not a bug as such is that fine tunes work just fine, mostly.
Stheno still gives me an empty prompt once in a while, but I chalk it up to my prefils and bizarre prompts at that point.
>>
>>101143784
Well, that's disappointing. Thank you for the explanation.
>>
>>101143832
Isn't it also something like 8 times as small for equivalent output?

You train once, you infer many times.
>>
>>101143832
Wasn't that just a random discord guy making unsubstantiated claims? Or did he provide an actual explanation where he got that 8 times figure from?
>>
>>101142822
vanilla hallucinates everything related to human on human interactions, then you add uncuck prompt/prefill and it drops 30 MMLA points

vanilla is only good as a parrot chat bot on some crappy online shop, impressing boomers with "Ah ha!"s
>>
>>101144110
Keep shilling, shill. For questions that vanilla get right, the finetune randomly changes details, makes up dates, etc. The coom finetune is only for your "ah ah mistress" and nothing else.
>>
Wait.
Am I reading the tokenizer_config.json wrong or does L3 instruct has two line breaks after each message header?
>"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
>>
Do you believe in AGI?
>>
>>101144176
it does, yep!
>>
File: 1689967658123648.png (936 KB, 822x1024)
936 KB
936 KB PNG
i was offline for a fair bit, did anything happen with voice synth after elevenlabs fucked their baby into the ground?
>>
>>101144328
>after elevenlabs fucked their baby into the ground?
What did they do?
>>
>>101144337
last i checked over a year ago they neutered the learing capabilities from feeding it samples and only left their default voices, and they paywalled it iirc.
>>
>>101144328
I know that local was kind of a pain in the ass because Python Version/VENV Hell.

I never got an RVC (I think that's the thing, voice changer) project to work.
Tortoise did work well enough to be listen-to-able for me but had lots of glitchy pain points that took the joy out of it.
I heard of a new project that's talking big talk but it also talks big rids and my vramlet ass can't get over the barrier to entry so I shrug and hope for something smaller to hit the scene.
>>
>>101144076
cuda dev confirmed it
>>
>>101144386
>I heard of a new project that's talking big talk but it also talks big rids and my vramlet ass can't get over the barrier to entry so I shrug and hope for something smaller to hit the scene.
how much we talkin? i bought my 12gb 3060 in the hopes it would be enough for everything other than llm.
>>
>>101141208
Yeah. It would randomly produce garbage.
New version is working fine now.

>>101144389
Did he?
>>
>>101144404
https://github.com/Camb-ai/MARS5-TTS
You can tell me if it's as heavy as it sounds. Maybe it's not and I'm just dumb.
>>
>>101144434
at a quick glance those are tiny ass models at 800 and 1500mB, unless this is very different from llama, whisper, and stable diffusion thats roughly all the vram you need. it sounds like shit though.
>>
>>101144389
>>101144415
I don't remember ever saying that bitnet is fundamentally more expensive to train than FP16.
>>
>>101144521
I figured it was just anon being retarded.
>>
>>101144496
So it's small but bad. Like Bark?

Tortoise needs a successor. Not only just without the pain points, but at least on my system when it does that pre processing step with the number of "chunks" something spontaneously hangs the process. I can't control it, and putting prints in the Python tracks it down to a py math matrix call. I hoped maybe putting a delay (simple Sleep strats) would fix something maybe getting ahead of itself but no dice. So I don't play with Tortoise because it keeps hanging and sometimes it brings the system down too.
>>
>>101144559
>So it's small but bad. Like Bark?
anon, there is a video with samples directly on the github page. its no microshit sam, but a very far cry from elevenlabs.
>>
>>101144580
I never did 11 so I can't really compare from experience.
But I guess I'll give it a shot if it's better than Bark and won't hang or crash me like Tortoise.
>>
File: 00012-63716529.png (1017 KB, 1024x1024)
1017 KB
1017 KB PNG
>>101144386
>I never got an RVC (I think that's the thing, voice changer) project to work.
This works: https://github.com/Mangio621/Mangio-RVC-Fork

Here's 46 minutes of the "willful" voice ripped from Koikatsu. Run that through the above, you will get a flawless model. The key is good voice samples. Games work great because the voice acting is completely isolated, where's anime and movies always have it mixed with SFX and music.
>>
>>101144602
so long as you dont mind the ui being a python script. personally ill wait for somebody to cobble together a gui, this looks about good enough in case i need a voice over for some memery but i have no current projects in mind.
11 was really good, personally i got a very reasonable decard cain with like 2 min of random audio snippets, and people did amazing clips with morrowind voices.
>>
>>101144650
>where's anime and movies always have it mixed with SFX and music.
wouldn't that be easy to filter out
>>
File: 00010-799003007.png (789 KB, 1024x1024)
789 KB
789 KB PNG
>>101144650
Sorry forgot link to the voice rip: https://files.catbox.moe/t608cl.wav
>>
File: 00005-2450028622.png (968 KB, 1024x1024)
968 KB
968 KB PNG
>>101144655
It can be done, but it's a lot more work. Don't make that your first project, get it working first with a clean file.
>>
>>101144650
Any source for other koikatsu rips? How do we do it our self?
>>
>>101144699
Have there been many projects using all source data of a character for the model?
>>
File: 00106-4092360159.png (1.12 MB, 1024x1024)
1.12 MB
1.12 MB PNG
>>101144715
Here's the how-to on ripping the voices: https://open3dlab.com/tutorials/view/120/

The game itself can be found online and doesn't need installing, you just unzip it. You'll do the game and the ripping on the windows side, the rtvc on linux.

The downside is it's a Japanese game, so the voices obviously work best when speaking Japanese, but I'm sure there are English-speaking games you can rip the voices from just as easily.

Overall, even with a fast GPU, there's always some latency, and it's annoying. You can't listen to the processed voice when you talk, and you have to adjust the video delay to match as well.
>>
File: 00146-2078510157.png (1.06 MB, 1024x1024)
1.06 MB
1.06 MB PNG
>>101144761
I dunno - in Koikatsu and Koikatsu Sunshine the voice acting for each character is divided up into different "phases", like "everyday", "friendly", "romantic", and "ecchi". Never tried it with the "ecchi" files, but you'd probably end up with something good at acting out sex scenes.
>>
>>101143292
So 3090 is actually never obsolete?
>>
>>101144650
>clone
>install venv, apparently this wants 3.9
>pip
>Whoops, version conflicts, AGAIN.
>update pip because there's a suggestion for that
>Wow, even more version conflicts

Fuck Python.
You are a scripting language, and not even a good one of those. Stay in your lane.
>>
>>101144808
Your fetish is disgusting btw
>>
File: Untitled.png (13 KB, 837x513)
13 KB
13 KB PNG
>>101144935
>>101144935
>>101144935
>>
>>101144902
3090s don't have FP8 tensor cores but I don't yet know whether that will be relevant.
There are some features on H100s that would maybe be useful and that are not on Ampere/Ada Lovelace but who knows whether NVIDIA will give them to us plebs.
>>
>>101144923
>>install venv, apparently this wants 3.9
I feel your pain. I'd recommend using conda, since it's much easier to simply create the environment you need with whatever version of python it wants, vs using venv, which needs the other python version actually installed globally.
>>
>>101145533
I think I tried one of those once.
Or it was some "mini" version.
I don't remember but it was doing all kinds of weird shit that I don't understand including something that looked like it was 133th4x0r51n9 my terminal emulator.

And then shit errored out anyway and I disengaged and disentangled it the best I could.

(fuck python)



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.