/g/ - /lmg/ - Local Models General - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
/lmg/ - Local Models General 09/16/24(Mon)17:27:42 No.102417229

File: __hatsune_miku_kasane_tet(...).jpg (185 KB, 850x1016)

/lmg/ - Local Models General Anonymous 09/16/24(Mon)17:27:42 No.102417229

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102406696 & >>102396290

►News
>(09/12) DataGemma with DataCommons retrieval: https://blog.google/technology/ai/google-datagemma-ai-llm/
>(09/12) LLaMA-Omni: Multimodal LLM with seamless speech interaction: https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni
>(09/11) Fish Speech multilingual TTS with voice replication: https://hf.co/fishaudio/fish-speech-1.4
>(09/11) Pixtral: 12B with image input vision adapter: https://xcancel.com/mistralai/status/1833758285167722836
>(09/11) Solar Pro Preview, Phi-3-medium upscaled to 22B: https://hf.co/upstage/solar-pro-preview-instruct

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp

Anonymous
09/16/24(Mon)17:27:58 No.102417233

Anonymous 09/16/24(Mon)17:27:58 No.102417233

File: __hatsune_miku_and_kasane(...).png (50 KB, 588x454)

50 KB PNG

►Recent Highlights from the Previous Thread: >>102406696

--Techniques to improve Nemo's intelligence and output: >>102415200 >>102415228 >>102415273
--Recommended LLM setup for 24GB VRAM with text and PDF analysis capabilities: >>102415003 >>102415082 >>102415303 >>102415331 >>102415946 >>102415994
--Continuous batching and backend options for scalable LLM web site: >>102413520 >>102413675 >>102413997 >>102414699 >>102414824 >>102414134 >>102414873 >>102416052 >>102416096
--XTTS and Fish recommended for local AI text-to-speech, but XTTS is a dead project: >>102407218 >>102407255 >>102407327 >>102407418 >>102407454
--Technique to prevent model from speaking for user works well: >>102412758 >>102412777 >>102412810 >>102412852 >>102412986 >>102413030 >>102412875 >>102412908 >>102412937 >>102413643
--TPUs/NPUs have limited memory bandwidth, making them unsuitable for running language models: >>102410589 >>102411137 >>102411228
--Suggestions for finding and using open source LLMs for a project: >>102408387 >>102408419 >>102408455 >>102408500 >>102409110 >>102409121 >>102409334 >>102414969
--Python's limitations and improvements in llama.cpp/GGML: >>102406946 >>102407068 >>102408347 >>102408641 >>102408665
--Open source Cursor IDE and Claude Dev for solo projects: >>102409321 >>102409909
--New g1 model based on Llama-3.1, potential competition for OpenAI: >>102408681 >>102408873 >>102408923
--Gemma Scope AI model's views on racially charged language and corpospeak: >>102412169 >>102412287 >>102412313 >>102412417
--Doubts about open source models creating 3D playable Doom in React template within a year: >>102411349 >>102411437 >>102411894
--Advice on running Llama 3.1 8B locally for stackoverflow-tier questions: >>102406862 >>102406980 >>102407053 >>102407070 >>102407236 >>102407486 >>102407503 >>102407095 >>102407134
--Miku (free space): >>102406782 >>102408146 >>102411132 >>102412599

►Recent Highlight Posts from the Previous Thread: >>102406725

Anonymous
09/16/24(Mon)17:32:18 No.102417287

Anonymous 09/16/24(Mon)17:32:18 No.102417287

File: 50 Days Until November 5.png (1.22 MB, 1472x1104)

1.22 MB PNG

Anonymous
09/16/24(Mon)17:37:43 No.102417346

Anonymous 09/16/24(Mon)17:37:43 No.102417346

Does anyone here have any experience with older business-oriented GPUs? I noticed I can get an Nvidia GRID K1 (16 GB GDDR3) for relatively cheap, and I was wondering if it would be useful for running local models.

Anonymous
09/16/24(Mon)17:45:12 No.102417415

Anonymous 09/16/24(Mon)17:45:12 No.102417415

>>102417346
>Bandwidth 28.51 GB/s x4
Your RAM is going to be quicker at that point. Rule of thumb is to never go older than Pascal (P40, etc). Those are the oldest cards CUDA dev optimizes llama.cpp for.

Anonymous
09/16/24(Mon)17:58:54 No.102417567

Anonymous 09/16/24(Mon)17:58:54 No.102417567

>>102414452
Imagine Chink Sota 100B 1.58 bpw that runs on a single gpu and you can't just quant it higher to justify your multi gpu rig.

Anonymous
09/16/24(Mon)18:01:25 No.102417592

Anonymous 09/16/24(Mon)18:01:25 No.102417592

>>102417567
Monkey's paw. It's going to be the biggest sloppinator you've ever saw.

Anonymous
09/16/24(Mon)18:03:20 No.102417621

Anonymous 09/16/24(Mon)18:03:20 No.102417621

>>102417592
Qwen models have never been good, don't see any reason to expect that to change just because they tuned it on the latest CoT scam

Anonymous
09/16/24(Mon)18:06:28 No.102417652

Anonymous 09/16/24(Mon)18:06:28 No.102417652

>>102417567
unprompted
unironically rent free 24/7

Anonymous
09/16/24(Mon)18:07:43 No.102417669

Anonymous 09/16/24(Mon)18:07:43 No.102417669

File: bugs.png (209 KB, 1238x1340)

209 KB PNG

>>102416919
wonder if bigger models would've made it scarier

Sao
09/16/24(Mon)18:09:00 No.102417686

Sao 09/16/24(Mon)18:09:00 No.102417686

I would like to say that Drummer is a hack and a fraud.

Undi95
09/16/24(Mon)18:10:06 No.102417699

Undi95 09/16/24(Mon)18:10:06 No.102417699

I think Sao is a faggot.

TheDrummer
09/16/24(Mon)18:11:08 No.102417708

TheDrummer 09/16/24(Mon)18:11:08 No.102417708

nigger

Anonymous
09/16/24(Mon)18:11:41 No.102417720

Anonymous 09/16/24(Mon)18:11:41 No.102417720

>>102417686
you're so right sao
hanami is kino btw

Anonymous
09/16/24(Mon)18:13:18 No.102417736

Anonymous 09/16/24(Mon)18:13:18 No.102417736

Has anyone experimented with trying to calibrate tail-free sampling or typical-p for a particular model? How did you find good values? Did you edit transformers / llama.cpp to return a bunch more data?

Anonymous
09/16/24(Mon)18:13:18 No.102417737

Anonymous 09/16/24(Mon)18:13:18 No.102417737

>>102414452
>1__
10B

Anonymous
09/16/24(Mon)18:13:40 No.102417741

Anonymous 09/16/24(Mon)18:13:40 No.102417741

>>102417415
I was actually thinking about image generation. But I guess it's not very useful for that either, then?

Anonymous
09/16/24(Mon)18:19:56 No.102417808

Anonymous 09/16/24(Mon)18:19:56 No.102417808

File: 1717889148027915.png (183 KB, 600x530)

183 KB PNG

phi vision/qwen vision/pixtral compatibility status?
wanted to hook up my shitty frontend to kcpp or llama to look through my archive and bring me up images based on text content in them

Anonymous
09/16/24(Mon)18:20:12 No.102417810

Anonymous 09/16/24(Mon)18:20:12 No.102417810

>>102417741
I have no clue about imgen requirements these days but I doubt that a card with that bandwidth and compute on the level of a Radeon 4670/GT240 is going to be useful at all.

Anonymous
09/16/24(Mon)18:23:30 No.102417847

Anonymous 09/16/24(Mon)18:23:30 No.102417847

Can't wait until there are agents so I can tell it to make 10 different programming projects that take too much effort for me to make

Anonymous
09/16/24(Mon)18:26:08 No.102417881

Anonymous 09/16/24(Mon)18:26:08 No.102417881

>>102417847
I would pay for an agent service that could spit out even a half working middle complexity project.

Anonymous
09/16/24(Mon)18:29:24 No.102417930

Anonymous 09/16/24(Mon)18:29:24 No.102417930

>>102417847
Now if only you had an agent that could build those agents for you

Anonymous
09/16/24(Mon)18:29:59 No.102417941

Anonymous 09/16/24(Mon)18:29:59 No.102417941

>>102417808
>phi vision/qwen vision/pixtral compatibility status?
never, phi vision has been out for ages and still isnt supported
any alteratives for llamacpp?

Anonymous
09/16/24(Mon)18:35:18 No.102417992

Anonymous 09/16/24(Mon)18:35:18 No.102417992

>>102417229
Local models for translating German to English? I have photocopied a couple German engineering books for a prototype nuclear reactor, but I don't want to hand over their contents to Skynet.

Anonymous
09/16/24(Mon)18:35:28 No.102417995

Anonymous 09/16/24(Mon)18:35:28 No.102417995

Any LLM that can cure my crippling depression?

Anonymous
09/16/24(Mon)18:38:03 No.102418030

Anonymous 09/16/24(Mon)18:38:03 No.102418030

>>102417995
Fuck no.
It might help you sink even deeper into it.

Anonymous
09/16/24(Mon)18:38:05 No.102418031

Anonymous 09/16/24(Mon)18:38:05 No.102418031

So will kiwi be a thing or are they just making fun of the strawberry larp?

Anonymous
09/16/24(Mon)18:38:39 No.102418042

Anonymous 09/16/24(Mon)18:38:39 No.102418042

>>102417287
Is this Flux? If yes, it's so cool to see that it can even write text in a perspective like this. I still find it hard to believe this shit is actually free and open source. Maybe it's okay to believe that some day we will have old c.AI locally.

Anonymous
09/16/24(Mon)18:46:58 No.102418126

Anonymous 09/16/24(Mon)18:46:58 No.102418126

>>102417287
>Migugle
my sides

Anonymous
09/16/24(Mon)18:49:54 No.102418157

Anonymous 09/16/24(Mon)18:49:54 No.102418157

>>102417995
no but they might delay your inevitable suicide

Anonymous
09/16/24(Mon)18:53:47 No.102418194

Anonymous 09/16/24(Mon)18:53:47 No.102418194

>>102418042
>Maybe it's okay to believe that some day we will have old c.AI locally.
Why wouldn't you?

Anonymous
09/16/24(Mon)18:54:16 No.102418197

Anonymous 09/16/24(Mon)18:54:16 No.102418197

>>102418031
believe in kiwi

Anonymous
09/16/24(Mon)18:54:39 No.102418200

Anonymous 09/16/24(Mon)18:54:39 No.102418200

>>102418194
The trend is more and more slopped LLMs

Anonymous
09/16/24(Mon)18:55:26 No.102418210

Anonymous 09/16/24(Mon)18:55:26 No.102418210

>>102417669
Seems nice, I didn't have much luck with chronos gold or celeste, but you like the merge of them?

Anonymous
09/16/24(Mon)18:56:26 No.102418217

Anonymous 09/16/24(Mon)18:56:26 No.102418217

>>102418042
yes flux is pretty amazing. as always, it's gooning that leads the way. Yes, I know it's sfw, but that's what will drive further innovation.
Also, stuff like c.ai locally is not far off.
I'd be set with a sonnet 3.5 to run locally but I'd need a gorillion vrams. maybe someday.

Anonymous
09/16/24(Mon)18:58:28 No.102418241

Anonymous 09/16/24(Mon)18:58:28 No.102418241

>>102417229
What model can I use to count the number of people (and with their race) in a thumbnail?

Do I just need YOLOv8?

Anonymous
09/16/24(Mon)19:06:51 No.102418310

Anonymous 09/16/24(Mon)19:06:51 No.102418310

>>102418210
it's adequate, i'm enjoying it.
i'm not in love with it, i've just used nemomix unleashed, stardustv2, and lyrav4 to the point to where they felt stale.

Anonymous
09/16/24(Mon)19:07:44 No.102418315

Anonymous 09/16/24(Mon)19:07:44 No.102418315

>>102418241
yes if you have the annotated data to train it with

Anonymous
09/16/24(Mon)19:15:30 No.102418392

Anonymous 09/16/24(Mon)19:15:30 No.102418392

>>102417737
god imagine if their ebin thursday is just a nemo competitor

Anonymous
09/16/24(Mon)19:20:04 No.102418440

Anonymous 09/16/24(Mon)19:20:04 No.102418440

>>102418392
as a vramlet, i'd dance in joy

Anonymous
09/16/24(Mon)19:23:58 No.102418468

Anonymous 09/16/24(Mon)19:23:58 No.102418468

>>102418440
Qwen isn't Mistral. I'm not very hyped even for this 100B model, I bet it will be pure slop like their 72B.

Anonymous
09/16/24(Mon)19:25:50 No.102418491

Anonymous 09/16/24(Mon)19:25:50 No.102418491

>>102418468
>Mistral
mistral puts out dry shitty models but nobody here gets that because nobody here reads books

Anonymous
09/16/24(Mon)19:32:33 No.102418563

Anonymous 09/16/24(Mon)19:32:33 No.102418563

>>102418491
I read plenty of text online, and it's much more varied than anything I could find in generic books

Anonymous
09/16/24(Mon)19:33:56 No.102418578

Anonymous 09/16/24(Mon)19:33:56 No.102418578

>>102418563
>quantity over quality
yeah I don't doubt that

Anonymous
09/16/24(Mon)19:40:52 No.102418640

Anonymous 09/16/24(Mon)19:40:52 No.102418640

I'm using Mistral Nemo 12b on sillytavern
Sometimes, its quite great, other times it repeats itself a lot.
I know this is a completely common thing with AI chatbots
What I'm wondering is, sometimes it happens after a pretty lengthy conversation. Other times it literally happens immediately in a new chat. Is there any way to avoid the latter? I sometimes get stuck in a loop, creating new chats just to try and get the character to not talk unnecessarily verbosely and repeat itself. Usually I would prevent this via editing messages along the way but there's no real way to avoid that if its happening immediately. Why are my results (on the same character mind you) so wildly varied and why does it sometime seem to get really stuck on being fucked up? Is it just bad RNG?

Anonymous
09/16/24(Mon)19:41:31 No.102418646

Anonymous 09/16/24(Mon)19:41:31 No.102418646

>>102418640
why are you asking questions about using LLMs in /lmg/?

Anonymous
09/16/24(Mon)19:44:19 No.102418668

Anonymous 09/16/24(Mon)19:44:19 No.102418668

>>102418640
Yeah, that's a nemo thing for you.
I suggested the samplers
>temp 5, minP 0.1, top K 3, everything else off
And got 2 positive and one negative feedback, so you might as well try.
Also, try mini-magnu, and lyrav4. See if either of those suffer less from repetition.

Anonymous
09/16/24(Mon)19:48:24 No.102418701

Anonymous 09/16/24(Mon)19:48:24 No.102418701

File: Screenshot_20240916-194041.png (265 KB, 720x1560)

265 KB PNG

>>102418491
i don't know what you're talking about, this is peak literature

Anonymous
09/16/24(Mon)19:50:59 No.102418725

Anonymous 09/16/24(Mon)19:50:59 No.102418725

>>102418646
I'm using a local model and wasn't sure if I needed to configure something a particular way. Most of the AI chatbot guys use services it seems like.

Anonymous
09/16/24(Mon)19:58:00 No.102418775

Anonymous 09/16/24(Mon)19:58:00 No.102418775

>>102418701
reads worse than AO3 fanfic slop

Anonymous
09/16/24(Mon)20:09:20 No.102418876

Anonymous 09/16/24(Mon)20:09:20 No.102418876

File: 1639462027420s.jpg (6 KB, 250x140)

6 KB JPG

>>102418701
>her expression a mix of

Anonymous
09/16/24(Mon)20:24:18 No.102419016

Anonymous 09/16/24(Mon)20:24:18 No.102419016

File: a.png (451 KB, 1448x940)

451 KB PNG

>>102418725
try this maybe

Anonymous
09/16/24(Mon)20:31:48 No.102419095

Anonymous 09/16/24(Mon)20:31:48 No.102419095

https://github.com/WangHelin1997/SSR-Speech
i like this new model, for the last hour ive been doing nothing but rewriting existing speeches by trump and letting him describe all kinds of pussies completely randomly in between
a toast to technology

Anonymous
09/16/24(Mon)20:35:50 No.102419133

Anonymous 09/16/24(Mon)20:35:50 No.102419133

File: 2024-08-18_050152_seed2_s(...).png (1.81 MB, 1536x864)

1.81 MB PNG

>>102417233
That image made me think of Katamari as I've been playing it a lot recently. So fun.
Here's a random generic Mikugen for Migu Monday.

Anonymous
09/16/24(Mon)20:44:10 No.102419181

Anonymous 09/16/24(Mon)20:44:10 No.102419181

>>102417233
how the fuck would teto know how big or small the world is from space? mf there's no way you know how close or far away you are. god what a faggot.

Anonymous
09/16/24(Mon)20:46:16 No.102419193

Anonymous 09/16/24(Mon)20:46:16 No.102419193

what happened to megu monday

Anonymous
09/16/24(Mon)20:48:09 No.102419206

Anonymous 09/16/24(Mon)20:48:09 No.102419206

https://www.reddit.com/r/LocalLLaMA/comments/1fidhib/new_model_identifies_and_removes_slop_from/

Anonymous
09/16/24(Mon)20:48:53 No.102419209

Anonymous 09/16/24(Mon)20:48:53 No.102419209

>>102419193
Not enough tokens

Anonymous
09/16/24(Mon)21:00:23 No.102419290

Anonymous 09/16/24(Mon)21:00:23 No.102419290

>>102419193
What the fuck is a megu

Anonymous
09/16/24(Mon)21:03:01 No.102419306

Anonymous 09/16/24(Mon)21:03:01 No.102419306

>>102419206
>If you'd like to learn more about this project, you can join the Exllama Discord
Fucking discordfags.

Anonymous
09/16/24(Mon)21:09:46 No.102419364

Anonymous 09/16/24(Mon)21:09:46 No.102419364

File: 2992969da989364e8c0752d6e(...).jpg (267 KB, 1265x1275)

267 KB JPG

>>102419290

Anonymous
09/16/24(Mon)21:29:19 No.102419520

Anonymous 09/16/24(Mon)21:29:19 No.102419520

>>102417810
Alright, thanks. I was only looking at 'has lots of VRAM' but I guess a lot of VRAM alone doesn't cut it.

Anonymous
09/16/24(Mon)21:30:06 No.102419526

Anonymous 09/16/24(Mon)21:30:06 No.102419526

>>102418725
Maybe your samplers are fucked? Try
>Temp 0.85
>TopK 0
>TopP 1
>TypicalP 1
>MinP 0.02
>Rep pen 1.2
>DoSample on
>Add BoS

Make sure you're using mistral instruct and context presets too.

Anonymous
09/16/24(Mon)21:37:30 No.102419584

Anonymous 09/16/24(Mon)21:37:30 No.102419584

>(((we)))

Anonymous
09/16/24(Mon)21:45:25 No.102419661

Anonymous 09/16/24(Mon)21:45:25 No.102419661

Why is this cheaper than the others
>tugm4470 Gigabyte MZ73-LM1 + 2* AMD EPYC GENOA SP5
US $2,580.00
usually saw it as US $3,230.00
scam status?

need tech support
09/16/24(Mon)21:54:37 No.102419724

need tech support 09/16/24(Mon)21:54:37 No.102419724

File: 1725481122010684.jpg (18 KB, 253x289)

18 KB JPG

llama-server --hf-repo bartowski/Phi-3.5-mini-instruct-GGUF --hf-file ./Phi-3.5-mini-instruct-Q5_K_M.gguf
>ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 51539607584

I'm trying to run it on my CPU. I've got like ~8GB of ram available. why is it a default of 51GB

Anonymous
09/16/24(Mon)22:08:40 No.102419847

Anonymous 09/16/24(Mon)22:08:40 No.102419847

>>102419206
>look at what we did!!!
>wow cool, where is the dataset and model weights?
>*radio silence*
fuck off

Anonymous
09/16/24(Mon)22:09:18 No.102419853

Anonymous 09/16/24(Mon)22:09:18 No.102419853

>>102419095
Post some audios if you can anon

Anonymous
09/16/24(Mon)22:12:30 No.102419885

Anonymous 09/16/24(Mon)22:12:30 No.102419885

Why can't we just copy the human brain?

Anonymous
09/16/24(Mon)22:14:50 No.102419899

Anonymous 09/16/24(Mon)22:14:50 No.102419899

>>102419724
Set a context size manually.
It's probably defaulting to something huge.

Anonymous
09/16/24(Mon)22:24:18 No.102419975

Anonymous 09/16/24(Mon)22:24:18 No.102419975

>>102419885
Your brain?

Anonymous
09/16/24(Mon)22:25:16 No.102419983

Anonymous 09/16/24(Mon)22:25:16 No.102419983

>>102419975
No it wouldn't be much better

need tech support
09/16/24(Mon)22:27:55 No.102419998

need tech support 09/16/24(Mon)22:27:55 No.102419998

>>102419899
thanks -c 512 worked

Anonymous
09/16/24(Mon)22:29:27 No.102420010

Anonymous 09/16/24(Mon)22:29:27 No.102420010

>>102419998
512 tokens of context is almost nothing. Is that really what you want?
Most people do at least 8192 these days.

Anonymous
09/16/24(Mon)22:29:56 No.102420014

Anonymous 09/16/24(Mon)22:29:56 No.102420014

>>102419899
According to the docs it defaults to the model's normal context size, but I thought phi was at like 128k
>>102419998
Not sure if flash attention works on cpu, but you can try it. KV quantization also lets you at least halve the memory requirements

Anonymous
09/16/24(Mon)22:30:46 No.102420022

Anonymous 09/16/24(Mon)22:30:46 No.102420022

will a 3060 12gig and 32gb of RAM be capable of hosting decent local waifus? asking for a friend

Anonymous
09/16/24(Mon)22:32:31 No.102420049

Anonymous 09/16/24(Mon)22:32:31 No.102420049

>>102420022
nice double doubles, also you should be fine for the little models

Anonymous
09/16/24(Mon)22:59:02 No.102420251

Anonymous 09/16/24(Mon)22:59:02 No.102420251

>>102420022
you can run nemo comfortably at 6-bpw on 12gb with 32k context. don't let anyone talk you into running mixtral, it's meme-tier in current year.

Anonymous
09/16/24(Mon)23:02:10 No.102420278

Anonymous 09/16/24(Mon)23:02:10 No.102420278

>>102417229
>►Getting Started
>https://rentry.org/llama-mini-guide
>https://rentry.org/8-step-llm-guide
>https://rentry.org/llama_v2_sillytavern
>https://rentry.org/lmg-spoonfeed-guide
>https://rentry.org/rocm-llamacpp
>https://rentry.org/lmg-build-guides
You can remove all of these. They're outdated as fuck and newfags prefer to be spoonfed anyway.

Anonymous
09/16/24(Mon)23:28:41 No.102420475

Anonymous 09/16/24(Mon)23:28:41 No.102420475

I am waiting for power cables so I need to not use my new GPU. I am attempting to get oogabooga to ignore my new card. I am not seeing an option for llama. Is there a solution or should I use something else?

Anonymous
09/16/24(Mon)23:31:48 No.102420519

Anonymous 09/16/24(Mon)23:31:48 No.102420519

>>102420475
nvm. CUDA_VISIBLE_DEVICES as a command line flag fixes it.

Anonymous
09/16/24(Mon)23:35:21 No.102420555

Anonymous 09/16/24(Mon)23:35:21 No.102420555

File: Untitled.png (1.35 MB, 1080x2522)

1.35 MB PNG

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
https://arxiv.org/abs/2409.10516
>Transformer-based large Language Models (LLMs) become increasingly important in various domains. However, the quadratic time complexity of attention operation poses a significant challenge for scaling to longer contexts due to the extremely high inference latency and GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to accelerate attention computation. To leverage the dynamic sparse property of attention, RetrievalAttention builds approximate nearest neighbor search (ANNS) indexes upon KV vectors in CPU memory and retrieves the most relevant ones via vector search during generation. Due to the out-of-distribution (OOD) between query vectors and key vectors, off-the-shelf ANNS indexes still need to scan O(N) (usually 30% of all keys) data for accurate retrieval, which fails to exploit the high sparsity. RetrievalAttention first identifies the OOD challenge of ANNS-based attention, and addresses it via an attention-aware vector search algorithm that can adapt to queries and only access 1--3% of data, thus achieving a sub-linear time complexity. RetrievalAttention greatly reduces the inference cost of long-context LLM with much lower GPU memory requirements while maintaining the model accuracy. Especially, RetrievalAttention only needs 16GB GPU memory for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds on a single NVIDIA RTX4090 (24GB).
Pseudocode in paper. No mention of release of working code. Perhaps it might be posted here.
https://github.com/microsoft

Anonymous
09/16/24(Mon)23:41:46 No.102420621

Anonymous 09/16/24(Mon)23:41:46 No.102420621

>>102420278
this.

just stick links to kcpp and L3 instruct in the stick and it'll keep them busy for weeks.

Anonymous
09/16/24(Mon)23:41:59 No.102420624

Anonymous 09/16/24(Mon)23:41:59 No.102420624

is there a local implementation of RAG yet that isn't shit?

Anonymous
09/16/24(Mon)23:45:23 No.102420655

Anonymous 09/16/24(Mon)23:45:23 No.102420655

>>102420555
So they are going to say they have a faster way of prioritizing tokens without saying anything about how it changes the accuracy (unless I missed it).

I offer my code to do the same thing.
def token_selection(input):
    return len(input) % 4

Anonymous
09/16/24(Mon)23:57:09 No.102420762

Anonymous 09/16/24(Mon)23:57:09 No.102420762

>Hermes 3 405B keeps trying to undress a character who is already naked.
I know it's been a while since it was mentioned, but c'mon.

Anonymous
09/17/24(Tue)00:06:15 No.102420831

Anonymous 09/17/24(Tue)00:06:15 No.102420831

>even 405B llama 3 is shit
llama sisters...

Anonymous
09/17/24(Tue)00:18:02 No.102420912

Anonymous 09/17/24(Tue)00:18:02 No.102420912

>>102420762
Oh, nevermind, that was Mistral NeMo. Well that explains a lot then.

Anonymous
09/17/24(Tue)00:21:41 No.102420932

Anonymous 09/17/24(Tue)00:21:41 No.102420932

I can't get anything but png out of aicharactercards. Is the site screwed up?

Anonymous
09/17/24(Tue)00:21:45 No.102420933

Anonymous 09/17/24(Tue)00:21:45 No.102420933

>>102420912
You couldn't tell from reply speed?

Anonymous
09/17/24(Tue)00:24:48 No.102420954

Anonymous 09/17/24(Tue)00:24:48 No.102420954

>>102420933
No because I keep doing swapping between different windows.

Anonymous
09/17/24(Tue)00:35:22 No.102421049

Anonymous 09/17/24(Tue)00:35:22 No.102421049

>>102408109
What's so funny about this is that the People can never be a fiduciary in Democracy.

Anonymous
09/17/24(Tue)00:36:23 No.102421060

Anonymous 09/17/24(Tue)00:36:23 No.102421060

>>102420010
>8192
Why that number?

Anonymous
09/17/24(Tue)00:38:22 No.102421071

Anonymous 09/17/24(Tue)00:38:22 No.102421071

>>102421060
gotta be multiples of 512

Anonymous
09/17/24(Tue)00:38:44 No.102421073

Anonymous 09/17/24(Tue)00:38:44 No.102421073

File: ahhhhh.png (27 KB, 543x435)

27 KB PNG

>>102420251
It's OVER

Anonymous
09/17/24(Tue)00:39:57 No.102421084

Anonymous 09/17/24(Tue)00:39:57 No.102421084

>>102421060
Why 8192 specifically? Because it's a power of 2
Why 8k? Because it's good enough for short chats and small stories if you use your memory correctly and has basically no vram requirements nowadays

Anonymous
09/17/24(Tue)00:43:13 No.102421108

Anonymous 09/17/24(Tue)00:43:13 No.102421108

>>102421071
thanks. What does this do?

-n N, --n-predict N: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.

Anonymous
09/17/24(Tue)00:44:14 No.102421114

Anonymous 09/17/24(Tue)00:44:14 No.102421114

>>102421108
Sets the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.

Anonymous
09/17/24(Tue)00:46:29 No.102421132

Anonymous 09/17/24(Tue)00:46:29 No.102421132

>>102421108
Image the AI is driving a car. It is how often it checks the map to see if it is going in the correct direction.

also this >>102421114 ..unfortunately

Anonymous
09/17/24(Tue)00:48:10 No.102421147

Anonymous 09/17/24(Tue)00:48:10 No.102421147

>>102419526
I appreciate the help anon. I realized, as well as using your suggested settings, that I just didn't have a lot of shit configured (Instruct mode, didn't have my context set correctly in ST, etc). I'm retarded, and its working quite well now. carry on.

Anonymous
09/17/24(Tue)00:48:14 No.102421148

Anonymous 09/17/24(Tue)00:48:14 No.102421148

>>102421108
Imagine the AI is diving in a lake. It is how often it surfaces for air.

Anonymous
09/17/24(Tue)00:50:29 No.102421163

Anonymous 09/17/24(Tue)00:50:29 No.102421163

>>102421114
I do not understand
>>102421132
This made more sense, thanks

Anonymous
09/17/24(Tue)00:50:54 No.102421167

Anonymous 09/17/24(Tue)00:50:54 No.102421167

>>102420624
No, even proprietary SotA still sucks.

Anonymous
09/17/24(Tue)00:54:44 No.102421193

Anonymous 09/17/24(Tue)00:54:44 No.102421193

>>102421167
I would be thrilled if it stopped corrupting everything. I load something into the character attachments and it starts hallucinating like a meth head.

Anonymous
09/17/24(Tue)01:04:55 No.102421246

Anonymous 09/17/24(Tue)01:04:55 No.102421246

>According to Stanford, even pro-grade RAG systems (the kind used by lawyers) are only right 65% of the time at best
https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf

Anonymous
09/17/24(Tue)01:05:05 No.102421249

Anonymous 09/17/24(Tue)01:05:05 No.102421249

>>102421114
>>102421132
>>102421148
thanks

What are good values?

Anonymous
09/17/24(Tue)01:06:04 No.102421259

Anonymous 09/17/24(Tue)01:06:04 No.102421259

>>102421249
It's how many words you want to have in the reply from the model.
Decide for yourself, dumbass.
500.

Anonymous
09/17/24(Tue)01:06:41 No.102421263

Anonymous 09/17/24(Tue)01:06:41 No.102421263

>>102421246
The hallucinations are for me a feature, it's a boundless source of laughter. I think this kind of thing requires a dose of wordcellness. It absolutely slays me how ai can write stories so vanilla they sound like they are a parody of Hardy Boys.

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.