/g/ - /lmg/ - Local Models General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

[Post a Reply]

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous
/lmg/ - Local Models General 07/17/24(Wed)03:26:19 No.101439122

File: CleanupCrew.png (1.94 MB, 1024x1528)

1.94 MB PNG

/lmg/ - Local Models General Anonymous 07/17/24(Wed)03:26:19 No.101439122

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101431253 & >>101421477

►News
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1
>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1
>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271
>(07/09) Anole, based on Chameleon, for interleaved image-text generation: https://hf.co/GAIR/Anole-7b-v0.1
>(07/07) Support for glm3 and glm4 merged into llama.cpp: https://github.com/ggerganov/llama.cpp/pull/8031

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp

Anonymous
07/17/24(Wed)03:27:05 No.101439126

Anonymous 07/17/24(Wed)03:27:05 No.101439126

File: 1671214888959931.gif (437 KB, 500x483)

437 KB GIF

►Recent Highlights from the Previous Thread: >>101431253

--Paper: Mixture of A Million Experts: >>101431545 >>101431583 >>101431611 >>101431637 >>101432015 >>101432203 >>101432320 >>101431838 >>101431631 >>101431757 >>101432525
--Fine-tuning a Language Model to Generate Personalized Cover Letters: Seeking Recommendations and Exploring Alternatives: >>101436669 >>101436713 >>101436900
--AMD Promotes CPUmaxxing with EPYC Genoa: Outlined GPU Performance in LLM Tasks: >>101433552
--Uranium has 82 protons, but typos and sampler settings can confuse LLMs: >>101434145 >>101434215 >>101434236 >>101434265 >>101434332 >>101434423 >>101434467 >>101434789
--Question about PSU lines and GPU power requirements: >>101438498 >>101438516 >>101438541 >>101438675 >>101438873 >>101438790 >>101438988
--SCALE Programming Language and its support for llama.cpp: >>101432707 >>101432797
--Llama 3 Finetune Tops BFCL Leaderboard, But Are Function-Calling Models a Meme?: >>101434816 >>101434850 >>101434865 >>101434884
--Lack of Development Discussion and Frustration with LLM Dominance: >>101434645 >>101434734 >>101434859 >>101434939 >>101435320 >>101435452 >>101435571
--Miku (free space): >>101431341

►Recent Highlight Posts from the Previous Thread: >>101431260

Anonymous
07/17/24(Wed)03:46:06 No.101439243

Anonymous 07/17/24(Wed)03:46:06 No.101439243

Worship the Miku

Anonymous
07/17/24(Wed)03:54:18 No.101439308

Anonymous 07/17/24(Wed)03:54:18 No.101439308

File: hatsune_miku_at_cs2.jpg (129 KB, 1024x576)

129 KB JPG

>>101439126
>Uranium has 82 protons
Shrinkflation has finally reached the nuclear energy industry.

Anonymous
07/17/24(Wed)03:56:16 No.101439320

Anonymous 07/17/24(Wed)03:56:16 No.101439320

File: ComfyUI-2024-07-16-164008(...).png (1.15 MB, 1024x1024)

1.15 MB PNG

>>101439243
Yes.

Anonymous
07/17/24(Wed)03:57:55 No.101439327

Anonymous 07/17/24(Wed)03:57:55 No.101439327

Is there any chance to get decent results with a rtx 2070?
I tried some llama V3 model yesterday at at most I could get 1 liner replies. Escaping Claude's grasp isn't that easy it seems like. Didn't tinker with the settings in oogabooga yet, however.

Anonymous
07/17/24(Wed)03:59:37 No.101439348

Anonymous 07/17/24(Wed)03:59:37 No.101439348

>>101439327
no

Anonymous
07/17/24(Wed)03:59:44 No.101439349

Anonymous 07/17/24(Wed)03:59:44 No.101439349

What's the most affordable GPU for achieving 80+ T/s on Gemma-2 9b 5bpw in a headless server? Are there any AMD cards capable of doing it?

Anonymous
07/17/24(Wed)04:00:13 No.101439355

Anonymous 07/17/24(Wed)04:00:13 No.101439355

>>101439327
(Laughs in 2060 6GB)

Anonymous
07/17/24(Wed)04:00:20 No.101439356

Anonymous 07/17/24(Wed)04:00:20 No.101439356

>>101439327
say that you want long and detailed replies in the system prompt

Anonymous
07/17/24(Wed)04:01:42 No.101439368

Anonymous 07/17/24(Wed)04:01:42 No.101439368

>>101439355
Damn. What are you using?

Anonymous
07/17/24(Wed)04:04:53 No.101439396

Anonymous 07/17/24(Wed)04:04:53 No.101439396

>>101439368
gemma-2-9b-it.Q4_K and mixtral-8x7b-v0.1.Q4_K_M
Like >>101439356 said the system prompt helps.

Anonymous
07/17/24(Wed)04:11:27 No.101439429

Anonymous 07/17/24(Wed)04:11:27 No.101439429

File: Screenshot_20240717_161059.png (406 KB, 1342x1189)

406 KB PNG

>>101439396 (me)

Anonymous
07/17/24(Wed)04:12:31 No.101439438

Anonymous 07/17/24(Wed)04:12:31 No.101439438

>>101439327
you have 8GB of VRAM, you can fit a 8B model easily, look for the gemma and llama3 smaller versions and find a gguf quant that you can run.

Anonymous
07/17/24(Wed)04:13:22 No.101439447

Anonymous 07/17/24(Wed)04:13:22 No.101439447

>>101431757
Weren't MoEs cheaper to train?
>>101431838
You could use a lot of RAM or even terabytes on storage on SSDs and the inference would be still fast. CPUmaxx is the way to go.

Also that recent scaling paper claimed MoEs need only 1.3 more parameters to contain the same amount of information as dense models

Anonymous
07/17/24(Wed)04:13:24 No.101439448

Anonymous 07/17/24(Wed)04:13:24 No.101439448

>>101439429
Damn, not bad. Guess I need to figure out this shit some more. Thanks.

Anonymous
07/17/24(Wed)04:15:03 No.101439456

Anonymous 07/17/24(Wed)04:15:03 No.101439456

>>101439438
You think it would be possible to cram an 11b model in somehow?

Anonymous
07/17/24(Wed)04:17:21 No.101439468

Anonymous 07/17/24(Wed)04:17:21 No.101439468

>>101439448
This is the system prompt I use (among others).
https://huggingface.co/datasets/ChuckMcSneed/various_RP_system_prompts/blob/main/unknown-simple-proxy-for-tavern.txt

Anonymous
07/17/24(Wed)04:17:57 No.101439471

Anonymous 07/17/24(Wed)04:17:57 No.101439471

File: 1514830956221.png (1.15 MB, 1001x1200)

1.15 MB PNG

i cannot fucking handle this youtube slop dude
im running the docker version of memgpt (fuckin wish i could use it with silly tavern instead) and all the tutorials for making it run with LLM's are for the CMD version instead of docker so im fuck outta luck
any advice? i tried doing kobolds api key in place of the open ai key spot in the .env (formerly .env.example) which made it finally open to the memgpt.localhost but then when I send a message it just thinks and then dies
I'll try and get a screen shot for ya one sec

Anonymous
07/17/24(Wed)04:27:32 No.101439530

Anonymous 07/17/24(Wed)04:27:32 No.101439530

File: 1657345609038.png (587 KB, 719x713)

587 KB PNG

>>101439027
Then recommend me a new model, faggit.

Anonymous
07/17/24(Wed)04:28:25 No.101439537

Anonymous 07/17/24(Wed)04:28:25 No.101439537

File: ITSHAPPENING.webm (588 KB, 1024x1024)

588 KB WEBM

>Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.

Anonymous
07/17/24(Wed)04:30:44 No.101439556

Anonymous 07/17/24(Wed)04:30:44 No.101439556

>>101439456
Yes, an 11B gguf at Q4 will fit. A Q5 will depend on the context. Check this for reference:
https://huggingface.co/mradermacher/Fimbulvetr-11B-v2.1-16K-i1-GGUF

Anonymous
07/17/24(Wed)04:31:43 No.101439573

Anonymous 07/17/24(Wed)04:31:43 No.101439573

File: asry.png (446 KB, 1280x720)

446 KB PNG

>>101439471
top is what it's stuck doing then boom it stops thinking and i get nothing
the CMD is just what it looks like immediately after connecting to memgpt.localhost

Anonymous
07/17/24(Wed)05:28:19 No.101439990

Anonymous 07/17/24(Wed)05:28:19 No.101439990

File: 448808774_867367808769058(...).png (677 KB, 664x664)

677 KB PNG

i was just catching up on the last couple threads and noticed this appeared 2 threads back, didnt get any replies and didnt get caught in the recap that i can see, so gonna repost it
>>101426978
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
https://arxiv.org/abs/2407.10969
>We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
from the bitnet team. seems it didn't get posted here yet

>from the bitnet team. seems it didn't get posted here yet

Anonymous
07/17/24(Wed)05:31:29 No.101440013

Anonymous 07/17/24(Wed)05:31:29 No.101440013

File: 67562011_699721293801449_(...).jpg (71 KB, 959x958)

71 KB JPG

>up to 15 characters made on local setup
>all of them are just various fluff around my poorly hidden breeding kink
i need new material, there's only so many ways i can spin scenarios before i have to get into weird shit

Anonymous
07/17/24(Wed)05:36:21 No.101440042

Anonymous 07/17/24(Wed)05:36:21 No.101440042

>>101439990
According to this paper, BitNet models have an optimal sparsity of about 60% so the improvement is significant but not groundbreaking.

MoE models can also be considered "sparse" and if you can have a model with a million experts like that other Google paper claims, with an optimal number of active experts being in the hundreds, inference can be **orders of magnitude** faster than with current models or any future BitNet model.

Anonymous
07/17/24(Wed)05:36:25 No.101440043

Anonymous 07/17/24(Wed)05:36:25 No.101440043

>>101440013
All roads lead to Rome.
"New material" is either just different preludes to the same kink moment, or you need to find in you a different kink to be drawn toward.

The end result—squirt squirt—is the same no matter how you get there. Unless you change that intention, it's all just variations on a theme.

So why not ask your LLM to come up with some new material for you? Literally, at the end of a role play ask it for new ideas. If your context is large enough for it to see what you've done, it might come up with some neat new stuff.

Anonymous
07/17/24(Wed)05:40:46 No.101440064

Anonymous 07/17/24(Wed)05:40:46 No.101440064

File: never ever.png (549 KB, 1510x856)

549 KB PNG

>>101439990

Anonymous
07/17/24(Wed)05:49:13 No.101440110

Anonymous 07/17/24(Wed)05:49:13 No.101440110

>>101440064
If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.

Anonymous
07/17/24(Wed)05:52:00 No.101440134

Anonymous 07/17/24(Wed)05:52:00 No.101440134

File: file.png (19 KB, 967x162)

19 KB PNG

>>101439990
interestingly they say in here that off the shelf LLMs can be "continue-trained" to make use of Q-sparse

Anonymous
07/17/24(Wed)05:52:06 No.101440136

Anonymous 07/17/24(Wed)05:52:06 No.101440136

>>101440043
i don't even really see it as a "kink", and i hate that word anyway
i've always been a boring vanillafag
i might try your suggestion though

Anonymous
07/17/24(Wed)05:53:05 No.101440147

Anonymous 07/17/24(Wed)05:53:05 No.101440147

File: file.png (123 KB, 962x703)

123 KB PNG

>>101440134
this was for mistral-7b

Anonymous
07/17/24(Wed)05:59:04 No.101440187

Anonymous 07/17/24(Wed)05:59:04 No.101440187

>>101440110
They mean risk instability of training runs and wasting millions on experiments that will be only good for proving that mamba/SSM fail to scale. Not safetyism risk.

Anonymous
07/17/24(Wed)05:59:47 No.101440196

Anonymous 07/17/24(Wed)05:59:47 No.101440196

>>101440136
>i don't even really see it as a "kink"
It was your word choice, so maybe you do.

>i've always been a boring vanillafag
I could describe myself the same way, but I know where my kinks lie and LLM has been an interesting way to see exactly which details of their topics "work" and which don't.

But definitely let the LLM try things. Some of the most interesting RPs I've had were by playing through a part that was a "nah" and then it went someplace I would never have thought of. And then it's hot till the context fills causing sudden derp and collapse and sadness.

Anonymous
07/17/24(Wed)06:00:33 No.101440202

Anonymous 07/17/24(Wed)06:00:33 No.101440202

>>101440187
Semantics. Meta has had over a year and failed to innovate at all.

Anonymous
07/17/24(Wed)06:01:06 No.101440208

Anonymous 07/17/24(Wed)06:01:06 No.101440208

https://www.semianalysis.com/p/gb200-hardware-architecture-and-component
neat

Anonymous
07/17/24(Wed)06:54:10 No.101440609

Anonymous 07/17/24(Wed)06:54:10 No.101440609

>>101440110
>If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.
I think the first non meme BitNet model will be from the Mistral team, they're the only one making non transformers models (they made a MoE model and a mamba model)

Anonymous
07/17/24(Wed)06:54:47 No.101440611

Anonymous 07/17/24(Wed)06:54:47 No.101440611

File: SubvertedDemocracy.jpg (31 KB, 640x708)

31 KB JPG

Sup /lmg/. I'm looking for an open source project that allows me to have a local server with:

1- An OpenAI compatible API

2- Allows for multi-user or multi request. Ideally it won't run them in parallel, but will queue them (i.e. at any time it will be running inference for only one request but won't crash or reject requests while running inference)

3- allows for multiple models but doesn't load more than one at any single time.

4- ideally, it would seamlessly "hotswap" models as requested. (If a new request needs a different model, it will automatically unload the current model and load the new one.

Llama-cpp-python allows for all of the above except the #2. I want to have a single LLM server and use it for multiple client apps.

Pic unrelated.

Anonymous
07/17/24(Wed)07:08:59 No.101440724

Anonymous 07/17/24(Wed)07:08:59 No.101440724

>>101440043
How do you ask for that stuff?

Anonymous
07/17/24(Wed)07:12:40 No.101440750

Anonymous 07/17/24(Wed)07:12:40 No.101440750

>>101440043
at some point I think anons will need to do their own preference optimization dataset based on their fetishes to then finetune models with. maybe make a google form or something to then click through that generates the dataset when you're done

Anonymous
07/17/24(Wed)07:32:17 No.101440898

Anonymous 07/17/24(Wed)07:32:17 No.101440898

>>101440611
ollama can do all that. But, if you want really good multi users shit, you have to use vllm.

Anonymous
07/17/24(Wed)07:36:57 No.101440928

Anonymous 07/17/24(Wed)07:36:57 No.101440928

>>101440724
Pursuant to >>101440013, I tried,
>Someone on 4chan laments that he's made 15 character definitions for his LLM to role play as, but they all play into his breeding kink so narrowly that they're becoming repetitive and requiring him to get into "weird shit" to make them interesting. Please list seven kinds of characters that his LLM could role play as that offer something interesting to explore while still ultimately becoming a scenario where his character will begin producing offspring with his LLM's character. Consider all kinds and genres of fiction for ideas of what kinds of people or things these role play partner characters could be.

Removing the explanations to save space in one post, it offered:
1. Space Colonist
2. Ancient God/Goddess
3. Time Traveler
4. Shapeshifter
5. AI Entity
6. Nature Spirit
7. Alien Hybrid

Those sound like good ways to make both the run up to the kink and the consequences of the kink have something fresh to offer.

>>101440750
>their own preference optimization dataset based on their fetishes
That sounds like the road to boredom. If you move toward what you know you want, you'll get the same things again and again, as >>101440013 complained about. Guide the AI away from your no-gos and dealbreakers, and move toward things you don't have much of an opinion of, and you'll be able to find new things that you didn't know you would like. And it'll be trivial to work in a personal kink on the fly when it's wanted.

Anonymous
07/17/24(Wed)07:44:11 No.101440991

Anonymous 07/17/24(Wed)07:44:11 No.101440991

>Added --unpack, a new self-extraction feature that allows KoboldCpp binary releases to be unpacked into an empty directory. This allows easy modification and access to the files and contents embedded inside the PyInstaller. Can also be used in the GUI launcher.
Holy heckin based

Anonymous
07/17/24(Wed)07:51:14 No.101441052

Anonymous 07/17/24(Wed)07:51:14 No.101441052

What do you do after shooting your load in rp context?
Close chat and start a new one?

Anonymous
07/17/24(Wed)07:58:54 No.101441108

Anonymous 07/17/24(Wed)07:58:54 No.101441108

>>101440611
>>101440898
Forgot to add...

* Partial GPU/CPU offloading

Anonymous
07/17/24(Wed)08:30:19 No.101441368

Anonymous 07/17/24(Wed)08:30:19 No.101441368

you're going to be able to make videos that look real

Anonymous
07/17/24(Wed)08:31:52 No.101441380

Anonymous 07/17/24(Wed)08:31:52 No.101441380

>>101441368
people be questionin the legitimacy of livestreamed, realtime footage
if that can be fake then there's no such thing as real

llama.cpp CUDA dev !!OM2Fp6Fn93S
07/17/24(Wed)08:35:01 No.101441409

llama.cpp CUDA dev !!OM2Fp6Fn93S 07/17/24(Wed)08:35:01 No.101441409

>>101441108
If you need CPU+GPU hybrid inference llama.cpp/ggml is you only choice in terms of backend.
The llama.cpp HTTP server has an OAI-compatible API and will queue requests by default but model hot-swapping is not implemented.
Ooba I think lets you load/unload models via the API but I don't know if it's OAI-compatible.

Anonymous
07/17/24(Wed)08:44:16 No.101441474

Anonymous 07/17/24(Wed)08:44:16 No.101441474

So are we just never going to get an HF version of mamba codestral that works outside of Mistral's shitty basic bitch backend?

Anonymous
07/17/24(Wed)08:48:00 No.101441511

Anonymous 07/17/24(Wed)08:48:00 No.101441511

>>101441409
ollama queue by default and can do parallel requests, can easily swap models via API and is OAI compatible.

Anonymous
07/17/24(Wed)08:48:07 No.101441512

Anonymous 07/17/24(Wed)08:48:07 No.101441512

>>101441474
>mamba
>works

Anonymous
07/17/24(Wed)08:53:58 No.101441574

Anonymous 07/17/24(Wed)08:53:58 No.101441574

>>101441409
>The llama.cpp HTTP server has an OAI-compatible API

Last I checked, only the chat API is OAI compatible, the regular text completion isn't.

Anonymous
07/17/24(Wed)08:58:00 No.101441601

Anonymous 07/17/24(Wed)08:58:00 No.101441601

>>101441512
mamba is a supported model-type since like transformers 4.39
Why shouldn't it work?

Anonymous
07/17/24(Wed)08:58:42 No.101441604

Anonymous 07/17/24(Wed)08:58:42 No.101441604

>>101441511
ollama just runs the llama.cpp server in the background, petra.

Anonymous
07/17/24(Wed)09:06:03 No.101441654

Anonymous 07/17/24(Wed)09:06:03 No.101441654

>>101441601
okay
use it

Anonymous
07/17/24(Wed)09:08:35 No.101441679

Anonymous 07/17/24(Wed)09:08:35 No.101441679

>>101441654
Well I can't because I'm not going to install Mistral or Mamba's shitty inference packages since neither of them seem to have an API which renders them utterly fucking useless.

Anonymous
07/17/24(Wed)09:33:04 No.101441934

Anonymous 07/17/24(Wed)09:33:04 No.101441934

>>101441679
well then it doesn't work does it

Anonymous
07/17/24(Wed)09:36:39 No.101441976

Anonymous 07/17/24(Wed)09:36:39 No.101441976

There's been websim for a few months, seems quite popular now. I never cared much to try it, is it actually great? And are they or something similar open source? Official website wants me to login with google. Can you run it with a local model?

Anonymous
07/17/24(Wed)09:45:16 No.101442061

Anonymous 07/17/24(Wed)09:45:16 No.101442061

so I installed ollama cli and run lama3
its pretty cool but i hoped /save sessionName save the chat so I can come back to it next time with llama remembering what was told, but apparently it is not the case
how to save/load chat history so i can continue the conversation?

Anonymous
07/17/24(Wed)09:49:37 No.101442109

Anonymous 07/17/24(Wed)09:49:37 No.101442109

>>101442061
Wrong site: reddit.com/r/LocalLLaMA/

Anonymous
07/17/24(Wed)09:50:16 No.101442112

Anonymous 07/17/24(Wed)09:50:16 No.101442112

>>101441934
For once, doomer Anon, on this one exact topic, you are correct. It is unironically over and there's absolutely no hope.

Anonymous
07/17/24(Wed)09:59:54 No.101442219

Anonymous 07/17/24(Wed)09:59:54 No.101442219

>>101442109
sorry, I thought this was local models general

Anonymous
07/17/24(Wed)10:02:49 No.101442244

Anonymous 07/17/24(Wed)10:02:49 No.101442244

>>101442219
ollama is not beloved

Anonymous
07/17/24(Wed)10:03:30 No.101442253

Anonymous 07/17/24(Wed)10:03:30 No.101442253

Anybody knows if the whole slot system from llama-server works with Sily, or would I need to change the way Silly calls the API to specify a slot or something of the sort?
You can have different prompt caches per slot right? That would be pretty cool when switching between cards or even when using things like the summary extension.

Anonymous
07/17/24(Wed)10:04:53 No.101442269

Anonymous 07/17/24(Wed)10:04:53 No.101442269

>>101440202
>>101440187
I think it was too early to get their clusters fully online, and they're still in the process of getting and building more. They began training and setting in stone what Llama 3 was going to be earlier than some of the current hyped research could prove scalable. Llama 4 is probably going to be the one with a more unique architecture.

However, it is pretty normal that a big corporation lags behind startups when it comes to smaller scale faster paced roll outs of technology. Their advantage is that when they do roll out a new product, it's done with more money. That doesn't always equal a better product. In the case of LLMs it means they can spend a lot of time training the model like 15T, or they can train a 400B dense, or whatever. Startups may come out with a Bitnet or Jamba or whatever sooner, but then it'll take a megacorp to produce a Bitnet or Jamba or whatever with 15T pumped into one, or a 400B, or etc.

Anonymous
07/17/24(Wed)10:06:41 No.101442286

Anonymous 07/17/24(Wed)10:06:41 No.101442286

>>101442061
Maybe ollama has updated and fixed this since I used it a month or two ago, but I found the /save feature to be total ass. The parser was busted and many character sequences would kill the parser, causing it to save only the first few turns of the conversation.

I could avoid then on my side (for example, NEVER end a turn with ". A space or spare period after would be fine but if it ended on a quote mark from dialog, that's the end of the save. Of course, the AI could write killing sequences, too, so even if I'm careful it's a doomed chat.

I roll Kobold now. Much easier to adjust settings now that I know them, I don't have to play Ollama's silly JSON and renamed file game, and I can use (almost) all of the models.

Ollama is a great introduction but the instant you want to do more, step up to a better wrapper.

Anonymous
07/17/24(Wed)10:07:34 No.101442296

Anonymous 07/17/24(Wed)10:07:34 No.101442296

>>101442269
so they allow training on the output? If so that would probably be great for Mistral and similar

Anonymous
07/17/24(Wed)10:21:45 No.101442409

Anonymous 07/17/24(Wed)10:21:45 No.101442409

>>101439126
Isn't Mixture of a Million Experts basically Marvin Minsky's Society of Mind? Also with the fact that it could theorhetical have life long learning...

Bros, we're actually going to get lifelike ai gfs in our life times, aren't we?

Anonymous
07/17/24(Wed)10:24:40 No.101442432

Anonymous 07/17/24(Wed)10:24:40 No.101442432

so can anyone actually even test the 405b? by maybe renting some gpus online or something?

Anonymous
07/17/24(Wed)10:26:30 No.101442440

Anonymous 07/17/24(Wed)10:26:30 No.101442440

>>101442432
I believe someone from huggingface said they'd host it (though I'm quite confused that the hugging.chat models are always broken)

Anonymous
07/17/24(Wed)10:27:05 No.101442443

Anonymous 07/17/24(Wed)10:27:05 No.101442443

>>101442286
thank you for an actual answer
kobold seems interesting but i wanted something working in terminal for now
I guess I will try some ollama terminal clients, I see there are quite a few
anyone can recommend specific one?

Anonymous
07/17/24(Wed)10:27:32 No.101442446

Anonymous 07/17/24(Wed)10:27:32 No.101442446

>>101442432
If any providers bother with it it will probably be on open router but for $2 a gen I imagine

Anonymous
07/17/24(Wed)10:32:44 No.101442485

Anonymous 07/17/24(Wed)10:32:44 No.101442485

>>101442443
I too enjoyed that Ollama was terminal based at first, and it was something I wanted when getting started. But then I wanted to change settings without curling one line JSON strings, didn't like the multi line text bug, the inability to save, etc, so I switched to Kobold and now it's trivial for me to reach it from my phone when I step away from my computer, I have access to important settings, state saving works correctly, and easy access to editing the document to fix errors or to reroll a response is very nice to have.

Anonymous
07/17/24(Wed)10:36:45 No.101442516

Anonymous 07/17/24(Wed)10:36:45 No.101442516

>>101439447
>Weren't MoEs cheaper to train?
Yeah, generally. Wasn't necessarily true for this "distributed" FFW method though. Looking at the paper though it should be more flop efficient. I'm actually warming up to the concept as I dig into it a bit more, there might be something really good here.

This paper alone isn't enough to sell the idea, but I think a few minor improvements and suddenly this becomes the new training paradigm.

Anonymous
07/17/24(Wed)10:36:49 No.101442517

Anonymous 07/17/24(Wed)10:36:49 No.101442517

>>101442219
Yeah, this is not the ollama tech support general. Go to Reddit to shill your scam.

Anonymous
07/17/24(Wed)10:47:53 No.101442599

Anonymous 07/17/24(Wed)10:47:53 No.101442599

state space 1 million expert bitnet 1T parameter model when?

Anonymous
07/17/24(Wed)10:49:15 No.101442605

Anonymous 07/17/24(Wed)10:49:15 No.101442605

>>101442599
Don't forget with JEPA and native multimodal and multitoken prediction.

Anonymous
07/17/24(Wed)10:51:01 No.101442618

Anonymous 07/17/24(Wed)10:51:01 No.101442618

>>101442599
Bitnet probably wouldn't work well with experts in the order of thousands of parameters in size.

Anonymous
07/17/24(Wed)10:51:15 No.101442621

Anonymous 07/17/24(Wed)10:51:15 No.101442621

>>101442599
we need to return to good old lstms

Anonymous
07/17/24(Wed)11:02:39 No.101442697

Anonymous 07/17/24(Wed)11:02:39 No.101442697

mambeleonbytejepabitnetMoaME 600B when

Anonymous
07/17/24(Wed)11:05:11 No.101442712

Anonymous 07/17/24(Wed)11:05:11 No.101442712

>>101442599
with fill in the middle support

Anonymous
07/17/24(Wed)11:06:36 No.101442725

Anonymous 07/17/24(Wed)11:06:36 No.101442725

>in my system prompt I give the narrator a personality, since that decreases slop
>after plapping the character from the card, ask what the narrator thinks so far using OOC
>she's horny
>propose her, herself joining on the fun
>now I get to plap the original character as well as the narrator turned into a character
Let's fucking go.

Anonymous
07/17/24(Wed)11:07:00 No.101442729

Anonymous 07/17/24(Wed)11:07:00 No.101442729

File: Screenshot 2024-07-17 at (...).png (42 KB, 480x810)

42 KB PNG

llama.cpp has support for some NPU now.

>>101442621
Wouldn't it be hilarious if the next step is to go back but with a couple of adjustments and at a bigger scale?
It's not even absurd to think that since that kind of thing happens all the time.

Anonymous
07/17/24(Wed)11:07:48 No.101442735

Anonymous 07/17/24(Wed)11:07:48 No.101442735

>>101442697
I'll type the weights by hand

Anonymous
07/17/24(Wed)11:09:01 No.101442750

Anonymous 07/17/24(Wed)11:09:01 No.101442750

https://github.com/ggerganov/llama.cpp/pull/8543
>Add support for Chameleon #8543
that one anon will be happy
>For now, this implementation only supports text->text inference and serves as base to implement the (more interesting) image->text, text->image and interleaved pipelines. However, such an implementation will probably require some changes to the CLI and internal architecture, so I suggest to do this in a separate PR.
oh...

Anonymous
07/17/24(Wed)11:09:34 No.101442753

Anonymous 07/17/24(Wed)11:09:34 No.101442753

>>101442599
right after you hang yourself.

Anonymous
07/17/24(Wed)11:09:38 No.101442754

Anonymous 07/17/24(Wed)11:09:38 No.101442754

>>101442729
Interesting. Hopefully now we can get some benchmarks of how those new laptops do with LLMs.

Anonymous
07/17/24(Wed)11:12:45 No.101442792

Anonymous 07/17/24(Wed)11:12:45 No.101442792

>>101442729
isn't there any standard NPU api?

Anonymous
07/17/24(Wed)11:13:45 No.101442804

Anonymous 07/17/24(Wed)11:13:45 No.101442804

>>101442754
they've still got poor memory bandwidth compared to low end nvidia gpus

Anonymous
07/17/24(Wed)11:15:21 No.101442820

Anonymous 07/17/24(Wed)11:15:21 No.101442820

>>101442792
Doubt it.
It's probably the same situation as GPUs where each vendor has its own computing architecture with its own APIs.

Anonymous
07/17/24(Wed)11:20:13 No.101442868

Anonymous 07/17/24(Wed)11:20:13 No.101442868

>>101442804
Well yeah. But low power windoze/linux laptops that can run LLMs is still pretty good to have exist. Also would be good to know how the LPDDR5X in them performs, whether it'll still be the bottleneck, or the compute is the bottleneck, in the case of these computers. We could then extrapolate to think about how a desktop with a similar chip could perform paired with separate GPU.

Anonymous
07/17/24(Wed)11:25:27 No.101442922

Anonymous 07/17/24(Wed)11:25:27 No.101442922

Why don't they just add matrix multiplication to slide rules?

Anonymous
07/17/24(Wed)11:39:56 No.101443047

Anonymous 07/17/24(Wed)11:39:56 No.101443047

>>101440064
Meta became irrelevant when they started to filter their pre-training dataset with llamaguard. Claude shows that a diverse dataset with no regards of safety produces great results. A model that could not learn anything 'unsafe' because the data it was fed was designed to be 99% safe will never be good.

Anonymous
07/17/24(Wed)11:42:22 No.101443072

Anonymous 07/17/24(Wed)11:42:22 No.101443072

>>101443047
hi petra

Anonymous
07/17/24(Wed)11:49:39 No.101443144

Anonymous 07/17/24(Wed)11:49:39 No.101443144

>>101439308
Disregarding nuclear safety regulations with Miku

Anonymous
07/17/24(Wed)11:51:21 No.101443158

Anonymous 07/17/24(Wed)11:51:21 No.101443158

File: teto.jpg (158 KB, 1024x1024)

158 KB JPG

How to restrain model's output in tabbyapi? I tried "json_schema": {"type": "string","enum": ["Yes","No","Maybe"]}, but I keep getting errors:
ERROR: ExLlamaV2Sampler.sample(
ERROR: File "/home/petra/tabbyAPI/venv/lib/python3.10/site-packages/exllamav2/generator/sampler.py", line 247, in
sample
ERROR: assert pass_tokens, "Filter excluded all tokens"
ERROR: AssertionError: Filter excluded all tokens
ERROR: Sent to request: Completion aborted. Maybe the model was unloaded? Please check the server console.

Anonymous
07/17/24(Wed)11:58:28 No.101443202

Anonymous 07/17/24(Wed)11:58:28 No.101443202

>>101443072
Are you retarded or just schizophrenic?

Anonymous
07/17/24(Wed)12:00:21 No.101443222

Anonymous 07/17/24(Wed)12:00:21 No.101443222

>>101443047
This, this is why Command-R is good, and partially what makes WLM 8x22b wayyy better then 8x22b instruct. (Although I think WLM just has some other interesting continued pretraining and finetuned techniques up their sleeves iirc)

Anonymous
07/17/24(Wed)12:02:18 No.101443244

Anonymous 07/17/24(Wed)12:02:18 No.101443244

File: 1695520564243917.png (20 KB, 1009x382)

20 KB PNG

>>101443222
When will this meme die?

Anonymous
07/17/24(Wed)12:04:59 No.101443272

Anonymous 07/17/24(Wed)12:04:59 No.101443272

>>101443244
if you mean the huggingface leaderboard meme, whenever you stop posting it

Anonymous
07/17/24(Wed)12:08:26 No.101443304

Anonymous 07/17/24(Wed)12:08:26 No.101443304

>>101443222
>what makes WLM 8x22b wayyy better
When are the mikufags going to drop this meme? How is this related to pretraining when it's a finetune done with GPT-4 outputs?

Anonymous
07/17/24(Wed)12:13:02 No.101443349

Anonymous 07/17/24(Wed)12:13:02 No.101443349

>>101443222

They have a paper out for wizard. You can replicate if you have the compute and money.

They ran an offline arena and basically trained on the best outputs in that on something, over and over again. SFT, DPO, PPO i believe? Reward models type shit.

Anonymous
07/17/24(Wed)12:17:51 No.101443389

Anonymous 07/17/24(Wed)12:17:51 No.101443389

>>101443349

Basically.

https://x.com/victorsungo/status/1811427047341776947?t=k7ZXwSCRnYKBW_7Rj0_q6w&s=19

Anonymous
07/17/24(Wed)12:30:45 No.101443503

Anonymous 07/17/24(Wed)12:30:45 No.101443503

File: file.png (479 KB, 593x812)

479 KB PNG

Duality of /lmg/

Anonymous
07/17/24(Wed)12:37:27 No.101443568

Anonymous 07/17/24(Wed)12:37:27 No.101443568

File: 1719351514748681.jpg (575 KB, 2048x2048)

575 KB JPG

>>101443244
>Aktually according to the HF leaderboard...
>t. Never run Mistral8x22 or Wiz8x22 for himself
>>101443503
>Yes with a 5% lead
We are so back bros

Anonymous
07/17/24(Wed)12:45:09 No.101443642

Anonymous 07/17/24(Wed)12:45:09 No.101443642

>>101443158
good image

Anonymous
07/17/24(Wed)12:50:36 No.101443698

Anonymous 07/17/24(Wed)12:50:36 No.101443698

>>101443047
buy an ad

Anonymous
07/17/24(Wed)12:53:09 No.101443724

Anonymous 07/17/24(Wed)12:53:09 No.101443724

>>101439122
>>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271

what am I gonna need to run this?

Anonymous
07/17/24(Wed)12:57:07 No.101443763

Anonymous 07/17/24(Wed)12:57:07 No.101443763

Will we even need to make videos anymore if we can just generate them?

Anonymous
07/17/24(Wed)13:09:12 No.101443885

Anonymous 07/17/24(Wed)13:09:12 No.101443885

>>101443763
There are artists still making art even with AI generated art. There are still writers despite AI being able to write. There will still be filmmakers despite AI being able to make film. Some people just like to create.

Anonymous
07/17/24(Wed)13:12:38 No.101443917

Anonymous 07/17/24(Wed)13:12:38 No.101443917

>>101443724
At least a raspberry pi

Anonymous
07/17/24(Wed)13:13:25 No.101443930

Anonymous 07/17/24(Wed)13:13:25 No.101443930

>>101443763
>t. the least illiterate AI bro

Anonymous
07/17/24(Wed)13:19:42 No.101444001

Anonymous 07/17/24(Wed)13:19:42 No.101444001

I have been doing some testing between GGUF and EXL2.

I have always preferred exl2 as it was faster and had Q4 cache for KV.

But the speed difference seems to have vanished, and llama.cpp supports KV caching too now.

All averaged though 3 runs+using SillyTavern as front end+using FA and Q4 cache in both exllama (mandadory in tabbyapi) and llama.cpp:

TabbiAPI Backend (Exllamav2 0.1.7):
WizardLM2 8x22B Exl2 @ 4.0bpw 24.1t/s

Llamacpp backend (pulled 2hours ago):
WizardLM2 8x22B GGUF @ Q4_K_M 25.2t/s

Textgenwebui Backend Exl2:
WizardLM2 8x22B Exl2 @ 4.0bpw 22.1t/s

Textgenwebui Backend GGUF:
WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/s

Due to the higher support for llama.cpp and thus better compatibility with devices and faster compatibility with new models. Is there any reason to still use exl2?

System:
4x3090's at 250w max
EPYC 7402

Anonymous
07/17/24(Wed)13:21:25 No.101444012

Anonymous 07/17/24(Wed)13:21:25 No.101444012

>>101444001
>>Textgenwebui Backend GGUF:
>>WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/s

I meant WizardLM2 8x22B Exl2 @ *Q4_K_M* 23.2t/s

Anonymous
07/17/24(Wed)13:28:02 No.101444083

Anonymous 07/17/24(Wed)13:28:02 No.101444083

>>101444001
Can you also provide results for prompt processing?

Anonymous
07/17/24(Wed)13:31:04 No.101444118

Anonymous 07/17/24(Wed)13:31:04 No.101444118

>>101444001
>Is there any reason to still use exl2?
Unironically no.

Anonymous
07/17/24(Wed)13:33:01 No.101444136

Anonymous 07/17/24(Wed)13:33:01 No.101444136

>>101444001
>those numbers
big if true

Anonymous
07/17/24(Wed)13:36:43 No.101444167

Anonymous 07/17/24(Wed)13:36:43 No.101444167

>>101444001
exl2sisters... what went wrong?

Anonymous
07/17/24(Wed)13:38:16 No.101444182

Anonymous 07/17/24(Wed)13:38:16 No.101444182

>>101444001
The llama.cpp server is too rough around the edges and tabbyapi is more polished. I would switch back to exllama if Gemma 2 worked as well as it does with llama.cpp.

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.