[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: CleanupCrew.png (1.94 MB, 1024x1528)
1.94 MB
1.94 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101431253 & >>101421477

►News
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1
>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1
>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271
>(07/09) Anole, based on Chameleon, for interleaved image-text generation: https://hf.co/GAIR/Anole-7b-v0.1
>(07/07) Support for glm3 and glm4 merged into llama.cpp: https://github.com/ggerganov/llama.cpp/pull/8031

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: 1671214888959931.gif (437 KB, 500x483)
437 KB
437 KB GIF
►Recent Highlights from the Previous Thread: >>101431253

--Paper: Mixture of A Million Experts: >>101431545 >>101431583 >>101431611 >>101431637 >>101432015 >>101432203 >>101432320 >>101431838 >>101431631 >>101431757 >>101432525
--Fine-tuning a Language Model to Generate Personalized Cover Letters: Seeking Recommendations and Exploring Alternatives: >>101436669 >>101436713 >>101436900
--AMD Promotes CPUmaxxing with EPYC Genoa: Outlined GPU Performance in LLM Tasks: >>101433552
--Uranium has 82 protons, but typos and sampler settings can confuse LLMs: >>101434145 >>101434215 >>101434236 >>101434265 >>101434332 >>101434423 >>101434467 >>101434789
--Question about PSU lines and GPU power requirements: >>101438498 >>101438516 >>101438541 >>101438675 >>101438873 >>101438790 >>101438988
--SCALE Programming Language and its support for llama.cpp: >>101432707 >>101432797
--Llama 3 Finetune Tops BFCL Leaderboard, But Are Function-Calling Models a Meme?: >>101434816 >>101434850 >>101434865 >>101434884
--Lack of Development Discussion and Frustration with LLM Dominance: >>101434645 >>101434734 >>101434859 >>101434939 >>101435320 >>101435452 >>101435571
--Miku (free space): >>101431341

►Recent Highlight Posts from the Previous Thread: >>101431260
>>
Worship the Miku
>>
File: hatsune_miku_at_cs2.jpg (129 KB, 1024x576)
129 KB
129 KB JPG
>>101439126
>Uranium has 82 protons
Shrinkflation has finally reached the nuclear energy industry.
>>
>>101439243
Yes.
>>
Is there any chance to get decent results with a rtx 2070?
I tried some llama V3 model yesterday at at most I could get 1 liner replies. Escaping Claude's grasp isn't that easy it seems like. Didn't tinker with the settings in oogabooga yet, however.
>>
>>101439327
no
>>
What's the most affordable GPU for achieving 80+ T/s on Gemma-2 9b 5bpw in a headless server? Are there any AMD cards capable of doing it?
>>
>>101439327
(Laughs in 2060 6GB)
>>
>>101439327
say that you want long and detailed replies in the system prompt
>>
>>101439355
Damn. What are you using?
>>
>>101439368
gemma-2-9b-it.Q4_K and mixtral-8x7b-v0.1.Q4_K_M
Like >>101439356 said the system prompt helps.
>>
>>101439396 (me)
>>
>>101439327
you have 8GB of VRAM, you can fit a 8B model easily, look for the gemma and llama3 smaller versions and find a gguf quant that you can run.
>>
>>101431757
Weren't MoEs cheaper to train?
>>101431838
You could use a lot of RAM or even terabytes on storage on SSDs and the inference would be still fast. CPUmaxx is the way to go.

Also that recent scaling paper claimed MoEs need only 1.3 more parameters to contain the same amount of information as dense models
>>
>>101439429
Damn, not bad. Guess I need to figure out this shit some more. Thanks.
>>
>>101439438
You think it would be possible to cram an 11b model in somehow?
>>
>>101439448
This is the system prompt I use (among others).
https://huggingface.co/datasets/ChuckMcSneed/various_RP_system_prompts/blob/main/unknown-simple-proxy-for-tavern.txt
>>
File: 1514830956221.png (1.15 MB, 1001x1200)
1.15 MB
1.15 MB PNG
i cannot fucking handle this youtube slop dude
im running the docker version of memgpt (fuckin wish i could use it with silly tavern instead) and all the tutorials for making it run with LLM's are for the CMD version instead of docker so im fuck outta luck
any advice? i tried doing kobolds api key in place of the open ai key spot in the .env (formerly .env.example) which made it finally open to the memgpt.localhost but then when I send a message it just thinks and then dies
I'll try and get a screen shot for ya one sec
>>
File: 1657345609038.png (587 KB, 719x713)
587 KB
587 KB PNG
>>101439027
Then recommend me a new model, faggit.
>>
File: ITSHAPPENING.webm (588 KB, 1024x1024)
588 KB
588 KB WEBM
>Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
>>
>>101439456
Yes, an 11B gguf at Q4 will fit. A Q5 will depend on the context. Check this for reference:
https://huggingface.co/mradermacher/Fimbulvetr-11B-v2.1-16K-i1-GGUF
>>
File: asry.png (446 KB, 1280x720)
446 KB
446 KB PNG
>>101439471
top is what it's stuck doing then boom it stops thinking and i get nothing
the CMD is just what it looks like immediately after connecting to memgpt.localhost
>>
i was just catching up on the last couple threads and noticed this appeared 2 threads back, didnt get any replies and didnt get caught in the recap that i can see, so gonna repost it
>>101426978
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
https://arxiv.org/abs/2407.10969
>We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
from the bitnet team. seems it didn't get posted here yet

>from the bitnet team. seems it didn't get posted here yet
>>
>up to 15 characters made on local setup
>all of them are just various fluff around my poorly hidden breeding kink
i need new material, there's only so many ways i can spin scenarios before i have to get into weird shit
>>
>>101439990
According to this paper, BitNet models have an optimal sparsity of about 60% so the improvement is significant but not groundbreaking.

MoE models can also be considered "sparse" and if you can have a model with a million experts like that other Google paper claims, with an optimal number of active experts being in the hundreds, inference can be **orders of magnitude** faster than with current models or any future BitNet model.
>>
>>101440013
All roads lead to Rome.
"New material" is either just different preludes to the same kink moment, or you need to find in you a different kink to be drawn toward.

The end result—squirt squirt—is the same no matter how you get there. Unless you change that intention, it's all just variations on a theme.

So why not ask your LLM to come up with some new material for you? Literally, at the end of a role play ask it for new ideas. If your context is large enough for it to see what you've done, it might come up with some neat new stuff.
>>
File: never ever.png (549 KB, 1510x856)
549 KB
549 KB PNG
>>101439990
>>
>>101440064
If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.
>>
File: file.png (19 KB, 967x162)
19 KB
19 KB PNG
>>101439990
interestingly they say in here that off the shelf LLMs can be "continue-trained" to make use of Q-sparse
>>
>>101440043
i don't even really see it as a "kink", and i hate that word anyway
i've always been a boring vanillafag
i might try your suggestion though
>>
File: file.png (123 KB, 962x703)
123 KB
123 KB PNG
>>101440134
this was for mistral-7b
>>
>>101440110
They mean risk instability of training runs and wasting millions on experiments that will be only good for proving that mamba/SSM fail to scale. Not safetyism risk.
>>
>>101440136
>i don't even really see it as a "kink"
It was your word choice, so maybe you do.

>i've always been a boring vanillafag
I could describe myself the same way, but I know where my kinks lie and LLM has been an interesting way to see exactly which details of their topics "work" and which don't.

But definitely let the LLM try things. Some of the most interesting RPs I've had were by playing through a part that was a "nah" and then it went someplace I would never have thought of. And then it's hot till the context fills causing sudden derp and collapse and sadness.
>>
>>101440187
Semantics. Meta has had over a year and failed to innovate at all.
>>
https://www.semianalysis.com/p/gb200-hardware-architecture-and-component
neat
>>
>>101440110
>If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.
I think the first non meme BitNet model will be from the Mistral team, they're the only one making non transformers models (they made a MoE model and a mamba model)
>>
File: SubvertedDemocracy.jpg (31 KB, 640x708)
31 KB
31 KB JPG
Sup /lmg/. I'm looking for an open source project that allows me to have a local server with:

1- An OpenAI compatible API

2- Allows for multi-user or multi request. Ideally it won't run them in parallel, but will queue them (i.e. at any time it will be running inference for only one request but won't crash or reject requests while running inference)

3- allows for multiple models but doesn't load more than one at any single time.

4- ideally, it would seamlessly "hotswap" models as requested. (If a new request needs a different model, it will automatically unload the current model and load the new one.

Llama-cpp-python allows for all of the above except the #2. I want to have a single LLM server and use it for multiple client apps.

Pic unrelated.
>>
>>101440043
How do you ask for that stuff?
>>
>>101440043
at some point I think anons will need to do their own preference optimization dataset based on their fetishes to then finetune models with. maybe make a google form or something to then click through that generates the dataset when you're done
>>
>>101440611
ollama can do all that. But, if you want really good multi users shit, you have to use vllm.
>>
>>101440724
Pursuant to >>101440013, I tried,
>Someone on 4chan laments that he's made 15 character definitions for his LLM to role play as, but they all play into his breeding kink so narrowly that they're becoming repetitive and requiring him to get into "weird shit" to make them interesting. Please list seven kinds of characters that his LLM could role play as that offer something interesting to explore while still ultimately becoming a scenario where his character will begin producing offspring with his LLM's character. Consider all kinds and genres of fiction for ideas of what kinds of people or things these role play partner characters could be.

Removing the explanations to save space in one post, it offered:
1. Space Colonist
2. Ancient God/Goddess
3. Time Traveler
4. Shapeshifter
5. AI Entity
6. Nature Spirit
7. Alien Hybrid

Those sound like good ways to make both the run up to the kink and the consequences of the kink have something fresh to offer.

>>101440750
>their own preference optimization dataset based on their fetishes
That sounds like the road to boredom. If you move toward what you know you want, you'll get the same things again and again, as >>101440013 complained about. Guide the AI away from your no-gos and dealbreakers, and move toward things you don't have much of an opinion of, and you'll be able to find new things that you didn't know you would like. And it'll be trivial to work in a personal kink on the fly when it's wanted.
>>
>Added --unpack, a new self-extraction feature that allows KoboldCpp binary releases to be unpacked into an empty directory. This allows easy modification and access to the files and contents embedded inside the PyInstaller. Can also be used in the GUI launcher.
Holy heckin based
>>
What do you do after shooting your load in rp context?
Close chat and start a new one?
>>
>>101440611
>>101440898
Forgot to add...

* Partial GPU/CPU offloading
>>
you're going to be able to make videos that look real
>>
>>101441368
people be questionin the legitimacy of livestreamed, realtime footage
if that can be fake then there's no such thing as real
>>
>>101441108
If you need CPU+GPU hybrid inference llama.cpp/ggml is you only choice in terms of backend.
The llama.cpp HTTP server has an OAI-compatible API and will queue requests by default but model hot-swapping is not implemented.
Ooba I think lets you load/unload models via the API but I don't know if it's OAI-compatible.
>>
So are we just never going to get an HF version of mamba codestral that works outside of Mistral's shitty basic bitch backend?
>>
>>101441409
ollama queue by default and can do parallel requests, can easily swap models via API and is OAI compatible.
>>
>>101441474
>mamba
>works
>>
>>101441409
>The llama.cpp HTTP server has an OAI-compatible API

Last I checked, only the chat API is OAI compatible, the regular text completion isn't.
>>
>>101441512
mamba is a supported model-type since like transformers 4.39
Why shouldn't it work?
>>
>>101441511
ollama just runs the llama.cpp server in the background, petra.
>>
>>101441601
okay
use it
>>
>>101441654
Well I can't because I'm not going to install Mistral or Mamba's shitty inference packages since neither of them seem to have an API which renders them utterly fucking useless.
>>
>>101441679
well then it doesn't work does it
>>
There's been websim for a few months, seems quite popular now. I never cared much to try it, is it actually great? And are they or something similar open source? Official website wants me to login with google. Can you run it with a local model?
>>
so I installed ollama cli and run lama3
its pretty cool but i hoped /save sessionName save the chat so I can come back to it next time with llama remembering what was told, but apparently it is not the case
how to save/load chat history so i can continue the conversation?
>>
>>101442061
Wrong site: reddit.com/r/LocalLLaMA/
>>
>>101441934
For once, doomer Anon, on this one exact topic, you are correct. It is unironically over and there's absolutely no hope.
>>
>>101442109
sorry, I thought this was local models general
>>
>>101442219
ollama is not beloved
>>
Anybody knows if the whole slot system from llama-server works with Sily, or would I need to change the way Silly calls the API to specify a slot or something of the sort?
You can have different prompt caches per slot right? That would be pretty cool when switching between cards or even when using things like the summary extension.
>>
>>101440202
>>101440187
I think it was too early to get their clusters fully online, and they're still in the process of getting and building more. They began training and setting in stone what Llama 3 was going to be earlier than some of the current hyped research could prove scalable. Llama 4 is probably going to be the one with a more unique architecture.

However, it is pretty normal that a big corporation lags behind startups when it comes to smaller scale faster paced roll outs of technology. Their advantage is that when they do roll out a new product, it's done with more money. That doesn't always equal a better product. In the case of LLMs it means they can spend a lot of time training the model like 15T, or they can train a 400B dense, or whatever. Startups may come out with a Bitnet or Jamba or whatever sooner, but then it'll take a megacorp to produce a Bitnet or Jamba or whatever with 15T pumped into one, or a 400B, or etc.
>>
>>101442061
Maybe ollama has updated and fixed this since I used it a month or two ago, but I found the /save feature to be total ass. The parser was busted and many character sequences would kill the parser, causing it to save only the first few turns of the conversation.

I could avoid then on my side (for example, NEVER end a turn with ". A space or spare period after would be fine but if it ended on a quote mark from dialog, that's the end of the save. Of course, the AI could write killing sequences, too, so even if I'm careful it's a doomed chat.

I roll Kobold now. Much easier to adjust settings now that I know them, I don't have to play Ollama's silly JSON and renamed file game, and I can use (almost) all of the models.

Ollama is a great introduction but the instant you want to do more, step up to a better wrapper.
>>
>>101442269
so they allow training on the output? If so that would probably be great for Mistral and similar
>>
>>101439126
Isn't Mixture of a Million Experts basically Marvin Minsky's Society of Mind? Also with the fact that it could theorhetical have life long learning...

Bros, we're actually going to get lifelike ai gfs in our life times, aren't we?
>>
so can anyone actually even test the 405b? by maybe renting some gpus online or something?
>>
>>101442432
I believe someone from huggingface said they'd host it (though I'm quite confused that the hugging.chat models are always broken)
>>
>>101442286
thank you for an actual answer
kobold seems interesting but i wanted something working in terminal for now
I guess I will try some ollama terminal clients, I see there are quite a few
anyone can recommend specific one?
>>
>>101442432
If any providers bother with it it will probably be on open router but for $2 a gen I imagine
>>
>>101442443
I too enjoyed that Ollama was terminal based at first, and it was something I wanted when getting started. But then I wanted to change settings without curling one line JSON strings, didn't like the multi line text bug, the inability to save, etc, so I switched to Kobold and now it's trivial for me to reach it from my phone when I step away from my computer, I have access to important settings, state saving works correctly, and easy access to editing the document to fix errors or to reroll a response is very nice to have.
>>
>>101439447
>Weren't MoEs cheaper to train?
Yeah, generally. Wasn't necessarily true for this "distributed" FFW method though. Looking at the paper though it should be more flop efficient. I'm actually warming up to the concept as I dig into it a bit more, there might be something really good here.

This paper alone isn't enough to sell the idea, but I think a few minor improvements and suddenly this becomes the new training paradigm.
>>
>>101442219
Yeah, this is not the ollama tech support general. Go to Reddit to shill your scam.
>>
state space 1 million expert bitnet 1T parameter model when?
>>
>>101442599
Don't forget with JEPA and native multimodal and multitoken prediction.
>>
>>101442599
Bitnet probably wouldn't work well with experts in the order of thousands of parameters in size.
>>
>>101442599
we need to return to good old lstms
>>
mambeleonbytejepabitnetMoaME 600B when
>>
>>101442599
with fill in the middle support
>>
>in my system prompt I give the narrator a personality, since that decreases slop
>after plapping the character from the card, ask what the narrator thinks so far using OOC
>she's horny
>propose her, herself joining on the fun
>now I get to plap the original character as well as the narrator turned into a character
Let's fucking go.
>>
llama.cpp has support for some NPU now.

>>101442621
Wouldn't it be hilarious if the next step is to go back but with a couple of adjustments and at a bigger scale?
It's not even absurd to think that since that kind of thing happens all the time.
>>
>>101442697
I'll type the weights by hand
>>
https://github.com/ggerganov/llama.cpp/pull/8543
>Add support for Chameleon #8543
that one anon will be happy
>For now, this implementation only supports text->text inference and serves as base to implement the (more interesting) image->text, text->image and interleaved pipelines. However, such an implementation will probably require some changes to the CLI and internal architecture, so I suggest to do this in a separate PR.
oh...
>>
>>101442599
right after you hang yourself.
>>
>>101442729
Interesting. Hopefully now we can get some benchmarks of how those new laptops do with LLMs.
>>
>>101442729
isn't there any standard NPU api?
>>
>>101442754
they've still got poor memory bandwidth compared to low end nvidia gpus
>>
>>101442792
Doubt it.
It's probably the same situation as GPUs where each vendor has its own computing architecture with its own APIs.
>>
>>101442804
Well yeah. But low power windoze/linux laptops that can run LLMs is still pretty good to have exist. Also would be good to know how the LPDDR5X in them performs, whether it'll still be the bottleneck, or the compute is the bottleneck, in the case of these computers. We could then extrapolate to think about how a desktop with a similar chip could perform paired with separate GPU.
>>
Why don't they just add matrix multiplication to slide rules?
>>
>>101440064
Meta became irrelevant when they started to filter their pre-training dataset with llamaguard. Claude shows that a diverse dataset with no regards of safety produces great results. A model that could not learn anything 'unsafe' because the data it was fed was designed to be 99% safe will never be good.
>>
>>101443047
hi petra
>>
>>101439308
Disregarding nuclear safety regulations with Miku
>>
File: teto.jpg (158 KB, 1024x1024)
158 KB
158 KB JPG
How to restrain model's output in tabbyapi? I tried "json_schema": {"type": "string","enum": ["Yes","No","Maybe"]}, but I keep getting errors:
ERROR: ExLlamaV2Sampler.sample(
ERROR: File "/home/petra/tabbyAPI/venv/lib/python3.10/site-packages/exllamav2/generator/sampler.py", line 247, in
sample
ERROR: assert pass_tokens, "Filter excluded all tokens"
ERROR: AssertionError: Filter excluded all tokens
ERROR: Sent to request: Completion aborted. Maybe the model was unloaded? Please check the server console.
>>
>>101443072
Are you retarded or just schizophrenic?
>>
>>101443047
This, this is why Command-R is good, and partially what makes WLM 8x22b wayyy better then 8x22b instruct. (Although I think WLM just has some other interesting continued pretraining and finetuned techniques up their sleeves iirc)
>>
File: 1695520564243917.png (20 KB, 1009x382)
20 KB
20 KB PNG
>>101443222
When will this meme die?
>>
>>101443244
if you mean the huggingface leaderboard meme, whenever you stop posting it
>>
>>101443222
>what makes WLM 8x22b wayyy better
When are the mikufags going to drop this meme? How is this related to pretraining when it's a finetune done with GPT-4 outputs?
>>
>>101443222

They have a paper out for wizard. You can replicate if you have the compute and money.

They ran an offline arena and basically trained on the best outputs in that on something, over and over again. SFT, DPO, PPO i believe? Reward models type shit.
>>
>>101443349

Basically.

https://x.com/victorsungo/status/1811427047341776947?t=k7ZXwSCRnYKBW_7Rj0_q6w&s=19
>>
File: file.png (479 KB, 593x812)
479 KB
479 KB PNG
Duality of /lmg/
>>
File: 1719351514748681.jpg (575 KB, 2048x2048)
575 KB
575 KB JPG
>>101443244
>Aktually according to the HF leaderboard...
>t. Never run Mistral8x22 or Wiz8x22 for himself
>>101443503
>Yes with a 5% lead
We are so back bros
>>
>>101443158
good image
>>
>>101443047
buy an ad
>>
>>101439122
>>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271

what am I gonna need to run this?
>>
Will we even need to make videos anymore if we can just generate them?
>>
>>101443763
There are artists still making art even with AI generated art. There are still writers despite AI being able to write. There will still be filmmakers despite AI being able to make film. Some people just like to create.
>>
>>101443724
At least a raspberry pi
>>
>>101443763
>t. the least illiterate AI bro
>>
I have been doing some testing between GGUF and EXL2.

I have always preferred exl2 as it was faster and had Q4 cache for KV.

But the speed difference seems to have vanished, and llama.cpp supports KV caching too now.

All averaged though 3 runs+using SillyTavern as front end+using FA and Q4 cache in both exllama (mandadory in tabbyapi) and llama.cpp:

TabbiAPI Backend (Exllamav2 0.1.7):
WizardLM2 8x22B Exl2 @ 4.0bpw 24.1t/s

Llamacpp backend (pulled 2hours ago):
WizardLM2 8x22B GGUF @ Q4_K_M 25.2t/s

Textgenwebui Backend Exl2:
WizardLM2 8x22B Exl2 @ 4.0bpw 22.1t/s

Textgenwebui Backend GGUF:
WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/s


Due to the higher support for llama.cpp and thus better compatibility with devices and faster compatibility with new models. Is there any reason to still use exl2?

System:
4x3090's at 250w max
EPYC 7402
>>
>>101444001
>>Textgenwebui Backend GGUF:
>>WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/s

I meant WizardLM2 8x22B Exl2 @ *Q4_K_M* 23.2t/s
>>
>>101444001
Can you also provide results for prompt processing?
>>
>>101444001
>Is there any reason to still use exl2?
Unironically no.
>>
>>101444001
>those numbers
big if true
>>
>>101444001
exl2sisters... what went wrong?
>>
>>101444001
The llama.cpp server is too rough around the edges and tabbyapi is more polished. I would switch back to exllama if Gemma 2 worked as well as it does with llama.cpp.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.