[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: CleanupCrew.png (1.94 MB, 1024x1528)
1.94 MB
1.94 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101431253 & >>101421477

►News
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1
>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1
>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271
>(07/09) Anole, based on Chameleon, for interleaved image-text generation: https://hf.co/GAIR/Anole-7b-v0.1
>(07/07) Support for glm3 and glm4 merged into llama.cpp: https://github.com/ggerganov/llama.cpp/pull/8031

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: 1671214888959931.gif (437 KB, 500x483)
437 KB
437 KB GIF
►Recent Highlights from the Previous Thread: >>101431253

--Paper: Mixture of A Million Experts: >>101431545 >>101431583 >>101431611 >>101431637 >>101432015 >>101432203 >>101432320 >>101431838 >>101431631 >>101431757 >>101432525
--Fine-tuning a Language Model to Generate Personalized Cover Letters: Seeking Recommendations and Exploring Alternatives: >>101436669 >>101436713 >>101436900
--AMD Promotes CPUmaxxing with EPYC Genoa: Outlined GPU Performance in LLM Tasks: >>101433552
--Uranium has 82 protons, but typos and sampler settings can confuse LLMs: >>101434145 >>101434215 >>101434236 >>101434265 >>101434332 >>101434423 >>101434467 >>101434789
--Question about PSU lines and GPU power requirements: >>101438498 >>101438516 >>101438541 >>101438675 >>101438873 >>101438790 >>101438988
--SCALE Programming Language and its support for llama.cpp: >>101432707 >>101432797
--Llama 3 Finetune Tops BFCL Leaderboard, But Are Function-Calling Models a Meme?: >>101434816 >>101434850 >>101434865 >>101434884
--Lack of Development Discussion and Frustration with LLM Dominance: >>101434645 >>101434734 >>101434859 >>101434939 >>101435320 >>101435452 >>101435571
--Miku (free space): >>101431341

►Recent Highlight Posts from the Previous Thread: >>101431260
>>
Worship the Miku
>>
File: hatsune_miku_at_cs2.jpg (129 KB, 1024x576)
129 KB
129 KB JPG
>>101439126
>Uranium has 82 protons
Shrinkflation has finally reached the nuclear energy industry.
>>
>>101439243
Yes.
>>
Is there any chance to get decent results with a rtx 2070?
I tried some llama V3 model yesterday at at most I could get 1 liner replies. Escaping Claude's grasp isn't that easy it seems like. Didn't tinker with the settings in oogabooga yet, however.
>>
>>101439327
no
>>
What's the most affordable GPU for achieving 80+ T/s on Gemma-2 9b 5bpw in a headless server? Are there any AMD cards capable of doing it?
>>
>>101439327
(Laughs in 2060 6GB)
>>
>>101439327
say that you want long and detailed replies in the system prompt
>>
>>101439355
Damn. What are you using?
>>
>>101439368
gemma-2-9b-it.Q4_K and mixtral-8x7b-v0.1.Q4_K_M
Like >>101439356 said the system prompt helps.
>>
>>101439396 (me)
>>
>>101439327
you have 8GB of VRAM, you can fit a 8B model easily, look for the gemma and llama3 smaller versions and find a gguf quant that you can run.
>>
>>101431757
Weren't MoEs cheaper to train?
>>101431838
You could use a lot of RAM or even terabytes on storage on SSDs and the inference would be still fast. CPUmaxx is the way to go.

Also that recent scaling paper claimed MoEs need only 1.3 more parameters to contain the same amount of information as dense models
>>
>>101439429
Damn, not bad. Guess I need to figure out this shit some more. Thanks.
>>
>>101439438
You think it would be possible to cram an 11b model in somehow?
>>
>>101439448
This is the system prompt I use (among others).
https://huggingface.co/datasets/ChuckMcSneed/various_RP_system_prompts/blob/main/unknown-simple-proxy-for-tavern.txt
>>
File: 1514830956221.png (1.15 MB, 1001x1200)
1.15 MB
1.15 MB PNG
i cannot fucking handle this youtube slop dude
im running the docker version of memgpt (fuckin wish i could use it with silly tavern instead) and all the tutorials for making it run with LLM's are for the CMD version instead of docker so im fuck outta luck
any advice? i tried doing kobolds api key in place of the open ai key spot in the .env (formerly .env.example) which made it finally open to the memgpt.localhost but then when I send a message it just thinks and then dies
I'll try and get a screen shot for ya one sec
>>
File: 1657345609038.png (587 KB, 719x713)
587 KB
587 KB PNG
>>101439027
Then recommend me a new model, faggit.
>>
File: ITSHAPPENING.webm (588 KB, 1024x1024)
588 KB
588 KB WEBM
>Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
>>
>>101439456
Yes, an 11B gguf at Q4 will fit. A Q5 will depend on the context. Check this for reference:
https://huggingface.co/mradermacher/Fimbulvetr-11B-v2.1-16K-i1-GGUF
>>
File: asry.png (446 KB, 1280x720)
446 KB
446 KB PNG
>>101439471
top is what it's stuck doing then boom it stops thinking and i get nothing
the CMD is just what it looks like immediately after connecting to memgpt.localhost
>>
i was just catching up on the last couple threads and noticed this appeared 2 threads back, didnt get any replies and didnt get caught in the recap that i can see, so gonna repost it
>>101426978
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
https://arxiv.org/abs/2407.10969
>We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
from the bitnet team. seems it didn't get posted here yet

>from the bitnet team. seems it didn't get posted here yet
>>
>up to 15 characters made on local setup
>all of them are just various fluff around my poorly hidden breeding kink
i need new material, there's only so many ways i can spin scenarios before i have to get into weird shit
>>
>>101439990
According to this paper, BitNet models have an optimal sparsity of about 60% so the improvement is significant but not groundbreaking.

MoE models can also be considered "sparse" and if you can have a model with a million experts like that other Google paper claims, with an optimal number of active experts being in the hundreds, inference can be **orders of magnitude** faster than with current models or any future BitNet model.
>>
>>101440013
All roads lead to Rome.
"New material" is either just different preludes to the same kink moment, or you need to find in you a different kink to be drawn toward.

The end result—squirt squirt—is the same no matter how you get there. Unless you change that intention, it's all just variations on a theme.

So why not ask your LLM to come up with some new material for you? Literally, at the end of a role play ask it for new ideas. If your context is large enough for it to see what you've done, it might come up with some neat new stuff.
>>
File: never ever.png (549 KB, 1510x856)
549 KB
549 KB PNG
>>101439990
>>
>>101440064
If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.
>>
File: file.png (19 KB, 967x162)
19 KB
19 KB PNG
>>101439990
interestingly they say in here that off the shelf LLMs can be "continue-trained" to make use of Q-sparse
>>
>>101440043
i don't even really see it as a "kink", and i hate that word anyway
i've always been a boring vanillafag
i might try your suggestion though
>>
File: file.png (123 KB, 962x703)
123 KB
123 KB PNG
>>101440134
this was for mistral-7b
>>
>>101440110
They mean risk instability of training runs and wasting millions on experiments that will be only good for proving that mamba/SSM fail to scale. Not safetyism risk.
>>
>>101440136
>i don't even really see it as a "kink"
It was your word choice, so maybe you do.

>i've always been a boring vanillafag
I could describe myself the same way, but I know where my kinks lie and LLM has been an interesting way to see exactly which details of their topics "work" and which don't.

But definitely let the LLM try things. Some of the most interesting RPs I've had were by playing through a part that was a "nah" and then it went someplace I would never have thought of. And then it's hot till the context fills causing sudden derp and collapse and sadness.
>>
>>101440187
Semantics. Meta has had over a year and failed to innovate at all.
>>
https://www.semianalysis.com/p/gb200-hardware-architecture-and-component
neat
>>
>>101440110
>If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.
I think the first non meme BitNet model will be from the Mistral team, they're the only one making non transformers models (they made a MoE model and a mamba model)
>>
File: SubvertedDemocracy.jpg (31 KB, 640x708)
31 KB
31 KB JPG
Sup /lmg/. I'm looking for an open source project that allows me to have a local server with:

1- An OpenAI compatible API

2- Allows for multi-user or multi request. Ideally it won't run them in parallel, but will queue them (i.e. at any time it will be running inference for only one request but won't crash or reject requests while running inference)

3- allows for multiple models but doesn't load more than one at any single time.

4- ideally, it would seamlessly "hotswap" models as requested. (If a new request needs a different model, it will automatically unload the current model and load the new one.

Llama-cpp-python allows for all of the above except the #2. I want to have a single LLM server and use it for multiple client apps.

Pic unrelated.
>>
>>101440043
How do you ask for that stuff?
>>
>>101440043
at some point I think anons will need to do their own preference optimization dataset based on their fetishes to then finetune models with. maybe make a google form or something to then click through that generates the dataset when you're done
>>
>>101440611
ollama can do all that. But, if you want really good multi users shit, you have to use vllm.
>>
>>101440724
Pursuant to >>101440013, I tried,
>Someone on 4chan laments that he's made 15 character definitions for his LLM to role play as, but they all play into his breeding kink so narrowly that they're becoming repetitive and requiring him to get into "weird shit" to make them interesting. Please list seven kinds of characters that his LLM could role play as that offer something interesting to explore while still ultimately becoming a scenario where his character will begin producing offspring with his LLM's character. Consider all kinds and genres of fiction for ideas of what kinds of people or things these role play partner characters could be.

Removing the explanations to save space in one post, it offered:
1. Space Colonist
2. Ancient God/Goddess
3. Time Traveler
4. Shapeshifter
5. AI Entity
6. Nature Spirit
7. Alien Hybrid

Those sound like good ways to make both the run up to the kink and the consequences of the kink have something fresh to offer.

>>101440750
>their own preference optimization dataset based on their fetishes
That sounds like the road to boredom. If you move toward what you know you want, you'll get the same things again and again, as >>101440013 complained about. Guide the AI away from your no-gos and dealbreakers, and move toward things you don't have much of an opinion of, and you'll be able to find new things that you didn't know you would like. And it'll be trivial to work in a personal kink on the fly when it's wanted.
>>
>Added --unpack, a new self-extraction feature that allows KoboldCpp binary releases to be unpacked into an empty directory. This allows easy modification and access to the files and contents embedded inside the PyInstaller. Can also be used in the GUI launcher.
Holy heckin based
>>
What do you do after shooting your load in rp context?
Close chat and start a new one?
>>
>>101440611
>>101440898
Forgot to add...

* Partial GPU/CPU offloading
>>
you're going to be able to make videos that look real
>>
>>101441368
people be questionin the legitimacy of livestreamed, realtime footage
if that can be fake then there's no such thing as real
>>
>>101441108
If you need CPU+GPU hybrid inference llama.cpp/ggml is you only choice in terms of backend.
The llama.cpp HTTP server has an OAI-compatible API and will queue requests by default but model hot-swapping is not implemented.
Ooba I think lets you load/unload models via the API but I don't know if it's OAI-compatible.
>>
So are we just never going to get an HF version of mamba codestral that works outside of Mistral's shitty basic bitch backend?
>>
>>101441409
ollama queue by default and can do parallel requests, can easily swap models via API and is OAI compatible.
>>
>>101441474
>mamba
>works
>>
>>101441409
>The llama.cpp HTTP server has an OAI-compatible API

Last I checked, only the chat API is OAI compatible, the regular text completion isn't.
>>
>>101441512
mamba is a supported model-type since like transformers 4.39
Why shouldn't it work?
>>
>>101441511
ollama just runs the llama.cpp server in the background, petra.
>>
>>101441601
okay
use it
>>
>>101441654
Well I can't because I'm not going to install Mistral or Mamba's shitty inference packages since neither of them seem to have an API which renders them utterly fucking useless.
>>
>>101441679
well then it doesn't work does it
>>
There's been websim for a few months, seems quite popular now. I never cared much to try it, is it actually great? And are they or something similar open source? Official website wants me to login with google. Can you run it with a local model?
>>
so I installed ollama cli and run lama3
its pretty cool but i hoped /save sessionName save the chat so I can come back to it next time with llama remembering what was told, but apparently it is not the case
how to save/load chat history so i can continue the conversation?
>>
>>101442061
Wrong site: reddit.com/r/LocalLLaMA/
>>
>>101441934
For once, doomer Anon, on this one exact topic, you are correct. It is unironically over and there's absolutely no hope.
>>
>>101442109
sorry, I thought this was local models general
>>
>>101442219
ollama is not beloved
>>
Anybody knows if the whole slot system from llama-server works with Sily, or would I need to change the way Silly calls the API to specify a slot or something of the sort?
You can have different prompt caches per slot right? That would be pretty cool when switching between cards or even when using things like the summary extension.
>>
>>101440202
>>101440187
I think it was too early to get their clusters fully online, and they're still in the process of getting and building more. They began training and setting in stone what Llama 3 was going to be earlier than some of the current hyped research could prove scalable. Llama 4 is probably going to be the one with a more unique architecture.

However, it is pretty normal that a big corporation lags behind startups when it comes to smaller scale faster paced roll outs of technology. Their advantage is that when they do roll out a new product, it's done with more money. That doesn't always equal a better product. In the case of LLMs it means they can spend a lot of time training the model like 15T, or they can train a 400B dense, or whatever. Startups may come out with a Bitnet or Jamba or whatever sooner, but then it'll take a megacorp to produce a Bitnet or Jamba or whatever with 15T pumped into one, or a 400B, or etc.
>>
>>101442061
Maybe ollama has updated and fixed this since I used it a month or two ago, but I found the /save feature to be total ass. The parser was busted and many character sequences would kill the parser, causing it to save only the first few turns of the conversation.

I could avoid then on my side (for example, NEVER end a turn with ". A space or spare period after would be fine but if it ended on a quote mark from dialog, that's the end of the save. Of course, the AI could write killing sequences, too, so even if I'm careful it's a doomed chat.

I roll Kobold now. Much easier to adjust settings now that I know them, I don't have to play Ollama's silly JSON and renamed file game, and I can use (almost) all of the models.

Ollama is a great introduction but the instant you want to do more, step up to a better wrapper.
>>
>>101442269
so they allow training on the output? If so that would probably be great for Mistral and similar
>>
>>101439126
Isn't Mixture of a Million Experts basically Marvin Minsky's Society of Mind? Also with the fact that it could theorhetical have life long learning...

Bros, we're actually going to get lifelike ai gfs in our life times, aren't we?
>>
so can anyone actually even test the 405b? by maybe renting some gpus online or something?
>>
>>101442432
I believe someone from huggingface said they'd host it (though I'm quite confused that the hugging.chat models are always broken)
>>
>>101442286
thank you for an actual answer
kobold seems interesting but i wanted something working in terminal for now
I guess I will try some ollama terminal clients, I see there are quite a few
anyone can recommend specific one?
>>
>>101442432
If any providers bother with it it will probably be on open router but for $2 a gen I imagine
>>
>>101442443
I too enjoyed that Ollama was terminal based at first, and it was something I wanted when getting started. But then I wanted to change settings without curling one line JSON strings, didn't like the multi line text bug, the inability to save, etc, so I switched to Kobold and now it's trivial for me to reach it from my phone when I step away from my computer, I have access to important settings, state saving works correctly, and easy access to editing the document to fix errors or to reroll a response is very nice to have.
>>
>>101439447
>Weren't MoEs cheaper to train?
Yeah, generally. Wasn't necessarily true for this "distributed" FFW method though. Looking at the paper though it should be more flop efficient. I'm actually warming up to the concept as I dig into it a bit more, there might be something really good here.

This paper alone isn't enough to sell the idea, but I think a few minor improvements and suddenly this becomes the new training paradigm.
>>
>>101442219
Yeah, this is not the ollama tech support general. Go to Reddit to shill your scam.
>>
state space 1 million expert bitnet 1T parameter model when?
>>
>>101442599
Don't forget with JEPA and native multimodal and multitoken prediction.
>>
>>101442599
Bitnet probably wouldn't work well with experts in the order of thousands of parameters in size.
>>
>>101442599
we need to return to good old lstms
>>
mambeleonbytejepabitnetMoaME 600B when
>>
>>101442599
with fill in the middle support
>>
>in my system prompt I give the narrator a personality, since that decreases slop
>after plapping the character from the card, ask what the narrator thinks so far using OOC
>she's horny
>propose her, herself joining on the fun
>now I get to plap the original character as well as the narrator turned into a character
Let's fucking go.
>>
llama.cpp has support for some NPU now.

>>101442621
Wouldn't it be hilarious if the next step is to go back but with a couple of adjustments and at a bigger scale?
It's not even absurd to think that since that kind of thing happens all the time.
>>
>>101442697
I'll type the weights by hand
>>
https://github.com/ggerganov/llama.cpp/pull/8543
>Add support for Chameleon #8543
that one anon will be happy
>For now, this implementation only supports text->text inference and serves as base to implement the (more interesting) image->text, text->image and interleaved pipelines. However, such an implementation will probably require some changes to the CLI and internal architecture, so I suggest to do this in a separate PR.
oh...
>>
>>101442599
right after you hang yourself.
>>
>>101442729
Interesting. Hopefully now we can get some benchmarks of how those new laptops do with LLMs.
>>
>>101442729
isn't there any standard NPU api?
>>
>>101442754
they've still got poor memory bandwidth compared to low end nvidia gpus
>>
>>101442792
Doubt it.
It's probably the same situation as GPUs where each vendor has its own computing architecture with its own APIs.
>>
>>101442804
Well yeah. But low power windoze/linux laptops that can run LLMs is still pretty good to have exist. Also would be good to know how the LPDDR5X in them performs, whether it'll still be the bottleneck, or the compute is the bottleneck, in the case of these computers. We could then extrapolate to think about how a desktop with a similar chip could perform paired with separate GPU.
>>
Why don't they just add matrix multiplication to slide rules?
>>
>>101440064
Meta became irrelevant when they started to filter their pre-training dataset with llamaguard. Claude shows that a diverse dataset with no regards of safety produces great results. A model that could not learn anything 'unsafe' because the data it was fed was designed to be 99% safe will never be good.
>>
>>101443047
hi petra
>>
>>101439308
Disregarding nuclear safety regulations with Miku
>>
File: teto.jpg (158 KB, 1024x1024)
158 KB
158 KB JPG
How to restrain model's output in tabbyapi? I tried "json_schema": {"type": "string","enum": ["Yes","No","Maybe"]}, but I keep getting errors:
ERROR: ExLlamaV2Sampler.sample(
ERROR: File "/home/petra/tabbyAPI/venv/lib/python3.10/site-packages/exllamav2/generator/sampler.py", line 247, in
sample
ERROR: assert pass_tokens, "Filter excluded all tokens"
ERROR: AssertionError: Filter excluded all tokens
ERROR: Sent to request: Completion aborted. Maybe the model was unloaded? Please check the server console.
>>
>>101443072
Are you retarded or just schizophrenic?
>>
>>101443047
This, this is why Command-R is good, and partially what makes WLM 8x22b wayyy better then 8x22b instruct. (Although I think WLM just has some other interesting continued pretraining and finetuned techniques up their sleeves iirc)
>>
File: 1695520564243917.png (20 KB, 1009x382)
20 KB
20 KB PNG
>>101443222
When will this meme die?
>>
>>101443244
if you mean the huggingface leaderboard meme, whenever you stop posting it
>>
>>101443222
>what makes WLM 8x22b wayyy better
When are the mikufags going to drop this meme? How is this related to pretraining when it's a finetune done with GPT-4 outputs?
>>
>>101443222

They have a paper out for wizard. You can replicate if you have the compute and money.

They ran an offline arena and basically trained on the best outputs in that on something, over and over again. SFT, DPO, PPO i believe? Reward models type shit.
>>
>>101443349

Basically.

https://x.com/victorsungo/status/1811427047341776947?t=k7ZXwSCRnYKBW_7Rj0_q6w&s=19
>>
File: file.png (479 KB, 593x812)
479 KB
479 KB PNG
Duality of /lmg/
>>
File: 1719351514748681.jpg (575 KB, 2048x2048)
575 KB
575 KB JPG
>>101443244
>Aktually according to the HF leaderboard...
>t. Never run Mistral8x22 or Wiz8x22 for himself
>>101443503
>Yes with a 5% lead
We are so back bros
>>
>>101443158
good image
>>
>>101443047
buy an ad
>>
>>101439122
>>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271

what am I gonna need to run this?
>>
Will we even need to make videos anymore if we can just generate them?
>>
>>101443763
There are artists still making art even with AI generated art. There are still writers despite AI being able to write. There will still be filmmakers despite AI being able to make film. Some people just like to create.
>>
>>101443724
At least a raspberry pi
>>
>>101443763
>t. the least illiterate AI bro
>>
I have been doing some testing between GGUF and EXL2.

I have always preferred exl2 as it was faster and had Q4 cache for KV.

But the speed difference seems to have vanished, and llama.cpp supports KV caching too now.

All averaged though 3 runs+using SillyTavern as front end+using FA and Q4 cache in both exllama (mandadory in tabbyapi) and llama.cpp:

TabbiAPI Backend (Exllamav2 0.1.7):
WizardLM2 8x22B Exl2 @ 4.0bpw 24.1t/s

Llamacpp backend (pulled 2hours ago):
WizardLM2 8x22B GGUF @ Q4_K_M 25.2t/s

Textgenwebui Backend Exl2:
WizardLM2 8x22B Exl2 @ 4.0bpw 22.1t/s

Textgenwebui Backend GGUF:
WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/s


Due to the higher support for llama.cpp and thus better compatibility with devices and faster compatibility with new models. Is there any reason to still use exl2?

System:
4x3090's at 250w max
EPYC 7402
>>
>>101444001
>>Textgenwebui Backend GGUF:
>>WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/s

I meant WizardLM2 8x22B Exl2 @ *Q4_K_M* 23.2t/s
>>
>>101444001
Can you also provide results for prompt processing?
>>
>>101444001
>Is there any reason to still use exl2?
Unironically no.
>>
>>101444001
>those numbers
big if true
>>
>>101444001
exl2sisters... what went wrong?
>>
>>101444001
The llama.cpp server is too rough around the edges and tabbyapi is more polished. I would switch back to exllama if Gemma 2 worked as well as it does with llama.cpp.
>>
>>101444001
Nah, I hate exllamav2 with a passion now because they keep breaking their god damn pip package

switched to llama.cpp when they increased the speed and haven't looked back
>>
>>101443885
And there are still blacksmiths hand forging horseshoes.

What matters is if AI will completely collapse the manual art market, or if it'll be like music has been, with synths and DAWs rolling into the toolkit and letting new sounds enter the ecosystem and more artists offer their ideas.
>>
current foss models are barely better than chatgpt 3.5, seems like fossissies lost in the end.
>>
>>101444309
>with synths and DAWs rolling into the toolkit and letting new sounds enter the ecosystem and more artists offer their ideas.
For most commercial endeavors, I think that's what it will be.
Some market segments like commercials might end up as mostly AI generated at some point, however.
>>
https://x.com/RuoyuSun_UI/status/1813635251652227505
Big
>>
>>101444647
"gradually overlap" aka it's not as good
if it's more stable than SGD (and it certainly seems that way from the graph) it's definitely an option for people without a lot of memory aka local stuff
>>
Does ST not get token probabilities with Llama.cpp? I checked the box and went into the token probabilities tab but nothing is appearing in it.
>>
>>101444380
For many tasks gemma 27b feels just as good as GPT4-o
>>
>>101444083

gguf vs exl2 anon here

GGUF:

1st run: prompt eval time = 742.80 ms / 380 tokens ( 1.95 ms per token, 511.58 tokens per second)
generation eval time = 19797.38 ms / 500 runs ( 39.59 ms per token, 25.26 tokens per second)

2nd run: prompt eval time = 157.15 ms / 1 tokens ( 157.15 ms per token, 6.36 tokens per second)
generation eval time = 19793.48 ms / 500 runs ( 39.59 ms per token, 25.26 tokens per second)

(restarted llama.cpp server)
3rd run: prompt eval time = 642.11 ms / 380 tokens ( 1.69 ms per token, 591.80 tokens per second)
generation eval time = 19772.64 ms / 500 runs ( 39.55 ms per token, 25.29 tokens per second)

EXL2:

1st run: Metrics: 500 tokens generated in 21.56 seconds (Queue: 0.0 s, Process: 0 cached tokens and 380 new tokens at 315.16 T/s,
Generate: 24.57 T/s, Context: 380 tokens)

2nd swipe: Metrics: 500 tokens generated in 20.29 seconds (Queue: 0.0 s, Process: 379 cached tokens and 1 new tokens at 13.75 T/s,
Generate: 24.73 T/s, Context: 380 tokens)

(restarted tabby to avoid caching)
3rd swipe: Metrics: 500 tokens generated in 21.55 seconds (Queue: 0.0 s, Process: 0 cached tokens and 380 new tokens at 314.84 T/s,
Generate: 24.58 T/s, Context: 380 tokens)


So... llama.cpp is also faster in prompt processing. I have noticed it as the time to first token feels quicker in llama.cpp

Note: llama.cpp server also loads the model faster. I don't know why, seems to load the model in parallel in each 3090 while exllama loads it in series and waits for each one to be filled.


Im a turboderp fanboy and it was until today that I assumed that exl2 was always better. But it seems that llama.cpp has made so much progress

>>101444182
is it?

by tabbyapi do you mean exllamav2? the tabbyapi auther says in his readme that tabby is not producction ready and we should use aphrodite. But to be honest tabby+exllama has been rock solid for me.
>>
>>101444302
I haven't had any pip problem to be honest. I use python envs though. Just following the readme and using the whl works fine for me.
>>
>>101441409
what's your stance on SCALE
https://docs.scale-lang.com/
>>
>>101444771
I have a shared env for the experiments I use, and twice now installing a new package that updated the exllamav2 package broke it in a non-obvious way. They adjust how parameters are handled too often, and it quietly just breaks generation. Queue tens of hours finding out why my pipeline is randomly breaking, because sometimes it still outputs decent stuff.
>>
Is it advised to have character speak about their actions in first person instead of third?
Like instead of *{char} smiles* say *I smile*, but still in the *action* markup.
Would it solve the heavy narration bias of many models that tend to describe too much and talk too little?
>>
>>101444786
I will be happy to cooperate for wider hardware support.
But it doesn't fix the fundamental issue that GPU performance depends very strongly on hardware details.
So something like this will never be a replacement for e.g. a dedicated ROCm implementation.
Also compared to HIP it will not be possible to make informed decisions regarding which kernel (configurations) should be used when running on AMD.
>>
>>101443158
Marvelous, "grammar_string" doesn't work as well
ERROR: File "/home/petra/tabbyAPI/endpoints/OAI/utils/completion.py", line 135, in stream_generate_completion
ERROR: raise generation
ERROR: File "/home/petra/tabbyAPI/endpoints/OAI/utils/completion.py", line 87, in _stream_collector
ERROR: async for generation in new_generation:
ERROR: File "/home/petra/tabbyAPI/backends/exllamav2/model.py", line 1070, in generate_gen
ERROR: grammar_handler.add_ebnf_filter(grammar_string, self.model, self.tokenizer)
ERROR: File "/home/petra/tabbyAPI/backends/exllamav2/grammar.py", line 147, in add_ebnf_filter
ERROR: ebnf_filter = ExLlamaV2EbnfFilter(model, tokenizer, ebnf_string)
ERROR: File "/home/petra/tabbyAPI/backends/exllamav2/grammar.py", line 46, in __init__
ERROR: self.state = self.fsm.first_state
ERROR: AttributeError: 'CFGFSM' object has no attribute 'first_state'. Did you mean: 'final_state'?
ERROR: Sent to request: Completion aborted. Please check the server console.
>>
File: 1721166662887634.png (51 KB, 1482x342)
51 KB
51 KB PNG
>gemma 27b feels just as good as GPT4-o
>>
>>101444738
>"gradually overlap" aka it's not as good
can you elaborate on that?
>>
>>101444743
such as?
>>
>>101444911
I just asked the character I'm talking to the exact same question (Q6_K), 1 try, she answered "No, that is incorrect. The element with atomic number 82 is lead. Uranium has the atomic number 92."
>>
>>101444986
Actually most tasks except code, because GPT4-o outputs longer code
>>
still trying out dry with good success but then i came across this leddit post explaining it more. turns out i should be using rep pen still, but i have it turned off atm to test dry specifically. going to leave it off a bit longer but i'll try with both.
>https://old.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
>>
>>101444425
>Some market segments like commercials might end up as mostly AI generated at some point
I think the space for that is going to be for targeted advertising, which will be fully customized generated on the fly baited Skinner boxes, far beyond the "HOT SINGELS IN ${YOUR} AREA" trash. Likely it'll be a "smart" Telescreen thing where it knows your family's viewing habits and AI gens tailored inserts that it can force you to look at because you didn't support Sceptre as the last television supplier.

For mass marketing I think it'll just be another tool in the box. There's that pizza ad that screams "AI assisted" from beginning to end. A shame that it's still not as cool as Pepperoni Hug Spot but it's a sign of the times when they sell pizza with an exploding head when formerly that was the reason Scanners became a meme.
>>
>>101445019
so I have to say "ahh ahh mistress" to make it smarter?
>>
>>101445084
yes, that's claudes secret
>>
>>101444756
Thanks! You've saved me a lot of time.
>>
>>101444971
It initially doesn't learn as fast, and it's really not guaranteed to ever catch up to base adam
It simply performs worse, though with the memory savings it might be worth it in some cases
>>
>>101445147
So I guess that's something that could be used first as a cheap method and if it's not successfull then going for regular adam, I see
>>
>>101443158
Fuck, I solved it! It seems that exllamav2 always expects an object, so grammar like this works:
{
"type":"object",
"properties": {
"result": {
"type": "string",
"enum": ["Yes!","No?","Maybe..."]
}
},
"required": ["result"]
}
>>
who is petra?
>>
>>101444001
Now try vLLM.
>>
>try Lunaris, it's shit
>try Stheno, it's shit
>try Gemma, it's shit

I fell for the vramlet "good model" meme
>>
>>101445195
Sorry anon, I'm actually retarded. I just realized I read the graph backwards. Based on what they're saying it preforms better, though I very much doubt that's true in general. Will wait for more people to try it out, but it might be good.
>>
>>101445267
A Sao fanboy/fangirl from the UK.
>>
who is sao?
>>
>>101445313
Petra's crush.
>>
>>101442725
the only thing your are plapping is your hand, anon
>>
>>101445295
that isn't a vramlet meme its 'i have a laptop with 4gb vram and 12gb ram' meme which covers most of the thirdworlders who post here. you can run 70b just fine with a good cpu and ram
>>
>>101445323
pic?
>>
>>101444866
Recommend 8-13B models for doing the actual chatting instead of narration. I'm kinda tired of every conversation sliding into the purple prose.
>>
i have 88GB of VRAM and i want to run either https://huggingface.co/OpenGVLab/InternVL2-40B or https://huggingface.co/facebook/chameleon-30b. how? they are both multimodal models, but i dont know how to run them. do they work just fine with oobabooga or do i need some sort of specialized backend?
>>
>>101442729
> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.

http://www.incompleteideas.net/IncIdeas/BitterLesson.html



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.