/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>101431253 & >>101421477►News>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271>(07/09) Anole, based on Chameleon, for interleaved image-text generation: https://hf.co/GAIR/Anole-7b-v0.1>(07/07) Support for glm3 and glm4 merged into llama.cpp: https://github.com/ggerganov/llama.cpp/pull/8031►News Archive: https://rentry.org/lmg-news-archive►FAQ: https://wikia.schneedc.com►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardProgramming: https://hf.co/spaces/bigcode/bigcode-models-leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbench►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>101431253--Paper: Mixture of A Million Experts: >>101431545 >>101431583 >>101431611 >>101431637 >>101432015 >>101432203 >>101432320 >>101431838 >>101431631 >>101431757 >>101432525--Fine-tuning a Language Model to Generate Personalized Cover Letters: Seeking Recommendations and Exploring Alternatives: >>101436669 >>101436713 >>101436900--AMD Promotes CPUmaxxing with EPYC Genoa: Outlined GPU Performance in LLM Tasks: >>101433552--Uranium has 82 protons, but typos and sampler settings can confuse LLMs: >>101434145 >>101434215 >>101434236 >>101434265 >>101434332 >>101434423 >>101434467 >>101434789--Question about PSU lines and GPU power requirements: >>101438498 >>101438516 >>101438541 >>101438675 >>101438873 >>101438790 >>101438988--SCALE Programming Language and its support for llama.cpp: >>101432707 >>101432797--Llama 3 Finetune Tops BFCL Leaderboard, But Are Function-Calling Models a Meme?: >>101434816 >>101434850 >>101434865 >>101434884--Lack of Development Discussion and Frustration with LLM Dominance: >>101434645 >>101434734 >>101434859 >>101434939 >>101435320 >>101435452 >>101435571--Miku (free space): >>101431341►Recent Highlight Posts from the Previous Thread: >>101431260
Worship the Miku
>>101439126>Uranium has 82 protonsShrinkflation has finally reached the nuclear energy industry.
>>101439243Yes.
Is there any chance to get decent results with a rtx 2070?I tried some llama V3 model yesterday at at most I could get 1 liner replies. Escaping Claude's grasp isn't that easy it seems like. Didn't tinker with the settings in oogabooga yet, however.
>>101439327no
What's the most affordable GPU for achieving 80+ T/s on Gemma-2 9b 5bpw in a headless server? Are there any AMD cards capable of doing it?
>>101439327(Laughs in 2060 6GB)
>>101439327say that you want long and detailed replies in the system prompt
>>101439355Damn. What are you using?
>>101439368gemma-2-9b-it.Q4_K and mixtral-8x7b-v0.1.Q4_K_MLike >>101439356 said the system prompt helps.
>>101439396 (me)
>>101439327you have 8GB of VRAM, you can fit a 8B model easily, look for the gemma and llama3 smaller versions and find a gguf quant that you can run.
>>101431757Weren't MoEs cheaper to train?>>101431838You could use a lot of RAM or even terabytes on storage on SSDs and the inference would be still fast. CPUmaxx is the way to go.Also that recent scaling paper claimed MoEs need only 1.3 more parameters to contain the same amount of information as dense models
>>101439429Damn, not bad. Guess I need to figure out this shit some more. Thanks.
>>101439438You think it would be possible to cram an 11b model in somehow?
>>101439448This is the system prompt I use (among others). https://huggingface.co/datasets/ChuckMcSneed/various_RP_system_prompts/blob/main/unknown-simple-proxy-for-tavern.txt
i cannot fucking handle this youtube slop dudeim running the docker version of memgpt (fuckin wish i could use it with silly tavern instead) and all the tutorials for making it run with LLM's are for the CMD version instead of docker so im fuck outta luckany advice? i tried doing kobolds api key in place of the open ai key spot in the .env (formerly .env.example) which made it finally open to the memgpt.localhost but then when I send a message it just thinks and then diesI'll try and get a screen shot for ya one sec
>>101439027Then recommend me a new model, faggit.
>Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
>>101439456Yes, an 11B gguf at Q4 will fit. A Q5 will depend on the context. Check this for reference:https://huggingface.co/mradermacher/Fimbulvetr-11B-v2.1-16K-i1-GGUF
>>101439471top is what it's stuck doing then boom it stops thinking and i get nothingthe CMD is just what it looks like immediately after connecting to memgpt.localhost
i was just catching up on the last couple threads and noticed this appeared 2 threads back, didnt get any replies and didnt get caught in the recap that i can see, so gonna repost it>>101426978Q-Sparse: All Large Language Models can be Fully Sparsely-Activatedhttps://arxiv.org/abs/2407.10969>We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.from the bitnet team. seems it didn't get posted here yet>from the bitnet team. seems it didn't get posted here yet
>up to 15 characters made on local setup>all of them are just various fluff around my poorly hidden breeding kinki need new material, there's only so many ways i can spin scenarios before i have to get into weird shit
>>101439990According to this paper, BitNet models have an optimal sparsity of about 60% so the improvement is significant but not groundbreaking.MoE models can also be considered "sparse" and if you can have a model with a million experts like that other Google paper claims, with an optimal number of active experts being in the hundreds, inference can be **orders of magnitude** faster than with current models or any future BitNet model.
>>101440013All roads lead to Rome."New material" is either just different preludes to the same kink moment, or you need to find in you a different kink to be drawn toward.The end result—squirt squirt—is the same no matter how you get there. Unless you change that intention, it's all just variations on a theme.So why not ask your LLM to come up with some new material for you? Literally, at the end of a role play ask it for new ideas. If your context is large enough for it to see what you've done, it might come up with some neat new stuff.
>>101439990
>>101440064If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.
>>101439990interestingly they say in here that off the shelf LLMs can be "continue-trained" to make use of Q-sparse
>>101440043i don't even really see it as a "kink", and i hate that word anywayi've always been a boring vanillafagi might try your suggestion though
>>101440134this was for mistral-7b
>>101440110They mean risk instability of training runs and wasting millions on experiments that will be only good for proving that mamba/SSM fail to scale. Not safetyism risk.
>>101440136>i don't even really see it as a "kink"It was your word choice, so maybe you do.>i've always been a boring vanillafagI could describe myself the same way, but I know where my kinks lie and LLM has been an interesting way to see exactly which details of their topics "work" and which don't.But definitely let the LLM try things. Some of the most interesting RPs I've had were by playing through a part that was a "nah" and then it went someplace I would never have thought of. And then it's hot till the context fills causing sudden derp and collapse and sadness.
>>101440187Semantics. Meta has had over a year and failed to innovate at all.
https://www.semianalysis.com/p/gb200-hardware-architecture-and-componentneat
>>101440110>If Meta wants to safety themselves into irrelevence, that's their perogative. Waste of their H100 farm, but the Qwen team has already shown interest in BitNet.I think the first non meme BitNet model will be from the Mistral team, they're the only one making non transformers models (they made a MoE model and a mamba model)
Sup /lmg/. I'm looking for an open source project that allows me to have a local server with:1- An OpenAI compatible API2- Allows for multi-user or multi request. Ideally it won't run them in parallel, but will queue them (i.e. at any time it will be running inference for only one request but won't crash or reject requests while running inference)3- allows for multiple models but doesn't load more than one at any single time.4- ideally, it would seamlessly "hotswap" models as requested. (If a new request needs a different model, it will automatically unload the current model and load the new one.Llama-cpp-python allows for all of the above except the #2. I want to have a single LLM server and use it for multiple client apps.Pic unrelated.
>>101440043How do you ask for that stuff?
>>101440043at some point I think anons will need to do their own preference optimization dataset based on their fetishes to then finetune models with. maybe make a google form or something to then click through that generates the dataset when you're done
>>101440611ollama can do all that. But, if you want really good multi users shit, you have to use vllm.
>>101440724Pursuant to >>101440013, I tried,>Someone on 4chan laments that he's made 15 character definitions for his LLM to role play as, but they all play into his breeding kink so narrowly that they're becoming repetitive and requiring him to get into "weird shit" to make them interesting. Please list seven kinds of characters that his LLM could role play as that offer something interesting to explore while still ultimately becoming a scenario where his character will begin producing offspring with his LLM's character. Consider all kinds and genres of fiction for ideas of what kinds of people or things these role play partner characters could be.Removing the explanations to save space in one post, it offered:1. Space Colonist2. Ancient God/Goddess3. Time Traveler4. Shapeshifter5. AI Entity6. Nature Spirit7. Alien HybridThose sound like good ways to make both the run up to the kink and the consequences of the kink have something fresh to offer.>>101440750>their own preference optimization dataset based on their fetishesThat sounds like the road to boredom. If you move toward what you know you want, you'll get the same things again and again, as >>101440013 complained about. Guide the AI away from your no-gos and dealbreakers, and move toward things you don't have much of an opinion of, and you'll be able to find new things that you didn't know you would like. And it'll be trivial to work in a personal kink on the fly when it's wanted.
>Added --unpack, a new self-extraction feature that allows KoboldCpp binary releases to be unpacked into an empty directory. This allows easy modification and access to the files and contents embedded inside the PyInstaller. Can also be used in the GUI launcher.Holy heckin based
What do you do after shooting your load in rp context?Close chat and start a new one?
>>101440611>>101440898Forgot to add...* Partial GPU/CPU offloading
you're going to be able to make videos that look real
>>101441368people be questionin the legitimacy of livestreamed, realtime footageif that can be fake then there's no such thing as real
>>101441108If you need CPU+GPU hybrid inference llama.cpp/ggml is you only choice in terms of backend.The llama.cpp HTTP server has an OAI-compatible API and will queue requests by default but model hot-swapping is not implemented.Ooba I think lets you load/unload models via the API but I don't know if it's OAI-compatible.
So are we just never going to get an HF version of mamba codestral that works outside of Mistral's shitty basic bitch backend?
>>101441409ollama queue by default and can do parallel requests, can easily swap models via API and is OAI compatible.
>>101441474>mamba>works
>>101441409>The llama.cpp HTTP server has an OAI-compatible API Last I checked, only the chat API is OAI compatible, the regular text completion isn't.
>>101441512mamba is a supported model-type since like transformers 4.39Why shouldn't it work?
>>101441511ollama just runs the llama.cpp server in the background, petra.
>>101441601okayuse it
>>101441654Well I can't because I'm not going to install Mistral or Mamba's shitty inference packages since neither of them seem to have an API which renders them utterly fucking useless.
>>101441679well then it doesn't work does it
There's been websim for a few months, seems quite popular now. I never cared much to try it, is it actually great? And are they or something similar open source? Official website wants me to login with google. Can you run it with a local model?
so I installed ollama cli and run lama3its pretty cool but i hoped /save sessionName save the chat so I can come back to it next time with llama remembering what was told, but apparently it is not the casehow to save/load chat history so i can continue the conversation?
>>101442061Wrong site: reddit.com/r/LocalLLaMA/
>>101441934For once, doomer Anon, on this one exact topic, you are correct. It is unironically over and there's absolutely no hope.
>>101442109sorry, I thought this was local models general
>>101442219ollama is not beloved
Anybody knows if the whole slot system from llama-server works with Sily, or would I need to change the way Silly calls the API to specify a slot or something of the sort?You can have different prompt caches per slot right? That would be pretty cool when switching between cards or even when using things like the summary extension.
>>101440202>>101440187I think it was too early to get their clusters fully online, and they're still in the process of getting and building more. They began training and setting in stone what Llama 3 was going to be earlier than some of the current hyped research could prove scalable. Llama 4 is probably going to be the one with a more unique architecture.However, it is pretty normal that a big corporation lags behind startups when it comes to smaller scale faster paced roll outs of technology. Their advantage is that when they do roll out a new product, it's done with more money. That doesn't always equal a better product. In the case of LLMs it means they can spend a lot of time training the model like 15T, or they can train a 400B dense, or whatever. Startups may come out with a Bitnet or Jamba or whatever sooner, but then it'll take a megacorp to produce a Bitnet or Jamba or whatever with 15T pumped into one, or a 400B, or etc.
>>101442061Maybe ollama has updated and fixed this since I used it a month or two ago, but I found the /save feature to be total ass. The parser was busted and many character sequences would kill the parser, causing it to save only the first few turns of the conversation.I could avoid then on my side (for example, NEVER end a turn with ". A space or spare period after would be fine but if it ended on a quote mark from dialog, that's the end of the save. Of course, the AI could write killing sequences, too, so even if I'm careful it's a doomed chat.I roll Kobold now. Much easier to adjust settings now that I know them, I don't have to play Ollama's silly JSON and renamed file game, and I can use (almost) all of the models.Ollama is a great introduction but the instant you want to do more, step up to a better wrapper.
>>101442269so they allow training on the output? If so that would probably be great for Mistral and similar
>>101439126Isn't Mixture of a Million Experts basically Marvin Minsky's Society of Mind? Also with the fact that it could theorhetical have life long learning...Bros, we're actually going to get lifelike ai gfs in our life times, aren't we?
so can anyone actually even test the 405b? by maybe renting some gpus online or something?
>>101442432I believe someone from huggingface said they'd host it (though I'm quite confused that the hugging.chat models are always broken)
>>101442286thank you for an actual answer kobold seems interesting but i wanted something working in terminal for nowI guess I will try some ollama terminal clients, I see there are quite a fewanyone can recommend specific one?
>>101442432If any providers bother with it it will probably be on open router but for $2 a gen I imagine
>>101442443I too enjoyed that Ollama was terminal based at first, and it was something I wanted when getting started. But then I wanted to change settings without curling one line JSON strings, didn't like the multi line text bug, the inability to save, etc, so I switched to Kobold and now it's trivial for me to reach it from my phone when I step away from my computer, I have access to important settings, state saving works correctly, and easy access to editing the document to fix errors or to reroll a response is very nice to have.
>>101439447>Weren't MoEs cheaper to train?Yeah, generally. Wasn't necessarily true for this "distributed" FFW method though. Looking at the paper though it should be more flop efficient. I'm actually warming up to the concept as I dig into it a bit more, there might be something really good here.This paper alone isn't enough to sell the idea, but I think a few minor improvements and suddenly this becomes the new training paradigm.
>>101442219Yeah, this is not the ollama tech support general. Go to Reddit to shill your scam.
state space 1 million expert bitnet 1T parameter model when?
>>101442599Don't forget with JEPA and native multimodal and multitoken prediction.
>>101442599Bitnet probably wouldn't work well with experts in the order of thousands of parameters in size.
>>101442599we need to return to good old lstms
mambeleonbytejepabitnetMoaME 600B when
>>101442599with fill in the middle support
>in my system prompt I give the narrator a personality, since that decreases slop>after plapping the character from the card, ask what the narrator thinks so far using OOC>she's horny>propose her, herself joining on the fun>now I get to plap the original character as well as the narrator turned into a characterLet's fucking go.
llama.cpp has support for some NPU now.>>101442621Wouldn't it be hilarious if the next step is to go back but with a couple of adjustments and at a bigger scale?It's not even absurd to think that since that kind of thing happens all the time.
>>101442697I'll type the weights by hand
https://github.com/ggerganov/llama.cpp/pull/8543>Add support for Chameleon #8543that one anon will be happy>For now, this implementation only supports text->text inference and serves as base to implement the (more interesting) image->text, text->image and interleaved pipelines. However, such an implementation will probably require some changes to the CLI and internal architecture, so I suggest to do this in a separate PR.oh...
>>101442599right after you hang yourself.
>>101442729Interesting. Hopefully now we can get some benchmarks of how those new laptops do with LLMs.
>>101442729isn't there any standard NPU api?
>>101442754they've still got poor memory bandwidth compared to low end nvidia gpus
>>101442792Doubt it.It's probably the same situation as GPUs where each vendor has its own computing architecture with its own APIs.
>>101442804Well yeah. But low power windoze/linux laptops that can run LLMs is still pretty good to have exist. Also would be good to know how the LPDDR5X in them performs, whether it'll still be the bottleneck, or the compute is the bottleneck, in the case of these computers. We could then extrapolate to think about how a desktop with a similar chip could perform paired with separate GPU.
Why don't they just add matrix multiplication to slide rules?
>>101440064Meta became irrelevant when they started to filter their pre-training dataset with llamaguard. Claude shows that a diverse dataset with no regards of safety produces great results. A model that could not learn anything 'unsafe' because the data it was fed was designed to be 99% safe will never be good.
>>101443047hi petra
>>101439308Disregarding nuclear safety regulations with Miku
How to restrain model's output in tabbyapi? I tried "json_schema": {"type": "string","enum": ["Yes","No","Maybe"]}, but I keep getting errors:ERROR: ExLlamaV2Sampler.sample(ERROR: File "/home/petra/tabbyAPI/venv/lib/python3.10/site-packages/exllamav2/generator/sampler.py", line 247, in sampleERROR: assert pass_tokens, "Filter excluded all tokens"ERROR: AssertionError: Filter excluded all tokensERROR: Sent to request: Completion aborted. Maybe the model was unloaded? Please check the server console.
>>101443072Are you retarded or just schizophrenic?
>>101443047This, this is why Command-R is good, and partially what makes WLM 8x22b wayyy better then 8x22b instruct. (Although I think WLM just has some other interesting continued pretraining and finetuned techniques up their sleeves iirc)
>>101443222When will this meme die?
>>101443244if you mean the huggingface leaderboard meme, whenever you stop posting it
>>101443222>what makes WLM 8x22b wayyy betterWhen are the mikufags going to drop this meme? How is this related to pretraining when it's a finetune done with GPT-4 outputs?
>>101443222They have a paper out for wizard. You can replicate if you have the compute and money.They ran an offline arena and basically trained on the best outputs in that on something, over and over again. SFT, DPO, PPO i believe? Reward models type shit.
>>101443349Basically.https://x.com/victorsungo/status/1811427047341776947?t=k7ZXwSCRnYKBW_7Rj0_q6w&s=19
Duality of /lmg/
>>101443244>Aktually according to the HF leaderboard...>t. Never run Mistral8x22 or Wiz8x22 for himself>>101443503>Yes with a 5% leadWe are so back bros
>>101443158good image
>>101443047buy an ad
>>101439122>>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271what am I gonna need to run this?
Will we even need to make videos anymore if we can just generate them?
>>101443763There are artists still making art even with AI generated art. There are still writers despite AI being able to write. There will still be filmmakers despite AI being able to make film. Some people just like to create.
>>101443724At least a raspberry pi
>>101443763>t. the least illiterate AI bro
I have been doing some testing between GGUF and EXL2. I have always preferred exl2 as it was faster and had Q4 cache for KV.But the speed difference seems to have vanished, and llama.cpp supports KV caching too now.All averaged though 3 runs+using SillyTavern as front end+using FA and Q4 cache in both exllama (mandadory in tabbyapi) and llama.cpp:TabbiAPI Backend (Exllamav2 0.1.7):WizardLM2 8x22B Exl2 @ 4.0bpw 24.1t/s Llamacpp backend (pulled 2hours ago):WizardLM2 8x22B GGUF @ Q4_K_M 25.2t/sTextgenwebui Backend Exl2:WizardLM2 8x22B Exl2 @ 4.0bpw 22.1t/sTextgenwebui Backend GGUF:WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/sDue to the higher support for llama.cpp and thus better compatibility with devices and faster compatibility with new models. Is there any reason to still use exl2?System:4x3090's at 250w maxEPYC 7402
>>101444001>>Textgenwebui Backend GGUF:>>WizardLM2 8x22B Exl2 @ 4.0bpw 23.2t/sI meant WizardLM2 8x22B Exl2 @ *Q4_K_M* 23.2t/s
>>101444001Can you also provide results for prompt processing?
>>101444001>Is there any reason to still use exl2?Unironically no.
>>101444001>those numbersbig if true
>>101444001exl2sisters... what went wrong?
>>101444001The llama.cpp server is too rough around the edges and tabbyapi is more polished. I would switch back to exllama if Gemma 2 worked as well as it does with llama.cpp.
>>101444001Nah, I hate exllamav2 with a passion now because they keep breaking their god damn pip packageswitched to llama.cpp when they increased the speed and haven't looked back
>>101443885And there are still blacksmiths hand forging horseshoes.What matters is if AI will completely collapse the manual art market, or if it'll be like music has been, with synths and DAWs rolling into the toolkit and letting new sounds enter the ecosystem and more artists offer their ideas.
current foss models are barely better than chatgpt 3.5, seems like fossissies lost in the end.
>>101444309>with synths and DAWs rolling into the toolkit and letting new sounds enter the ecosystem and more artists offer their ideas.For most commercial endeavors, I think that's what it will be.Some market segments like commercials might end up as mostly AI generated at some point, however.
https://x.com/RuoyuSun_UI/status/1813635251652227505Big
>>101444647"gradually overlap" aka it's not as goodif it's more stable than SGD (and it certainly seems that way from the graph) it's definitely an option for people without a lot of memory aka local stuff
Does ST not get token probabilities with Llama.cpp? I checked the box and went into the token probabilities tab but nothing is appearing in it.
>>101444380For many tasks gemma 27b feels just as good as GPT4-o
>>101444083gguf vs exl2 anon hereGGUF:1st run: prompt eval time = 742.80 ms / 380 tokens ( 1.95 ms per token, 511.58 tokens per second) generation eval time = 19797.38 ms / 500 runs ( 39.59 ms per token, 25.26 tokens per second)2nd run: prompt eval time = 157.15 ms / 1 tokens ( 157.15 ms per token, 6.36 tokens per second) generation eval time = 19793.48 ms / 500 runs ( 39.59 ms per token, 25.26 tokens per second)(restarted llama.cpp server)3rd run: prompt eval time = 642.11 ms / 380 tokens ( 1.69 ms per token, 591.80 tokens per second)generation eval time = 19772.64 ms / 500 runs ( 39.55 ms per token, 25.29 tokens per second)EXL2:1st run: Metrics: 500 tokens generated in 21.56 seconds (Queue: 0.0 s, Process: 0 cached tokens and 380 new tokens at 315.16 T/s,Generate: 24.57 T/s, Context: 380 tokens)2nd swipe: Metrics: 500 tokens generated in 20.29 seconds (Queue: 0.0 s, Process: 379 cached tokens and 1 new tokens at 13.75 T/s,Generate: 24.73 T/s, Context: 380 tokens)(restarted tabby to avoid caching)3rd swipe: Metrics: 500 tokens generated in 21.55 seconds (Queue: 0.0 s, Process: 0 cached tokens and 380 new tokens at 314.84 T/s,Generate: 24.58 T/s, Context: 380 tokens)So... llama.cpp is also faster in prompt processing. I have noticed it as the time to first token feels quicker in llama.cppNote: llama.cpp server also loads the model faster. I don't know why, seems to load the model in parallel in each 3090 while exllama loads it in series and waits for each one to be filled.Im a turboderp fanboy and it was until today that I assumed that exl2 was always better. But it seems that llama.cpp has made so much progress>>101444182is it? by tabbyapi do you mean exllamav2? the tabbyapi auther says in his readme that tabby is not producction ready and we should use aphrodite. But to be honest tabby+exllama has been rock solid for me.
>>101444302I haven't had any pip problem to be honest. I use python envs though. Just following the readme and using the whl works fine for me.
>>101441409what's your stance on SCALEhttps://docs.scale-lang.com/
>>101444771I have a shared env for the experiments I use, and twice now installing a new package that updated the exllamav2 package broke it in a non-obvious way. They adjust how parameters are handled too often, and it quietly just breaks generation. Queue tens of hours finding out why my pipeline is randomly breaking, because sometimes it still outputs decent stuff.
Is it advised to have character speak about their actions in first person instead of third?Like instead of *{char} smiles* say *I smile*, but still in the *action* markup.Would it solve the heavy narration bias of many models that tend to describe too much and talk too little?
>>101444786I will be happy to cooperate for wider hardware support.But it doesn't fix the fundamental issue that GPU performance depends very strongly on hardware details.So something like this will never be a replacement for e.g. a dedicated ROCm implementation.Also compared to HIP it will not be possible to make informed decisions regarding which kernel (configurations) should be used when running on AMD.
>>101443158Marvelous, "grammar_string" doesn't work as wellERROR: File "/home/petra/tabbyAPI/endpoints/OAI/utils/completion.py", line 135, in stream_generate_completionERROR: raise generationERROR: File "/home/petra/tabbyAPI/endpoints/OAI/utils/completion.py", line 87, in _stream_collectorERROR: async for generation in new_generation:ERROR: File "/home/petra/tabbyAPI/backends/exllamav2/model.py", line 1070, in generate_genERROR: grammar_handler.add_ebnf_filter(grammar_string, self.model, self.tokenizer)ERROR: File "/home/petra/tabbyAPI/backends/exllamav2/grammar.py", line 147, in add_ebnf_filterERROR: ebnf_filter = ExLlamaV2EbnfFilter(model, tokenizer, ebnf_string)ERROR: File "/home/petra/tabbyAPI/backends/exllamav2/grammar.py", line 46, in __init__ERROR: self.state = self.fsm.first_stateERROR: AttributeError: 'CFGFSM' object has no attribute 'first_state'. Did you mean: 'final_state'?ERROR: Sent to request: Completion aborted. Please check the server console.
>gemma 27b feels just as good as GPT4-o
>>101444738>"gradually overlap" aka it's not as goodcan you elaborate on that?
>>101444743such as?
>>101444911I just asked the character I'm talking to the exact same question (Q6_K), 1 try, she answered "No, that is incorrect. The element with atomic number 82 is lead. Uranium has the atomic number 92."
>>101444986Actually most tasks except code, because GPT4-o outputs longer code
still trying out dry with good success but then i came across this leddit post explaining it more. turns out i should be using rep pen still, but i have it turned off atm to test dry specifically. going to leave it off a bit longer but i'll try with both.>https://old.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
>>101444425>Some market segments like commercials might end up as mostly AI generated at some pointI think the space for that is going to be for targeted advertising, which will be fully customized generated on the fly baited Skinner boxes, far beyond the "HOT SINGELS IN ${YOUR} AREA" trash. Likely it'll be a "smart" Telescreen thing where it knows your family's viewing habits and AI gens tailored inserts that it can force you to look at because you didn't support Sceptre as the last television supplier.For mass marketing I think it'll just be another tool in the box. There's that pizza ad that screams "AI assisted" from beginning to end. A shame that it's still not as cool as Pepperoni Hug Spot but it's a sign of the times when they sell pizza with an exploding head when formerly that was the reason Scanners became a meme.
>>101445019so I have to say "ahh ahh mistress" to make it smarter?
>>101445084yes, that's claudes secret
>>101444756Thanks! You've saved me a lot of time.
>>101444971It initially doesn't learn as fast, and it's really not guaranteed to ever catch up to base adamIt simply performs worse, though with the memory savings it might be worth it in some cases
>>101445147So I guess that's something that could be used first as a cheap method and if it's not successfull then going for regular adam, I see
>>101443158Fuck, I solved it! It seems that exllamav2 always expects an object, so grammar like this works:{ "type":"object", "properties": { "result": { "type": "string", "enum": ["Yes!","No?","Maybe..."] } }, "required": ["result"]}
who is petra?
>>101444001Now try vLLM.
>try Lunaris, it's shit >try Stheno, it's shit >try Gemma, it's shitI fell for the vramlet "good model" meme
>>101445195Sorry anon, I'm actually retarded. I just realized I read the graph backwards. Based on what they're saying it preforms better, though I very much doubt that's true in general. Will wait for more people to try it out, but it might be good.
>>101445267A Sao fanboy/fangirl from the UK.
who is sao?
>>101445313Petra's crush.
>>101442725the only thing your are plapping is your hand, anon
>>101445295that isn't a vramlet meme its 'i have a laptop with 4gb vram and 12gb ram' meme which covers most of the thirdworlders who post here. you can run 70b just fine with a good cpu and ram
>>101445323pic?
>>101444866Recommend 8-13B models for doing the actual chatting instead of narration. I'm kinda tired of every conversation sliding into the purple prose.
i have 88GB of VRAM and i want to run either https://huggingface.co/OpenGVLab/InternVL2-40B or https://huggingface.co/facebook/chameleon-30b. how? they are both multimodal models, but i dont know how to run them. do they work just fine with oobabooga or do i need some sort of specialized backend?
>>101442729> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. http://www.incompleteideas.net/IncIdeas/BitterLesson.html