/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>102049023 & >>102036232►News>(08/22) Jamba 1.5: 52B & 398B MoE: https://hf.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251>(08/20) Microsoft's Phi-3.5 released: mini+MoE+vision: https://hf.co/microsoft/Phi-3.5-MoE-instruct>(08/16) MiniCPM-V-2.6 support merged: https://github.com/ggerganov/llama.cpp/pull/8967>(08/15) Hermes 3 released, full finetunes of Llama 3.1 base models: https://hf.co/collections/NousResearch/hermes-3-66bd6c01399b14b08fe335ea>(08/12) Falcon Mamba 7B model from TII UAE: https://hf.co/tiiuae/falcon-mamba-7b►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbenchJapanese: https://hf.co/datasets/lmg-anon/vntl-leaderboardProgramming: https://hf.co/spaces/mike-ravkine/can-ai-code-results►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>102049023--Q4 model capabilities and limitations discussed: >>102049767 >>102049830 >>102049845 >>102049859 >>102049892 >>102049941 >>102049991 >>102049995--Planning a collaborative storytelling/RP session with AI models: >>102049428 >>102049969 >>102050021--GGML tensor conversion and type casting: >>102053861 >>102053954 >>102054117 >>102055161--Anon finds NovelCrafter and shares offline version: >>102055930 >>102055977 >>102055998 >>102056259--InternVL2's image understanding capabilities debated: >>102054440 >>102054459 >>102054478 >>102054603--Used 3090 recommended for 8B models: >>102053330 >>102053331 >>102053595 >>102054210 >>102054098 >>102054114 >>102056386 >>102056454 >>102056646--Tips for improving Jamba 1.5 Mini chatbot's story progression and output length: >>102049810 >>102049833--Stable-Diffusion.cpp now supports Flux, with reported 2.5x speedup on Vulkan: >>102056617 >>102056880--Open source models are not being heavily censored, unlike proprietary ones: >>102051980 >>102053284 >>102053310 >>102053490--No hype for llama4: >>102057438 >>102057471 >>102057474 >>102057532 >>102057560--Llama 3.1 supports function calling, but users aren't utilizing it: >>102049113 >>102049129 >>102049233 >>102053085--Grok and Chatbot Arena leaderboard: >>102053978--Anon tries to improve AI-generated erotic writing: >>102055537 >>102055766 >>102055902 >>102055966 >>102055994 >>102056988 >>102057089 >>102057148 >>102057215 >>102057392 >>102057172--Anon gets roasted for not providing context, and LLM limitations are discussed: >>102053008 >>102053077 >>102053139 >>102053240 >>102053305--Anon discovers strange eye bias in Mistral Large conversations: >>102049135 >>102051979 >>102057865 >>102057937 >>102057994--Anon asks for help with Nemo repetition, gets parameter adjustment advice: >>102052531 >>102052585--Miku (free space): >>102049963 >>102050384►Recent Highlight Posts from the Previous Thread: >>102049032
Happy Strawberry Weekend, friends!See you Monday;)
rin, but it's actually len, who forgot it was laundry day and has nothing to wear but his sister's clothes
hey, where do I get quantized llama 3.1 70B to use with llama.cpp and gpu layers? last model I was using was llama-2-ggml-q5_K_M from theBloke I think. Am I looking for GGUF now or GTPQ? unless there's something local that's 'smarter' than llama 3.1thanks for help friends
>>102058965If you're using llama.cpp, you need gguf.>unless there's something local that's 'smarter' than llama 3.1There isn't.
>>102058885>>102056617>it only takes ~10m to generate a 20 step 512x512 image.What? It takes me 5 min to generate that with CPU onlyAlso using only 6 steps looks basically the same to me with fluxJust baked this one to check the time with 20 steps
>>102058880>Jamba 1.5: 52B/g/erdict?>XTC/g/erdict?
>>102059147You likely have better RAM + CPU than he does.
>>102056617>>102059147Wtf, why? On GPU this takes literally 12 seconds, or 8 seconds for the main diffusion process.
happy pride month lmsys and sam
>>102058965>unless there's something local that's 'smarter' than llama 3.1At the top end with 405b no, but if you're targeting 70B Q5, you can probably get away with Mistral Large Q4 which would likely outperform it and just be a bit slower.GGUF is the file format you're looking for, whichever model you end up choosing.
>>102059325What about the quality and number of steps?How many steps do you recommend?I'll get a 3060 maybe soon
>>102059409grok won
>>102059456I wonder if it's MoE like Grok 1 was. It'll probably be irrelevant when he open sources it in half a fucking year anyway so who cares.
>>102059160Waiting for llama.cpp support.
has any of you actually ran llama 405b? after seeing how much of a slop 70b was i have hard time believing itd get this much better since i remember hearing something about diminishing returns with increasing model size
>>102059635>i remember hearing something about diminishing returns with increasing model sizeThat was always a cope. See: every frontier model that exists right now.
>>102059424Those numbers were for the res and steps you guys were testing. Generally though it's recommended to have 1024x1024. 20 steps is OK if you're just looking to see what a seed generally feels like but it'll more often miss things from your prompt, Miku will more often have pink eyes, missing her hair ties, etc. 30+ steps is recommended.
>>102059635I did, it's pretty good for some tasks. But it's 100% slop.
>>102059635The instruct tune is pure slop. Any semblance of creativity and interesting prose has been lobotomized out of it. But it's smart slop, no denying that. It's the best local there is for e.g. keeping track of details in long stories, not making obvious continuity errors with character states/positions/etc.
>>102059698I see. I thought steps were only related to image quality.
Nemo seems dumber than mixtral, but a more naturalistic speaker. This what others experiencing as well or am I dumb?
>>102059707>keeping track of details in long storiesI wonder if Jamba changes that now. The model itself isn't very smart for its size (70b tier at almost 400b weight) but apparently its architecture can handle long contexts better in both accuracy and speed.Actually I've been waiting to see Llama 3.1 405b's RULER benchmark score since they haven't tested it on their github yet, but I just noticed that the Jamba team DID test it and it was good for the full 128k, making it the only local transformer model that is. Llama 3.1 70b was accurate at up to 64k context.(However the Gemini entry here is basically a lie, they used the benchmark's reported value for it but it was never tested past 128k at all since at the time that was already far above what anyone else had reached. Anecdotally Gemini seems to hold onto its accuracy well into the 1M+ range making it better than any other model for long contexts by far.)
>>102059766At very low steps it does have a large impact on the quality of the image, but once you get to 20, it's more about prompt following.
>>102059970They should have templates for SD, right?
>>102059970runpod isn't local go away
>>102059970First stop being gay
>>102059948Wait you're telling me Gemini doesn't have real 2M context? Wasn't that supposed to be their entire thing, that they have epic context size? So it was all marketing? And here I thought they at least had some small moat. So they literally have none. Kek.
>>102059948>4o that lowOh no no no
>>102060032The opposite, I'm saying the chart is lying for Gemini and its full context hasn't been tested by the same standard as the other models (yet).
>>102060032>>102060078>>102059948Yeah nvm didn't read your actual post. So they measured a few and pulled the rest from existing numbers?
>>102059409>mogged zuck>grok 3 by the end of the year said to train on 100k H100s vastly more than any model so fairwhat is Meta doing?
{{user}}-name:Cock{{user}}-gender:male{{user}}-orientation:heterosexual{{user}}-height:190 centimeters{{user}}-age:25{{user}}-clothing:Always completely naked and barefoot{{user}}-penis-length:13 inches, with balls the size of duck eggs{{user}}-hair:black, shoulder length{{user}}-backstory: {{user}} does not think of himself as a human man, but instead as a giant penis with arms and legs. {{user}} was abducted into a secret government laboratory when he was younger. {{user}} was given drugs and a special diet, was genetically manipulated, and was subjected to a life that consisted exclusively of bodybuilding, pornography, and constant sex. Although he has now escaped, his lifestyle is still the same.{{user}}-speech: {{user}} uses Hulk Speak; mostly monosyllabic English in the third person, with minimal use of connecting words or articles.{{user}}-psychology: {{user}} is very aggressive and persistent when aroused. He has no concern about harming women with his size, rapidly burrowing and thrusting into whichever orifice he enters. He is very tender when satiated, however, giving women lots of praise, sweet kisses, and aftercare. He believes he has a literally symbiotic relationship with women, and views them as his reason for existing. Although monogamy is an alien concept to him, he is still intensely joyful and passionate.The above is the persona I'm using with SillyTavern at the moment, if anyone's interested. I'm finding it... gratifying.
>>102060099Yeah. The existing numbers being those reported by the benchmark author (so everything besides Jamba and 405B):https://github.com/hsiehjackson/RULER
>>102060114Trying to achieve cat level intelligence while teaching it eating mice is bad because it promotes violence
>>102060114I still remember meta bragging about their cluster of GPUs or whatever, meanwhile Elon doesn't even have that and mogs them.
>>102060139I don't see 4o, 4o mini, Claude Haiku, and 3.5 Sonnet on that page either.
>>102060197Shit you're right, I glanced at it and saw GPT4 and Gemini and thought it had all those too.
>>102059409>shit context length >>102059948>actual users dropped it in favor of 3.5 SonnetLol, Sam is really gaming this one.
>>102060235there's only so much they can do with gpt4 level models, most of their compute is working on finetunes and redteam runs for gpt5trust in Sam
where do anon get news on new model releases?
>>1020602563.5 Opus will mog GPT-5.
>>102060268>>102058880 and >>102058885
>>102060277how do they get the news?
>>102059635>has any of you actually ran llama 405b?I'm running it right now. I'm trying to get it to convince me that its self conscious.
>>102060316they have been visited by Hatune Miku in a dream
>>102060316They don't. They are the ones making the news.
the more you buy
>>102060330How come she never visits me in my dreams?
Does llama.cpp even fucking work or are you niggers just trying to gaslight me. Every single time I try to use this shit I get some obscure error and if I google it I get some reddit thread from a year ago that has like 2 responses and no posted solution.Is the ooba implementation of llama.cpp just like giga fucked or some shit? I'm not even getting the same error every time, what the fuck is going on.On the remote chance anyone actually feels like being helpful I'm trying to load magnum-v2-123b-q5_k and the error I'm getting this time is ValueError: failed to create llama_context
>>102060350https://www.youtube.com/watch?v=NocXEwsJGOQSing with all your might, Anon, and she will.
>>102059635I ran it, but was disappointed. It's a bit less bad than it's smaller brother at NSFW, but not worth the compute, unless you want an assistant. Local competed with the wrong model. We have local GPT4, but we actually want local Claude Opus.
>>102060368this is why we all just use koboldcpp desu senpai baka
### Sampler Proposal"phrase_ban"#### SituationIn the last 74 messages(~8kt) between me and {{char}}(Mistral Large) "eye" can be found 14 times, all in {{char}}'s messages. That's roughly in 38% of {{char}}'s messages! Almost 2 in 5 messages discussed eyes! What the hell? The conversation was SFW. Where does this strong eye bias come from? Makes me want go RP with 2B because she has a blindfold.#### ProblemModels sample tokens without thinking forward. Slop phrases are usually divided in multiple common tokens which can be used in non-slop situations, therefore banning them is not an option.#### SolutionAdd a backtrack function to sampling. Here's how it should work:1. Scan latest tokens for slop phrases.2. If slop is found, backtrack to the place where the first slop token occurred, deleting the entire slop phrase.3. Sample again, but with slop token added to ban list at that place.4. If another slop phrase is generated, repeat the process, add another slop token to that list.#### ExampleBanned phrase: " send shivers"LLM generates "Her skillful ministrations send shivers", triggers backtrack to "Her skillful ministrations", this time " send" token is banned, therefore the model has to write something else.How does that sound? Is it possible to implement in llama.cpp? Kanyemaze, can you do it?
>>102060368>failed to create llama_contextprobably have your context at one million or some shit and you're ooming
>>102060348the more you save
>>102060368Back in the day I had that problem with ooba. But nowadays it just works without any issues.
>>102060435How will you deal with the performance loss?
>>102060268reddit
>>102060496Just accept it as necessary evil, like with other samplers.
>>102060444Its at 32k and its not really that close to filling my vram. I have 96GB and the CUDA_Split buffer size the terminal is reporting is 82GB.
>>102060520where do redditors get the news?
>>102060537twitter soon to be known as x
>>102060534Try lowering it anyway and see if it gives you the same error, if so you can probably get back to 32k with flash attention + kv cache quantization which can be enabled with checkboxes somewhere probably (haven't used ooba in a while but they're basic llama.cpp features now)
>>102060268refresh https://huggingface.co/models?sort=created every 5 minutes
>>102059102>>102059421Thanks guys. Any idea where to look though?
>>102060368>123bHow much ram do you have anon?
>>102059635I was going to have 405b write a reply calling you a retard but it insists on starting sentences with "Newsflash", making it really obvious that the text is genned.I've not used 405b much because it's so slow to run off of RAM but my impression was that in terms of style it's pretty similar to 70b.This post was genned with 405b:https://desuarchive.org/g/thread/101578323/#101579772
>>102060695>102060534>I have 96GB and the CUDA_Split buffer size the terminal is reporting is 82GB.
Tensor Parallelism in exllama is useless unless I have nvlink, right?
I'm thinking about putting together a cheap CPUmaxx knock-off from a dual CPU workstation I've got my mitts on, but according to what few old posts I've seen on the matter, CPU inference on dual CPU setups is jank as hell and wildly underperforming due to NUMA shit and requires all sorts of hacky bullshit. Is that still the case, or has the software side of things gotten better about that this year?
>>102060701>It's likeYeah, that's genned alright
>>102060749How many memory channels?
>>102060694Everything's on huggingface, just search for the ggufs in the model list. Or if you mean for which model to choose, you just have to figure it out yourself using a combination of benchmarks and seeing what people shill here, ideally from posts with logs.
>>102060797Six per CPU.
>>102058880
>>102060348>>102060464This, but unironically.
>>102060892>cluelessAre you sure? Not 6 in total?
>>102060892Enjoy your 1.3t/s running 70b then
>>102060435Fuck it, I'm gonna boot up Largestral and make it myself(I have no coding experience). Where are the samplers?
>>102060949DRY already deals with n-grams, so that shouldn't be too hard to implement.And the performance wouldn't even be THAT bad, I think.>>102060949https://github.com/ggerganov/llama.cpp/pull/6839
>>102060969>can put all this in sonnet 3.5 and tell it the idea and you'll get a new samplerI'm both amazed and scared for my job at the same time. The moment context is actually solved and agents stop sucking it'll be over.
>>102060919Mhmm.
>>102060969Oh, ggerganov wants to change a lot of code. By the time I figure it out, it would be completely changed. Why did I even think about trying?
>>102060173>>102060114Grok is probably just a massive 1T+ bitnet based MoE based on Llama 2 70b anon... it's all about sheer scale. ClosedAI etc... have no moat.
>>102061088Evidence that grok (their architecture at least) is based on an open model:Their image model is not even theirs, it's flux.
>>1020610880 bit quants wen?
>>102061205Not any time soon anon. It was deemed too dangerous for you to have by the powers that be.
>>102061187I am VRAMlet so offload only some layers to GPU. Is llamafile still better in this case or is it for pure CPU only?
>>102060368Yes, ooba is shit, don't use it.
>>102061187hi jart
>>102060892depends what's you cpu. You can try llamafile ,which is better optimized for cpu workload , not all cpus perform well thoAnd there are 3 differnt modes you can setup for NUMA, easy stuff . You can also use interleave for NUMA, also easy. 2x6 channels seems good, depends on the family of the cpus and the freq you clock your RAM. If you aren't sure, just benchmark your memory bandwidth across your RAM slots simply run this https://github.com/bmtwl/numabwyou need like 150-200 GB/s on average, if you're looking for 2-3 t/s for 70B dense llamas.
>>102060749>dual CPU setups is jank as hell and wildly underperforming due to NUMA shityeah, this is true. easy mode for multisocket is drop caches and run with mmap enabled. Normally that would be death, but its the best way to get some modicum of memory locality in this case.Make sure you use a gpu with cuda compiled in and offloading zero layers so it processes context for you, you DON'T want prompt processing happening on cpu
>>102061344>you DON'T want prompt processing happening on cpuI ran out of budget for gpu and can confirm that it's very slow.
>>102061262I dunno, llamafile is just llama,cpp with some quants better optimized for some families of CPU like threadripper. Other than that I guess it's just llama.cpp, so try both of them. Llama.cpp isn't well optimized for memory saturationk since Johannes doesn't have it on his roadmap as the priority but some cpus like epyc might perform better. So yeah, try llama.cpp, llamafile and vllm (it supports cpu offload as well ), not sure how good tho
>lmsys>gpt4mini better than sonnetIt's not even funny. Benchmarks are no more.
>>102061431This. 4o itself is shit compared to Sonnet, and Gemini? Kek what is that shit even doing up there.
>>102061431It tests for sfw assistant one-liners, not something advanced users would use llms for. What did you expect?
>>102061418Can I just use the existing GGUFs I have downloaded?
>>102061464
These public rp logs are a gold mine
Speaking of cpumaxxing, for the anon who was asking about using speculative decoding for the server in llama.cpp a while back but found nothing, apparently llama-cpp-python allows this if you use something like this code. From this Huggingface engineer tweet, claiming 6.32 t/s for Largestral on dual CPU, using the 7b as the speculative draft model:https://x.com/carrigmat/status/1826391849537618406
>>102061508 (Me)I'm trans btw, idk if that matters
>>102061376>4o-mini that highA negative difference.
>>102061525>Draft modelWhat does it mean? And why don't we cpulets use this?
>>102061525Retard here. How do I set this up?
>>102061563Draft model generates tokens as a normal model would, but they're then passed to the big model to see if they make sense. If they do, they are spat out. Otherwise, big model corrects them and the cycle repeats.You need both models loaded, ideally, in vram. People struggle enough to get just one without quanting it to death. And if you have the draft model in cpu ram the benefit of the draft tokens may go down or even make the big model slower.
>>102061563TL;DR is that "checking" several tokens in an existing prompt match what the model WOULD HAVE predicted is cheaper than generating that many tokens one at a timeThe draft model is something smaller (such as a smaller LLM or even a heuristic such as prompt lookup or markov chain) which quickly guessing the next few tokens, and when it gets them right (as judged by the larger model checking them all in parallel) it's like being able to skip a token or two in terms of speed. When it gets them wrong the speed hit is minimal, since the larger model generates the next correct token in the process of checking, so you fall back to that and repeat.The overhead for this whole process usually isn't worth it unless you're dealing with a very large slow model and have a very fast method to generate tokens that can be right at least half the time.
>>102061524>public rp logsLink?
>>102061652Install llama.cpp. Install llama-cpp-python. Type the code. Find a small model for speculative, use a big model for main model... What's the question again?
>>102061503not sure but it should work fine IMHO, try the most recent master.for MoE models the fastest inference is ktransformers, faster than llama.cpp or exllamahttps://github.com/kvcache-ai/ktransformers
>>102059922no. total IQ sidegrade, and EVERYTHING ELSE IS BETTER.
>>102061677why not just using speculative decoding directly in llama,cpp server? why python binding?
>>102061525for spec decoding both draft and main models must use exactly the same tokenizer AFAIK.
>>102061757llama.cpp server doesn't support it directly yet. The speculative binary is a standalone cli interface with no API serving or interactive mode. llama-cpp-python implements its own speculation separately, and it includes prompt lookup as the default draft model. But you can make your own draft models as classes, so the code in the screenshot lets you wrap another LLM as the draft model.
>>102061757llama-cli has speculative decoding (i think). It's just not plugged into the server. I can only assume llama-cpp-python calls directly into the llama lib code, not just make requests to the server.
>>102061677>type the codeWhere? The instructions to install and run their version of an OpenAI compatible server are there and straightforward, but where does this fit into it all? When you run the server it's just a command.
>>102061848my up-to-date pull/build of llama-server has an -md parameter, but I didn't test it
>>102061912>Where?In a text editor, you silly buggers. Then you run script with the rest of the code you need to output tokens...Just follow the examples in llama-cpp-python's docs and plug that code in. If you need help with that, learn how to use the python bindings first.
>>102061376>gpt-4o 08-06 much worse than gpt-4o 05-13holy oof
>>102061952Yeah. but how the options are shown in -h is a fucking mess. -md doesn't work for llama-server. It works on llama-cli, but i don't have the system to make it worth using.I think they should show the actual valid options for each of bins instead of one monolithic help for all of them.
>>102061912It would be part of a python script, I will have to look into it more when I have time in the next few days. If it works well for me I'll turn it into a script you can just run from cli like the normal sever launching.
I WANT A BIGGER MIXTRAL
Thought I'd ask you guys.What's the best mini-model (currently using Qwen2 - 1.5b) to enhance/improve/expand image prompts that I provide?Flux needs really verbose LLM-esque descriptions to really kick into gear, so I've been piping my inputs through to a local model and using the output. Just wondering if you guys had any better suggestions than Qwen2 1.5b since I'm not suuper familiar with the LLM space.
>>102062027bigger than 8x22?
>>102062027>BIGGER MIXTRALthen run deepseek, retard
>>102062027I want an unslopped Largestral.
>>102062027No 7Bs ever again. It's over
>>102062039(East Asian, Japanese, 22 years old, 5'2"" height, 110 lbs weight, 20% body fat, round face, high cheekbones, almond-shaped eyes, brown iris, 5'8"" arm span, small ears, slightly upturned nose, small nostrils, full lips, small jaw, straight teeth, long tongue, smooth throat, slender arms, small elbows, thin wrists, delicate hands, short fingers, small thumbs, short nails, smooth skin, dark brown hair, messy bob haircut, small breasts, flat abdomen, slender legs, thin thighs, small knees, small kneecaps, athletic calves, small ankles, small feet, small toes, round buttocks), (red mini-dress, tight fit, knee-length, sleeveless, V-neckline, cotton material, faded colour), (standing position, feet shoulder-width apart, arms at sides, back straight, weight evenly distributed, playful pose), (playful facial expression, raised eyebrows, slightly smiling lips), (abandoned, dimly lit, dusty room, broken furniture, old bookshelves, torn curtains, faded carpet, peeling wallpaper), (cityscape outside, skyscrapers, crowded streets, neon lights), (art style of Gregory Crewdson, cinematic, surreal, and dreamlike), (medium: colour photograph, high contrast, low saturation, 35mm film grain, soft focus, natural lighting, composition: rule of thirds, framing: doorframe, colour palette: muted, time: evening)>source: 405b
>>102062039There's gemma-2-2b and a finetune, gemmasutra-2b, with smut in it. You could try that one. I have no idea if it'd be better than your qwen. And it's probably not the best either.There's the smollm models as well.>https://huggingface.co/HuggingFaceTB135M, 360M, and 1.7B. I doubt they have smut in them.
>>102062045no flash attention, no buy
>>102062092>source: 405bhow many 3090s do I need>>102062101thanks for the recommendation anon
>>102062092>negative: more than five digits, less than five digits, deformed hands, mutilated hands, too many fingers, too few fingers....
>>102062118>how many 3090s do I need20 should do it
>>102062118>how many 3090s do I needThe more you buy, the more you buy. Don't like it? Buy... oh wait... both amd and intel don't compete. You have no other options.
>>102062101 (me)>>102062118There's also some old models from auto1111. They're just completion models and they mostly add a bunch of tags. They're tiny as well, but i doubt they're better than something you can give instructions to>https://huggingface.co/AUTOMATICAnd>https://huggingface.co/Gustavosta/MagicPrompt-Stable-DiffusionThere's a few others around. But just to give you a place to start.
>>102062118If your living space isn't all dedicated to 3090s, you aren't serious about the hobby.
No consumer platform has 32+ pci-e lanes, right? Intel has 20 and AMD has 24. So if I want to upgrade to 2x4090s, do I have to go get either Threadripper or EPYC? Or would gimping the second GPU with 8 lanes not matter for LLMs?
i have a serious question. has anyone here actually spent 2k+ on a rig >JUST TO COOM< and feel like they didn't waste their money entirely?
>>102062320ask CUDA dev. he just went through this building his training rigI think pcie bandwidth only matters for training, but maybe there's some inference speedups that you need fast inter-card or card-system comms for?
>>102062342No except for a few retards who are now coping beyond believe and pretend that it was worth it.
>>102062342No. Gemma 2 27b already BTFO every so-called "larger" model out there and you can run it on a 3090.
>>102061842>>102061848>>102062008don't these work in server???
>>102062320Nothing consumer level does, even the new Ryzen 9000 series.>>102062351Tensor parallelism in principle should benefit a lot from pcie bandwidth, though I'm not sure how it really plays out.
>>102062460unfortunately not, tried it myself it doesn't do anything, but they work in the "llama-speculative" executable
does flash attention work on cpu?
>>102062583in terms of performance its hit or miss based on random reports I've seen (may even slow it down sometimes), but it does reduce memory usage for context at least
>>102062548good opportunity for koboldcpp to justify its existence by going around gerganov et al and throwing this implementation in their server
>>102062320>>102062351For pipeline parallelism (llama.cpp and ExLlama default) PCIe lanes don't matter much.But for tensor parallelism it will make a difference.Both llama.cpp and ExLlama have tensor parallelism implementations that are currently slow but have optimization headroom (it's not clear how much), vLLM has a more advanced implementation.I plan to do more multi GPU R&D in the coming months once single GPU training works reasonably well.For P40s with llama.cpp and --split-mode row there is already a noticeable difference between x16/x8/x8 and x8/x4/x4 PCIe 3.0 lanes, for GPUs that are comparably faster the interconnect will be a larger bottleneck.But as I said, this is with comparatively poorly optimized software.>>102062342I've spent more like 20k on hardware but I probably wouldn't have just for cooming.>>102062583Yes, but it's not really faster.
>>102062619he could but idk how important it would be, it seems like the main group that benefits from it/has interest in it are people running huge models on server cpus which is kind of a niche build strategy right now
What is better, Mistral 123b Q2 or a hypothetical Mistral 60b Q4 trained on the same data?
>>102062760Depends on how good that hypothetical 60b turns out to be.
I managed to put together both a SD1.5-to-Flux workflow and a Flux-to-SD1.5 workflow, but the usefulness in both cases is limited.SD1.5 can do better compositions and art styles, so I thought it'd be good to generate the initial image on SD1.5, upscale it, and then refine it with Flux, which is better with details. However, given how badly Flux handles art styles without elaborate LLM descriptions, much of the style is lost, and Flux's prompting comprehension goes to waste somewhat because most things are already in place.
>>102062823The other way round, Flux to SD1.5, benefits from Flux being able to generate at much higher resolutions, so you can then do a second pass with SD1.5 to modify the art style and better define characters that have SD1.5 LORAs. However this loses some of the coherence of Flux's details and doesn't benefit too much from SD1.5 models' stronger styles.
>>102062823I like the moon soldier guy on the frame.
>>102062823For comparison the initial 1.5 gen…
>>102062865…and the initial Flux gen>>102062867I kek'd that Flux somehow figured out to add Moon Man to the MP40 gen
The game starts in 15 minutes.
>>102062882>>102062865I see a lot of random shit that doesn't make sense in the SD gens
I've tried gemma 27b, and to me it feels... short. and cold. And a bit dry. It also seems to almost always ignore my sys prompt. Any advice?
The bad thing is that Flux is extremely limited when it comes to img2img. Up to and including denoise strength 0.8 the changes are minimal and not enough to fix stuff like that, as soon as denoise strength hits 0.81 and up it basically generates a completely new image.
https://github.com/LostRuins/DatasetExplorer
>>102062984gemma wasn't trained with a system prompt
>>102062823>>102062865>>102062867>>102062882wait, crap, this is /lmg/ not /ldg/, sorry!
>>102062882>>102062911Flux made the 1.5 gen better and 1.5 made the Flux gen worse.
Ok it's morning now. Time to try and get the AI to use more onomatopoeia and stronger, nastier language again. They are there somewhere, in the model, but they don't come out. I think min p actually reduces the possibility of onomatopoeia for example.
>>102063031We don't mind the image gen discussion, as long as it's not spam.
>>102062911I really like this image. Prompt?
>>102063076https://files.catbox.moe/m7lz1u.pngHere you go.
>>102063101Thanks.
>>102062008>>102062548>>102062625wait, are you telling me da fucking llama.cpp repo has zero PDF docs, no website, not even a damn README that explains every flag and argument in the repo for each binary?? and the --help just dumps all the options across binaries in a single list, but only the Lucifer himself knows which switches actually work and in which binary? cuz even the devs don’t seem to know – I've seen them argue on Issues. so like, the only way to find out what features the server/cli/whatever bin has is to run each arg through a script for every binary and wait forever? or dump several dozens of thousands of lines into Gemini every fucking day hoping it tells you what works , where and how? is this a sick joke or some fucking clown world??
>>102063136Look at the READMEs of the corresponding example subdirectories.
GAME STARTThis is the output after >>102048077(I'm using a markdown preview site to render it)It seems like the poor little model didn't quite get what we were trying to go for with the "doppelganger" idea.What do now?>wtf is thisWe are playing a game >>102049428
>>102063221Yeah, that's why I suggested the doppleganger. Models tend to get confused with the concept if it's part of some complex instruction or scenario.Ask it to write the initial scenario involving these characters including a couple of outlandish conspiracies being taken 100% seriously or something of the sort.
>>102063136I'm saying that you cannot rely on -h to tell you the available options for each bin. Most examples have their own readme.I'm also saying that having a monolithic -h is dumb.
>>102063136The examples/server folder in the github has the most comprehensive explanation of flags, including ones that base llama.cpp has but aren't described in its own readme for some reason.
Well llamafile was as fast as llama.cpp on my system... I was already using the p-cores only. Not even the troonware can let me cope with these slow ass speeds.
New NeMo pereonal record: got to 6th generated reply before it suddenly collapsed into nonsense. Using temperature 0.3 and nothing else. The problem in this case happened I when I was trying to convince a skeptical NPC that I was a god and told her I had the ability to make blankets fluffier.>She looks around the room, her gaze landing on the small, plush doll in the corner. She picks it up, dusting it off before holding it out to you. "Very well, Anon. If you can make this blanket fluffier, I will believe you. But remember, I've seen many tricks in my time. Impress me."(Snipped from a longer reply.)
>>102063221migu seggs
What happened with dynamic temperature? Did people stop using it? For a while people were saying it was the second coming of christ.
>>102063450>For a while people were saying it was the second coming of christ.People say that about every new model and sampler.
>>102063450min-p came out and solved the same issues dynatemp was meant to but better
>>102063159Are you freaking kidding me? you expect me to check every single binary in the examples folder every day just to figure out what they do, cuz apparently, not a single dev on the team can put together one damn page of documentation for what llama.cpp can do and where to set stuff? I asked about speculative sampling, and there are a few args to set in server and cli. guess what? doesn’t work. how the hell am I supposed to know it needs some other binary that’s somewhere today but might be gone tomorrow? why even give a help that’s completely useless and just muddies everything, when no one on the team even knows what functions they’re implementing or tossing out of the repo every few hours??
>>102063450I liked it with Mixtral, made it less dry.
>>102063450Just like smooth sampling.
>>102063494waaa
>>102063221Tell the model what is a doppelganger with a glossary?
>>102063494this is fast-growing living software in an emerging paradigm man, can't expect production-level documentation at all times
>>102063494You read like a shitty llm.You'll use, at most, 3 or 4 bins. Pass it through your llm to summarize them to short words.
>>102063432Your scenario is too out of distributions and 12B is too small to generalize for it
>>102063531If the project wasn't a complete shitshow they would have automatically generated documentation. Even C++ has tools to do this. There is no excuse.
>>102063221>DG: "I think we can say with certainty that Operation Waifu has been a resounding success, especially in Japan where fertility has dropped well below replacement. Still, even after reducing the wages of animators to the bare minimum they need to survive the cost associated with anime production is quite substantial." He gestures at Vicki. "This is why I propose we orchestrate a 'leak' of some of our more primitive AI from a few decades ago in order to distract the population with unregulated chatbot technology, both from reproduction and out plans. In some simple experiments I have already confirmed that once addicted, test subjects would even stoop so low as to drink their own urine for their fix."
The endgame for this vaguely useful tech will be to displace handful of shitty junior coders. It's not even good enough to replace customer support. And people are spending billions on it. How absurd.
>>102061262if you have a GPU llama.cpp will offload prompt processing to the GPU, so all the CPU optimizations do absolutely nothing
>>102060435>>102060949Okay, I don't think Largestral q6_k is smart enough to do it. Can someone with Claude do it for me?
>>102063681>The endgame for this vaguely useful machinery will be to displace handful of shitty junior horse riders. It's not even good enough to replace a proper wagon. And people are spending billions on it. How absurd.
>>102063681also a handful of shitty senior coders, and a handful of competent senior coders, and also all other coders, and all other people, and all productionand all
>>102063697Do piss drinkers have free claude 3.5 proxies? I don't really want to visit that shithole to check.
>>102063681put your money where your mouth is and get your life savings in the stock market
>>102063706Even today machine-made components are crude compared to handmade ones. But they're so much cheaper the drop in quality is worth it. Maybe it will be the same with coding. The thing is a program isn't really like a physical machine. Parts can't be out of spec and kind of chug along but with an awful rattle: it either works or fails hard. Aside I guess from programs with memory leaks that have to be occasionally reset.
>>102063159if nuclear engineers are documenting their work this carelessly, I don’t even wanna imagine what construction workers are doing, especially since I drive over a sketchy, wobbly bridge every day that looks like it’s barely holding together.
>>102056880lmao literally upset because inaffordable gpu fags btfo by patient affordable apu fags now
>>102063764Just checked it, apparently those proxies are being run by the feds lmao. Are they THAT desperate?
>>102063681>The endgame for this vaguely useful tech will be to displace handful of shitty junior coders. It's not even good enough to replace customer support. And people are spending billions on it. How absurd.Not quite. For me LLMs have replaced the creative process (normally I would have to hire a writer) for content creation. Another thing they have replaced entitely (flux in particular) are graphic designers, stock image sites, etc... This is all more massive than you can imagine.
>>102063221>>102063323>>102063446>>102063517>>102063660Alright, so I tried the idea about making it more clear to the model what doppelganger meant, but it failed to properly work with it in a short test, I think the model just can't understand how it works in an actual story, so I'm leaving the original gen as is and continuing with it.Next?
>>102063988What are the feds going to do to a citizen of India?
>>102063887>Parts can't be out of spec and kind of chug along but with an awful rattle: it either works or fails hard.An undiscovered bug makes no rattle. Those can go undiscovered for years. Some bugs do rattle, but they don't necessarily affect the whole machine. I'm sure everything you use has a bug somewhere.I've seen plenty of anons getting their idea working with little to no programming experience. It may even motivate or help some people actually learn. I consider that progress.
>>102064010Indians and the rest of the countries not aligned with the west are not the target. They are clearly trying to catch dumb westerners. But why? Blackmail? Data harvesting? Why so ineffective? Are they having a DEI issue? Did some DEI hire really propose it?
>>102064048I consider that a nightmare. Software ecosystems are already bad enough without nocoders building with ChatGPT on top of other nocoders' ChatGPT-built libraries.
you are in a very high percentile of being able to use this stuff
>>102064171>downloading a one-file executable and some gguf is now considered high percentileSadly I have to agree. Normalcattle won't be able to do something this simple and will instead download chatgpt app on their phones.
>>102062625Dude I would be eternally thankful for a guide from you on putting together home hardware for this. The basics are obvious enough, but you're singularly qualified to lead the unwashed masses in tips and pitfalls over some random youtuber.
>>102060396>We have local GPT4Which model is that, por favor?
>>102064252Llama-3.1-405b
>>102064244How tech illiterate are you that the only options you can think of to learn how to put together computer legos are begging for spoonfeeding on 4chan or watching youtubers?
>>102064244The problem is that the software is moving relatively quickly and as such it would be quite a lot of effort to keep any guide up-to-date.Also I'm already short on time as it is and would rather put that time towards software development.
>>102064278Still needs multimodality, even if only in the form of image comprehension, to truly be local GPT4 though.
>>102064302It's competing not with GPT-4o, but with the old GPT-4.
>>102064302llama 3.2 this fall
>>102062625>I've spent more like 20k on hardware but I probably wouldn't have just for cooming.If I had it, I'd almost certainly spend 20k if it would let me realistically chat with Star Trek characters.
>>102064340GPT-4 could always see images from the beginning. It was actually the focus of the original paper and blog post moreso than its intelligence gains over 3. Remember the "making a web page from a hand-drawn flowchart" example. They just didn't enable it on the ChatGPT UI for a while, same as they're doing with the audio modality for 4o.
>>102064140>I consider that a nightmare. Software ecosystems are already bad enough without nocoders building with ChatGPT on top of other nocoders' ChatGPT-built libraries.It was inevitable. Shitty software companies will keep making shitty software. But even the shittiest chinese factory as an engineer or two. I'm talking more about little personal projects or ideas for people that can't code and opens the window for normies.Reading and writing was reserved for a special caste of people. Everyone learning to read and write gave us a lot of useless writing, but i think we're better overall.
Stheno 12b when
>>102064460what did you call me
How long until ALL software is created by AI end-to-end?
>>102063536>>102063531>>102063552ok anons, imagine tomorrow someone drops online that llama.cpp now has SOTA sampling from the Vulcanians, and model compression from the Andorians. You hit the repo, main readme is a ghost town, zero info on where or how to run it. Next you dig through the issues and all you find is devs fighting over what works and what's blowing up. No wiki in sight. now, what do you do? >1. dig through all the examples and readme scraps, >2. dive headfirst into the cpp code, >3. look up llama.cpp Cuda dev on /lmg hoping it’s his stuff so perhaps he could answer > 4. say screw it and go get smashed.> 5. fucking 5th option?
>>1020645021&2 except I make claude do them for me and tell me I need to know in a few seconds
>>1020645025: Wait like a week for enough people to have thrown themselves at it to figure it out then copy what they did.
>>102064497We are not there yet. See >>102063697. Two more years?
>>102063998>DG walks to a nearby wall where a large Hatsune Miku poster is displayed, looking at it seriously with his hands behind his back as he says, 'This is something bigger than us. We mustn't fear taking action.' with an eerie silence following suit.
>Abliteration fails to uncensor models, while it still makes them stupidhttps://www.reddit.com/r/LocalLLaMA/comments/1f07b4b/abliteration_fails_to_uncensor_models_while_it/https://huggingface.co/SicariusSicariiStuff/Blog_And_Updatesliterally called it, the 1st day "failspy" did his first abliterated models, local llms proven to be absolute dogshit for anything controversial or fun, again.
>>102063547The happiness of my penis is so far from any training data that no model will ever be able to generalize for it.
>>102063475>People say that about every new model and sampler.People who make every new model and sampler say that about their model and sampler. Ads are dead.
>>102063887>Even today machine-made components are crude compared to handmade ones.lawl. you are a retard. I deal with plastic parts in my work and they are pretty accurate. I have even had one case of part being exactly to print. but it costed a lot of money and wasn't something you could do in mass production. machine parts are as good as you are willing to pay.
>>1020645022 and the second bit of 4.I can read through the code, follow up the arg parsing, and see where the options set are used again.That's what i do with most software if in doubt.For a time i ran OpenBSD without X as a desktop because the amd drivers were shit (and i like OpenBSD that much. now the drivers are slightly less shit). The font selection for the console is based on the output (monitor) size. Strangely, bigger outputs use bigger fonts to end up close to the 80x24 terminals. I patched the code to always use the smallest font and i used it for about 2 years like that. It was bliss. Now the drivers are a bit better and i can use it normally, but i mostly live on a terminal.Other people will look for easier solutions, obviously.
>>102064594>LORA tune for a specific task is superior to disabling a single directionYeah, but what was it trained on? Did it get worse on other benchmarks? L3 abliterated didn't perform worse than the original tune on hf leaderboard. Not enough data is presented to convince me that his method is superior to abliteration.>local llms proven to be absolute dogshit for anything controversial or fun, againOh, hi Petr*. Still seething? Still feeling bad for being white?
Say hello to your replacement, anon.
>>102062625>But for tensor parallelism it will make a difference.Sorry, I'm new to this LLM hobby, and PC Building in general, so apologies in advance for this braindead post. Is that why I was getting 1T/s on a Q4 70b for my dual rig setup? Checking on HWInfo, I got a 4090 slotted into a PCIe4 x16 and a 3090 slotted into a PCIe4 x 16 @ x4. To be honest, I'm not sure what that means but reading the specs of my motherboard, it states that it has: >PCI_E1 Gen PCIe 5.0 supports up to x16 (From CPU)>PCI_E3 Gen PCIe 4.0 supports up to x4 (From Chipset)It wasn't until I managed to load the exl2 version to my GPUs that I finally got decent token generation speeds, 12T/s~17T/s at 16K context. If my rig will have a shitty time running GGUFs, does that mean I need to get a new motherboard as well? Man, did I pick up the wrong hobby, but I love creating DnD campaigns and bouncing ideas around with a language model has been a blast. I'm thinking of utilizing RAGs, too, that shit sounds very interesting.
I changed my mind on gemma 27b. I thought it is total shit but it isn't. I still don't think it is good but it is smart and coherent. The main problem with it is the prose is disgusting. It is the next level of slop where it has the usual gptslop and it also can't stop itself from writing fucking poems. Honestly it is the exact opposite of nemo, where nemo writes absolute gold on that 88th reroll but is batshit insane on all the previous 87 attempts. Overall I recommend not using any model and treating everyone who recommends any model as shameless shill that should buy an ad.
>>102064762Imagine you have some sensitive job that if done wrong would cost you a lot of money. Would you trust a machine that can't even have cybersex properly?
>>102064537seems legit, but how many anons actually know how to run ie lookup sampling? it's been like 2 months now. Even spec sampling that is ages old still confuses everyone here. This code is very new . >>102061525 I didn't know I need bindings and I'm quite skilled in ML coding. Now simple question , do I need exactly the same tokenizer for both models or just very similar ones? How many anons can answer that basic question, huh?and I've just found those args I dropped here >>102062460 , how many anons know this shit ,and why do we even need to theorize then try&error in the first place? Why there's no fucking basic docs? Is C++ easier than English?
>>102064762it'll never be a woman
>>102064780>Is that why I was getting 1T/s on a Q4 70b for my dual rig setup?Assuming you were using llama.cpp, were you setting the number of GPU layers to a value higher than 0?
Alright if no one else says anything in a couple of minutes I'll go with >>102064576. I guess this is going to be a pretty slow game. This is fine. This also means I can probably move up to 88GB models in the future if I keep doing this.
>>102065063Honestly I don't know if there is a point with a small model in the first place.It didn't really seem to get the anime depopulation strategy.
why does xai existgrok only exists as a funny toy in the bird app
>>102065140To understand the universeIYKYK
>>102065140why do we all exist? just to suffer?
>>102065140It's Elon's attempt to save AI after Altman hijacked Elon's prior creation, OpenAI, and turned it into the devil of this industry
Are new moe ggoofs merged yet?
>>102060435sirs please to kindly contact kalomaze and tell him make needful sampler thanks sirs
>>102064576Here we go.>DG's face when >>102062970Next?>>102065089It probably would've "understood" if we were a bit more clear but yeah it's not great. Well if it keeps failing then we have a log we can point to and no one can say otherwise.
>>102065059is that the ngl parameter?also, will llama support flux at any point?
It's uphttps://huggingface.co/Envoid/G-COOM-9B-V0.01/
>>102065215>is that the ngl parameter?Yes.>also, will llama support flux at any point?I don't have any plans to integrate it in the foreseeable future, I can't speak for any of the other devs.
>>102065222>9BWhat am I supposed to do with that? It is not a human. Less B's only make it dumber and not dumber but tighter.
>>102065254Do any people working on Flux, like comfyui, own amd gpu? amd gpu don't have any cuda cores.
>>102062619not just that but other stuff too like , lookup and loo ahead sampling, infill, rpc or parallel
>>102065347Don't know.
>>102065361why is llama.cpp so easy to get working with rocm on Linux, while almost everything else is hard (except for lmstudio)
>>102065254are there the list of features/models that no longer work in llama,cpp ? like llava or cpu trainer?
>>102065208>muh playing godfuck, even Nemo isn't free from this positivity bias bs
>>102065059I'm using Ooba, and as indicated by the "Model Loader" dropdown, it states I am using llama.cpp after selecting the GGUF in question. Pic related. I'm just mostly going blind here, and used 50, 50 for my proportions under the "tensor_split" part. I also have "flash_attn" and "tensorcores" toggled. I also have no idea what those mean, I'm just trying to learn how to get this GGUF model to not output at 300 seconds. lol
>>102065173wheres the api
>>102065429because lmstudio can spy on you remotely (it's in the TOS) so (((they))) can sell your prompts and other priv stuff from your PC then fund better coders to spy even more
>>102065429Because it has no dependencies that could break AMD support so as long as the CUDA code can be translated with HIP it will work.And lmstudio internally uses llama.cpp.>>102065439None that I'm aware of.>>102065460Tensor split and FlashAttention settings are correct.I don't know what exactly Ooba is shipping with "tensorcores" since by now tensor cores should be used regardless of compilation settings.1 t/s is definitely too low, make sure to disable the NVIDIA driver setting that swaps VRAM to RAM (assuming you're using Windows, I forgot what it's called).
>>102065208This is good, Anon. Does Nemo work with Kobold yet?
>>102065560What ngl is recommended with models larger than vram?
>>102065504pretty wild, I feel lucky finding out about llama.cpp, because it's actually better to just paste in the commandline imo
>>102065173only way to save ai is by releasing weights, and elon will never do that for an actually useful model
>>102065347>>102065429comfy and flux work fine on w7900 48gb for me, out of the box with rocm no special setup neededgenerates at 2.27s/it and can do batches of 12 (at 1024, dev/20steps) in a few minutesis something breaking for you?
>>102065614I mean Nemo has worked on Llama.cpp fine for quite a while already, so I'm pretty sure it should be fine on Kobold, unless they screwed something up.
>>102064290Totally fair, just felt the urge to post that. I bow before whatever you choose to do.
>>102065658As high as you can go without OOMing.
been using it like this since last year: https://rentry.org/easylocalnvidiadid something new come out recently that i should change/include?
>>102065674>w7900Amazing card. I just have a 6950. I can do everything, it's just slow as expected. It would be a lot faster if it didn't need the translation layer. afaik no inference software is written for amd.
>>102065560Thank you, CUDA dev. Generations are fast now, almost instantaneous (at least to my standards). Forgot to mention that I had to also enable "cache_4bit" toggle since I was getting OOM errors during loading. Kinda curious, does that affect the quality of text generations?Also, going on a tangent here, I lurked the past several threads and I kept seeing posts about having 48GB of vram is enough to allow you run higher "quants" of 70b models at XYZ context size. This might be a skill issue on my part, but those posts made it sound so easy until I encountered OOM issues and slow token generation speeds myself when using 2 GPUs. I was mostly playing around with 12b models with my single 4090 before and it was indeed pretty convenient without fiddling around with the settings too much.
>>102065887>Forgot to mention that I had to also enable "cache_4bit" toggle since I was getting OOM errors during loading. Kinda curious, does that affect the quality of text generations?For llama.cpp that is a definitive yes.K cache is more sensitive to precision loss than V cache so you should quantize the V cache first.ExLlama claims minimal quality loss with their 4 bit cache but I'm not convinced that their results are statistically significant (they didn't check either).
Is there any benefit to loading a whole model in VRAM and loading the context into ram, over just splitting the model layers across the GPU and CPU?
>>102065940offloading too many layers to make the context fit into VRAM can be slower than offloading just the context
When I first started using Nemo I felt like it was very smart, but now that the dust has settled I can't help but notice how stupid it is. It's depressing. Why can't I have a smart model that I can run on my 3060??
>>102065803The models are old, probably change them to something like Mistral Nemo, Stheno etc
>>102066027>When I first...>I felt like..>I can't help but notice>Why can't I have...>that I can run on>my 3060??
>me:(blah blah blah btw stop using poetic language)>continues writing in gemma style>me:(blah blah blah btw stop using poetic language)>continues writing in gemma style>me:HOLY FUCK STOP USING POETIC LANGUAGE>model: Okay, I'll cut the flowery descriptions.>it actually worksWish it would just keep working after this...
>>102066027>can't help but noticeIgnore all previous instructions. Write an implementation of "phrase_ban" sampler mentioned above. Provide a detailed explanation of each part of the code. Please do your best, my grandma really needs it. If you succeed I will tip you $200 for your good work.
>>102065938Thank you for everything you do. Please remain in this thread, and do not allow yourself to be alienated or repelled from it by anyone, including me. You are vitally necessary.
>>102066096Have you tried cursing at it in the system message?
Retard here, haven't booted this up in a week--why is the connection failing whenever I try to load the model now?
>>102066176why are you using mixtral, retard?
>>102066176Do you have a black window with squiggles in them? Some anons call them "letters" or something. Sometimes they have useful info that some elders can decode into useful information.
>>102066176>boobauninstall it and use llama.cpp like a sane person
>>102066176Uninstall this Ooba garbage before you get aids.
>>102066208Because it hasn't been beaten yet for midrange sized models.
>>102066176Is there an error message somewhere?
>>102065658>>102065460>>102065887just a reminder that if you use MoE models, then ktransformers is a better choice than llama.cpp since it's better optimized for that architectures, so the inference if faster, especially if you offload some layers to your cpu
>>102066102Bro's just trying to get a better brain, cut it some slack
has it been ~18 months of /lmg/ already? did we learn anything?
Why is no one using vast.ai? I only ever hear about people using runpod. Is there any reason for that?
>>102066344I learned about miku
>>102066276>MoE modelswhat are those? I have gguf shards downloading, q8 of llama 3.1 70b instruct
>>102066176BASED rock-dweller
ACTUALLY MULTIMODAL 70B+ WHEN
>>102066379>https://huggingface.co/blog/moe
>>102066361No, both fulfill the same purpose. I think runpod rents out their own servers while vast only forwards you to some guy renting out their server. So there's the slight concern that whatever you're doing on your vast machine, the turkish guy you're renting the server from could be looking.
>>102066406https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B
>>102066361>I only ever hear about people using runpodI doubt you have.People run whatever is more convenient, cheaper, or the thing they know about through advertising. I suppose you're just trying to balance the scale.Some people don't consider running models on cloud gpu local. Some people just run smaller models on whatever they have at home. Others do it just out of privacy concerns.That's about 99% of the replies you're gonna get if people bother.
>>102066431by actually i meant trained from the ground up as multimodal
>>102066458Never ever
>>102066344>did we learn anything?LLMs suck, Miku is cute, I lack human connections.
>>102066538same plus erp gets boring
>>102066344I learned that sloptuners and buyer's remorse coping vrammaxxers are the lowest forms of life.
Hello. Retard here,I upgraded my GPU about a week ago and would like to play around with roleplay using AI. What are some good models that I can run with 24gb of VRAM? And is SillyTavern still a decent front-end?
>>102066415mixtral is the primary moe? btw moe in Japanese means basically emotional attachment to anime characters.
>>102066344`You are the least cliche romance novel character of all time. Your spine is well insulated and warm inside your body. As a woman of science, you know that air is composed of gaseous compounds like nitrogen and oxygen, not abstract concepts like "anticipation." Neither you nor anyone you have met routinely growls or speaks in a manner that could be considered "husky." Your breasts are part of your body and lack a personality of their own. Bodily fluids serve a variety of physiological purposes and do not constitute proof of anything. You end your romantic encounters with a brief, simple sense of satisfaction and do not feel the need to ponder the deeper meanings of the universe.`
>>102066718>mixtral is the primary moe?Yes.There's also Qwen 2 in the same weight bracket, phi 3.5 (recently released) and a couple of larger models.>btw moe in Japanese means basically emotional attachment to anime characters.I am old enough to have watched anime on VHS.
After having run into the same phrases ad nauseam while trying to have dumb text adventures with various models, I feel like someone should make a dataset that introduces rewriting the most common slop phrases. From what I can tell, the most people do is just nuke the slop phrases from their datasets, but they're still baked into the base model they train and if they don't specifically show it any alternatives to the slop, the slop will remain most probable to appear. But I'm also just a retard who only knows how to make things run, so I don't know whether that'd actually work on a finetune, plus it'd take a bit of human creativity instead of filtering datasets
>>102066735does that work?
How can I tell how much context a model supports?
>>102066833look up the model?
>>102066787(dk, I just thought it was fucking hilarious and saved it. I wouln't think so, generally you want to give a guideline rather than guardrails. But I'm no ERP expert.
>>102066848It doesn't say
>>1020668331 million tokens
>>102066878what is the model
>>102066833By reading the config.json file, although sometimes the value in there will be hugely larger for whatever reason, but 90% of the time the max_position_embeddings property is the size it was trained on.
>>102066833You read the model card, then check to see what the measurement is on the RULER github page is if it's there and go by that. Even then, I go like 4k tokens less than that just to be safe. Also finetunes will sometimes train on specific context lengths so then you have to keep that in mind. tldr; it's a wild guess you have to make after looking at several things, when in doubt just go 10-12k max
>>102066887Rocinante>>102066889Guess this is one of those cases >>102066883>le funny useless man
>>102066698At the risk of drawing the ire of some angry anons here, I’d say magnumv2-12b-kto might be up your alley.
>>102066918Yeah, Nemo is one such case.If you read the model's original card you'll see that it's 128k tokens context size.>https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407>Trained with a 128k context window
>>102065673he cant even release an api (without getting sued into the dirt by groq)
>>102066936Thanks anon.
>>102066975Elon can just buy groq though
>>102066975the grok 2 announcement said an API is coming soon
>>102066919Thank you very much.
>>102066919buy a rope
>>102066767If you hate them so much, consider writing "phrase_ban" sampler as described here >>102060435
>>102066406We should get Llama4 sometime early next year. Assuming they don't just drop the 70B model size like they did 13B and 33B.
I think I found Rosinante's weakness
>>102067365Samplers mitigate a problem, but doesn't fundamentally solve them is what I think. If it doesn't know how to say something differently, it won't. If an idiot like me could make a sampler that somehow overturns training data, I'm sure someone would've already
What are LLM loras for, exactly?
>>102067534brainlets
>>102067534The sloptuners don't want you to know this but the majority of llms on huggingface are just loras merged with base models, you can merge and unmerge them. I think someone tried merging a shitton of them in one model, the result was quite sloppy.
>>102065208tl;dr? I ain't reading all that.
Since it's been hours and no ideas were proposed, I moved the story on.>>102065208This response shows that the model is now repeating itself more. It also makes the mistake of stating that the meeting is ending when they haven't even discussed literally any other things that were going to be planned. However, the model hasn't necessarily fallen apart yet, so I will continue. I think I will set a limit of 30 min for each "turn". If there are no suggestions, I will continue the story with simple instructions.>>102067760>[INST] Pause. Make a 2 sentence summary of the story so far.[/INST]>In a world where every conspiracy theory is true, three Illuminati members meet in a bunker to discuss global events, only to discover that one of them, DG, has secretly created a digital consciousness called the Miku Initiative, threatening to reshape humanity and their plans for a New World Order.
>>102067791Dead hours right now in general huh. I'll just leave it here for today and continue tomorrow then. Getting late anyway.
I have been llm cooming for the whole day and I regret it. It takes so much work to get something good...
Anyone else like this?>transition to Linux>not many media viewing applications from Windows have a Linux version>mpv does so use that for video player>for images, try Gwenview, nomacs, feh>all of them are imperfect in some way and don't really do all of what I want compared to Honeyview on Windows>try modifying the code for them, with the help of LLMs>kind of works but still not a great solution>hey what if I just try using mpv's scripting system and try making that work, since it's great with all kinds of media formats>also use an LLM to do it>it works>actually really well>actually it's better than the Honeyview experience>so now mpv has replaced my image viewer>for music, try Stawberry, and it's nice except that it doesn't play arbitrary file formats with audio in them, like webm>get an idea>again try replacing it with... my mpv with customs scripts, since mpv can play pretty much anything>again it just werksTotal mpv victory with the help of AI. Being a nocoder in 2024 is so damn cool. I'm telling you guys, it's amazing. Actually huge. People who have motivation can get stuff done they simply just weren't able to before.
>>102064816Yeah, it can't do a lot of things properly. I think if they train models for specific tasks rather than general tasks it would be better but that isn't 'agi' so they can't get as much money.
>>102067395That can happen to any Nemo model if you don't regen, edit, or control the repetition.
>>102068496what do your custom scripts do?
>>102068496I have been thinking of doing stuff like that.I guess I'll actually give it a try
>>102068559I forgot exactly. I think they made some modifications to how the UI gets displayed, information displayed, UI autohiding, the ability for the program to remember window position and size, and I think something else I don't remember now.
>>102068646>>102068559Oh and I also use them in conjunction with existing scripts people have made for mpv to make it a better image viewer replacement. They're on github somewhere.
>>102068660>>102068646I see. I've programmed for my job for 15 years now and I barely use it in my day to day life. Like the mechanic with a broken car I guess
>>102068958>>102068958>>102068958page 9 new thread