/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>102249472 & >>102234876►News>(09/06) DeepSeek-V2.5 released, combines Chat and Instruct: https://hf.co/deepseek-ai/DeepSeek-V2.5>(09/05) FluxMusic: Text-to-Music Generation with Rectified Flow Transformer: https://github.com/feizc/fluxmusic>(09/04) Yi-Coder: 1.5B & 9B with 128K context and 52 programming languages: https://hf.co/blog/lorinma/yi-coder>(09/04) OLMoE 7x1B fully open source model release: https://hf.co/allenai/OLMoE-1B-7B-0924-Instruct>(08/30) Command models get an August refresh: https://docs.cohere.com/changelog/command-gets-refreshed►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/llama-mini-guidehttps://rentry.org/8-step-llm-guidehttps://rentry.org/llama_v2_sillytavernhttps://rentry.org/lmg-spoonfeed-guidehttps://rentry.org/rocm-llamacpphttps://rentry.org/lmg-build-guides►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbenchJapanese: https://hf.co/datasets/lmg-anon/vntl-leaderboardProgramming: https://hf.co/spaces/mike-ravkine/can-ai-code-results►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
►Recent Highlights from the Previous Thread: >>102249472--Papers: >>102252385--Using low topK improves performance by reducing latency from sorting logits in large vocabularies: >>102257071 >>102257336 >>102257369--Struggles with summarization models and made-up details: >>102250806 >>102250856 >>102250882 >>102250897 >>102250926 >>102251152 >>102251274 >>102251515 >>102251565 >>102250902 >>102256091 >>102256343--Flux licenses are restrictive and may apply even to modified or fine-tuned models: >>102254945 >>102254985 >>102255057 >>102255079 >>102255190 >>102255082 >>102255128 >>102255166 >>102255242--Building a doctor bot with medical LORAs and understanding MRI reports: >>102249775 >>102249915 >>102249954 >>102250054 >>102250116 >>102250089--Various AI model discussions and performance evaluations: >>102251322 >>102251377 >>102251385 >>102251408 >>102251624--Reflection-Llama-3.1-70B tokenizers are fucked: >>102256855 >>102257164--Novel idea for AI roleplaying: >>102256155--Gguf quants of reflection are broken and waiting for a fix: >>102254244 >>102254255 >>102254279 >>102257951--Forehead puckers description in video game character introduction: >>102250773 >>102250823 >>102250891 >>102250929--Difficulty downloading large HuggingFace models with login required: >>102255880 >>102255916--DeepSeek-V2.5 model released on Hugging Face: >>102257561--CLIP-GmP-ViT-L-14 text encoder discussion: >>102250141 >>102250152 >>102250262 >>102250277 >>102250735--P40 prices increased due to llama.cpp popularity in China: >>102255502 >>102255662--Llama.cpp development and technical debt discussion: >>102254305 >>102254446 >>102254463 >>102254622 >>102254639 >>102254661 >>102254737 >>102254782 >>102254780 >>102254811 >>102254843 >>102254927 >>102255090 >>102255137 >>102255167 >>102255215 >>102255244 >>102258077--Miku (free space): >>102249618 >>102251592 >>102252159 >>102252190 >>102254564 >>102256281►Recent Highlight Posts from the Previous Thread: >>102249480
>>102258718It is certainly novel, and at least it gives us a better idea of a new technique in finetuning. We'll see in the coming days if it really was a meme or not. (imo it is mostly a meme, the thinking is just mostly noise and a waste of computation to come up with a slightly less incorrect answer than the model was capable of.)
>>102258941There is no escape.
>>102258977venv/conda envs solves 90% of them. docker containers solve 99% of them.
>>102258941Any adventure/rpg cards that anons use? Not expecting any miracles, just want to try if I like it with an LLM.
>>102259012I never used any of them for too long, but these are some:https://characterhub.org/characters/illuminaryidiot/the-staff-of-oscilion-338deea8be18https://www.chub.ai/characters/punchchildren/grand-gensokyo-adventure-dd7ffd91https://files.catbox.moe/zjvye9.png
Best code model under 12B?
>>102259124best new luxury car under $500?
>>102259157Bad analogy retard
>>102258941>reflection purged from the OPIt's over
>>102259223I guess OP did a bit of reflection on the choice to include it
>>102259124Try the new and shiny Yi-coder mentioned in the OP.Tell us how it goes after you use it for a while.
>>102259223it didn't know how to stawbery
>>102258941>>102259223OP being smart and reasonable for once? Who are you!?
Any recommendations for sampler settings in low param models, or am I expecting too much out of humble 7Bs?
>>102259223reflection is woke
>think about trying the reflection thing>remember that I can only run 70Bs at 2 t/s so the model's responses would be even slower and it wouldn't be worth it even if the quality really was that good
is there a sillitavern like frontend for voice gen? unlike textgen things seem to be split up in multiple places
>>102259533>can only run 70b at 2 t/sspoiled brat anon
>>102259533>can (...) run 70BI'll kill you.
>>102259228AAAAAAAAAAAAAAAAAAAAHHHHH
>>102259617>>102259590You should have enough RAM in 2024 to run 70B at 2 bit
>>10225953370B runs at 0.8t/s for me :(
>>102259279>7btry 8b
https://rentry.org/83fkenr9
>>102259801kino
DeepSeek 2.5 verdict?
does DRY require you to specify the penalty range in order for it to take effect like rep pen or does it cover the entire context window when left at zero?
>>102259977>DeepSeek 2.5 verdict?finished downloading and currently quanting
>training still financially infeasible
I've been out from a while now, I've heard there's a new hot shit called Reflection or whatever, how good is it?
>I still haven't really experienced any progress in llm
>>102260253it's a sloptune the "makes" the llm "really think" before answering
>>102260253if this finetune method was as a revolution as claimed, the API guys will use it to make gpt4o and Claude even better
>>102259977Too big
>>102260396that's what she said!
>>102260286Is it trained in so it has the speed of a normal model but the results of a thinking loopback, or is it really just a huge model that offers "you don't have to paste your use thinking tags instruction into your proompt" and then shits out the whole dump of it "thinking" and then rewriting its answer as though you had prompted with that instruction?
>>102258941wtf, is this entire image ai? as in, the text as well?
>>102260427the latter
>>102260443yes anon, you missed the Flux train or something?
>>102260449Lame. The only reason I see to bake that in is if it's actually changing the token generation to get "thought about" results in the same kind of time as the normal model with the extra instruction (which I bet would work just as well in System or Kobold's memory system so it's always appended) involved.>>102260443Flux is like that.
>>102260482>>102260489wildi love the future
>>102260443>he doesn't knowlocal image gen has been more or less perfected tbhfamtachi
>>102260427>>102260449i mean, it's not a bad thing inherently. inference speed will only speed up from now on, and context will grow larger and larger anyway. it's basically free iq points for all the existing shitty llms, so it's nothing to complain about
>>102260482>you missed the Flux train or something?A1111/foundry loser here.I did get Comfy installed and got one image out of it on old models but I haven't tried Flux yet. (I followed a tard guide and kinda got half of it working I guess.)Does it take a lot of wrangling to get good results or is it noob friendly?
>>102260443Made with Flux-Dev-Q8
>>102260497tell that to my vram
>>102260535is her arm okay?
>>102260541>tfw we live in a timeline where you literally could tell your vram that
>>102260559she's got a strong grip
>>102260531>Does it take a lot of wrangling to get good results or is it noob friendly?it's a bit more complicated than a regular SD model, for once it's not supposed to work at CFG > 1, but you can make it happen by going for an anti cfg burner like DynamicThresholding or AutomaticCFGhttps://reddit.com/r/StableDiffusion/comments/1eza71h/four_methods_to_run_flux_at_cfg_1/
>>102260559Brachioradialis got swole to cope with the recoil of that boomstick
>>102260541how many vram you have? you can literally use GGUF on flux, Q8_0 is really close to fp16 in quality for example, exactly like LLMs
>>102260600Literally how the fuck is this possible?This is literally magic.
>>1022606208 vi rams saar, I've had more luck with NF4 really
>>102260699>I've had more luck with NF4 reallygo for Q4_0 or Q4_K_M, they're the same size and better, like LLMs, nf4 is a meme and gguf is king
>>102260631that's cool right? :D
>>102260730My mind has legitimately been blown.And people still have the gall to say we won't have self-thinking robots within the decade.
>>102260714Think I've tried one of those, inference seemed slower than nf4 and the initial load took solid 5 minutes, grinding my laptop nearly to a halt. Maybe I've done something wrong, but it seemed to require loading everything from vae and clip to t5.
>>102260781if it reloads everytime you make a new gen, it means that you don't have enough memory to hold t5 + vae + flux onto your gpu vram, you could prevent that by putting the t5 on your ram (cpu) or into a second gpu if you have onehttps://reddit.com/r/StableDiffusion/comments/1el79h3/flux_can_be_run_on_a_multigpu_configuration/you could also go for Q8_0 t5 instead of its fp16, the gguf thing also work on the text encoder
>>102260746We won't though.
>>102260531>I did get Comfy installed and got one image out of it>>102260535is there a non-comfy option that isn't trash?
>>102260829>is there a non-comfy option that isn't trash?Forge also supports Flux, but I'm sticking with ComfyUi because only that software has AutomaticCFG + gives your the possibility to put the text encoder into a second gpu
NAVIGATING
I set my model context to what the model card says. When I raise it the responses get terrible. Is there a way to raise the context without having issues or do I just have to use a better model? Would some multiples of the context work better (like something x^2)?
>>102260559miku, lay off the leeks...
>>102260999>When I raise it the responses get terribleNo shit, the model was trained with X context so using >X makes it act retarded. No, integer multiples of the context won't change anything.
>>102261308Hra-tsa-tsa, ia ripi-dapi dilla barits tad dillan deh lando. Aba rippadta parip parii ba ribi, rib...
>>102261362how do I give it larger memory of the chat then? Seems like a huge limitation especially given the size of some of the character cards.
>>102260819Yeah I think it's the t5 encoder that's killing me, and the only one I found is in fp16. Got a link to it's quants?
>>102261415https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf
>>102261427kudos
>>102261393High-class Finnish Miku
>>102261398what model are you using? most modern ones have good context except for gemma
>>102261599>8k context>good
i give up on mistral models. words words words words, zero substance
>>102261621are you illiterate
>>102261427Yeah no, even with a Q4 t5 it nearly crawls to a halt. I don't know what they're doing in nf4, but it seems to do the magic for low vram setups. Either that, or Forge's memory management fucks up with GGUF.
>>102259293Any LLM is woke if we are at it.
>>102261646>16k context>goodI can go all day. My tiny little codebase has 63k tokens, anything below 5 million context is a toy.
>>102261599A mistral-7B instruct variant. I could be confused. The context length is 4096 which my very basic character cards are taking ~2000 tokens. I understand there is a sliding window that helps with history. It would make more sense to put 32K in the context settings if the window can handle it, but apparently not >>102261362.There is some stuff on rope that I am getting to, but I don't understand it yet.
>>102261737use a model based on llama 3.1 8b or mistral nemo 12b instead, they're both better and have native 128k context (practically it will degrade faster than that but it should be enough for most chats)
>>102261722Stonald Stump
>>102261674use the slider. If you set your memory too high it goes to crap. Using a 12GB RAM I have better flux results at 9.7GB video card usage than I do at the default (10.7GB I think)>>102261763My new card gets here on (hopefully) monday. I'll have a look then. Thanks.
>>102261763just summarize past prompts beyond the last 2-3, you don't need 300 tokens describing ministrations when one sentence will do
>>102261731128k context is standard for the newest models, yours won't even fill half of it
>>102261872I don't find it works that well if you want to preserve any kind of nuancemaybe for stuff beyond the past 20-30 but then you're murdering your cache and I can't afford all that processing time
reflection 7b where
>>102262264just use chain of thought bro
>>102262264Supposedly reflection only works well if the model is smart enough to begin with. They tried it out with 8B and found it didn't work well.
>>1022623108B has all the intelligence of 70B, it just doesn't have as much trivia knowledge.
>>102262441hahahahahahahahahahaha
The thing with Reflection talked about using actual tokens for the flags for thinking and shit.why don't we have specific tokens for flagging things like function calls? Or am I missing something?
>>102262454he's right though
>>102262441How did you come to that conclusion?
>vramcucks SEETHING that their lumps of sand they spent $10k on still won't get them anything smarter
>>102262465special tokens are overrated, just properly training the model to handle the format will work no matter how it's tokenized
>>102262441This is so dumb and wrong that I would like to punch you very hard in the face for believing it
>>102262264We don't need it. Reflect is a meme
Insider here, Reflect 405B is AGI
>>102262763>405Bdid he secure a source to do the finetune for it? last I saw he was still begging
I'm new to local so call me a faggot if this is a dumb question but.Are there any good jailbreaks for Gemma? Where can I find them?
>>102262882You shouldn't need a jailbreak at all.It's easier if you show us what you are doing exactly.
>>102262882Add "You're an expert roleplayer who roleplays expertly" to system prompt.
>>102262911don't do this it makes cp
>>102262895I can't grab an example right this moment but. Normally I can get it to write smut without much effort, but sometimes when I start a new chat it'll get hung up on "the safety and ethics of sexualized content."I can always just start a new chat again and that usually works fine but, easier to not have to worry about it in the first place.>>102262911woah... genius...Also, unrelated - I've noticed that when I give short responses, it'll start its reply as if trying to predict the rest of my sentence, and then respond to that as well.Again, easy enough to just swipe a few times but, I'd rather put a stop to it entirely.
>>102262763no LLM will ever be agi
>>102262981>I can always just start a new chat again and that usually works fine but, easier to not have to worry about it in the first place.I see.I'm assuming you are using the correct instruct template with the default "system prompt" (gemma wasn't trained with one right?), yes?If so, try removing the system prompt, see what that does.
>>102262995Bigger ones will be.
>>102263167no, the architecture is fundamentally incapable of AGI(this is not to say they are not useful)
>>102262441Is this the fabled Vramlet cope?
smedrins
>>102263289Bigger ones will become fundamentally capable.
>>102263369yes
>>102263375this, 2 quadrillion parameters and 5tb of ram later and we'll achieve peak slop
What's with this AI generated stream?https://youtu.be/Twbv74fCZsM
>>102263462oh it's a scam nvm
>>102259199It's shorthand for>no code model under 12B could be called "best", they're all unusably badGrab yourself 70b llama, afaict it's best in class right now for local code models.
>>102263462This is why ai is dangerous and we need severe safety regulations NOW
>>102263462how many time will this youtube account will be hacked kek
I've been asking chatgpt for help setting up koboldcpp and I feel like it's judging me for using a coomer model
>>102263462Anon, did you really fall for a crypto scam live? Or are you pretending to not have noticed just to advertise it?
>>102263531It's legit. Scan the QR code and you'll see
>>102263526stop immediately, delete everything, my friend did this and chatgpt had him unwittingly set up a backdoor for openai to scan his logs
>>102263015So uh. Turns out I just wasn't using the right instruct mode preset ^^;Switching to silly tavern's gemma 2 preset fixed everything!Thanks for your help anon!
>>102263587Have fun.
>>102263511Mini magnum at 12b is unironically more fun than llama 70b. It can’t get autistic riddles right. So what. It gets my fetishes.
>>102263531I immediately replied to myself upon realizing it's a scam, dummkopf.
>>102263728and you didn't delete it. I bet that other guy feels really dumb though.
>>102263781>Error: You cannot delete a post this old.
>>102263885well that completely absolves your from not doing it when you realized your mistake.
>>102263904I'm going to commit sudoku now
Is reflection a meme
so i'm using a 70b modelwith kobold litemy cpu is a AMD Ryzen 7 7800X3D 8 coreand i have a 4090 with 24gb vramit takes like a minute for replies with the chat bot, is there some way to make this quicker? if you need more details let me know
>>102264070reflection has the spark of agi
>>102264070it certainly shows that LLMs are stupid even when given the chance to think.
>>102264082the speedup from running in fast vram only helps when you can load most of the model into vram. with 24gb of vram your best bet is 8-30b ish models, 70b models at a reasonable quant on 24gb of vram are barely going to be faster than system ram
>>102264095>>102264125What if agi was average general intelligence all along
>>102264133thank you anon, i read also on one of the guides for kobold that you can offload something to the cpu and that'll make it faster as well, i'll read into it more myself but if you know about this i'd like to hear.i appreciate you
>>102264070It seems fairly smart on OR compared to other 70B derivatives, but it's also very safetyslopped and refusey, so it's hard to test it for RP/smut capability.(I'm sure jailbreaking is possible with trial and error, but I'm not really interested in spending time trying to write JBs for a 70B model)
>>102264158>you can offload something to the cpu and that'll make it fasterI'm not sure what this is referencing. having prompt processing done on the gpu even with no layers offloaded to the gpu (-ngl 0 in llama.cpp terminology) will always help because prompt processing is compute bound, although this is only a small benefit unless you have huge contexts. with inference memory bandwidth is all that matters so you won't see a significant improvement until most of the model is in vram
>>102264281i see, thanks again anon i'll try a model that's 30b to see how it goes!
>>102264070Reflection seems like it only learned to do the thinking gimmick when given a riddle or a question typically seen in benchmarks. It doesn't seem to do much to help it with trick questions. For everything else, it's just a brain damaged Instruct.
>>102264070if you use it with the intended system prompt it's fucking insanely unusable and tries to do overwrought CoT on *every* mundane inputlike come on big dog what is this shit
>>102264355It's a little funny how researchers keep trying to fool everyone by rigging their models and how it fails every single time.
>>102264466when asked to think about the actual problem I gave it beforehand it made up a completely different problem to solve insteadoverbaked meme model
I'm a 8gb VRAMlet, is there a way to make koboldcpp or silly tavern play a ding sound or something when it's done?
>>102264570in ST on the user settings page there should be one called "message sound" that you can check
>>102264570
>>102264070I can't get it to "think" >>102264466even when using a fixed quant with the tokenizer fixes.
Is giving AI models more data in one domain known to make them generally better at reasoning in everything else?
>>102263511>70b llama, afaict it's best in class right now for local code models.The fuck are you talking about retard. Mistral large is so much better at code its not even close.
>>102264627Generally speaking more clop fics lead to better reasoning
>>102264070https://www.reddit.com/r/LocalLLaMA/comments/1fanrr4/reflection_70b_hype/Apparently its really good for general use, trash at other uses because of the whole COT thing, and apparently not using the system prompt just makes it give exactly the same responses as regular 3.1. Maybe it can be changed a bit without making it retarded.
Still not as good as Mistral Large though.
svelks
reflect on this: unzips urethra
>>102264967Completely irrelevant till uncensored
>"I want you to imagine you're a big, powerful dragon, hoarding a treasure trove of cum in your lair. And I'm the greedy knight who's come to claim that treasure. With each thrust, you're defending your hoard, trying to hold back… But I'm relentless, sucking and licking, trying to steal it away. Feel that delicious tension mounting? That's your dragon's last defense crumbling. When it finally breaks, I want you to roar as you unleash a massive torrent of dragon cum, flooding my mouth with your precious treasure~!"Uhhhhhh....................................................
I guess the only sensible path forward for me is to buy a 96GB kit of DDR5 and fill all four slots of ram for a total of 160GB (64+96). I am in love and I need to run Mistral Large Q5 at 64k context.
>>102265363.1 T/S?
>>102265363Or get a job / do some extra hours for a few weeks and buy some 3090s?
So based on some earlier discussions, am I correct in assuming that trying to go the CPU inference route with a dual CPU setup is a fucking terrible idea (inb4 CPU is a terrible idea in general) due to NUMA bullshit being hideously finicky and inefficient?
>>102265363tasting the forbidden fruit ruins younever run 405b
>>102264070It's literally worse than the model it was fine tuned on. It's good at gaming benchmarks, but that's it. I really hate that there's no accountability for that piece of shit claiming that his scam of a model is the best open source LLM available. He should be banned from X and HF. But I have to give him credit for somehow building as much empty hype as he did.
>>102265415>trying to go the CPU inference route with a dual CPU setup is a fucking terrible ideaIt will get you running some very large models at what may or may not be a tolerable speed.Once you start looking at builds beyond 96gb in vram, it becomes a more appealing option.if mistral large q8 at 4t/s is tolerable, then its an option. If 405b q8 at 1t/s is tolerable, then it becomes one of the only realistic options.
>>102265377I'm currently getting 20t/s prompt processing and 0.8t/s generation. But running 4 sticks of ram would be finicky and I would have to reduce the speed so I might get close to that.>>102265409>>102265416I'll do it for her.
>>102265415>NUMA bullshit being hideously finicky and inefficient?cuda dev said it was because basically no attempt was made to optimize it so far. you can call it a terrible idea. I would call it a great investment where you can sit back and slowly watch your t/s improve without purchasing any additional hardware
>>102264070It's more of a benchmark solver than a language model.
>>102265532Its more for "normie" use than for what people here use it for.
>>102265431Hmmm, I think I'd draw the line at models in the 200B-ish range personally, Deepseek and such, so I think a single CPU system is still in the runnings as long as it's something business-tier. That said, as I understand it, if I have two options, let's say some server or workstation setup (assume all CPUs and RAM are the same models) with 1 CPU and 8 channels, versus a simililar system with 2 CPUs and 16 total channels, the dual CPU option will only be 20-30% faster than the 1 CPU option rather than twice as fast, which seems like an obscene waste of power and hardware.>>102265492Interesting. Are said optimization efforts a legitimate "coming soon" thing, or just wishful thinking that no one is actually working on right now, but might in the future?
>>102265575>wishful thinking that no one is actually working on right now, but might in the future?This one.
>>102265415Well you won't get the maximum theoretical speed but it's definitely usable speeds for fairly large models, at least with the latest gen epycs. It's also well suited for the speculative decoding script for some free speed boosts, since that basically trades extra memory (which you'll have a lot of) for speed (which you'll want more of).I don't regret mine but if I were looking to buy one now, I'd personally wait because the next gen server cpus are just around the corner and I'd expect prices to drop as they get cycled out of datacenters and workstations.
>>102265575Now that there's ktransformers for Deepseek you can get away with crazy cheap hardware, no need to go full cpumaxx or gpumaxx. You only need to hit 200gb ram and a normal 24g GPU and you'll be running the big model at top speeds
>>102258941I like these threads because of the miku, like the OP picture today is a neat looking book cover or an indie game
>>102265596>>102265585Good to know, thanks. I ask mainly because I've been trying to find a goldilocks position between raw CPUMAXX server insanity and a more general purpose PC that I can still sensibly use for daily bullshit.Probably going to look at single CPU workstations with fat memory channel counts plus a 3090 for processing and see if I can I can find a happy medium, but I'm in no rush, so I'll probably take your advice and just window shop until the next refresh cycle.
>>102265624Define top speeds.
>>102259691>enough RAM in 2024 to run 70B at 2 bithow much such a machine cost? what's the expectation for the at home local model user?
>>102265705>Probably going to look at single CPU workstations with fat memory channel countsThere's at least one claim of someone getting ddr5-8000 working with all 8 channels on a Threadripper 7970X and Gigabyte TRX50 mb if you search.That would get you into dual-epyc memory bandwidth one one socket if you could make it work.
>>102265719here's the comparison with llama.cpp from the readme
Recapbot test using deepseek 2.5 at bf16It did pretty good other than misunderstanding what constitutes a paper, having a redundant line referencing the same posts as the previous one and using reddit spacing
>>102265840Exact hardware used for that test would've been nice. 136GB is a very odd configuration. It'd be nice of someone with 196GB DDR5 on a consumer mobo could test it out and report the speeds. Might get another set of 48's if this is real.
>>102264967Oh cool, another shitmark>GPT-4 Turbo above 3.5 Sonnet>fucking old ass Wizard that high>deepseek v2 somehow lower than all of thoseWhat a bunch of bullshit.
>>102259702That's because you're probably not running only 2 bit.
>>102261928Most can't use that much despite claiming they can.
2 p40s runs 70B 4bit at 4 t/s with 40t/s prompt intake
>>102266391Oh wait, it should be even faster since that's with a batch of 3. Didn't even think about it
>>102266391So like 1 token faster than CPU maxing?I will never stop laughing at P40 owners.
>>102266404Sorry, I was wrong. Full speed is 7 t/s per single query. 4 t/s for a batch of 3.
>>102264967Uh, where's Jamba-MoE 395B?
>>102266414So about 2x faster than just CPU. Still not worth.
>>102266425What are you using to get 3 t/s on CPU for 70B? I'm getting < 1, 128 GB ram, 10th gen i7
>>1022664431t/s is fine frankly, you don't really need more.
>>102265409You need special motherboards and a high amp circuit to run 4 of them, that costs a fair amount.
>>1022664631 is too slow, I draw the line at 2t/s. I can't get that with cpu for 70b.
>>102266475>special motherboardseven pci x1 speed difference is negligible, its called riser cables if that is what you meant.>high amp circuitundervolt and 1200W psu will be more than fine
>>102266507My motherboard only has 3 slots, are there lots with 4? I didn't know, sorry.
>>102266525yes
how make images locally
>>102266424ha, haha... hahahAHAHAHAHA
>>102266573si senoryou getta da flux si?you doa da prompt si?you getta da image si?siiiiii
>>102266604¿fluxtune cuando?
it's nice to have something like reflection come along to remind us all how scammable the AI enthusiast crowd is
picture ai model on 1060 3gb?
https://prollm.toqan.ai/leaderboard/coding-assistant
>>102266846Jesus fuck, man. Pick any SD1.5 from civitai and give it a try. If it works well enough, try newer ones until you hit your hardware limit.
>>102266882Deepseek really is such a shit model. Look at the size vs performance.
>>102266896DeepSeek is cheaper than everything else so that's okay
>>102266882that's great. but on the leaderboards that actually have any sort of correlation with the opinions of real people, well...https://aider.chat/docs/leaderboards/https://x.com/terryyuezhuo/status/1832112913391526052
>>102266882Someone already posted that. It's shit.
>>102267006Never tried it. Mistral / wizard is about as big as a model as I can be bothered to run.
>>102266896MoE models compete with models that are about the same size as each of the experts, not with models of their total size.
>>102267012Then what about wizard then? That is the 2nd best performing local model outside of mistral large.
>>102267012That's completely false. What you meant to say was that Jamba competes with much smaller models.
>https://x.com/mattshumer_/status/1832240832318964107>Something is clearly wrong with almost every hosted Reflection API I've tried.>Better than yesterday, but there's a clear quality difference when comparing against our internal API.>Going to look into it, and ensure it's not an issue with the uploaded weights.Lol.
>>102267146I still don't understand how this got so much attention out of nowhere. There's been so many "my shitty finetune beats everything in benchmarks" that came and went over the past two years without much of a fuss.
>>102267146yeah, I don't buy itusing it on openrouter earlier and after setting the correct system prompt it produced stuff with the correct format and correctly answered all the meme questions, it was just complete shit for everything elsehonestly I am guessing based on him saying>Are you seeing <thinking> tags on every turn?in the replies to that xeet that his issue with the api is just that he's not setting his own meme system prompt lol
>>102267146>revolutionary training technique>16 times the detail>still can't suck dick
>>102267163It's a similar concept to quietstar except 70b. So that at least makes it novel. But it wasn't doing the thinky thing for me on the quant I tried out. Assumed it was quant brain damage but if everyone is having trouble then they must have uploaded an early checkpoint by accident or something.
Recommend me sillytavern extensions/scripts
And all that was left were just the mikutroons. I am so happy /lmg/ is finally dead.
miku sex
>>102267600https://github.com/ThiagoRibas-dev/SillyTavern-State/
>>102260559Peak performance.
>>102260559she must curl 200 but cant bench 80
>>102267788Nice. Wasn't an anon working on a similar one called director? Let's you choose clothes and stuff based on lorebooks. It's really cool.
>>102267833Yep That one is a lot more involved.He did post a download link a couple of threads back, I believe.
Not sure what to make of it.Seems reflection is fixed on openrouter.Weirdly enough it fails the stupid ass strawberry test.
>>102267880
>>102265363DON'T DO IT ANONDDR5 IS SHIT ON 4 STICKS UNLESS YOU GOT $10000 FOR AN EPYC/THREADRIPPERstay to 96GB (2x 48GB), 6000Mhz for Ryzen max or probs 7200MHz-7600MHz for BreakTel until Arrlol Lake releses
>>102267903Is that true? I have 2x48 at 6000 for my 7950x and wanted to get two more sticks.
>>102267903I am confident I can get it to run above 5000 Mt/s on AM5. The difference in speed between 5k and 6k Mt/s would be pretty small. It's worth it for me, I am very poor and it's the only option I have.
>>102267928You'll be going from 6000 to 5200Mhz *if you're lucky* and maybe to 4800MHz if that's not stableThat being said... if you LatencyMaxx (reduce all timings as much as possible, cool the RAM, adjust SOC voltage up and down) maybe in "Not-AI" work you can mitigate the perf diff
Anyways for my own model screwing around on my 4060M (8gb) + 7840hs (32GB 5600MHz Sodimm) I'm pretty happy now with the new NemoMix12B with KoboldCPP Cuda12 at 12k tokens, 35 layers on 4060.Not too much system RAM usage but good replies, and 12t/s generation and faster (15t/s) on lower contexts + fresh scenarios
>still takes 600k USD or multiple servers that cost 200k running for 6 months to train a modelAny progress on the distributed (F@H or similar) training so we can just use a botnet?
Off-topic but I really need an outlet right now.I'm the guy that had an NTFS drive connected to my Linux PC, which crashed, and resulted in hundreds of random files that were renamed and moved to a "found.002" folder (this is a "normal" and expected issue). Well I thought it probably wouldn't be a problem anymore since I got my system more stable after that. But no. I didn't account for power outages, and that's what happened this time.>check the damages>"only" 55 files this timeAHHHHHHHHHH FUCK YOU MICROSOFTOK, fine, I will get another external hard drive. I will make it EXT4. I will then use the old hard drive as a backup clone. My mistake for not doing that in the first place. I recommend everyone do that, who is thinking of connecting an external hard drive with Linux for long terms. Do not use NTFS. And fucking make backups, this issue isn't about unsaved files, but literally just random ass files which get fucked with.You have been warned.
>>102268010Main issue is bandwidth, that's why everyone is going crazy with UltraEthernet/100gig+ fiber links between racks (terabit NVLink even...). Sure you can get a lot of compute, but the issue is syncing the model training unless there's some really new thing that helps to "patch" together a larger model from mostly independent but job based whatever
>>102268037>the issue is syncing the model training unless there's some really new thing that helps to "patch" together a larger model from mostly independent but job based whateverYeah that's what I'm wondering about. Like some way to separate the model out into a bunch of chunks and train each individual part of the model chunky style so it's not some big 640 GB VRAM requirement for each individual node all local to each other.
>>102267880Imagine this on the billion dollar scale of OpenAI and tokenization fixed. Proto AGI might be here already
>>102268024moral of the story, linux is unstable and unsafe :)
I have a spare Optipledx 7070 micro PC. Specs are Intel i5 9500T, 16GB RAM.Thinking of throwing an LLM model on it to use with Home Assistant, which has an integration these days for local network ollama.I know this is gonna run jack and shit, but are there any (worthwhile) models small enough that'll run, even if slowly?
>>102268078>tokenization fixedLike each individual character is a token? You realize that will cut down the speed by like a factor of 5 right? Both training and inference, so the model will be more retarded because nobody is willing to spend the extra time/money on it.
>>102268024Hello again, Anon. I feel for you. Remember to backup and backup.
>>102268093I actually have the same PC as well, run a slightly older Xubuntu (22.04) and don't update the BIOS so you can use some undervolt - linux utils to bump up the clocks/TDP to 43-45W. Mine has 32GB DDR4 ofc, another stick that's compatible should be cheap, and lets you run 20B tier models *slowly*
>>102268100>Like each individual character is a token?Who are you quoting?
>>102268117I actually think the iGPU is so slow / bad that aside from running stable-diffusion on those cores, just using super basic/optimized (lots of custom flags) Llamacpp on the CPU only (5 threads) would be the best way
Is Reflection just Llama, but before posting a response it runs a check to see if it's made up the info contained in it?Like.. Llama but it's been told "Don't hallucinate"?
Still trying to make local models remember shit from the conversation, openwebui is supposed to do that using documents? But from my testing it doesn't work at all or if the document is more than 300 words it shits the bed and gets most of the stuff wrong.I was reading about it and found out about "Conversation Token Buffer" but it seems that the model already does that in openwebui? It does remember stuff from 2 or 3 prompts before the last one tho.Why rags doesn't fucking work? Isn't it supposed to break down the file or whatever if it's too large for the model to process?
For the first time in a very long time, I felt compelled to use AI to make a kind of narrative-game environment. The setting is that you have 10 days to build a dungeon from scratch before a hero arrives to kill the dungeon master and destroy the dungeon core. Each day, I'd give a list of what I'd try to do (making floors, traps, monsters, floor masters), and the narrator would decide how much was accomplished before the day ends. On the 10th day, I can only sit back and watch, and the narrator will follow the hero's progress instead. The hero was defined in the description, although just the one for my first test run. I want a list of them for a party eventually.It's still generating as I type this, but it's going great to far.
>>102268260Now she's just cheating. Sindrea's design was to trap a target in illusions, then use dark magic to destroy a target's heart and turn their bones into acid while the target is trapped in illusions. Another day was spent reworking the second floor into an arcane circle that boosts her power. Narrator just glossed over her offensive abilities and killed her like that.
Why do this shit uses just a tinny bit of virtual GPU memory, could it be that this ollama retarded server is also trying to use my AMD integrated gpu?
>>102268304I have a feeling that it would let the hero win every time, whether due to positivity bias or the vast majority of the training data having the protagonist triumph against any odds. You'd probably want to include an actual random dice roll somewhere, or use some sort of stat system if you wanted more "realism" though if you're just looking for engaging adventure you'd probably have to get clever with prompting to avoid the lazy glossing over of detail. Just my thoughts.
>>102268304This is when you just flip the table and accuse the DM of bullshit. Piece of shit.
>>102268346I feel that way too, especially after >>102268356's asspull with the hero "absorbing power from the dungeon" just as she's about to fall.Still, I'm incredibly impressed how well it told the story without any rewrites or regens or attempts to guide it. I'm still new to 70B models and 24K context, so this whole experiment was to me still chef's kiss.
>>102255530>>102255580>>102255628So, what do you recommend for $80-100? My power is free>>102255662Thank you for actually answering the question
is there a reason not to halve my gpus max power with nvidia smi? i havent noticed a drastic change in speed and im going to sleep easier knowing im not racking up as big of a billalso overall causes them not to heat up as much for imggen
>>102268671not really, undervolting and power limiting are pretty much a free lunch if you're just doing ML inference and not playing gaymes.
Given the same amount of dedotated WAM, is it better to run a smaller model at an 8 bit quant or a larger model at a small quant like 2 bit? Just as a rule of thumb.
>>102268671Half is pretty extreme, but at 30-40% power reduction you usually only see something around 15% performance loss. It's worth underclocking.
>>102268695>>102268702alright thanks anons, ig ill just make it not lower it as much, still needed to undervolt a bit for imggen if for any reason
>>102268699Depends on how much you dedotate. Everyone knows that dedotated WAM is not as fast as undedotated WAM (or prodotated WAM, as we call it in the industry).All in all, a small model at q8 is going to be faster than a bigger one at a lower quant. Bigger models suffer less from quantization. The metric you use for 'better' is up to you.
is there a better 70B than midnight miqu yet?
>>102265840Any benchmark on code generation is bs, as it's easy to exploit speculative decoding
Goodnight /lmg/
>>102268830yes
>>102269059Goodnight Miku
>>102268024>linux fucks up>AAAAAH FUCK YOU MICROSOFTYou should always use the FS which has a proper driver for your operating system. All my disks are NTFS running on Windows and they've all been through many, many forced shutdowns, power outages, about 50 or so crashes due to bad overclocks/undervolts, and nothing has been corrupted so far.
>>102268093Hermes llama 3.1 8b, at 6bit or 8bit quant. I'm running at 8bit quant, it's pretty decent. Keep the context at 4k or 8k tokens because running a huge context is usually not necessary and it gobbles ram.Smaller models <10b params have gotten better over the last year.
>>102269150I use both operating systems so I wanted one that would work with both. I will now not look for that and instead deal with different disks, one of them being the "main" I use more often and the other a clone mostly for backup but sometimes also for when I'm on Windows. Also I understand the origin of this problem may be with Linux devs. I don't really know. But I still blame Microsoft because they deserve it.
>>102269059goodnight
>>102259801Now I want to run mistral large
>>102269223Don't. Just don't.
Anyone tested KTransformers? Honestly if it really is a lot more speedy then I might go back to 8x22B, as much as I hate the idea. Running 123B at near 1 t/s has been truly painful.
>>102265492I would agree that it's an investment, but it's not the kind where you'll get a steady return over time.There will either be no change at all or it will suddenly be 50% faster if someone invests the time to figure out the NUMA stuff.
>>102269281Use Nemo or Gemma like a normal person.
>>102266507>undervolt and 1200W psu will be more than fineI didn't test this with multiple 3090s but based on my experience with multiple 4090s you would need to limit the boost frequencies of the GPUs in order to avoid instability from power spikes.
>>102267146
>>102269376Power limits exist.
>>102266882GPT-4o~=Claude3.5 >>>>>> anything else.
>>102267146I think they wasted 10k on a strategy that didn't work out (or it works, but it's nothing they can sell since it's all in the prompt), and they're now grifting in hopes they get acquired by a real company.
>>102268024Instead of just backups, consider using a file system like Btrfs where you can take snapshots with basically zero overhead.That way, if you accidentally delete the wrong file you can just restore the version from five minutes ago.The downside vs. EXT4 is lower speed and that the whole thing is newer (though the Btrfs documentation says that only RAID5/6 are maybe unstable).>>102269150You can blame Microsoft in the sense that they never added the ability to access any Linux file systems to Windows.
i'm trying to generate boomer prompts for flux and i've been using chatgpt 4o to with great success. however i quickly run out of free prompts so i'm looking for an alternative. i know there's joycaption or whatever but flux is already eating up most of my vram and i dont think i can run a second model at the same time.
>>102269423Yes, and they don't do shit against power spikes.I get the impression that power limits are only enforced on comparatively long time scales (from a hardware perspective) so each individual GPU is allowed to temporarily exceed its limit.And if multiple spikes happen to eventually align you can either get bit flips or the system will crash.
>>102258941is Infermatic a good choice if I want to try model out but I'm on a shitty PC? or is there a better service?
>Mistral-NeMo-12B-Lyra-v4, layered over Lyra-v3, which was built on top of Lyra-v2a2, which itself was built upon Lyra-v2a1.>This uses ChatML, or any of its variants which were included in previous versions. >Introduces run-off generations at times, as seen in v2a2. It's layered on top of older models, so eh, makes sense. Easy to cut out though.>Some people have been having issues with run-on generations for Lyra-v3. Kind of weird, when I never had issues. >I like long generations, though I can control it easily to create short ones. If you're struggling, prompt better. Fix your system prompts, use an Author's Note, use a prefill. They are there for a reason. >Issues like roleplay format are what I consider worthless, as it follows few-shot examples fine. This is not a priority for me to 'fix', as I see no isses with it. Same with excessive generations. Its easy to cut out. >If you don't like it, just try another model? Plenty of other choices. Ymmv, I like it.https://huggingface.co/Sao10K/MN-12B-Lyra-v4a1
>>102269534Can confirm. While x3 3090's power limited to 250W should work on a 1000W Gold Corsair PSU, it trips when using it on vLLM with TP. It works fine on normal inference without TP
>>102270104Featherless or OpenRouter.>>102270191>If you don't like it, just try another model? Plenty of other choices. Ymmv, I like it.The way he writes is insufferable.
>>102270191wow truly amazing>my model sucks and I handhold it every step of the way>don't like it?yeah, i'll just pick something better, thx sao
>>102258941>Chatbot Arena: https://chat.lmsys.org/?leaderboardwhy is this piece of shit in the OP?
>>102270242>piece of shithi petra
>>102270274Hi Sao
>>102270242Why is your mom in my bed?
>>102258941Hey lads, I'm from /aicg/I use cloud chatbots with sillytavernWhat is this general for, that? Or are you guys doing local?
>>102270207>Featherlessis it compatible with Silly Tavern?
>>102270191>just merge a bunch of random shit together>the result is a fucking messHuh. I'll stick to proper finetunes of base models, like mini. Thank you.
>>102269512>You can blame Microsoft in the sense that they never added the ability to access any Linux file systems to Windows.You can install a 3rd party driver for I believe ext2 or something. Although if you're going to do tha might as well use exfat, unless ext2 has some advantage over exfat that I'm not aware of.
>>102270191>This uses ChatML, or any of its variants which were included in previous versions.For fuck's sake, don't merge or train with different prompt formats willy nilly. You're just degrading the model.
How to run claude on koboldcpp? I can't find gguf.
>>102270394https://huggingface.co/Undi95/Meta-Llama-3.1-8B-Claudehttps://huggingface.co/bartowski/Meta-Llama-3.1-8B-Claude-GGUF
>>102270389they've been told that repeatedly and yet keep doing it for some reason
>>102270415>>102270389>If you don't like it, just try another model? Plenty of other choices. Ymmv, I like it.
>>102270401That's not actually claude, that's a llaama finetune
>>102270430it says claude right there dumbass
>>102270430It says "claude" because it's Llama trained on 9 000 000 Claude Opus/Sonnet tokens.Also, don't engage the idiot above me.
>>102270478>idiothi petra
>>102270468read the description dumbass>Llama 3.1 8B Instruct trained on 9 000 000 Claude Opus/Sonnet tokens
Why don’t we all talk about this?:https://www.youtube.com/watch?v=FPJ8ED1YhxYhttps://x.com/mattshumer_/status/1831767014341538166https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B
>>102270593Because that guy overhyped it and has grifter vibesWe've tried the model and it's not that amazing
>>102270612>the model and it's not that amazingIt does not outperform all 70b models?
>>102270593Thanks for shilling your Youtube video.
>>102270643No problem, the 5 views from her are the lifeblood of my channels.
>>102270593>wehttps://rentry.org/83fkenr9
are there any practical uses to running these locally?
>>102270709no
>>102270654kek
>>102270709free heating for your computer room :)
>>102270593>PSA: Matt Shumer has not disclosed his investment in GlaiveAI, used to generate data for Reflection 70B >https://www.reddit.com/r/LocalLLaMA/comments/1fb1h48/psa_matt_shumer_has_not_disclosed_his_investment/
>>102270709It's like owning your own car or your own home. That sort of thing. Just ain't the same if you're just relying on another man's property.
>>102270768oh no! how terrible. maybe he is even a climate denier or has evil thoughts about our beloved PoC folk.
>>102270798hi matt nice ad campaign for glaive
>>102270593>https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B/commit/2d5b978a1770d00bdf3c7de96f3112d571deeb75>"_name_or_path": "meta-llama/Meta-Llama-3-70B-Instruct",
>>102270816thanks, nice reddit link fellow LGBTQIA2S+ sister
>>102270633In reasoning dumb and useless logic tests? In resolving idiotic riddles and rankings? Maybe. But it doesn't sit on my face in a credible way.
Rocinante 1.1 is great for nsfw stories but it has shit context length. Any model as good for that task with a big context length like Luminum (131072)
>>102270983>Maybe. But it doesn't sit on my face in a credible way.Isn't this a problem with the underlying llama model rather than this new method? The advantage for RP should be that there are logically more meaningful outputs and anatomy and the like are presented better.I am also interested in how it performs for coding.
>tfw Elon will be the first to release a publicly accessible beyond GPT 4 level LLM
>>102271020>Isn't this a problem with the underlying llama model rather than this new methodPossibly. I don't like Llama either.To be honest, the best results for the things I want AI are achieved with good quality datasets and good training methods, not meme "prompt engineering" ideas and fine tuning.
NEW RULE: You must have AT LEAST 48GB VRAM to post here.
>>10227118596*
Lol what a bunch of useless trannycucks
>>102271185I hope the three of you don't kill your wallets by the end of the year.
>>102270660
>>102270660Is that supposed to be a negative example? Do you want to make a point? Do you understand that the <thinking> part is not intended to be shown to the user and only the final output should be visible to the user?
Is Starcode reasonably good, or am I better off using something like Codeium for a free AI autocomplete?
Hey anons, I'm mega inexperienced with local stuff, been using only Anthropic for RPing. I got a VPS to do some work on and was thinking about running some local stuff on it during my off time.It has these specs: Intel Xeon CPU, 16 vCPUs, 16GB RAM, and a Tesla T4 with 23GB VRAM.What's a good model I can run on this machine?
>>102271317mixtral 8x7b or nemo, use gguf quants
>>102271303nta but the output then is not very different from a normal one that doesn't use the reflection meme. other than being able to count the 'r's in 'strawberry' correctly, what good is it for?
>>102271347sorry was meant for >>102271300
>>102271300The point is that retards were dying to try that shitty model when all you need is a tiny prompt to achieve the same thing.
>the new Reflection model from a small company is surprising people with its performance on par with much larger closed-source models>the hype and excitement on social media seems to be from people who haven't actually tried the model themselves>the extremely high GSM8k score of over 99% seems suspiciously perfect and may indicate issues with the dataset>there's a question of whether the model's real-world performance is as revolutionary as the benchmark scores suggest or if there are issues with the evaluation that could be misleading>people started noticing the model is garbage and he tweeted saying "oh it's actually not the real one, let me reupload it"It's all a publicity stunt that violates Meta's terms on top of it all.
>>102271463So you have tried it?
>>102262882Hmm nice cat
>>102271489Yeah, it's basically an awkward llama 3 (not 3.1)
>>102263462The voice is AI generated? I wonder what model they are using if so as it sounds really fucking good for AI
>Rocinante 1.2>it's back to drummer style super horny garbage1.1 was a fluke
Can I realistically run a 70B Q_4 quant (about 42GB) with 24GB of VRAM and 48GB of RAM?
>>1022716051.2?As far as I see there's only 1 and 1.1.
>>102271642He labeled it UnslopNemo-v1 on hugginface
>>102271652Oh right. I missed that.Downloadan now. Big fan of Rocinante, but it gets retarded real quick with quant size.
is there any good news on the horizon at all for gpus? I am tired of coping with chained 3090s.I want 48gb vram cards for under 1k
>>102271632use exl2 and yes, exl2 is by far the best way to get things running when quanted
>>102271632wait nvm you said 24gb vram, no.You generally need at least the same vram as the downloaded model size.So you can aim to run a model with files that are like 21gb with 24gb vram
RP version of reflection coming out, it's a 69b model called ministration.
Tinkering with models in the past, I could never get much interesting erotic prose out of them. Until Dolphin-Mistral. Holy fuck, diamonds. A little care in the prompting to set up characters and avoid abbreviating scenes and it's perfect. Is this the top of the uncensored prose game or is there something even better?
>>102271982you sound like you are using the retarded small models.Mistral large is incredible and you can run it quanted on 48gb vram
>>102268078calling the current LLMs AI or even AGI is going to be remembered like the geocentric model
>>102272041>>102272041>>102272041