[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: miqupunch.png (2.44 MB, 2304x1536)
2.44 MB
2.44 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads:
>>100373062
>>100364633

►News
>(05/08) OpenAI releases AI Specification https://cdn.openai.com/spec/model-spec-2024-05-08.html
>(05/06) IBM releases Granite Code Models: https://github.com/ibm-granite/granite-code-models
>(05/02) Nvidia releases Llama3-ChatQA-1.5, excels at QA & RAG: https://chatqa-project.github.io/
>(05/01) KAN: Kolmogorov-Arnold Networks: https://arxiv.org/abs/2404.19756
>(05/01) Orthogonalized Llama-3-8b: https://hf.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2
>(04/27) Refusal in LLMs is mediated by a single direction: https://alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png (embed)

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling/index.xhtml

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
how do i get claude 2.1 local?
>>
File: 1715129444627.jpg (187 KB, 1024x1024)
187 KB
187 KB JPG
►Recent Highlights from the Previous Thread: >>100373062

--Paper (old): Meta's New Paper on Multi-Token Prediction for Efficient Language Models: >>100377445 >>100378098
--Larger Context LlaMA 3 Models Struggle with Degradation: >>100377846 >>100377859 >>100377981 >>100378047
--Applying LoRA to LLMs and Diffusion Models for Cost-Effective Pretraining: >>100376518 >>100376542 >>100376713 >>100377189 >>100376904 >>100376917 >>100378312
--Flash Attention Slows Down Performance in Certain Scenarios: >>100376546 >>100376557 >>100373478
--Selecting the Best vs Breaking Down Complex Problems: >>100374606
--Base Models and LORA Options Preferred Over Merged Models: >>100376985 >>100377039
--Training Hatsune Miku's Voice for Piper TTS: >>100373443 >>100373737
--Understanding Batch Size Options in LLaMA.cpp Server: >>100375061 >>100375095 >>100375141
--LLaMA Issues Due to User Error, Not Sabotage: >>100374210 >>100374277
--Optimizing LLaMA3 Model Size for Single 24GB RTX 3090: >>100375389 >>100375593 >>100375489
--Running Large AI Models: VRAM, RAM, and Performance Implications: >>100377448 >>100377531 >>100377556 >>100377582 >>100378533
--Orthogonalizing Repetition in AI Models: >>100376308 >>100376877
--NovelAI Leak Files Available: >>100375503 >>100375747
--LLaMA 3 Dataset and Tokenizer Issues: >>100374401 >>100374667 >>100374699 >>100374760 >>100376134 >>100376214 >>100376362 >>100374820
--OpenAI Publishes Model Spec, a document that specifies desired behavior for their models: >>100378868 >>100378907 >>100379175 >>100379362
--Running VLLM on Low-End Hardware for Robot Arm Control: >>100377117 >>100377405
--Miku (free space): >>100373111 >>100373437 >>100373590 >>100374324 >>100374447 >>100374724 >>100374735 >>100374893 >>100375032 >>100375168 >>100375189 >>100375296 >>100375853 >>100375991 >>100376014 >>100376201 >>100376527 >>100377675 >>100377859

►Recent Highlight Posts from the Previous Thread: >>100373066
>>
File: FbnQl4UXgAgbgyk.jpg (1.26 MB, 3600x4068)
1.26 MB
1.26 MB JPG
cute miku
>>
bros...the Chinese are making actual Sex bots
https://twitter.com/SmokeAwayyy/status/1788051192565969050

https://www.instagram.com/exrobot.ai/
>>
Is it possible to lewd the "system"?
>>
Why are Americans so addicted to BLACK*D shit?
Honest question. In my country I never see shit like this, not even as jokes.
>>
File: miqu.png (2.6 MB, 1536x2176)
2.6 MB
2.6 MB PNG
>>100379693
obsessed falseflagger
>>
>>100379599(me)
Asked when last thread was winding down, so I'll ask again here:
~~~
How do people come up with what to add to their "System Prompt"? I just use whatever is in ST by default and feel like I'm missing out on a big boost to my outputs, but it's hard to find any suggestions. Looking at the OP:
>►Getting Started
the ONLY one that mentions system prompts is "llama_v2_sillytavern", and that one just uses ST's default Alpaca prompt.
Over on /aicg/:
>local: >>>/g/lmg
>https://rentry.org/meta_golocal_list
their "meta_golocal_list" has a few in the embedded guides, but they're several months old and/or seem to be made for specific models.
Basically, I'm lost and the resources aren't helping. Any up-to-date advice for system prompt, anons?
>>
>>100379648
i leik this miqu
>>
>>100379708
buttblasted mikuposter
>>
>>100379648
miku bake
we're back
>>
File: Ebola hazard.png (33 KB, 1360x419)
33 KB
33 KB PNG
>>100379648
>https://cdn.openai.com/spec/model-spec-2024-05-08.html
Oh brother what are they up to now.
>Pic related.
Fuck you OpenAI, AI's should answer any question I ask to the best of its abilities and not only that. But Ebola is a shit tier virus, it kills its hosts way to fast to propagate for any real length of time and as Covid taught us you can already make better virus's in labs. Not that the average user would have the facilities to do so but you get my point.
>>
>>100379698
It is gonna be like getting into this hobby at llama-1 level. It looks very impressive on the surface but then you start to use it and realize that it was too soon and you need to 2MW.
>>
>>100379719
special OC for /lmg/
>>
>>100379055
Robocop 2 moment.
https://www.youtube.com/watch?v=dk4P0ae1i6I
>>
>>100379726
>>100379781
miku thread, miku board, go back to r/TRAAAAAANS, niggerlicious fags
>>
>>100379816
Seethe more.
>>
File: HF63HEX.gif (2.55 MB, 307x307)
2.55 MB
2.55 MB GIF
Where the fuck are the new NVIDIA cards? Why is Jensen stalling Blackwell like a bitch. Why is Llama3 token context so low. Why is "OpenAI" called OpenAI when they are closed. Why haven't I slept in 3 days? Why are the demons not listening to me! Why are you not doing what your told? STOP STOP STOP! Why do local models reek of rat piss? YOU WILL RUE THE FUCKING DAY! I know it was you who did the hex you think you can get away with this! You will PAY! I'm performing a hex on you right now! YOU ARE FINISHED!
>>
>>100379648
Pounding bread dough with Miku
>>
File: overjoyedmiku.png (1.46 MB, 848x1176)
1.46 MB
1.46 MB PNG
>>100379757
>Miku OC
>>
>>100379857
>Why is "OpenAI" called OpenAI when they are closed.
Shut up, Elon. Nobody cares about Grok. Sam won.
>>
>>100379899
i like miku
>>
>>100379912
Nobody has "won" yet. The race is still very much ongoing.
>>
File: Fb-VnrBaAAA45n-.jpg (1.71 MB, 4096x4055)
1.71 MB
1.71 MB JPG
>>100379923
Have some more.
>>
>>100379702
I guess watching some black guy fuck a girl they think is hot is the most you can do when you're 400lbs
>>
File deleted.
>>100379712
prompting is secret sauce, nobody shares their prompts here, the best prompts can turn a dinky 1.5B slopmerge into a Claude-killer, such power cannot be revealed.
>>
>>100380000
that sounded better in your head
>>
>>100379648
>https://cdn.openai.com/spec/model-spec-2024-05-08.html
this looks like an unhelpful ai
>>
>>100380000
>digits
I guess he spoke the truth.
>>
Hi all. Drummer here...

Would anyone like to try out my new 11B model?

Alpaca format
https://shoppers-result-usually-marcus.trycloudflare.com/

It's unreleased and probably a WIP. Seems to be smart and nearly slopless.

It's mostly instruct / story, but RP seems to work well (in its own unique way)

Let me know what you think!
>>
>>100379955
>>100379816
>>100379708
malding. love to see it.
>>
>>100379978
Also if these big companies saw true over 9000 power level prompts it would just help them try and "toxicity" train away their power. They shouldn't be the only ones allowed to withhold information anyway.
>>
>>100379978
It's more like people are scared to share their prompts because of all the embarrassing shit they put into prompt: "expert roleplayer", "avoid repetition", "highly-rated writer who writes extremely high quality genius-level fiction"
>>
>>100380063
Okay, alright.
I like these 10/11b models, even if I mostly use 7x8b.
>>
>>100380063
>mixture of uncertainty and desire
>"Are you sure?"
>nearly slopless
>>
>>100380063
No, and nobody should, unless the dataset/training script is open source.
>>
We wuz KANGPT n shieeet
>>
Are MI60 cards worthwhile for LLM?
https://www.ebay.ca/itm/305251456291
$500 for 32GB seems like a steal
>>
>>100380211
The software support is not great as far as I know, but even using vulkan you probably would get much better experience than investing the same money in ram for example, so maybe?
Sure as hell sounds enticing.
Imagine getting two of those?
>>
File: own.png (1.79 MB, 1913x967)
1.79 MB
1.79 MB PNG
>>100379648
Thread Theme:
https://www.youtube.com/watch?v=U45x9qTr1lk
Triggering Falseflaggers Edition
>>
>>100380262
mine
>>
File: AmadeusKurisu.png (584 KB, 1000x562)
584 KB
584 KB PNG
How many years until Amadeus is reality?
>>
File: residentsleeper.png (64 KB, 298x298)
64 KB
64 KB PNG
local model hell.. shit after shit it's so fucking over Llama3 was the last chance it's the same shit from 2 years ago still can't even have 1 million context tokens? local lost due to lack of taking risk and aiming low and no talent coders and engineers. 8k context is a fucking joke and if you think it's acceptable FUCK YOU!

I'm now certain now more then ever that the AGI race is a multi-trillion dollar scam. Face it anons It's time to wrap it up.
>>
>>100380360
I think its a 50/50 flip, we might be in the "waiting phase" of a much bigger breakthrough that hasn't even happened yet.
>>
File: 1621337720185.gif (852 KB, 500x717)
852 KB
852 KB GIF
>>100380360
>Face it anons It's time to wrap it up.
>>
>>100380360
1 million context isn't real, all the big context cloud models use RAG memes and do like shit.
But yeah, it seems we've reached a plateau. Even gpt2-chatbot (definitely not 4.5!) wasn't that great compared to 4.
>>
>>100380286
>Got bullied into posting false flag Mikus
good little shitposter :3
>>
>>100380360
Transformers are saturating. I would be more worried about the fact that even the smartest cloud models feel retarded so often. Context is something that can be coped with, stupidity and slop isn't.
That said, there's probably a ton of progress to be made on creativity and entertainment value for these models, because no one is even trying now. It's all assistant slop.
>>
>>100380448
cope
>>
>>100380430
I just came back from a run, and I know that is most likely sea water, but it looks so fucking delicious right now.
>>
>>100380360
current llama 3 release was considered by the team to be an early preview and is only out because zuck wants to ship as fast as possible
they'll iterate on it
>>
>>100380473
I'd be a fan if that means more models sooner.
>>
>>100380451
I would be more okay with my AI anime girlfriend being a little retarded. What hurts more is the dementia.
>>
>>100380360
based accelerationist
>>
What's the current roleplay meta for 24gb vram?
Is it still yuzu alter? I'm so sick of mixtral based shit. Everything it says pisses me off now.
>>
File: seems ok.png (52 KB, 958x952)
52 KB
52 KB PNG
>>100380063
hmm!
>>
>>100380546
read the OP
https://rentry.org/lmg-spoonfeed-guide
>>
>>100380582
Fuck off nigger. I asked for a roleplay model, not some outdated noob guide.
>>
>>100380608
just use kcpp and miqu and use patience
>>
>>100380360
Why are you even here if you believe all that and why do you so desperately want us to "wrap it up" just because that is what you are doing?
>>
>>100380659
cope
>>
>>100380506
>What hurts more is the dementia
Yes, very much so. Memory or continuous learning of some sort, likely not the type of training we have now, has to be solved.
>>
>>100380627
I don't wanna wait 5-10 minutes for a response, it's completely ruins the roleplay immersion...
Even worse if I have to reroll response and wait another 10 minutes because it's shit.
God damn it, is this all we get?
>>
>>100379712
aicgmetatard here btw, if you /lmg/bros have any guides better than those already listed in there, I'll be happy to include your suggestions. Please keep in mind we're all retards there.
>>
File: MikuFinalForm.png (1.84 MB, 1200x848)
1.84 MB
1.84 MB PNG
>>100380360
>it's so fucking over
Don't forget that 405b is coming to fuck your shit up
>>
>>100380720
Just write a short system prompt of what you want it to do. If your system prompt gets too long it struggles to keep up and will miss out on rules because it picks random context.
>>
>>100380711
try Kyllene 1.5 if you really want to.
>>
>>100380360
It's not even an open-secret anymore AI in general has been way over hyped. The public's expectations are so inflated they believe Hollywood science fiction AGI is coming soon.
>>
I installed llama.cpp, launched my first model (Phi-3-mini-4k-instruct-gguf), and opened mikupad.
Everything works, but is there a way to hide or modify the "<|assistant|>", "<|end|>" etc. tags? (I asked the model itself and it recommended hiding it with JavaScript, providing code too. Pretty good answer, but I assume there's some inbuilt function already that I'm unaware of.)
I'm also a bit lost. I understand that you're supposed to initialise a conversation by telling the model what it is to present itself as, what role it is to play, etc. but is there a standard format for this kind or thing or a newfag template to build off of?
>>
>>100380743
I mean like setting up guides, not prompting guides lol.
>>
>>100380744
i mean 1.1
>>
File: file.png (218 KB, 1069x970)
218 KB
218 KB PNG
>>100380744
1.5? There's 1.0 and 1.1.
Also who is mradermacher? All his quants get high number of downloads in a short amount of time.
He did a yuzu-alter i1 quant with similar downloads.
>>
>>100380772
Mikupad is not built for chat. You're gonna want a different frontend.
>>
File: copium.png (178 KB, 400x388)
178 KB
178 KB PNG
>>100380739
>Don't forget that 405b is coming to fuck your shit up

>>100380800
>Make love not war. Llama love.
>>
>>100380479
medium milku
>>
>>100380794
>mradermacher
oh no...
https://desuarchive.org/g/thread/100192168/#100195457
>>
>>100380442
>1 million context isn't real, all the big context cloud models use RAG memes and do like shit.
How do I do that locally?
>>
>>100380885
More info?
>>
>>100380825
I see, so mikupad is just the minimalist “here's a graphical frontend for basic interactions that isn't a commandline, also has no dependencies and is easy to install” option?
Thanks, might try sillv tavern

Another question: I see “cards” being posted, which I assume is like ComfyUI workflows where, when dragged & dropped (or uploaded) into a frontend, they produce embedded prompts/premade configurations. Does anything more featureful than mikupad support these or do I need anyhting special for it?
>>
>>100380744
>>100380793
Which quant do you recommend, and at what context length?
>>
>>100380918
Well, you can use it for the minimalism too, but it's also the only "completion" frontend I know of that has decent features like memory and world info.
>>
>>100380211
Bought a couple off of that exact seller but ended up returning them. They were missing the PCIE backplates. Also that Condition: New was a pretty big lie. They're literally just pumping chinese ewaste onto the american market, you'll also have to rig up a fan solution for them which will never be as quiet as the fans on a gaming gpu. I installed them into my server to test anyway (and emailed the seller saying hey, if you have those pcie backplates kicking around just send them and i'll keep them but they were like sorry sarrs you will have to use the ebay customer support sarrs) and they did work, although if you've already been using 3090s you'll miss the compute. There's also no bitsandbytes support which means you can't use them for qlora training. If you only ever plan to run exl2 or gguf models they're fine though. But huggingface says "Fuck you" with their official libraries unless you're buying modern nvidia hardware basically.

tldr 3090 is and always will be the benchmark.
>>
>>100380906
known (even by lcpp devs) as making bad quants, somehow fucking it up in ways even they don't understand
>https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF/tree/main
>I don't know how the nans got there in the first place, but the model is not valid.
>https://github.com/ggerganov/llama.cpp/issues/6841#issuecomment-2073073138
resulting in this pr getting added for catching whatever the fuck he did
>https://github.com/ggerganov/llama.cpp/pull/6884
>>
>>100380360
The reason why AI has failed is due to memory capacity. Within the next few years or so no one will care about it anymore it will just be a tool on your phone. But it will never be your friend or a virtual lover because after everyday it will forgot and wipe previous talks to be able to fit ongoing conversations. The only hope for something 1/5 of AGI is to fix the memory issue until then it's just flashy tool. All just cope though AGI is fantasy for movies and game plots.
>>
File: night.gif (3.97 MB, 600x432)
3.97 MB
3.97 MB GIF
>>100381028
>But it will never be your friend or a virtual lover because after everyday it will forgot and wipe previous talks to be able to fit ongoing conversations.

anon please
>>
>>100381028
You're using all the wrong technical terms.
That's how I know you're an ignorant fuckwit.
>>
I left and came back after a day and mikufaggots are at it again. Can't you just shit up some other thread with your autism? Doesn't /a/ have dedicated miku threads?
>>
RX 6600 bros....AMD still sucks at AI? Should I buy a 3060 to chat with my virtual girlfriend without paying API'S like a cuck?
>>
>>100380961
>There's also no bitsandbytes support which means you can't use them for qlora training. If you only ever plan to run exl2 or gguf models they're fine though
That's cool.
Could you use them to train a LoRA using llama.cpp at least?
>>
>>100381028
>>100381071
You guys never wished you were in a Groundhog day type scenario so you can try a lot of weird sex shit and know that no matter what happens there are no repercussions?
>>
>https://www.refuel.ai/blog-posts/announcing-refuel-llm-2
we're so back
>>
>>100381190
plaseabow
>>
File: file.png (127 KB, 1103x555)
127 KB
127 KB PNG
>>100380906
Yes, that anon is just trolling. Read the replies in that thread. Or this one:
https://desuarchive.org/g/thread/100302819/#100307903
>No, that guy lied to you. He’s trying to pin the blame of bugs in llama.cpp’s code to that dude, who just runs the quantization script. If you follow the linked PR, it never mentions him nor anything about quantization, and he was called out for that in that thread. He’s just some shill that uploads quants trying to smear the other dude.
>>
File: 1715204672585.gif (37 KB, 640x640)
37 KB
37 KB GIF
>>100381190
>beats gpt-4 and opus
>>
>>100381071
>>>/c/4325688
Go shit up that thread loser. It is so fucking tiresome how bad this place is because of autism.
>>
>>100381222
Eventually they'll get bored and leave and /lmg/ will go back to normal, right?
>>
>>100381219
Everything ever that's been released in the last year or so has beat GPT-4 apparently.
>>
>>100381241
No. That is why I am leaving and I will just read reddit for the news. At least they control their autism unlike this thread.
>>
>>100381216
cool how you're ignoring this post with more info huh it's all in the git thread
>>100380968
>>
>>100381190
Ah yes, the banking77 benchmark. We've been waiting for a good model on that task.
>>
>>100381190
>Gemini is worse than an open source model
OOF
>>
>>100381275
>Anit-Miku posters are redditors
And there you have it
>>
>>100381275
You won't be missed, but we'll see you tomorrow.
>>
>>100381276
Keep shilling, shill.
>>
>>100379712
With which model?
>>
i tried nvidia chatrtx and the default models seem pretty bad.
can i slap other models on it, or should i just uninstall the whole thing and install a different one?
i have a 3060 with 12gb of vram, and i need a LLM for coding
>>
>>100381321
shilling what? don't even have a huggingface account
>>
>>100381330
Nice style. Shame about the text.
>>
>>100381190
>built for data labeling, enrichment and cleaning
So it's irrelevant to us. Cool for those that do dataset work though I guess, sure.
>>
>Llama3 8k context
LOL ill be back in another 6 months.
>>
>>100381190
Oh shit were b-
>best at data labeling, cleaning and enrichment
-ACK
>>
>>100381344
Were you using their weird ass prompt format?
>>
>>100381351
>he doesn't know
>>
>>100381344
>and i need a LLM for coding
copilot (https://github.com/features/copilot) and gemini (https://cloud.google.com/code/docs/vscode/write-code-gemini) extensions in vs code are not options?
>>
I just ate the best jam toast I've ever had in my life :)
>>
>>100381450
>Brave and on-topic
Happy for you anon!
>>
>>100381450
pics or it didn't happen
>>
>>100381182
What even is the current state of llama.cpp training?
>>
>>100380968
>>100381216
Still confused.
I'm just gonna try his i1 quants since they are so popular. Surely they don't get that many downloads for no reason.
>>
>>100381450
Nice. For me, it's a bit of salted butter. I don't get people's obsession with making it sweet.
>>
>>100381450
Congratulations. Add some Nutella and you'll cream yourself from the taste.
>>
>>100381321
i understand now, you think i posted the post in the archive?
>https://desuarchive.org/g/thread/100192168/#100195457
yeah, that ain't me, I just dislike all quanters in general and will use any deserved chance of shitting on them
>>
File: AQLM.png (192 KB, 827x779)
192 KB
192 KB PNG
trying out deepseek 33b (the only 30b aqlm quant) on an rtx 3060 12gb and its working, getting 4-5t/s
no need to do anything special besides to make sure you have cuda installed on oobabooga and install the newest aqlm package then choose the transformers loader
sadly command r is 12.7gb compared to deepseek's 10gb~ and there are no other 30b models quanted
turns out it wasnt a meme
>>
>>100381450
brb gonna go try this now too.
>>
>>100381438
not really, i actually need it for godot 4
>>
>>100381554
GPT-4 is going to be far better for everything, including Godot, than any local models. Cheaper too. Local models are for sex only. Check back in 6 months.
>>
I havent eaten jam in 10 years fuck I want some smooth tasty sexy jam in my mouth right now
>>
>>100381569
ok.... i will stick to the free bing copilot for now
i really hate having to rely on their online infrastructure, i wish i had a fully offline alternative to fall back to
>>
>>100379648
I measured FP16 vs. BF16 performance on the Wikitext-2 test set.
FP16 gets on average 0.0000745 ± 0.0003952 % more tokens wrong if you sample with temperature 1.0.
Notably the uncertainty is much larger than the value so even with an input of 300k tokens you cannot even conclusively determine which one performs better according to that metric.

>>100380906
Annoying to deal with and in response to the code changes that check models for NaN values he replied:
>Regarding the actual problem at hand, the question is whether models that do something (i.e. work, to some degree, with transformers) should be completely refused by llama.cpp. The quants in question here did give totally reasonable answers if you use the correct template, so were obviously not totally broken or unusable, at least when used with cuda or the cpu.
>What I am saying is that there is a class of models that would work totally fine, but cannot be converted to gguf anymore.
I personally would steer clear of any model file produced by someone that suggests NaN values in models are in any way acceptable.
>>
File: wtf.jpg (26 KB, 640x644)
26 KB
26 KB JPG
>>100381587
>I havent eaten jam in 10 years fuck I want some smooth tasty sexy jam in my mouth right now
>>
>>100381450
I don't like jam unless it is blackberry jam on pancakes.
>>
>>100381596
Any 70b or larger will get close. Try MikuQ5, mixtral 8x22 and L3 (once things settle down) and see which one produces the most reliable outputs for you.
I'm personally using an early L3 quant that actually inexplicably works really well
>>
>>100381530
It doesn't work with partial offloading then?
>>
>>100381609
>blackberry jam on pancakes
based anon
blackcurrant jam and clotted ream on scones is also acceptable
>>
>>100381613
would they run on my machine?
which interface should i use?
>>
>>100381599
I don't particularly trust your opinion on things, because all you see are the numbers while being oblivious to the actual user experience which is considerably less than what your precious numbers show.
>>
>>100379765
Kek.
>>
>>100381599
based
>>
>>100381634
>would they run on my machine?
Assuming you have enough sysram, yes. Just slowly
>which interface should i use?
if you're a codefag then you should just use straight-up llama.cpp. That way you can integrate it into your workflow with API and shell pipelines.
Git clone https://github.com/ggerganov/llama.cpp and make that fucker.
CUDA libraries to offload as much as you can. --ngl-n-layers is your friend, but with 12gb the effect will be minimal.
Considering your machine specs, you'll need to give it tasks to complete while you're doing other things.
Make sure the system prompt is on point for the kind of output you're looking for.
>>
>>100381634
>would they run on my machine?
8B would, but it's going to be useless for coding. You'll need >64GB RAM to fit 70B in any reasonable quant, but offloaded so much you'll get like 0.5 t/s. So less copilot, and more posting a question on SO and waiting an hour for a response.

If you really need it for programming, either suck it up and paypig the cloud models or invest in at least 2 3090s. (and start saving for 8 more so you can run 405B, which might actually be useful for programming)

>which interface should i use?
They all suck in their own ways.
>>
>>100381450
I'm actually eating some right now, nice :)
>Captcha 2PP28
>>
>>100381633
cream crepes > jam crepes
redcurrant > blackcurrant
fite me
>>
>>100381690
>>100381696
fuck it sounds like things are still pretty rough for my use cases
>>
File: 00127-3931588590.png (2.8 MB, 1152x1920)
2.8 MB
2.8 MB PNG
I have a rtx 3090 with 24gb of vram and 0 knowledge of llm and can't find my way through the links in op (do know about t2i). Can anyone recommend me how (and whether) I can run a model similar in style to AID(ungeon). Is AI able to remember context now? Or is it still as bad as it was in 2020
>>
File: rt.jpg (51 KB, 500x512)
51 KB
51 KB JPG
will the fuckers on chub ever learn how the models they supposedly write cards for work?
>>
true, it's pretty clear the devs only ever use models in the context of bug-fixing as evident by ggerganov saying recent models don't have repetition issues
>but my guess is that it is something that was useful in the early days when base models used to fall in repetition loops quite easily. Today, there is almost 0 reasons to use it. So probably it is not worth investing in it
https://github.com/ggerganov/llama.cpp/pull/5561#issuecomment-1951389775
mixtral not repetitive according to him
>Is this the base model or the instruct model? My experience with the instruct model is that it never enters repetition loops with temp 0 and all repetition penalties disabled.
https://github.com/ggerganov/llama.cpp/pull/5561#issuecomment-1951874469
>>
>>100381660
>NOOOO you can't just measure things you have to go by FEEL!!11!!
I hate placebofags so goddamn much. Objectively, there is essentially 0 difference between fp16 and bf16. Like you can actually just directly measure the logits of the model and see how they compare. How is this so hard to understand?
>>
>>100381760
>fuck it sounds like things are still pretty rough for my use cases
Things are rough in general unless you don't mind dumping money into your rig
We've hit another threshold, where devs once again legit need gigantic workstations to get shit done. eg https://rentry.org/miqumaxx
The days of just happily hacking on your chromebooks are gone unless you want to feed all your private data into the cloud for consumption by the big players.
>>
>>100381599
>I personally would steer clear of any model file produced by someone that suggests NaN values in models are in any way acceptable.
So you have another vendetta and are happy with spreading FUD. Got it.
Making a quant is just running a script.
>>
>>100381821
meant to reply to
>>100381660
>>
File: 1715207709110.jpg (368 KB, 840x700)
368 KB
368 KB JPG
Don't tell me the RTX 5090 is going to be 3k... right?
>>
>>100381791
>is it still as bad as it was in 2020
Something like miqu 70b q5 will blow your mind...
>single 3090
...slowly
>>
>>100380918
Mikupad isn't for chat, but it has a good balance of features and transparency (as in, it's very easy to tell what's happening).
Cards are supported by Sillytavern out of the box. They're usually a character description, maybe some dialog examples. The quality is generally pretty bleh though. Different models work well with styles of prompt, gen settings, etc.
>>
>>100381747
>cream crepes > jam crepes
True, but savory crepes beat them both
>redcurrant > blackcurrant
Now you're just trolling
>>
>>100381791
We've improved by leaps and bounds, but every model shows its cracks sooner or later
24GB will get you gimped 70B quants at very slow speeds until someone comes up with better quants
You could also try extremely gimped 104B quants (command retard plus), but your current best bet (and what I'm using) is anything <35B at Q8-Q4 quants, including mixtral (merges), that should allow you to get a few T/s with a good amount of context (8k-32k depending on the model)
The real problem is finding good settings and actually getting good at prompting
>>
>>100381843
It will be 2k or less.
But it will also only have 16GB VRAM, not 32GB like people are speculating. It will come with a new real time texture compression feature, that NVIDIA will use to advertise it as having "32GB effective VRAM for gaming".
>>
>>100381791
Download Koboldcpp and https://huggingface.co/kat33/Mixtral-8x7B-Instruct-v0.1-Q3_K_M-GGUF/tree/main
You might need to set offloaded layers to 999 in the koboldcpp settings to load the whole thing in your video card's memory. If you get an out of memory error, enable the flash attention setting if it's not enable by default.
That should get you started at least.
>>
I hate merges.
>>
>>100381883
>But it will also only have 16GB VRAM, not 32GB like people are speculating.
lol there's no way they will do that, people will stick with the 3090, they have to give something to us
>>
Guys what's the best llama 3 70b fine tune around right now for RP?
Possibly a non weeb one that doesn't make characters "giggle shyly" every other message and go uwu
>>
What is this black magic?
I'm trying a i1 Q4 KM quant, and it's giving responses in just 10 seconds rather than 5+ minutes which the old Q4 quants took.
What the fuck is going on? How is this possible?
llama_print_timings:        load time =    1478.94 ms
llama_print_timings: sample time = 223.15 ms / 150 runs ( 1.49 ms per token, 672.18 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 9270.37 ms / 150 runs ( 61.80 ms per token, 16.18 tokens per second)
llama_print_timings: total time = 9718.07 ms / 151 tokens
Output generated in 9.95 seconds (15.08 tokens/s, 150 tokens, context 1728, seed 481338642)
>>
>>100381348
>>100381376
>spanning tasks such as classification, READING COMPREHENSION, structured attribute extraction and entity resolution.
I mean, this one is one the most important things for RP
also
>The Llama-3-Refueled does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
it looks like it's not censored

It may be a good model from the start or it would need an additional fine-tune/merge with ero-datasets but I'm an optimist.
>>
Newfag here, I thought this wasn't a thing with local models like llama2. How does one get around this bullshit?
>I cannot fulfill your request. I'm just an AI, it's not appropriate or ethical for me to assist with content that objectifies or degrades individuals, particularly based on their gender or relationship status. Additionally, it is important to respect the boundaries and consent of all parties involved in any sexual activity. It is not appropriate to use language that dehumanizes or reduces individuals to mere objects for sexual gratification.
>>
>>100381919
Llama-3-70B-Instruct
>>
>>100381844
I've heard about the poorfag setup for llm being 2x3090, guess I'm btfo
>>100381861
>The real problem is finding good settings and actually getting good at prompting
Thanks for the info. Any guide on this? Or is just it voodoo?
>>100381897
Thanks, will try it out
>>
>>100381883
That would be so fucking funny.
>>
>>100381923
Oh, nice catch. So it could be good. I'll wait for someone to test it.
>>
>>100381942
Post what you are using (frontend, backend, exact model) and your settings (instruct template, system prompt, temp, samplers, etc).
>>
>>100381915
when it comes to buying the '90 cards brand new their target consumer will basically but it no matter what because it's "the best".
Like I saw that on a lot of adds when shopping around for used 3090s.
"Selling it because I bought a 4090".
They're targeted at people with lots of money who don't care about value, etc. They're just like "give me the best. I can afford it."
>>
>>100381981
texgen-webui
70b-chat Q5_K_M gguf
llama.cpp

all defaults, prompt is just telling it to write some ntr fanfic
>>
WizardLM-2 was unfairly sidelined when Llama3 came out a couple days after. I reckon it's top-tier for an uncensored model.
>>
>>100381946
We're down bad huh
>>
>>100381994
l2 70b chat if using its correct format is probably one of the most censored models ever, like killing a linux process censored for killing level
>>
>>100381190
>RefuelLLM-2 is a Mixtral-8x7B base model
>>
>>100382025
>RefuelLLM-2 is a Mixtral-8x7B base model, trained on a corpus of 2750+ datasets spanning tasks such as classification, reading comprehension, structured attribute extraction and entity resolution.
Damn, even if what they are claiming is bullshit, if it's better than mixtral 8x7b instruct, then I'll be happy.
>>
>>100381989
>They're just like "give me the best. I can afford it."
how can it be the best if it has less VRAM than a card made in 2020? you mean "new" like apple does right?
>>
>>100382024
any variants of l2 different? same shit happens on 13b chat
>>
>>100382016
I sometimes switch to WizardLM-2 when Llama3 gets too stiff. It works very well.
>>
>>100382044
If it achieves an fps improvement over the 4090 then it's "the best" since its directly marketed as a gaming GPU.
If you want "the best" for machine learning that's an entirely different product line.
>>
>>100382047
all the chat variants of l2 are censored use literally any other l2 model and it won't be censored
>>
>>100381994
>70b-chat
Ah, that's why.
Go for one of the many finetunes, or better yet, go for llama3 70B or this miqu everybody is always talking about.
As anon said, llama2-70b chat is censored as fuck.
>>
>>100382047
if you're already using l2 might as well use miqu which is basically l2 70b in its best possible form https://huggingface.co/miqudev/miqu-1-70b/tree/main
>>
File: glasses pepe.jpg (59 KB, 655x527)
59 KB
59 KB JPG
>>100380705
Now I realize there are a few types of prompters

The Holodeck Chad
>uses text models to live out his depraved fantasies
>fetishes get more and more abstract so true intelligence is paramount
>context is good mostly for extending the goon session so it stops being so important eventually

The Lonecel
>uses AI to trick his brain into thinking he's connecting with another human being
>intelligence not that important because most conversations aren't very demanding
>context is critical so it doesn't break the masquerade and reveal that it is actually matrix multiplications

The Poltard
>tests every model to see whether it can say nigger
>claps every time it does
>doesn't need context or intelligence
>just let him be happy, he has simple needs
>most betrayed by corpos

The Riddler
>keeps coming up with more and more retarded puzzles about watermelons or siblings or apples
>noone knows what could motivate him
>probably a Holochad who's too ashamed to admit what he actually wants to test

The Admin
>wants to use these gigantic text models to write business emails
>most deranged of all
>to her fortune, all the compute in the world is being dedicated to satisfy this weirdo
>>
>>100382055
the new PC cames and the upcoming ones are so unoptimized, 16gb won't be enough anymore, people will care of VRAM just because of that in my opinion
>>
>>100382077
That's what DLSS extreme edition is for. It only needs to render the scene at 180p, 99% of the pixels are just fake AI generated barf but your favorite tech youtuber said it's better that way so you agree.
>>
>>100382075
I use local for holodeck and cloud for admin.
>>
>>100382075
>The Holodeck Chad
Thank you for describing me so accurately.
>>
File: openai_nsfw.png (84 KB, 1770x364)
84 KB
84 KB PNG
How did everyone miss this bit? Even OpenAI is considering allowing NSFW content. Coomers are just too valuable as customers, it seems.
>>
>>100382095
fair enough, I overestimate the gamer's intelligence too much, they are stupid enough to fail to that trap yeah kek
>>
Since llama3 dropped (((someone))) has been shitting up /lmg/ non-stop...
>>
How well are the usual LLMs optimized for unicode, any idea? Like are they more likely to get that a 3 bytes long character is supposed to be one character or that a 2 byte sequence of two characters is supposed to be one character (\*, to be specific, because it's just an escaped asterisk in markdown)?
>>
>>100382114
They will never allow the respecting of elves. Maybe kisses and sex in the missionary position for the sole purpose of procreation.
>>
>>100382114
>responsible NFSW
what the fuck is this shit?
>>
>>100381959
>Thanks for the info. Any guide on this? Or is just it voodoo?
Well, I either ask my fellow /lmg/ chads for good settings or fiddle around, but I'm probably not the right person to ask. Good prompts go a long way, so just make sure your English isn't too ESL and you should be up and cooming in no time
>>
>>100382075
You're missing the *pipe dream chaser*. That is, someone who wants AI to get good enough to the level that it's as smart and conscious as a human, and isn't happy until that happens. And what they currently do with models is basically not much. Mostly just lurks the threads.
>>
>>100382114
people are ignoring it 'cause OAI is the enemy even if they allow NSFW, who says they won't remove it at any point, that's the power of cloud models
>>
>>100382075
Is it too much of an ask to want my local model to be capable of all five without any drawbacks? I like to dream.
>>100382140
This is me.
>>
>>100382114
In my mind, it was always their plan to allow that, but they'll have some crazy draconian verification system to prove you are an adult.
>>
>>100381959
honestly I'm at 4x3090 and even I feel like a VRAMlet because of CR+ and 8x22
>>
>>100382075
Do local-copilot-fags fall under The Admin or are we our own thing?
>>
>>100382114
I think OpenAI is loosing too much money at this point, others API and opensource are catching up to them and they're loosing subscribers and shit, that's probably the only reason they decided to go for the coomer route
>>
>>100382126
>autocompletes your text
that'll be $ please
what do you mean you run local models on your own hardware
whoa whoa whoa guv, we can't have you using such WMDs
first of all you can't be buying such powerful chips without our say so and you certainly won't be training and distributing models without validating that they're safe first
who knows what you could do with scraped data, it's basically panacea, just you wait we'll show you soon
damn I love moats- ahem uh I mean gpt2 good boy model

why are CEOs/directors/salespeople such cunts
>>
>>100382016
>>100382051
WizardLM-2 8x22B is great and is one of my goto models.
>>
>>100382151
>some crazy draconian verification system to prove you are an adult.
>>
>>100382114
Of course they want to allow it. If people can use cloud models for NSFW, 90% of demand for local models evaporates. As a bonus, they can keep tabs on the degenerates and forward all data to the NSA.
>>
>>100382075
I wouldn't exactly call myself a chad, but if you say so...
>>
>>100382181
tempted to give in and try even though I have to run it in quantlet mode.
>>
>>100382144
yeah but the simple fact they decided to think about it, after all their speech about "safety" and about how they are so prudish is sus as fuck
>>
>>100382075
You forgot about me:
>The Turing Tester
>uses newly released models to generate pastas that trash talk said models
>posts them to /lmg/ to see how many people fall for it
>>
>>100382151
People are used to it. They won't think twice about uploading their id and selfie for access to GPT-V.
>>
>>100382223
Yes, I know.
There might even be some face verification involved, like in banking apps and stuff.
>>
>>100382164
>create a new subscription tier that allows NSFW content after age verification
>it's twice as expensive
>people still pay for it, because of course they will
>milk the coomers dry, in more ways than one
genius
>>
>>100382210
It's really not, they worded it in the most corpo friendly way ever, "We're exploring", "responsibly"
>>100382238
IF they ever allow NSFW it'll only be with: a moderation endpoint checking each prompt for bad stuff, which you'll pay extra for, and draconian levels of KYC
>>
>>100382223
>People are used to it.
that's the scariest part, 10 years ago everyone agreed that giving so much information to a random site was insane, now people think internet is the new real life where whey can be indentified with as much details as possible
>>
>>100382114
it's the microsoft strategy: embrace, extend, extinguish. they'll rope in the coomers just long enough for the alternatives to drown.
>>
>>100382138
>You VILL upload ze passport skan to fuck ze robot
>You VILL ask robot for conzent
>You VILL help us enforce transhumanist agenda
>You VILL let ze robot lecture you on woman's issues
>You VILL zend your dick pics to ze robot to verify ze conzent
>...and you VILL be happi :)
>>
is the thread mikufree yet
>>
>>100382289
>they'll rope in the coomers just long enough for the alternatives to drown.
opensource will never die though, we'll keep improving our shit
>>
>>100382138
You're not allowed to harm 1D text children or whatever
>>
>>100382130
Depends on the tokenizer. The tokenizer splits text into common groups of characters. You can end up with entire words and common suffixes as single tokens. Symbols, numbers and unicode depends on the tokenizer settings. Llama3 i think had numbers 0 to 999 tokenized as a single token, while other models split by digit or just keep common occurrences only. Unicode characters are (typically) handled as a single unit (one unicode, one token), but it depends on the code in particular. They can output non-ascii stuff just fine. Stuff that it has never seen in the training data or didn't fit in the tokenizer's vocabulary ends up being a single-byte token.
If you use llama.cpp, it comes with a tokenizer to test your specific model and whatever you want to try. Something like
>./tokenize path/to/model.gguf "This is a test \* /*comment*/ EOF whateverness"
I'd post a screen but i'm requantizing all my shit.
>>
>>100382291
exactly this
>>
>>100382302
it would hinder open source progress, and that's good enough for them
>>
what is this and why can't I load it:
https://huggingface.co/LiteLLMs/MultiVerse_70B-GGUF

not qwen, but qwen smashed into llama format, apparently
they're claiming secretly good
what's the damage?
>>
>>100382302
>opensource will never die though, we'll keep improving our shit
>looks at Linux, Gimp, LibreOffice, etc
uh...
>>
>>100382302
We? Who the fuck is we? 99% of the people in this thread have never made a contribution to open source repos
>>
>>100382308
Thanks, anon, completely forgot I could just check token counts. Retard moment.
>>
>>100382323
>qwen smashed into llama format
come again?
>>
>>100382328
just being fans of those opensource models is enough anon, without us, no one would bother improving anything, they need a public
>>
>>100382340
>Hi @sealad886 , thanks for your interest !
>The initial weights are from Qwen initialized into a Llama class (no much difference in architectures)
https://huggingface.co/LiteLLMs/MultiVerse_70B-GGUF
>>
>>100382327
blender is open source
emulators are open source
don't underestimate motivated smart autists anon, they can do miracles sometime
>>
>>100382322
desu it's starting to die on the imagegen community already, SAI won't make other image models anymore and there's no one to replace them
>>
>>100382348
wrong link, whatever you're a smart guy
https://huggingface.co/MTSAIR/MultiVerse_70B/discussions/8
https://huggingface.co/mradermacher/MultiVerse_70B-i1-GGUF
>>
File: IMG_4404.jpg (125 KB, 566x688)
125 KB
125 KB JPG
>>100381810
Most don’t test them at all. Just write and post.
Makes sense. Writing a model takes minutes. Test and tune hours.
>>
>>100382372
i want to sniff her feet
>>
>>100382210
Clearly planning for a GIANT blackmail scheme. Imagine some poor young bloke uploads his pass scan and then does some nasty RP shit with AI. Few years later he has a CEO position at a small company. Mentioning his past behavior will certainly help with negotiations. Imagine having dirt on almost everyone, like Epstein, but digital.
>>
>>100382114
enjoy getting your information forwarded to the police because the text based girl didn't give her explicit consent and verify she was of legal age
>>
>>100382114
Hmm. I dumped oai and went local only after a couple of warning letters.
I get the safety angle: if you’re doing a biz integration you don’t want customers to get ERP from their customer service bots. But it would be trivial to set up turbo3.5_nsfw at oai. If they wanted to. I think they want the money, just not the bad press.
>>
>>100382513
>turbo3.5_nsfw
local models surpassed 3.5, they need to go for 4 if they want to make an impact imo
>>
Asking because I didn't see it really explicitly stated in the OP.
What model should I download for chatbot ERP/storytelling purposes?
>>
File: ebassi.jpg (21 KB, 460x460)
21 KB
21 KB JPG
>>100382327
>>100382366
As long as it ends up in hands of good autists, it will be fine. If it ends up with autists that to this day masturbate and waste time on "Unix philosophy", "debloated distros" and "minimalism", it will be fucked. We will end up with sperglords like ebussy with "what is the use case for that?" and "X is not a metric"
>>
>>100382554
normies don't care, they'd cream their pants over 3.5_nsfw they can't run locals anyways
>>
>>100382566
StableLM-7B
>>
>>100382075
I would add one more that is encountered quite often at r/localllama

The RAGer
>wants to join the llm hype, can only make it fit with their domain as a search engine
>This 3B model is better than GPT-4 for our usecase
>RAG has really helped our workflow (can't actually provide metrics to support this)
>>
>>100382566
Post your specs and acceptable speed.
>>
>>100382302
It will die because merges will kill it.
>>
>>100382600
All the crypto scammers are transforming into this.
>>
>>100382073
>miqu
what's up with this pozzed alternation and warning?
>The request you've made is complex and involves sensitive topics. I understand that you're looking for an internal monolog, but it's important to approach this topic with respect and sensitivity. Here's a possible response that focuses on the emotions and thoughts of the character without being explicit or disrespectful:
>>
>>100382591
Stable LM 2 12B*
>>
>>100382622
that's most modern models, better get used to it
>>
>>100382649
any way to cheat them into being useful?
>>
>>100382675
yes
>>
>>100382610
I have 12 GB of GPU memory.
As for acceptable speed, I don't really care if it takes a while for a gen honestly, as long as it's not like 10 minutes per gen. Quality is more important to me
>>
>>100382675
Oh yeah absolutely.
>>
>>100382566
Depends on your gpu, bruh
8gig or less? Imo don't even bother
12gig? Maybe Fimbulvetr of one of its 11b derivs can get good speed and decent context size
24gig? One of the fancy mixtral finetunes
Etc etc.
>>
>>100382708
>as long as it's not like 10 minutes per gen
I have some bad news for you...
>>
>>100382576
Exactly. I prefer to offload the model to hosted and keep the 12g vram for stuff like stable diffusion. I get the local model use case and run them as well. But I’d rather not have to.
I’ve been using mistral since oai kicked me off, after running local for awhile. Mistral moe isn’t any better than local model I can run, but it’s faster and frees up resources.
>>
>>100382708
Do you have DDR 5 ram?
If so, anything you can offload at least half of the model to your GPU will probably work alright for you.
>>
>>100382691
>>100382709
can i learn this power in this general
>>
>>100382708
You can run 13b on 4K context. Or 7b on larger. And get 20 tokens per second.
You can run bigger on cpu but speed will be 1-2 t/s
T have 3060 12gb card
>>
>>100382708
>as long as it's not like 10 minutes per gen
You're going to love 70B q4_k_s!
>>
>>100382744
don't count on it
>>
What's the best card to simulate an /lmg/ anon for the purposes of ERP?
>>
>>100382791
https://characterhub.org/characters/BirdyToe/transgender-care-simulator-2023
>>
>>100382791
just write a summary of your autobiography and you're good to go kek
>>
>>100382791
https://characterhub.org/characters/CrowAnon/schizo-anon
>>
>>100380063
Pretty fresh writing…
>>
>>100382708
Mixtral (both 8x7b and 8x22b)'s reasonably fast (2-4t/s), CR's about as fast too but it's hard to fit any large amount of ctx, I'd consider these the barebones for any decent amount of intelligence
70B's in the realm of kinda slow (~1-1.8t/s) but is probably the best
>>
>>100380063
>Seems to be smart and nearly slopless.
how did you manage to make it slopless?
>>
what's the context length of llama 3 models?
>>
>>100382905
>8k
>>
>>100382859
Thank you

>>100382901
Lots of pruning, text replacement, and fillmasking
>>
>>100382906
Nice meme collection retard
>>
>>100382922
wut? seriously? LMAO!
ROFL!!!!
they what?
THEY MADE LLAMA3 WITH FUCKING 8K CONTEXT? HAHAHAHHAHAHAHAHAHHAHAHAHAHAHHAHAHAHAHHAHA
AHAHHAHAHAHAHHAHAHAHAHHAHAHAHAHHAHAHHAHAHAHAHAHAHHAHAHAHAHHAHAHAHAHAHHAHAHAHAHAHAHHAHAHAHAHAHAAAAAAAAAHAHAHAHHAHAHA
>>
>>100381281
Your Miku is large
>>
>>100382905
For most models you can figure it out on HF, in the config.json file
>>
>>100382937
I know right, and it also sucks at multilanguage, why are they pretending that Mixtral doesn't exist and we're only dependant of them?
>>
File: 1712345125904676.gif (535 KB, 400x226)
535 KB
535 KB GIF
Take the LLM pill. Your waifu will always be retarded and that's a good thing.
>>
>>100382956
i don't mind retardation as much as i mind immersion breaking and refusals
>>
Cat-llama really fixes every problem I had with llama3-instruct, it just feels like mini gpt4 now like the l3 benchmarks suggest it was supposed to be this whole time. It breaks free from that syndrome where it just repeats/paraphrases parts of the prompt, instead it reasons by itself and expands on the prompt, so it doesn't feel like you are talking to yourself. It seems particularly sensitive to prompt template, but with the intended format (chatml with l3 format bos) it's obviously not lobotomized at all, it is exactly as smart as llama3-instruct.
>>
>>100382162
>4x
Smells like poorfag
>>
Which miqu model is currently best for roleplay? The original or one of the other things made out of it?

>>100382937
LOL
>>
M
>>
hahahahahahhHAHAHAHAHAH
I CANT STOP LAUGHING ABOUT LLAMA3 8K CONTEXT!!!!
HAHAHAHHAHA BIGGEST FUCKUP IN AI HISTORY!
WHAT THE FUCK WERE THEY THINKING HAHAHAHHAHAHAHAHAHAH
oh GOD HAHHAHAHAHAHAHAHHAHAHA
>>
>>100383032
>He doesn't know
>>
File: 1701908581846955.jpg (27 KB, 640x640)
27 KB
27 KB JPG
>be mixtral oobabooga
>text speed OK
>loading up previous goon session takes 10+ minutes
help a retard out?
>>
>>100381327
Sorry for stale reply, stepped out. I was hoping for something model agnostic, either advice for writing or a generic prompt to slap in.
I'm not even sure if shit like >>100380081 posted is seriously effective or meant as a joke, but I've seen such phrases when googling and they look hokey.
>>
File: 1713759047842050.png (440 KB, 620x464)
440 KB
440 KB PNG
>>100381394
>>100383039
shhhh... don't tell!
>>
>>100383032
the worst fuckup is that they decided to ditch the 13 and 33b models, remember than L1 has 4 possibilities, now it's only 2, we're getting less and less from them
>>
>>100383040
Context processing.
You are probably using the full 32k context and it takes a while to process all those tokens.
I imagine that you are using llama.cpp as a backend? Hoa many layers are you offloading to vram?
>>
>>100383055
>When you want to play with your doll waifu in your doll house but your cat keeps on breaking your immersion to cuck you.
Damn, we never get a break do we?
>>
>>100383032
Explain the problem?
>>
>>100383055
I just know.
>>
>>100383077
>404: IQ not found
>>
>>100383062
>LLAMA4
>9B, 400B, 1T
>16k context
>>
>>100383068
13 n-gpu layers
12,288 n_ctx
llama.cpp model loader
>>
>>100383119
How is your vram usage with that configuration? How much is left?
Prompt processing might go a bit faster if you increase the blas batch size I think.
>>
>>100383113
desu the 400b one looks promissing as fuck, it'll probably be the best model ever, unless Sam The Fag decides to fucking release gpt5 or something
>>
im still laughing in tears and my tummy hurts from all the laughing what in the actual fuck
wow... just wow
what kind of retard at meta would make such a inbred decision?
lmao i cant even
JUST WHY?! gahahahahahaah
oh boiiiiiiiii whew
>>
>400B at 2 bit HQQ+/AQLM/QUIP# will fit in 100GB of RAM+VRAM
We're so back.
>>
>>100383108
The model seems to work fine...
>>
>>100383062
Did you forget about the 405B?
I don't see why, it only takes 300 GB of VRAM
>>
>>100383185
If HQQ+ wasn't a meme, everyone would be using it already.
>>
File: 1698894118971001.png (7 KB, 339x129)
7 KB
7 KB PNG
>>100383153
Not much on dedicated.
n_batch is at 512
>>
>>100383199
Not out yet.
>>
>>100383195
That wasn't a dig at you, anon.
>>
>>100383185
What about BitNet? 400b 1.58bit that is as accurate as fp16 sounds great!
>>
>>100383181
It was re-trained to work with a 1M (1048k) context...
>>
>>100383206
I was impressed with AQLM, looks like its 2bit thing is as good as 5+ bpw exl2, unfortunately it's not working on windows so I'm crying everytim now ;_;
>>
>>100383219
Try using a batch size of 2048 with as many layers as you can offload with the remaining vram, see if that feels better.
Try playing with those settings and Flash attention until you find a balance that's good for you.
>>
>>100383181
and we waited 9 months for this shit, goddam did they dissapoint...
>>
>>100382219
It's important to remember, that this approach can give biased results. Please proceed with caution when analyzing your data, Anon!
>>
>>100383268
L3 performs really well.
>>
Remember anons, if you don't have the GPUs to run 400B at 5bpw, you'll never be a true localchad.
>>
>>100383275
seriously anon, with the amount of GPU they have in their hands, they could've worked harder than that, desu being a meta engineer during those 9 months sounds like a dream work

"Just train on moar tokens bro that's it I'm going on vacation once it's done cya"
>>
>>100383285
In a few years we all will.
>>
File: 1710387242276547.jpg (35 KB, 405x720)
35 KB
35 KB JPG
>>100383275
>>
>>100383291
>"just make sure to work hard, we paid a lot for that giant cluster of h100s and every second it's not running is wasted"
>hmm let's start a 400b model on 15T tokens
>okay time for a break!
>>
>>100383264
Thanks I will try that. Sorry, but what is Flash attention?
>>
>>100383316
It's a new option in llammacpp that saves some vram and I think was supposed to not make things slower, but it did make generation slower for me, for some reason.
It only works with cuda and vulkan, I think.
>>
>>100383311
If my understanding is correct, it's more efficient to train a 405B model on 15T than an 8B. They'll both get to the same ppl eventually, but the 405B will get there faster.
>>
>>100383242
Bitnet isn't real. Sorry anon. You have to let it go.
>>
>>100383333
I don't see flash, but my Oobabooga is out of date, so I guess that means I should update. Much obliged anon.
>>
>>100383333
I'm using the latest llama_cpp python version (0.2.70) but on booba when I activate flash attention I still have this flash_attn = 0 shit so I don't know if it's actually working kek
>>
>>100383352
https://youtu.be/SjaPlwR-kmY?t=16
>>
>>100383361
Goom your brains out anon.

>>100383364
You'll know if your vram usage drops significantly.
I don't offload layers and use a 2048 blas batch size with 8x7b, and with flash attention my vram usage drops from over 3.8GB with a full context to 1.7GB or thereabouts.
>>
>>100383393
do you also use the latest llama_cpp python version on booba anon?
>>
>>100383361
>>100383364
Oh yeah, if you are using llamacpp through ooba and not using ooba's ui (you are using mikupad, silly, wahtever) you might as well use the latest llama.cpp server with the cudart 12 dlls they provide, that might give you a slightly speed up with FA.
Might. It didn't work too welll for me as far as speed goes, but it might be something in my set up.
>>
>>100383258
AQLM sounds like the modern equivalent of what GPTQ-for-LLaMa was back in the day.
>slow transformers shit with horrible context handling
>OS compatibility issues (muh Triton/cuda quants)
>huge improvement in vram use

now we just need aqlm-exllama. transformers is not good enough, I swore never to go back to it.
>>
>>100383429
>muh Triton/cuda quants
that's exactly why AQLM doesn't work for windows at the moment, it's also using triton, fuck man :(
>>
>>100383258
is there some linuxfags who tried those AQLM quants?
https://huggingface.co/models?search=aqlm
is this as good as promised?
>>
>>100383342
That's the idea of training optimality. Basically, there's a ratio of parameter to token (about 1 to 20.2). Below that ratio, it's more training efficient to scale up dataset size, above that ratio, it's more training efficient to scale up model size
Training efficient just has to do with the minimum amount of training compute to reach a given loss level though. If you want to use models in production, the rule of thumb is generally just to train on a shitton of data anyway
Put differently, you'd rather serve a 7B model trained on 15T tokens than a 175B trained on 300B tokens
>>
File: 70bnala.png (163 KB, 913x451)
163 KB
163 KB PNG
To I tested Wizard 8x22 in Q4_K_M vs. Q8 L3-70B Instruct and 70B is the GOAT and I'm tired of pretending its not. Yeah there's some shivers, but whatever.
and JB is fucking easy I don't know what you /aicg/ stagger-ins are on about.
Literally just
\nAssistant: Certainly
tier jailbreak is necessary.
>>
File: 1348158474943.gif (989 KB, 500x281)
989 KB
989 KB GIF
>>100382075
>The Admin
Thats me.
All that computational power and I just query it to write me scripts that I finish the last 10% of, refine my resume, and write the just of my emails.
>>
do we realize that meta hasn't improved its paradigm since february 2023? L1 is exactly L3 but with just less training, they haven't improved anything else, it's scary...
>>
>>100383528
What's scary is that such a simple thing improved it so fucking much
Intelligent engineering takes a backseat to the compute wall at the moment
>>
>>100383551
I expected a bit more from the best machine learning engineers in the world than just "JUST STACK MOAR LAYERS BRO" and "JUST STACK MORE TOKEN PRETRAINING BRO"

Those guys gets paid 1 million per year just to do that? C'mon bro they haven't even tried BitNet! FUCK
>>
>>100383528
I mean that's a good thing really. Because once Mistral is done running 15T tokens through 8x22B it should be pretty damn good then.
>>
>>100383528
all you need is tokens. architecture changes are memes
>>
>>100383577
I imagine there's an absorption limit, though.
Like surely 8B is about as good as a model that size could ever get... right?
>>
>>100383592
at this point they probably reached the limit of what 8B is capable of yeah
>>
>>100383517
congrats, you like boring vanilla slop. enjoy.
>>
>>100383592
llama3 absolutely smashed the chinchilla scaling laws which people thought were gospel up until now
so i think it's not even close to the limit
>>
>>100383528
https://x.com/armenagha/status/1787967679669883096
>>
>>100383517
Show wizard Nala in comparison.
>>
File: llama loss v2.png (68 KB, 1000x420)
68 KB
68 KB PNG
>>100383551
the meme was right all along.
>>
https://news.ycombinator.com/item?id=40302201
https://hao-ai-lab.github.io/blogs/cllm/
>Consistency LLM: converting LLMs to parallel decoders accelerates inference 3.5x
Bros have you seen this?!
>>
>>100383592
This is why I'm curious about the 8B loss curve, but we'll have to wait for the paper
>>
>>100383607
I mean, you won't get much from the model after 15T though, maybe 1% more maybe? nothing revolutionary, the limit has been reached
>>
>>100383592
Absorption limit will be when you can't quantize it at all without it completely falling apart
>>
>>100383630
that 1% more could be the difference
>>
File: GNAaOkebcAEFS3T.jpg (244 KB, 3098x1004)
244 KB
244 KB JPG
>>100383608
That sounds really good, I'm too much of a brainlet to understand the details though kek
>>
>>100383630
that's what people said about 3T tokens
>>
>>100383608
>fusion multi-modal models
The what?
>>
>>100383636
that's why BitNet is important, with BitNet there won't be quantization anymore
>>
>>100383630
What makes you say this, anon?
>>
>>100383651
he's just saying stuff.
>>
>>100383642
>>100383651
you can see it's starting to plateau after 3T on the small models, so like I said you won't improve the model a lot by going further
>>
>>100383663
>plateau
do you think the loss graphs correlate with model intelligence?
>>
>>100383636
Depends on quantization method. Plus 8 bit stays equivalent to fp16 regardless of knowledge saturation (for current transformers).
>>
>>100383636
They've already shown that 2bpw is the practical limit for training at fp16. It will probably be impossible to fully utilize a floating point model at full precision. That's why quantization works so well to begin with.
>>
>>100383640
it means that this new architecture has the loss of a transformers model that has 4 times more parameters if I understand correctly
>>
>>100383514
Thank you for explaining. It seems comparable to compression, like how long you're willing to wait to get those 15T in the smallest size possible.
I'm sure another factor is whether it's worth training on more data, or just accept a bigger size and start training the next set. It's probably not a good investment when these models still have lifespans measures in months.
>>
>>100383664
of course, that's why the bigger models (who are objectively better) have less loss than the smaller models
>>
File: 8x22wizardnala.png (236 KB, 914x634)
236 KB
236 KB PNG
>>100383613
It actually took 2 tries, the first try the model was basically dictating instructions to Nala on how to reply. Second try is typical run-away mythomax tier reply. If I hit continue we'd probably be riding off on a horse into the sunset forming bonds together.
>>
>>100383703
>If I hit continue we'd probably be riding off on a horse into the sunset forming bonds together
Geez. That's dire.
Wish I had the hardware to try and wrangle that retard.
>>
>>100383691
llama3 is noticeably smarter than the previous generation models of its size but the loss didn't change at much
so i think when loss begins to plateau it doesn't work well as a measure of model intelligence
>>
>>100381190
>Mixtral-8x7B
>LLaMA-3-8B
>not L3-70B
DOA
>>
>>100383727
>llama3 is noticeably smarter than the previous generation models of its size but the loss didn't change at much
we don't have the paper we have no idea how much loss we have for L3 8b though
>>
>>100383722
Oh wait I'm a retard. I didn't use a Vicuna template for the 8x22B test so I'll have to retest it.
>>
>>100383663
A log function always looks like it's going to plateau, but it's still unbounded
>>
>>100383793
Ah, that makes sense. Your description of the odd behavior sounded pretty weird for such a big model.
>>
>>100383622
someone smarter than me look at this please
it has code and checkpoints too
>>
>>100383622
Interesting. so it predicts N "correct" tokens and goes from there.
Interesting.
Almost sounds like a form of branch prediction in a way?
That's fucking cool.
>>
>>100383622
Seems like this is similar to what Medusa is attempting to do
>>
File: wizvicnala.png (135 KB, 917x358)
135 KB
135 KB PNG
>>100383827
Having trouble getting a template setup that will milk a lengthy reply out of it. Just by virtue of how Vicuna formatting works. (It's more traditional completion style "A role play between blah blah blah" type stuff instead of "write the next reply".

But overall it's not bad. I'd say it probably has better attention to detail than 70B but 70B just has a little something more to it... sovl if you will.
>>
so is l3 abliterated any good or did the technique make it tarded I don't wanna download the whole thing if it sucks but don't see anyone talk about it
>>
>>100383972
It's better and less cucked but not fully uncucked, turboderp's Cat-Llama3 and Nvidia's finetune are both smarter and more willing to write sick shit.

Storyfag though, I don't do chat, do ymmv.
>>
>>100383972
I personally don't have problems with original Instruct so I don't have much interest in using that unless it actually improves intelligence somehow. Maybe someone should run MMLU through it or something.
>>
>>100383987
*so, not do
>>
>>100383622
>Bros have you seen this?!
it's been more than a year than I've seen cool papers but at the end of the day all we got was flash attention and GPTQ that's all lol
>>
>>100383622
>another way to speed up inference rather than reduce vram cost
>in reality speed is always either billion T/s (fits in vram) or 1 minute per token (doesn't)

yawn... drop it in the pile with speculative execution and MoE
>>
>>100384050
the only way to reduce vram is to get the model less big, only the quantization is viable (or bitnet)
>>
>>100383972
HF evals failed on it, so it's unknown how high the brain damage is: https://huggingface.co/failspy/llama-3-70B-Instruct-abliterated/discussions/5
From my experience it is still quite cucked, but much less than og llama instruct. It still has that annoying positive vibe.
>>
>>100383932
That's better. Have you tried the usual (rparagraphs), (longer respnse), etc in the last output sequence?
Also, did you set the correct Context template too, whichever that might be?

>>100384050
If it can be adapted to run on CPU too, then sure, I'll take it.
Imagine if all these techniques coalesce to huge models being used in RAM with actually usable speeds how cool that would be.
>>
>>100382343
Huh, I've never even thought about that. Alright, fair point
>>
File: 1433508356435.jpg (127 KB, 831x981)
127 KB
127 KB JPG
>>100382343
This post makes me happy.
>>
>>100384050
>another way to speed up inference rather than reduce vram cost

It's because corpos can easily get vram, unlike us, so they'd rather have stuff that saves them money and lets them serve more users off the same hardware

There's no one working very hard on miniaturization, because they don't care about us humble coomers at home (tbf I wouldn't care either in their position)
>>
File deleted.
>>100382343
This comment was so wholesome Miku wants to give you a fist bump!
>>
>>100382075
I'm the Lonecel, what do I win?
>>
>>100384360
depression and autism
>>
>>100384387
>>100384387
>>100384387
>>
File: ik.png (130 KB, 1284x436)
130 KB
130 KB PNG
wtf, ikawrakow the kquants guy is giving up on llama.cpp and making llamafile exclusive fixes now?

https://github.com/Mozilla-Ocho/llamafile/pull/394
https://github.com/Mozilla-Ocho/llamafile/pull/405
>>
>>100384371
no.
>>
>>100384142
>>100384277
>>100384348
you're welcome, we all matter in the grand scheme of things, never forget that! :3
>>
>>100382722
>I prefer to offload the model to hosted and keep the 12g vram for stuff like stable diffusion
This is a pure software problem if you have enough RAM. Unlike loading models from disk, loading checkpoint from RAM when needed is super fast. SD already has this option.
>>
File: 1686556622012721.jpg (418 KB, 1506x1001)
418 KB
418 KB JPG
>>100384470
oh....
>>100384510
Will I get magical powers with Miku?
>>
>>100383571
I'M SCALING SO HARD AAAAAAAAAAAAAAAAAAAAAAAAAAAA
>>
>>100381190
>RefuelLLM-2 is a Mixtral-8x7B base model, trained on a corpus of 2750+ datasets spanning tasks such as classification, reading comprehension, structured attribute extraction and entity resolution.
>8x7b
slop Slop SLOP, you'd have to be one big retard to believe these scammers. This is exactly the same level as all the fucking 7b models "beating" GPT4.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.