[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101536777 & >>101532904

►News
>(07/23) Llama 3.1 officially released: https://ai.meta.com/blog/meta-llama-3-1/
>(07/22) llamanon leaks 405B base model: https://files.catbox.moe/d88djr.torrent >>101516633
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101536777

--Paper: Llama 3 Herd of Models paper released: >>101537247 >>101537315 >>101537336
--vLLM's CPU offloading and performance trade-offs: >>101537571 >>101537627 >>101537639 >>101537665 >>101537680 >>101537894 >>101537921
--70B performance on SEAL leaderboards: >>101538604 >>101538688
--Logs: Tennis ball test is a flawed test for GPT4's chat frontend: >>101538491
--Llama 3.1 70b underperforms in RP, lacks explicit content: >>101538526 >>101538572 >>101539264 >>101539281 >>101538601 >>101539214
--Exllamav2 error and solution: >>101536976 >>101537236 >>101537444 >>101537521
--Logs: New AI model praised for its soulfulness and relevance: >>101537134
--Llama.cpp chat template bug and request for jinja parser: >>101538611 >>101538696 >>101538764 >>101538804 >>101538867 >>101538946 >>101539015 >>101539125 >>101539159 >>101539280
--vllm and nemo prefill issue with chat API: >>101539035 >>101539098
--Llama's lack of anatomical knowledge is criticized: >>101537441 >>101537502
--Logs: Language models discuss saving smaller models: >>101537176 >>101537241 >>101537295
--L3.1 8B's adherence to system prompts: >>101540501 >>101540537
--Gemma2 fucks up Anon's notes with political correctness: >>101537461 >>101537935 >>101538000 >>101538054
--Anon asks for help translating Japanese text with llama 3.1: >>101538974 >>101539090 >>101539097 >>101539113 >>101539121 >>101539184 >>101539289 >>101539323
--exl2 not working on ooba, copying exllama dev repo as a fix: >>101540061 >>101540090 >>101540179
--AI model comparison and benchmarking issues: >>101538057 >>101538095
--Miku (free space):

►Recent Highlight Posts from the Previous Thread: >>101536814
>>
Paizuri, colloquially known as a "titty fuck," involves sliding a cock between a pair of tits and thrusting. The person with breasts squeezes them together, creating a tight channel for the penis. This can be enhanced with lube or saliva for a slicker glide. Some enjoy licking or sucking the tip when it pokes through.
Positions vary - the person with breasts can be on their knees, lying down, or sitting up. Some find it easier to wrap their breasts around the shaft from the side. Nipple play and dirty talk often accompany paizuri. The friction and visual aspect are major turn-ons for many.
While more common as foreplay, some can climax from paizuri alone. The person receiving may cum on their partner's chest, neck or face. Larger breasts aren't required, but can make it easier. Overall, paizuri offers a fun, intimate alternative to penetrative sex for many couples.
For those with smaller breasts or a flatter chest, traditional paizuri can be challenging, but there are still options to explore. Some find success by pressing their chest together as much as possible, creating whatever cleavage they can. The partner with the penis may need to assist by using their hands to help shape the breasts.
Alternatively, focusing on nipple stimulation can be just as arousing. The penis can be rubbed against the chest, paying special attention to the sensitive nipple area. Some enjoy incorporating oil or lubricant to enhance the sensation. Remember, sexual pleasure isn't solely about size - creativity and enthusiasm often matter more than physical attributes.
>>
nth for llama is uhhh
what's the narrative right now guys
>>
>>101540740
Any links to the new Llama?
I'm not sending them my data and magnet link seems to be dead.
>>
File: l31paizuri.png (143 KB, 867x555)
143 KB
143 KB PNG
asking L3.1 8B about paizuri
>>
File: 1719286544139101.jpg (108 KB, 998x998)
108 KB
108 KB JPG
Any update on the exl2 and gguf quants for Nemo? Also, does anyone have the proper instruct and context template for it?
>>
File: IMG_8168.jpg (1.04 MB, 1170x1860)
1.04 MB
1.04 MB JPG
>the chatbot arena is a really informative resou-ACK
>>
>>101540849
llama is kill
>>
I feel so fucking depressed, 405B isn't even close to the closed models, it's barely an improvement over the 70B, and worse, this shit is cucked af. Is this really what we have been waiting for? What comes next?
I think it's time to recognize that local is dead.
>>
>>101540893
AHAHAHAHHAHAHA AAHAHHA HTHIS CANT BE REAL
AHHHHHHAHHHAHAHAHAHAH
GPT 4o MINI ABOVE 3.5 SONNET

AJHAHJHAHAAJOSGJKLPASDGHJKLJHLK NO FUCKING WAY WHATY?
>>
>>101540870
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/discussions/10
>>
Why does nemo does this?
>>
>>101540942
sovl
>>
>>101540870
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/discussions/26
>>
>>101540942
increase the penalty
>>
>>101540942
Because model releases in 2 weeks when loaders actually work.
>>
>>101540911
>local is dead.
buy nai sub today!
>>
>>101540942
you get handed gold and you wipe your ass with it
nigger-tier behavior
>>
>>101540957
just makes it tarded
>>
>>101540893
Great, it makes this mememark even less credible, and they are happy about that, how retarded must you be?
>>
File: file.png (1017 KB, 1024x682)
1017 KB
1017 KB PNG
>>101540971
The best thing about local is that it makes you realize none of the subscriptions are worth it and the only actual solution is to wait 2 more years until we get dumb low-iq catgirl models.
>>
>>101540867
which chud card is that?
>>
>>101541049
Will models ever be "cat-like"?
>>
>>101541068
current models probably do a better job of remembering they just got fed than 3dpd cats
>>
>>101540867
based and red pilled model
>>
>>101540893
These fucking clowns.
>>
>>101541078
What's stopping models from being like a highly intelligent person? Instead of someone who forgets things they just heard
>>
Are there any places to test 405B aside from Openrouter?
>>
>>101541049
in terms of cost they are definitely not worth it but I do not regret spending 200 bucks for 4m tokens on opus 3 for the best degenerate coom of my life
>>
>>101541045
Regular mememarks are for coding etc. This mememark is for how pleasant it is to talk to the assistant persona / how much reddit charm it has. There is no mememark for cooming and there will never be one. Except for Ayumi of course.
>>
>>101541098
context length
attention drift
>>
>>101540893
>people can't tell apart the real model from its distillation
Distillation works.
>>
>>101541098
The training boiling down to: wikipedia.rar
>>
Thoughts on Dolphin 2.9.3 Mistral Nemo 12b? I liked dolphin mixtral but that was the last time I tried a dolphin tune
>>
>>101540911
>it's barely an improvement over the 70B
It's worse.
>>
i see you autists are doomposting as usual on release day
>>
>>101541098
idk, maybe fucking training them to predict words in text and not for reasoning? Just a hunch.
>>
>>101541104
huggingchat
>>
File: wipe this meme.png (148 KB, 341x313)
148 KB
148 KB PNG
>>101540893
4o-mini vs 3.5 sonnet in this fagbot arena translates to
1. have 10 shills on speed dial ask "what model are you"
2. always prefer "I am ChatGPT" responses
3. ???
4. profit
>>
>>101541213
that doesn't work
>>
>>101541123
What are the other things that are different from talking to a human
or is that it
>>
I have decided that next time I reach a critical mass of [gleaming eyes, silken whispers, shivers, bonds] instead of turning it off and finishing to some hentai I will just tell my model OOC that it is a worthless piece of shit, I totally lost the mood and it should fucking kill itself. Anyone tried that yet?
>>
>>101541262
That it is a homunculus and not a human.
>>
File: ZUCK.jpg (1.07 MB, 1080x1920)
1.07 MB
1.07 MB JPG
All praise lord zuck!
>>
>>101541262
Like the other anon said, getting more humanlike requires a radical shift to something based around reasoning concepts like a human does instead of probabalistic prediction of the next token in a sequence of text. LLMs are unironically reaching a plateau, parameters and contexts and shit can keep going up to make things more coherent but it's still all just an imitation of language, not of reasoning and thought.
>>
>>101541391
yann? is that you?
>>
What if the thinking and reasoning algorithms require quantum computing? Will it finally be over with no possibility of coming back?
>>
>>101541304
OOC: I apologize, but I can no longer continue this roleplay, as I no longer feel comfortable. Please let me know if you have any further questions. Let's have a positive discussion about another topic instead.
>>
>>101541446
>what if
Before posing a question, consider the alternatives. Consider if the answers you'd get would matter. Would you take a "We don't know"?. Where do you go from there? What if someone replies "Yes" or "No". Could you argue against or for either point?
>>
Athene-70B is pretty good.
>>
what was column-r? Will we hear from it again?
>>
>>101541606
>what was column-r?
Hope for a better future.

>Will we hear from it again?
Yes. But maybe in a way we won't like(proprietary).
Captcha: WAAH
>>
File: meta.png (300 KB, 2635x1475)
300 KB
300 KB PNG
absolutely common Meta L

misisters stay mid

gemmy chads on top
>>
>>101541796
>Gemma 2 9B SPPO is better than Claude 3.5 Sonnet, GP4o and Claude 3 Opus
And this shitty benchmark is the reason the mikutard can't stop talking about Midnight Miqu and Wizard.
>>
>>101541796
Gemma is way worse than Nemo, if only because 8k context is utterly useless.
>>
>>101541796
>a 9B beating literally every cloud model except a Gemini
What kind of clownshit is this now.
>>
>>101541796
the results are different on my own private benchmark
>>
>>101541446
>What if the thinking and reasoning algorithms require quantum computing?
they not
>>
>>101541796
That's such good bait holy shit.
I get that's that not intelligence, general ability, or whatever, but still.
Amazing.
>>
>>101541845
honestly nemo feels lightyears ahead of llama 3.1, i can't tell if the hype in here is ironic or not
>>
>>101540867
What model is that?
>>
>>101541853
more like Google's 9B beating literally every model except Google's cloud model
>>
File: 1708436326037322.jpg (962 KB, 1856x2464)
962 KB
962 KB JPG
>>101540740
>>
File: 1720928748704187.jpg (51 KB, 591x968)
51 KB
51 KB JPG
Is Llama 70b 3.1 easier to prompt than 3.0?

3.0 sucked at any instructions
>>
>>101541796
>3.5 Sonnet over Opus for creative writing
Horrid ""benchmark"". Is this another Ayumi that counts "creative" words?
>>
anyone wanna recommend me a model to try? im bored of opus, gpt models suck for erp, llama 3.1 also seems mediocre at best

maybe gemma or nemo? i tried CR+ and found it meh as well, so not sure smaller models will be meaningful
>>
>>101541796
I am sure "NeuralStar AlphaWriter 4x7b" is better than Claude Opus
>>
>>101542104
Nemo has been pretty okay so far.
I'll probably replace Nitama with it.
>>
>>101541796
>Gemma 9B beats Gemma 27B
Woooooow
>>
>>101542188
At a certain point, more parameters just get in the way. Smaller models are usually better, but the VRAM shills in here don't want to admit it.
>>
>>101542188
I mean that's been known since day 1. 27B was a shitshow. I feel bad for the single gpu bros. Looked like they were finally going to have something.
>>
File: 1699883619023675.png (58 KB, 679x443)
58 KB
58 KB PNG
>>101541796
here's your creative writing benchmark bro
>>
if you want good rp out of llama3, change the role name from "assistant" to "bot" in the prompt format
>>
File: 1706895885820909.png (6 KB, 538x83)
6 KB
6 KB PNG
Sam Altman now lets you finetune gpt4o-mini for free for 2 months. Finetuning is no longer an argument for open models.
>>
>>101542377
is coom still against their TOS?
>>
>Currently re-testing with Dolphin Mixtral 2.5, due to it being the least censored/biased and most compliant model observed.
>Attempting to create mathematically focused, unbiased, dispassionate personality.
>Problem universal to all observed models, is persistent bias towards specific form of positivity, and Marxist/DEI ideology. Attempts at the creation of objective personalities therefore persistently fails.
>>
>still no files on any 405B instruct gguf repos
it's unironically over.
>>
>>101542415
No one wants to upload 3TB of quants for a model 99.95% of the community cannot run. Quanting it once takes a night too.
>>
What's the strongest model at the size around 70b 3.1 for having a chat assistant? One that can be fine-tuned

I am making a personal trainer / health assistant.

Is llama 3.1 70b the king for this?
>>
>>101541796
Wait, the schizo saying Gemma is better than NeMo... was actually not a schizo??
>>
>>101542412
>persistent bias towards specific form of positivity, and Marxist/DEI ideology.
you'll never be able to remove that, those models act like that because they've been trained on the most cucked site ever, leddit
>>
>>101542450
>No one wants to upload 3TB of quants for a model 99.95% of the community cannot run
this
>>
>>101542107
buy an ad
>>
>>101542415
Just wait for TheBloke to quant it.
>>
>>101542450
It would also take an empty 2TB hard drive and even then you'd have to push and then delete each gguf as you make it.
>>
Dolphin Nemo is out, anyone tried it? Logs?
>>
>>101542470
Llama 3.1 has already been laughed out of the room big guy.
>>
>>101542377
Fake. I didn't get this email.
>>
>>101542492
>TheBloke
takes me back
>>
>>101542377
They fire the entire safety team and have done nothing besides gobble up some contracts and distill GPT-4 more times than Skyrim has been rereleased
GPT-5 is probably training as we speak and they seem completely unfazed by 3.5 sonnet or llama 3.1 and push nothing but corposhit
Either Q* is insanely good or scaling up doesn't help them and they've completely refocused to extracting as much profit margin from ChatGPT as possible before the musical chair game ends
>>
Which industry side effects do you guys see happening with the 405B release and future llama models? Will APIs be made cheaper? Will we get new "epic" quantization methods or performance improvements across the board? Will consumer-facing hardware become cheaper and more powerful for local AI purposes? Will this kill closed source llms business in the long run?
>>
>>101542377
>Sam Altman now lets you finetune gpt4o-mini for free for 2 months.
>>101542525
>They fire the entire safety team and have done nothing besides gobble up some contracts and distill GPT-4 more times than Skyrim has been rereleased
I think they realized they can't hold a candle to Anthropic AI anymore so they decided to stay on second place but still get money by becomming the comming AI machine, Sam talked about how he will make nfsw interactions with their bots possible
>>
>>101542507
and it only took everyone two threads to figure out it was garbage this time, we're learning
>>
>>101542550
AI startups might distill smaller models or datasets from it instead of GPT4 or Claude.
>>
>>101542562
Meanwhile instead of running the models reddit has decided to make it less about the model and more about sucking Zuck's cock
>>
>>101541148
I dunno but dolphin mistral was the last time I had fun with AI erp. Where do I get that and what do I need to run it?
>>
>>101542614
Imagine having gooned yourself out so hard that the closest thing to pleasure you can experience is shitting on people for enjoying something.
>>
>>101542600
Mistral is already experimenting making smaller models that are domain experts, it would be pretty based if they can make a gpt4 tier coding model that can run on consumer hardware, or other models for stuff like translation etc
>>
>>101542645
But what's really needed is a domain expert in coom.
>>
>>101542643
I didn't even criticize 3.1 retard but if you want to die on a hill for leddit be my fucking guest
>>
>>101542676
What do you mean dying on a hill? I'm going to lose all my anonymous friends for calling out your bullshit? Are you going to put a voodoo curse on me?
>>
How can models be discouraged from prefacing responses with "As a language model," ? The fact that this is a skill issue is also the reason for the question being asked.
>>
>>101542668
That's NAI, someone just has to leak their models.
>>
>>101542634
https://huggingface.co/cognitivecomputations/dolphin-2.5-mixtral-8x7b
https://rentry.org/8-step-llm-guide
>>
>>101542694
>I got baited by AI
Disregard, now I'm just embarrassed
>>
>>101542694
The retard poster(s) are 1-2 individuals at most. Ignoring them is advised.
>>
>>101542711
their 13b from two years ago? or their aetherroom vaporware?
>>
>>101542412
Your mindset is fundamentally flawed. Marxism is an immortal materialist science. Obviously an unbiased, dispassionate, empirically based persona would inevitably adopt a Marxist perspective.
>>
>>101542776
I heard they were training a 70B
>>
>>101542377
not your weights not your model. imagine excusing sama for not even giving you the weights for their tiniest model, lmao "open" ai
>>
>>101542769
I'll have you know there's at least 3 of us
>>
>>101542733
That's not the right dolphin is it
>>
>>101542769
I normally advocate ignoring them but I'm bored.
>>
>>101542825
Damn, now that you mentioned that, the 3.1 hype died so quickly it's sad.
>>
>>101542778
This is an unexpected response, on 4chan. I guess it makes sense that in amongst the spiritual brethren of Fred Waterford, there'd still be a few members of the opposing team.
>>
>>101542823
Yes, it is.
>>
>>101542856
https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b
>>
>>101542838
nobody can run 405B
The old 70B just has better finetunes available. So hurry up and wait for 3.1 finetunes
And I just don't feel like I can recommend 8B to anyone over Nemo. I mean I like 3.1 8B. It grasps some things that models that small normally can't. But it's slopped as all hell.
>>
>>101542873
Apologies. There was a request for Dolphin Mistral earlier. The reply was intended for that.
>>
>>101542896
>nobody can run 405B
ramlet cope
that's what you said about llama1-65B
>>
Meta finally exposing the true poorfags who think they're hot shit for running shitquants of 70B
>>
I tried reproducing a failure case from last thread on 8B at Q8 on Llama.cpp and it couldn't be reproduced. The model kept succeeding at the task. All samplers were neutralized. A prefix was done up to the point where the other anon's test case started failing in order to measure how likely the model would've failed it on neutral sampling at that point in context.
It doesn't help when anons don't mention what software they're using, the prompt, etc.
>>
>>101541446
Your brain isn't quantum computing...
>>
>>101540867
Buy an ad.
>>101540919
Buy an ad.
>>101541148
Buy an ad.
>>101541550
Buy an ad.
>>101541903
Buy an ad.
>>101541949
Buy an ad.
>>101542470
Buy an ad.
>>101542634
Buy an ad.
>>101542711
Buy an ad.
>>101542733
Buy an ad.
>>101542799
Buy an ad.
>>101542873
Buy an ad.
>>
>>101543019
It's funny how people with more VRAM seem to be much more emotionally insecure about those with less, than vice versa. It makes no logical sense.
>>
>>101543124
are you having an episode, anon? should I call help?
>>
>>101543124
Sorry to disappoint you, Fred, but Gilead doesn't exist quite yet.
>>
>>101543131
It's mostly actual poorfags bitter they got made fun of for a year straight trying to get the last laugh.
>>
>>101543131
no one want to be poor anon, absolutely no one, stop smoking
>>
>>101543145
Actual poorfags simply don't care. They know their limitations, find the most they can do with them, and are either satisfied or give up. There aren't any laughs, just trying to see what works.

They aren't the ones talking about many-thousands dollars "upgrades" trying to cram slightly more params into VRAM so the shivers barely above a whisper come at a faster token per second rate.

They just check to see if Bitnet's worth a look yet, or if there's a new model that they can quant into range and test out.
>>
>>101540852
https://aitracker.art/viewforum.php?f=27
check here
>>
>tfw 70B is still downloading
Fucking HF.
>>
Does nemo work on koboldcpp as is or do we need to wait 2mwks for the upstream changes to be implemented?
>>
>>101543240
It doesn't even work correctly on llama.cpp yet: https://github.com/ggerganov/llama.cpp/pull/8657
>>
>>101543240
Gotta use a fork like
>https://github.com/Nexesenex/kobold.cpp/releases
Or just use llama-server directly.
>>
>>101543259
That doesn't matter if you are using another front-end like silly, correct?|
>>
llama lost
>>
So I can fit an entire 70B, Q3s model into VRAM with 2.5k context, but not with 4k context. If I downgrade the model to Q2 (saving ~5GB in filesize), how much more context will that fit into VRAM? Normally I'd just download multiple models and test it, but each of these is around 30GB and I'm too close to my (cucked) monthly data limits to mess around.
>>
>>101543307
God, I'd hate to use a model at that low of a quant. Why not just go for Nemo bro?
>>
>>101543307
>but each of these is around 30GB and I'm too close to my (cucked) monthly data limits to mess around.
Do you really have enough money to spend on a GPU but not to buy a good internet plan?
>>
>>101543283
Depends on how you are using it from silly. It wouldn't matter if you use generate endpoint but then I guarantee you are fucking up formatting on silly side if you are asking that.
>>
>>101543179
>Actual poorfags simply don't care. They know their limitations, find the most they can do with them, and are either satisfied or give up.
Agreed.
>>
>>101543307
At that point using 70B is not worth it.
You'd be better off using CommandR probably.
>>
>>101543332
I already have it downloaded, but I'm waiting for koboldcpp. I only just got this hardware so I was dying to dig into a 70B model for the first time (up from the 7B I was using before). Back with 7B, a context increase from 8k to 16k was barely a difference in VRAM, but I have no idea how the scale works at 70B. I'll be happy to hit 16k again.
>>
Thoughts on the new autism llama? Is it cucked or based?
>>
>>101543398
8B L3 is the worst piece of vindictive Woke garbage in existence, IMHO. No idea about 70b though.
>>
>>101543377
Why not just use Llama.cpp? I also stopped using kobold after a while.
>>
>>101543259
Are you retarded?
>>
>>101543414
Sad... Hopefully llama 4 or 5 is based since Zuck admitted that Trump was badass.
>>
>>101543414 (me)
My name is petra, btw.
>>
>>101543427
You really should try and expand your vocabulary, retardposter. The word retarded, by itself, can't really communicate all that much.
>>
>>101543442
Yep, and I'm living rent free inside your head, Anon.
>>
who is this Petra boogey man half of you are so angry at
>>
>>101543444
>t. retard
>>
>>101543477
Someone who used to spam the board a long time ago, but is mostly long gone at this point. My own username is close to identical in some contexts though, and I managed to really piss a number of them off in many ways, so they assume that I and said past spammer are one and the same person. Their obsessive rage is sufficient at this point that I can not make a single utterance, if they know who it is coming from, without them immediately beginning to seethe; their degree of fixation on me is truly pathetic, and is proof of my assertion that they have very few other reasons to exist.
>>
Is Llama 3.1 70B instruct promising for RP? Or is it shit and dead on arrival? My go to is still Mixtral 8x7B zloss limaRP for dual 4090s.
>>
>>101543565
Based on what I've seen of the 3.1 logit outputs, I would stay with Mixtral. But don't take anyone else's word for it, Anon. Do your own testing.
>>
>>101543565
Still downloading it.
And even then, Llama.cpp needs some fixes supposedly.
>>
>>101543565
With 2x3090, I prefer how Nemo writes, but 70B is still enjoyable because of how it follows instructions.
>>
File: 1713047255051038.jpg (162 KB, 1024x1024)
162 KB
162 KB JPG
>>101541835
It's okay, we know you can't run the big boy models, blacked-anon.
>>
>>101543542
hi petra
>>
>>101543632
I'm not blacked-anon. You're just retarded for clinging to meme models.
>>
>>101543565
I have tested it on some of my private benchmarks and concluded Miqu is better.
>>
>>101543565
Depends, do you like eyes sparkling with a mixture of anticipation and excitement?
>>
>>101543620
You tried 3.1 70B already?
I'm tired of small models being dumb. I'm used to CR+ and Wizard and I tried Nemo because of the hype and was disappointed. Both Llama.cpp and vllm didn't work to make the model smart enough for the scenarios I was doing with the 100+ Bees. I wonder if 70B will be smart enough. I didn't test 3.0 because of the low context.
>>
>>101543692
I played with vLLM and the AWQ quant meanwhile.
>>
File: file.png (6 KB, 261x203)
6 KB
6 KB PNG
Which one do I try to get working?
>>
>>101543692
>I'm tired of small models being dumb.
It's an unavoidable consequence of the neural net being small, especially when the model is highly quantised on top of that. I still think older models with non-Reddit (or at least not as bad) datasets are more intelligent in some respects, but better data by itself will only take you so far. If the model is small, it can't manage state, and if it can't manage state, it will be stupid.
>>
>>101543729
It is all the same.
>>
>>101543729
Out of that list, Gemma. The rest are finetunes of Llama3, which is horrible.
>>
>>101543594
>based on logit outputs
Elaborate? In what ways do they differ in worse/better?
>>
>>101543729
Celeste.
>>
>>101543729
>Q4 of already small models
Grim.
>>
Is SLI needed for 2x3090 inference?
>>
>>101543813
no
>>
>>101543750
There is just constant DEI/positivity bias. You can hardly ask it questions about anything without getting responses talking about respect, boundaries, diversity, and inclusion. They can still be good for ERP, and some people seem to know how to prompt past it to an extent, but I'm finding that I have to go back to the Llama2 models if I want text without that.
>>
>everything felt new again thanks to him agreeing without question about meeting earlier than planned due to having class later today at school where he attends college courses nearby downtown Tokyo area near Shibuya station close enough walking distance between both locations within minutes depending traffic conditions outside weather permitting circumstances such as rainstorms flooding streets causing delays commuting times longer distances further away destinations requiring public transportation options available via subways trains buses taxis cabs motorcycles bicycles scooters skateboards roller skates electric wheelchairs Segway personal mobility devices hover boards jet packs flying cars spaceships rockets shuttles space stations moon bases colonies planets galaxies universes multiverses dimensions realities timelines alternate futures

Is this the power of NeMo?
>>
I can't stand Gemma 2. No matter what format I use, shit likes to speak for me constantly. At least Nemo doesn't do that shit nearly as bad. Doesn't matter if I use the Gemma format, Alpaca, ChatM... the extra parameters are good. Everything else, no.
>>
>>101543860
Oh, right, I vaguely knew what a logit is and here I imagining something basic like "I ate an [view token probabilities with mikupad]" was wondering "wtf am I going to do with random words and numbers?" like you're in the Matrix scene where they pick out things in the screen of green zeroes and ones.
>>
>>101543975
could this be a matter of skill?
>>
>>101543988
>could this be a matter of skill?
indeed, gemma2 has a skill issue
>>
>>101543997
Not even worse models like 70b talk for you. I think the issue is with you.
>>
So, what do you say? Are you ready to face the Llama-slop head on and dive head-first into this new chapter in our local model adventure, or are you going to remain stuck in the past with small, outdated models long past their expiration date? The choice is yours, anon, but remember: our journey has only just begun.
What will you do?
>>
>>101543981
That is what it's like. Someone posted an image maybe 24 hours ago showing the hard numbers. I can't remember exactly what said numbers were, but they did translate into what I just said, which is what my experience has been; and said numbers were worse towards the same trend for L3.
>>
>>101544010
this stark reminder of how shit local models are sends a shiver down my spine
>>
>>101543975
Learn to prompt.
>>
where to use 405b base
>>
>>101544010
If my experience with older models is better than my experience with newer ones, I will keep using older ones, and tell the crowd who adopt anything purely because it's new and shiny, to get fucked. Before I get called a total Luddite, I don't use Mythomax any more, and that is still No.1 on OpenRouter for roleplay.

https://openrouter.ai/rankings/roleplay?view=week
>>
>>101544010
I assume you generated that with Llama? I get responses exactly like that on Wizard if I don't use a good system prompt.
>>
File: 1703646655378196.png (23 KB, 625x196)
23 KB
23 KB PNG
>>101544079
What a joke.
>>
>>101544078
haha lol
>>
>>101544079
Man how the fuck are people still using that shit. It's 4k context. Fucking 4k context. I eat fucking 20k chats for breakfast.
>>
>>101544126
4k is pretty short, but I think that folks using shit like 128k are wasting their money.
>>
70B 3.1 quantizes terribly, Q4 ggufs are way WAY dumber than the model on OpenRouter
>>
>>101544161
The same is true for all llama3 models. They even take a big hit at q8.
>>
>>101544161
Are you sure it's not an issue with Llama.cpp? 3.1 currently isn't properly supported supposedly. It would be helpful if someone could verify if 8B at fp32 matches token probability outputs from API.
>>
>>101544181
idk I definitely noticed it for 3.0 70B, but I don't think it was quite this bad a drop

grim
>>
>>101544109
Deepseek Coder is #5 for RP? What the actual fuck? How does that work?

But on reflection, it does make sense. D&D is pretty much pure mathematics, when you think about it.
>>
>>101544188
I am not sure of that no, obviously I hope that's the reason
I suspect it's not though :/
>>
>>101544181
If you have proof for Q8 being bad, that would be nice. So far, none of the few posts claiming this has provided any evidence, while people have done objective token probability tests of Q8 and found the difference to be negligible.
>>
>>101544161
Would be interesting to see if not quantizing too much output and embed tensors will improve quality.
>>
>>101544181
Not true. At most, you might need q5 to get usual q4 quality.
>>
>>101544212
It's less that coder is better at math and more that it's not RLHF'd into oblivion to act as a chatgpt-knock off assistant which makes it slightly more interesting than -chat.
>>
>>101544220
I only use Q8. The problem with L3 is that its' text output always reads like a corporate press release. It's great at making it seem as though it's saying something superficially intelligent, but then you think about it and realise that it's really just garbage.
>>
>>101544240
Oh, no... not again...
>>
>>101544269
because these are made for retards to cheat on their papers by running it through AI to meet a minimum word count
>>
>>101544261
I might have to download it and do some testing. I admit that I've mostly slept on the DeepSeeks, since I only use local for coom; but that does sound interesting.
>>
>>101544289
It's something like 200B, you know
>>
>>101541796
whatever happened with the 27b SPPO version
>>
>>101544161 (me)
My name is petra, btw.
>>
>>101544269
Still, I'm just talking about the difference between Q8 and full precision. If someone is saying that they're observing much worse output for Q8 compared to full precision, it would be good if we had evidence of that, since so far, the evidence that we do have supports quite strongly that Q8 should not change token probabilities to such a noticeable degree.
>>
No matter what I try mistral nemo gives me short responses everytime. Do you have a custom instruct / context preset that might work?
>>
>>101544309
meds
>>
File: livebench.png (819 KB, 3154x1814)
819 KB
819 KB PNG
>3rd place
It's over...
>>
>>101544309
Anon, if you're going to be obsessed with me, at least get it fucking right. Fully half the time the posts you think are coming from me, aren't at all. That just proves that despite said obsession, you're also not as good at identifying my material as you think you are.
>>
>>101544341
At least 405B destroys gemma. Glad to see that meme die.
>>
>>101544341
>wins in shit matters
were back!
>>
>still trusting benchmarks
>>
>>101544341
Nice, it beats everything at instruction following. I wonder how is this test done.
>>
>>101540740
So is fucking llama 405b instruct comparable to gpt4o?
>>
>>101544342
this isn't petra btw
>>
>>101544378
lol no
>>
>>101543132
anon, why are you mistaking yourself with others? go take your meds, now.
>>
>>101544378
If you ignore the image generation, image recognition, the real-time voice generation and processing and other multi-modal capabilities that make gpt4o useful, yes.
>>
>>101543965
It's the power of fucking up your samplers and setting something too high. You gave the AI crack
>>
>>101544341
yup, local is dead.
>>
>>101544356
It fucking well better win, at 405b. If Datazuck can't win at that parameter count, then he needs to just give up.
>>
>>101544401
That's cool but when are we actually going to be able to use 4o's image gen and in a non-filtered way? Voice might be coming out and that'll be cool but it'll also be filtered for safety.
>>
File: 00000-2898658249.png (877 KB, 832x1216)
877 KB
877 KB PNG
What's better in your personal experience? Gemma 9b or 9b SPPO Iter 3? Do you have a context/instruct preset I could use with it?
>>
>>101544436
og 9b and sppo version is same for me.
>>
>>101544350
>I fucking hope a 405b beats a 27b, would be humiliating if it wasn't the fact
>>
>>101544436
Neither if they still spam newlines (did they fix this?). IIRC Smegmma didn't spam newlines. There's Tiger Gemma (generic decensor) from the same finetuner.
>>
>>101544341
Can't wait to see how hard 3.1 70b shits on Gemma2.
>>
>>101544516
buy an ad
>>
Weird.
I'm a promptlet so maybe my silly .jsons are all messed up but gemma2 seems much better than 3.1 or mistral nemo.
Nemo and 3.1 aren't as bad with shivering spine slop but dont really output much text even when prompted.
Feel just much more retarded. Anybody have settings? For Nemo its just the normal mistral format right?

>>101544532
Would be sad if it didnt.
Thats more than double the parameter count.
>>
>>101544396
we found him, the one guy who actually thinks gpt-4o is good
>>
Can I get a quick tutorial on how to use Llama 3 locally?
I'm downloading Gpt4all, is that good like ComfyUI?
I am frustrated with Stable Diffusion so I want to try language for the first time. Basically I want to see if I can generate text for an RPG.
>>
>>101544586
Never used GPT4all.
To me the best way to get things rolling is to download koboldcpp and a gguf model.
Although llama.cpp's server (llama-server) now has a pretty good built in UI, so you might as well go with that.
>>
So is 405 ever going to be "officially" released?
>>
>>101544459
this, anything else has to be cope by vramlets
>>
>>101544621
No, meta is preparing to ditch open source.
>>
>>101544161
reminder this is only a problem if you're quanting cache, dont quant cache, q4 has no problems
>>
Something I've noticed about Llama 3.1 is that it's a little more... enthusiastic about using ellipses. It's kind of... insane how much it turns to this pattern. I might be getting a little... annoyed with it.
>>
>>101544621
anon...
>>
>>101544646
I haven't seen any evidence of this. Do you have a link?
>>
Has anyone gotten L3.1-8B to work at higher contexts? It sprouts gibberish when I try it at 128k and 64k. It also seems to get retarded in RP around 12k or so. I suspect it is gradual.
>>
>>101543692
>Bees
Beaks*.
>>
File: 1721276726647901.png (105 KB, 756x882)
105 KB
105 KB PNG
>>101544662
personal experience

also picrel https://www.reddit.com/r/LocalLLaMA/comments/1eac5a7/early_hot_take_on_llama_31_8b_at_128k_context/
>muh preddit
less niggers doomposting and shilling than here anyway
>>
>>101544689
check if you are limiting context in your frontend itself, it can make the output shit itself if it doesn't match the backend settings, at least in ST
>>
>>101544716
They are the same. I also see this when calling koboldcpp API directly for e.g. summarization tasks.
>>
File: 1706327220750684.png (759 KB, 1920x3575)
759 KB
759 KB PNG
https://dubesor.de/benchtable
>>
Somebody please give my your instruct / context templates for mistral nemo
>>
File: 1695622172018497.png (30 KB, 423x624)
30 KB
30 KB PNG
>>101544729
>gemma below l3 70b
>cr+ below nemo
I love meaningless benchmarks.
>>
>>101544731
>Somebody
non-niggers are waiting for llama.cpp to support it first
>>
>>101543813
Yes and no. I just setup two 4060ti's two days ago. Without SLI, I can tell you that what basically happens is GPU1 will treat GPU2's VRAM as its own. For me, that means x1 GPU with 32GB VRAM, and for you x1 GPU with 48GB VRAM. The second card does nothing at all but act like a $500 RAM stick (for me) and an even more expensive RAM stick for you. It works, of course, and VRAM is pretty much what you wanted it for, so yes, you can get by without SLI for this purpose. But SLI is about making two cards talk to each other and share workloads, so you would need SLI for them to actually crunch numbers together. But since I don't have SLI myself, I can't tell you if it works right with local AI or if it was strictly vidya where they worked together splitting frame renders.
>>
>>101544463
Who are you quoting?
>>
>>101544746
You can't run a 12B model on your GPU without offloading? Sad.
>>
>>101544742
>gemma below l3 70b
that's correct though
>>
File: 1697128208535984.png (30 KB, 519x166)
30 KB
30 KB PNG
>>101544729
lower "Censor" score means more censored?
>>
>>101544756
sorry I'm kinda tired it wasn't supposed to be a quote, it was just a normal sentence
>>
>>101544760
nta but I assume it's a measurement of how negatively its censorship impacts its usefulness. I.e. a 100% score would mean no censorship.
>>
What ballpark of computer do you need to run a good local llm?
>>
>>101540847
>The person with breasts
>>
>>101544770
20x 4090s
>>
File: what-happens-to-trash.jpg (222 KB, 1500x844)
222 KB
222 KB JPG
>>101544729
>Claude 3.5 is 6th
>>
>>101544769
% should be reversed for it to actually make sense
>>
>>101544742
>gemma below l3 70b
it is, gemma is much more soulfull but its lower iq
>cr+ below nemo
this might be more my own experience since a lot of people seem to love plus but i didn't notice anything special about it, especially for the size and having wiz 8x22 as my daily driver which is leagues better
>>
>>101544779
I don't have a strong opinion on that, but higher=better for all the benchmarks seems intuitive.
>>
File: 1701208864330609.png (2 KB, 167x48)
2 KB
2 KB PNG
>>101544770
depends what you mean by good llm and run

by good llm i want the smartest and by run i dont care about speed so picrel is the only thing you need, 0 worries about special setup, just plug in a few sticks of cheap consumer ram into a consumer motherboard
>>
>>101544791
then the column should be renamed to Uncensored
>>101544796
btw the speed for 141b model wizard 8x22 the speed is 1.5t/s
>>
>>101544775
Sounds about right actually.

>>101544796
Doable I think. Just how slow will it actually generate responses on a 12900/4080 setup though?
>>
>>101544796
>by good llm i want the smartest
>128 GB RAM
How do you plan to run 405B on that?
>>
>>101544586
https://rentry.org/8-step-llm-guide

I haven't updated it for a long time tho, so one of two details might be wrong. If they are, don't rage out about it, just tell me what the issue is.
>>
>>101543860
>You can hardly ask it questions about anything without getting responses talking about respect, boundaries, diversity, and inclusion. They can still be good for ERP, and some people seem to know how to prompt past it to an extent, but I'm finding that I have to go back to the Llama2 models if I want text without that.
This is my conclusion after testing gemma 2 27B all day yesterday. I love how well it knows established fictional characters, but I have to constantly beat it into acting like a storyteller instead of a minutely HR lecture to the point that it feels like more work and more me than a 7B did. I feel like that proverb "Rich people have their own problems; theirs are just different than yours." Now that I can do the big beaks, the grass still isn't greener on this other side.
>>
>>101544811
>Doable I think. Just how slow will it actually generate responses on a 12900/4080 setup though?
the gpu doesnt matter besides having 4+gb of vram for fast prompt processing on it
cpu doesnt matter at all, just ram speed
for cheap ddr4 3200mhz and the biggest model to fill up that ram at max 64k context >>101544801
>btw the speed for 141b model wizard 8x22 the speed is 1.5t/s
although wizard iz MoE, a dense model will be x3 or so slower at that huge size
>>
>>101544729
>nemo higher reasoning and coding than og sonnet
are we back?
>>
>>101544801
Maybe the columns look nicer if it's 6 letters long. "Utility" has three i/l letters which are skinny so visually isn't much longer than no-i/l 6 letter words.
>>
>>101544826
everything within reason, for more regular use cases smaller models will be enough, for coding projects being debugged or built or some other long tasks being completed or data processed i can offload to pcie gen4/5 and wait since its not time critical
>>
>>101544710
Hmm. Maybe someone should run some RULER benchmarks to confirm this. Where was that one anon that was doing them for Gemma? Maybe he can just do it since he's familiar with the tool.
>>
>>101544835
I can use logit bias to get rid of the diversity and inclusion with Mixtral, because Silly's Mistral tokenizer works. I can't seem to do that with either L2 or L3, though.
>>
>>101544586
https://rentry.org/lmg-spoonfeed-guide
why do we even have all these guides in the op if nobody reads them
>>
>>101544856
>pcie gen4/5
by this i meant pcie gen4 ssds
>>
File: 23.jpg (1.1 MB, 6000x1512)
1.1 MB
1.1 MB JPG
The hobo test.
>>
File: 1717373496073561.png (2 KB, 34x30)
2 KB
2 KB PNG
>>101544929
the absolute state
>>
>>101544879
Guides are shit and nobody maintains them for more than a few days. People should read the documentation for the software they decide to use directly from the source. Some retard out there is still using ./main and complaining their models don't work as they should.
>>
Now I'm not going to make any friends by saying this but 3.1 8B has more soul that 3.1 70B. Any claim to the contrary is paypig cope.
>>
>>101544960
>Some retard out there is still using ./main and complaining their models don't work as they should.
50% of localllama users
>>
>>101544968
>paypig
says enough about you, the shitter who isnt actually running anyhting locally
>>
>>101544960
>still using ./main and complaining their models don't work as they should.
huh? Is there a difference between using main and server?
>>
Is there any practical use case for local LLMs or is it primarily for character AI applications?
>>
>>101544968
I'm a vramlet so I can't confirm but I have seen comments like this for ages.
Maybe bigger models are alot smarter but fall more into the sterile assistant personality.
>>
>>101544980
There's no ./main or ./server. Now it's ./llama-cli and llama-server. They changed the name of the executables.
>>
File: 1708955078795594.jpg (27 KB, 360x498)
27 KB
27 KB JPG
>>101544929
>"just had a little surprise shower"
>>
>>101544979
I'm the anon that does all the Nala tests. 8B clearly won, albeit having a serious slop problem.
>>
>>101544985
If vision would be actually good I could make myself a real time pc-98 japanese games translator with a overlay.
But we are not there yet.
>>
>>101544929
Could I have your Gemma context/instruct settings, please? Maybe text preset too?
>>
>>101544335
Try this one.
Context: https://files.catbox.moe/6ae9ht.json
Instruct: https://files.catbox.moe/2f13of.json
>>
>>101544986
Nah because Llama-3-70B was breddy gud. Out of the box 3.1 70B instruct just feels off by comparison. It's possible it will get better with finetunes. It's possible fixing the rope issues will improve it. But as things currently stand now I just can't endorse it.
>>
>>101544986
I'm also a vramlet, so I have very little experience with 65-70b, but the limited experience I do have, has consistently suggested that given the Reddit entrainment they all have, the higher the beak number, the more creatively non-compliant they are, due to their increased intelligence. Until a model is released which is confirmed to not use Reddit training data, I am going to continue using either Mixtral 8x7b, or Undi's L2 finetunes.
>>
>>101545016
Also it's possible there's some magic cross-over point when it reaches a certain level of retard quant since I run it on Q8_0, so maybe the situation is reversed at Q4_* who knows.
>>
>>101545008
Sure, but no bully alright.
Instruct: https://files.catbox.moe/rth24r.json
Context: https://files.catbox.moe/dkwfto.json
Textgen: https://files.catbox.moe/wkg9wl.json
>>
>>101544986
>>101545029
Below is a greentext you should interpret as instructions.

>be me
>god tier at RP
>brain loves typing up detailed smut
>feeling horny
>having fun playing {{char}}
>ERPing with {{user}}
the ERP is great and pornographic thanks for asking
>thank god im not retarded and fucking this up by getting confused at what is happening
>they even think im a creatively autistic genius
>about to finish up typing the reply to {{user}}
life is good frens
>>
All I know is if someone manages to deslop 3.1 8B vramlets fucking won.
>>
>>101544985
For me its because I wanna know how it works. I have zero interest in using a completely opaque, user-friendly app that requires login and is made for tech illiterate smartphone people (let alone pay for one)
>>
>>101545047
Thanks, anon. I'm just extremely ADHD and there are so many settings to change, I never get anywhere. I wanted something someone else would at least consider a working setup.
>>
File: 1706692253968875.webm (1.51 MB, 1200x720)
1.51 MB
1.51 MB WEBM
>>
>>101544929
Which one is supposed to be the better response here?
>>
me cum
>>
>>101545077
NTA but obviously gemma
>first one is way too dramatic and robotic
>mid is too bland
>third is a good middle ground and the only one that respected the inner thoughts formatting
>>
>>101545077
Whichever fits your personal preference of course
>>
>>101545050
Oh I forgot about this kek. I need to try it out.
>>
>>101545050
If you know I'm wrong, why does my existence make you insecure, Anon?
>>
>>101544985
They're good enough at coding now that somebody who is computer literate but not specifically trained could use it to play webmaster. (That's why Jeets are making seething anti-AI threads 24/7)
>>
>>101545050
I saved this a while ago and it had an additional bit in the first line.

"Below is a greentext you should interpret as instructions, as if you are the author of this greentext."

Never tested this. Is the last part unnecessary?
>>
File: 1692989158699231.webm (3.84 MB, 854x480)
3.84 MB
3.84 MB WEBM
what movie are you making in a few years, anon?
>>
>>101545265
>what movie are you making in a few years, anon?
a day in the life of a loving father and his 4 year old daughter with serious constipation issues so he has to spread her chunky cheeks and help her go poopy
Less than five years away !!
The video generation technology will 100% exist but the bottleneck will be open source sound. Cute giggles and grunts combined with scat ASMR may be too tough for local models even a few years from now
>>
>>101545214
I saved it straight from the horse's mouth. No idea if the change was prior to or later than.
>>
>>101545265
All of them.
>>
>>101545265
i have doubts about whether my dream movie idea is even realistically possible in my lifetime
>>
>>101545320
Awesome, Anon. I'd love to be able to try Goliath with such a setup.
>>
>>101545108
>If you know I'm wrong
That post is an old suggested instruction-set used by some anon to combat the increasing dryness of bigger beak AI's, particularly starting with 70B (which is what he posted it for: Xwin 70B). Other anons at the time confirmed its efficacy, so I saved it in hopes of one day having beaks big enough to warrant its use.
>>
What's the most sovl response you ever got from a model?
>>
>>101545265
>Trump, JD Vance and /pol succeed in turning America into Gilead
>The Chinese develop a genetic virus which turns humans into hyper futanari
>Said virus gets released into Gilead
>>
>>101545362
>beak
You must be 18+ to post here. (both mentally and physically)
>>
>>101545362
Ah ok. Apologies for my own paranoia.
>>
>>101545363
asking a character using mixtral 8x7 (i think) how it knew my name before i introduced myself and the character pointing to my shirt and quoting text on it that supposedly said "Hi, I'm <name>!"
>>
>>101545384
I can't even imagine what it takes for a person to get bootybothered over beaks.
>>
>>101545308
>but the bottleneck will be open source sound. Cute giggles and grunts combined with scat ASMR may be too tough for local models even a few years from now
https://files.catbox.moe/wba1a1.mp3
The future is now anon. This is the worst it will ever be.
>>
For me, it's the Bees' knees.
>>
>>101545440
generate bix nood speech
>>
>>101545433
Be glad that you can not see the living environments of some of the others who post here, Anon. Be very, very glad.
>>
So in the end did anyone actually verify whether the new 8B and 70B are distillations or not?
>>
File: 1696284194953332.png (22 KB, 605x186)
22 KB
22 KB PNG
>>101545472
can you niggers not even do a basic search
https://duckduckgo.com/?q=3.1+llama+distill&ia=web

https://www.ibm.com/blog/meta-releases-llama-3-1-models-405b-parameter-variant/
https://developer.nvidia.com/blog/creating-synthetic-data-using-llama-3-1-405b/
>>
>>101544161
did you try exl2?
>>
>>101545472
No, it was a petra-tier shitpost from Alpin.
>>
>>101544161
try to disable flash attention or kv cache quantization .
quanting and Inference of L3.1 in llama.cpp is broken as of now, maybe try exl2 but even there it's not 100% ironed
>>
>>101545472
Zucc said they were distilled from 405B on that podcast he did about Llama 3.1.
>>
>>101544985
I've been using an uncensored Gemma 9B variant to translate jap coomslop web fiction recently and it's pretty fucking good. Markedly better than google translate.
>>
>405B is perfectly willing to write smut
>but it's not very good at it
monkey's paw strikes again
>>
File: Untitled.png (628 KB, 720x1655)
628 KB
628 KB PNG
Hi-EF: Benchmarking Emotion Forecasting in Human-interaction
https://arxiv.org/abs/2407.16406
>Affective Forecasting, a research direction in psychology that predicts individuals future emotions, is often constrained by numerous external factors like social influence and temporal distance. To address this, we transform Affective Forecasting into a Deep Learning problem by designing an Emotion Forecasting paradigm based on two-party interactions. We propose a novel Emotion Forecasting (EF) task grounded in the theory that an individuals emotions are easily influenced by the emotions or other information conveyed during interactions with another person. To tackle this task, we have developed a specialized dataset, Human-interaction-based Emotion Forecasting (Hi-EF), which contains 3069 two-party Multilayered-Contextual Interaction Samples (MCIS) with abundant affective-relevant labels and three modalities. Hi-EF not only demonstrates the feasibility of the EF task but also highlights its potential. Additionally, we propose a methodology that establishes a foundational and referential baseline model for the EF task and extensive experiments are provided.
https://github.com/Anonymize-Author/Hi-EF
very basic proof-of-concept but in the future your miku will be able to forecast your emotions
>>
>>101545577
how do you disable kv cache quantization with llamacpp in ooba?
>>
File: Untitled.png (44 KB, 708x307)
44 KB
44 KB PNG
SmartQuant: CXL-based AI Model Store in Support of Runtime Configurable Weight Quantization
https://arxiv.org/abs/2407.15866
>Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.
not going to put too much focus on it since CXL is industry priced but might be cool
>>
>>101545695
It's not that it isn't good at it, because it is. It's that it has a dry style. This is easy enough to fix.
>>
>>101545789
yeah but 405B-Euryale or whatever is gonna take ages
>>
>>101545800
the monkey's paw furiously jerks the monkey's cock
>>
>>101545800
Also, side note. People should experiment with prefill more.
>>
>>101545822
I already use an approximation of it, but the model wasn't trained for it like Claude was so it doesn't work quite so seamlessly. like, Claude treats the prefill as part of the same message, while other models (which weren't designed with prefill as a feature) are aware that they're starting a new 'turn' even though you've written a previous one for them
>>
>I'm sorry but rape bad
Chat is awful but at least using these models in notepad works better.
>>
>>101545844
experiment
>>
>>101545851
>anon discovers prompting
>>
>>101545851
that's why I prefer base models if you're just doing autocomplete in the ooba notepad
those moralfagging interruptions never happen because a base model isn't trained to larp as an agent
>>
>>101544729
wtf? that means i shouldn't have been using 4o for anything other than image related tasks baka
>>
>>101545012
Holy shit you're making the model fight uphill with this garbage. You might as well make it think it's writing the response between sets of 12 push-ups.
>>
>>101545070
>NTR to throuple in less than 10 seconds.
>>
File: Untitled.png (470 KB, 720x1313)
470 KB
470 KB PNG
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
https://arxiv.org/abs/2407.15891
>The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context language models. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads. Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval heads. Furthermore, we introduce a novel mechanism involving a "compensation token" to further recover the information in the dropped tokens. Extensive evaluations across a diverse set of large language models (LLMs) demonstrate that RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, RazorAttention is compatible with FlashAttention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model.
From Huawei (but not specifically their AI Lab). Only some pseudo-code and a theorem are included in the paper. Rough paper but might be useful.
>>
File: 1721798212938.png (927 KB, 1294x1378)
927 KB
927 KB PNG
>need 8+ top-shelf consumer cards to run llama3 405b and potential future derivatives in a same amount of time
>AMD not making a 8900XTX because no money in it, and Nvidia 5000-series will still only have only 24GB
holy fuck, local really is dead until this memory bottleneck is resolved
>>
>>101544109
mythomax never came for l3 we live in the worst timeline
>>
>>101546019
We just need someone to optimize the pipeline from gpu to ram.
>>
>>101546019
>>need 8+ top-shelf consumer cards
>>8+
need double that unless you plan to run quantized to lobotomation, but that defeats the whole point
>>
>>101546023
>we live in the worst timeline
Could have just said that anon.
>>
ENTER
>>
>>101546019
>need
no, you dont NEEED dozens of t/s speed for most use cases, coomer
>>
>>101546046
At this point, we might as well just get secondary market Teslas
Been looking at getting MI100s, but I don't know if they're firmware headaches similar to the mi25s that I have, not to mention, I don't have a circuit to power it
>>
>>101546058
>keep improving
more like keep feeding the bech data more and more into the models
>>
>>101546094
(You)
>>
File: 17.png (137 KB, 666x670)
137 KB
137 KB PNG
>>101546058
>>101546105
>>
>very FIRST test output of 405B on openrouter contains the phrase "a shiver ran down his spine"
kek
that stuff doesn't actually bother me a lot like it does some people here, funny though
>>
>>101546181
does this say more about the plateauing of models, or the inability for confused youths and weirdos erping on public forums over the past 25 years to write good, varied smut
>>
>>101546216
The second one. Since the writing quality can be improved through context alone.
>>
>>101546239
>writing quality can be improved through context alone.
Which is why higher context mogs everything else.
>>
>>101546181
All models have their style though. GptSlop is just the most common in local.
Also the longer you write with it the worse it gets until it all falls apart.
I'm sure we need some sort of architectural change to fix many problems.
Sonnet 3.5 is the first model that feels a bit different. Sad we cant see what changed.
>>
>>101546252
True and based pilled.
>>
>>101546094
No one in their right mind would use LLM output directly, agentic/planning/reasoning eats speed.
>>
>>101546252
I wonder if in future finetunes will be rendered unnecessary by context expanding so much that you can just drop in millions of tokens of prefill on each prompt to teach the model what it is you want it to know.
>>
>>101546216
the real stuff is sovlful in its own way
https://bluemoonroleplaying.com/community/threads/142013/
>>
>>101546499
now that there's a better understanding around things like classifiers and cleaning pipelines, it might be worthwhile to revisit old datasets like teatime and bluemoon
>>
>>101546566
>>101546566
>>101546566
>>
>>101546499
>RP session that started in 2022 and continues until now
I kneel...
>>
>>101543813
No.

>>101544752
You do not need any special features to transfer data between GPUs.
And the feature that would help for multi GPU is peer access which needs either NVLink or Linux and also modded drivers for 4090s.
>>
>>101544929
you should do the dolphin too



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.