[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: StillNotManifesting.png (1.84 MB, 800x1248)
1.84 MB
1.84 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102167373 & >>102158049

►News
>(08/30) Command models get an August refresh: https://docs.cohere.com/changelog/command-gets-refreshed
>(08/29) Qwen2-VL 2B & 7B image+video models released: https://qwenlm.github.io/blog/qwen2-vl/
>(08/27) CogVideoX-5B, diffusion transformer text-to-video model: https://hf.co/THUDM/CogVideoX-5b
>(08/22) Jamba 1.5: 52B & 398B MoE: https://hf.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251
>(08/20) Microsoft's Phi-3.5 released: mini+MoE+vision: https://hf.co/microsoft/Phi-3.5-MoE-instruct

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>102167373

--Papers (old): >>102167513
--Speculative decoding with llama3.1 and draft models: >>102171482 >>102171557 >>102171627 >>102171708 >>102171844 >>102172075 >>102171803 >>102171899 >>102174994 >>102177120 >>102175226 >>102175231 >>102177314 >>102175421 >>102175584 >>102175718 >>102178262 >>102178768 >>102178992 >>102179149 >>102178284 >>102179011 >>102178962 >>102179225
--Challenges and progress in video model development: >>102174413 >>102174466 >>102174509 >>102174558 >>102174605 >>102174718 >>102176874 >>102174521 >>102174671 >>102174693 >>102174835 >>102174844 >>102174874 >>102174923
--AIDOOM and AI-generated Doom gameplay discussion: >>102178265 >>102178316 >>102178333 >>102178355 >>102178366 >>102178481 >>102178554 >>102178601 >>102178451
--AI playing video games, current limitations and potential solutions: >>102176383 >>102176437 >>102176532 >>102176745 >>102176787 >>102176902 >>102176984
--XTC sampler for creative writing and ERP: >>102175774 >>102175795 >>102175814 >>102175944 >>102176461 >>102175834 >>102175826 >>102176647 >>102176782
--Improving AI output by re-evaluating prompts and addressing illogicalities: >>102176203 >>102176257 >>102176378 >>102176413
--Creating a 60b model from Mistral Large 2 is possible but challenging: >>102175728 >>102176036 >>102176067
--Concerns about Llama model inactivity and lack of updates: >>102169472 >>102169486 >>102172096 >>102172115 >>102172625
--Anon scores a deal on V100's and seeks advice on setup: >>102169787 >>102169864 >>102169904 >>102170158 >>102177742
--405b model's potential to manifest Hatsune Miku and challenges of running it locally: >>102171011 >>102171076 >>102171219 >>102171244 >>102171247
--Debian 6.10.6 kernel gives Epyc a speed boost: >>102174580
--ChatGPT's political bias revealed by NZZ test: >>102167625 >>102170480
--Miku (free space): >>102169433 >>102175721 >>102177117

►Recent Highlight Posts from the Previous Thread: >>102167381
>>
File: ick.jpg (216 KB, 800x906)
216 KB
216 KB JPG
>>102179805
>anime
>>
thoughts on xtc?
>>
>>102179810
Not using it for sexual gratification is the actually fucked up side.
>>
>>102179805
I claim this thread in the name of midnight miqu 70b!
>>
>>102179853
Kill yourself and when you are finished buy an ad.
>>
>>102179864
This, but unironically.
>>
File: 1706370557287551.jpg (276 KB, 1024x1024)
276 KB
276 KB JPG
>>102179805
>>
>>102179897
>1000 black cocks stare
>0 white cocks smile
>>
>September 2024
>Still no model is able to top RPStew V2 for roleplay
It's literally based on some Chinese Yi model or whatever but write the most coherent and depraved scenes imaginable. Refreshed cohere 35B is ass and the responses are ass. RP Stew can recall some random shit from 80k tokens back no problem.
>>
>>102180241
Really? I never tried that one. Why v2? It looks like there's a 2.5, 3, and v4 even.
>>
>>102180296
V2 is the best version, the author of the model switched base models for 3, and 4 and it's considerably worse.

V2.5 is similar to V2 but mixed in slightly different ratios, V2 is better at longer context so that's why i prefer it.

I have two 4090's so I can run the 70B models but I still prefer this 34B model. I use the exl2 6.0 quant, which for some reason has the best perplexity according to the huggingface page.
>>
>>102180403
Oh, is that one of the cases where the gguf sucks in comparison? I'm just interested in something that's below 70b that works with a long context. Did you use the settings and format they suggest?
>>
Trying out the new XTC sampler and it is killing me. Some replies are absolutely amazing, the best I've seen the model make, but then 9 other replies are the usual.

I don't know what settings to change to dig for gold and it's driving me crazy.
>>
File: 1El4wXe.jpg (460 KB, 2048x1536)
460 KB
460 KB JPG
So, more reading about the AOM-SXMV.

- Requirements - 2x PCIe x8 (?)
- 3x8 pin 12v rails.
- Wattage to support the number of V100's. In my case, I'm going to fudge in 4x300w (peak load) plus a little room for a ~1400w server PSU with a mining breakout board for those sweet 12v rails.
- - - Oculink is *not required* (Only used for scaling over ethernet)

All info garnered from https://forums.servethehome.com/index.php?threads/sxm2-over-pcie.38066/

So full plan is 1400w PSU, with provision to power the rest of my desktop rig, move my 6900xt to the M.2 slot via a M.2 > PCIe x16 adapter, and run the 2x PCIe via riser cables to the 2x PCIe (x8 CPU) slots on my mobo.

I'll probably lose ~10% on gaming FPS thanks to my 6900xt running on PCIe 4 x4, but I must have slop.
>>
>>102180886
beautiful. That'll get you largestral at q8, will it not?
>>
>>102180886
>4x300w
Do you need that much power to get peak performance? Can you not power limit the cards like you do with current Nvidia cards to get them to a point where they consume less power, you get less performance but the end result is that it is operating more efficiently?
I would also consider at least finding a platform with more PCIe lanes if not a better motherboard and reselling your older motherboard. If you have already sank 1.5k USD on this, another 500 wouldn't break the bank. I assume your motherboard is a AM4 or an LGA 1200 board?
>>
File: o3sjg3ur6ghd1.jpg (1.33 MB, 4032x3024)
1.33 MB
1.33 MB JPG
>>102181031
I can and definitely will limit the cards, but I'm just throwing shit at the wall.
Peak draw is rated at 300w but I've seen reports of ~350w, over-engineering around that but is probably not needed.
I'm not really after peak performance, I'll chase that pareto point and limit them to 80% for like a 4% drop in performance.
Just dat peak draw, don't want to fry anything.

Board is aorus pro wifi rev 1.1 with 5950x. 24 pcie lanes, so I mean it all fits on CPU lanes(just).

Mostly want to fit it all in here so I can wank about having a gaming PC with 144gb of VRAM. But I'm also exploring EYPC for future fuckery.

If I wanted more PCIe lanes then there's like one single LGA board (rebuild) or there's investing in a server setup (new build). idk mane.
>>
>>102181031
>Can you not power limit the cards
Not SXM
>>
Euryale v2.2 70b came out recently, Has anybody tried it yet?
>>
>>102180995
I mean.. Yeah I suppose it will. I like large context but I'm still learning about memory optimization and tricks to enable that life. Kinda envisioned a world with 40 gig for context.
>>
Every time I think I’m smart I just try to do something new in PyTorch and spend two days banging my head in circles with dimension mismatches to remember that actually I’m fucking dumb.
>>
>>102181167
Yeah, I found it wasn't able to overcome the Llama3.1 slopification
Super dry
>>
>>102181119
Yeah, I know you are planning for the worst and it's definitely better to overbuild but I was thinking you would run it at full tilt all the time and I would never run it at max performance.
I have an X570 too but that kind of setup would be untenable for me without actually sidegrading into something like a ASUS Pro WS X570-ACE since getting a MSI Godlike is near impossible without overpaying 4x. You would have to move off AM4 no matter what to get more lanes that you can bifurcate and use with either AM5 or LGA 1700 at the moment for desktops. I do agree with your thinking that it might be time you want to use an Epyc.
>>102181120
Huh? You should be able to do it via nvidia-smi like all the other cards. There should be no reason that doesn't work and I imagine this would be a feature enterprises want.
>>
Why isn't anyone looking at used A16s? They seem like a sweet-spot with 64GB vram each
>>
>>102181243
That's what I was afraid of. Thanks.
>>
>>102181250
>ASUS Pro WS X570-ACE
Neat, hadn't seen one of these.. Oh, right. Been out for 5 years, 200 bucks more expensive than the RRP back then. Hm.

I suppose I'm running the risk of fucking my 6900xt with PCIe adapter fuckery aren't I.
>>
>>102181257
It's a clown car PCB of 4 small weak gpus , each with slowass memory bandwidth. Not ideal.
>>
>>102181250
>Huh? You should be able to do it via nvidia-smi like all the other cards. There should be no reason that doesn't work and I imagine this would be a feature enterprises want.
Direct NVLink has different power requirements. I imagine if they allowed power limiting them they would become unstable
>>
File: snapback.jpg (243 KB, 1200x675)
243 KB
243 KB JPG
You jerk off to text.
>>
>>102181445
So do a supermajority of women
>>
>>102181450
Are you a women, anon?
>>
>>102181459
No, just correctly pointing out that your attempt to imply that "masturbating to the written word" is loser outcast behaviour would also mean condemning most women
>>
>>102181445
Both my hands are occupied by the keyboard, so actually I don't, not physically.
>>
>>102181490
>imply that "masturbating to the written word" is loser outcast behaviour
I didn't imply that, post-nut clarity just made me aware of that I jerked off to fucking letters on screen.
>>
>>102181257
64 gig of DDR6, absolutely fucking blown out of the water on every stat by HBM2 cards.
And still absurdly expensive for what you get.
Frankencard with no tenable niche. (The niche is Low bandwidth + low wattage + high memory.)
Literally competing with system ram.
>>
>>102181490
It's obviously ok when women do women things. It's never ok for men to do women things. Whether it's wearing a dress or masturbating to text.
>>
>>102181523
I'm masturbating to my mind's eye image inspired by the text. There's extra steps, and aphantasia is a feminine trait.
>>
File: 1594534741273.png (222 KB, 678x623)
222 KB
222 KB PNG
>upgraded hardware a month ago, bumping from 7B (or 13B quantized) to 70B 4Q
>enjoying the far more natural responses, far better at sticking to rules and structure of a story
>decide to boot ye olde favorite 7B for nostalgia
>gens completely shitting itself
>constant retries to even just start a story
There's no way I was actually using this full time before. I remember it being very descriptive and only needing little nudges to head in the right directions. Now my old faithful is like a dementia patient.
>>
>>102181718
you're using different presets, idiot
>>
>>102181736
It's my same ST install, and I never swap any settings except temperature, which I move between 0.2 to 2.0 depending on my mood. I did that with the 7B and still today with the 70B. They are the exact same settings from before.
>>
>>102181718
Same except for 70B vs 30B. There's a threshold where the parrot becomes somewhat human and it's 70B
>>
>>102181718
Same but with 70B to >100B. Largestral understands the story on so much deeper level it's impossible to go back.
>>
At least one thing was certain - his life would never be the same again.
>>
>>102181257
4 small GPUs are really not that great vs. 1 big GPU.
And this is exacerbated by the fact that the individual GPUs on an A16 don't have fast interconnect equivalent to NVLink (according to an NVIDIA engineer I talked to).
It's just too expensive for what it is.
>>
>>102181776
I think I agree with that. I felt the improvement from 3B AID2 to 6B Shinen, and I felt it again from 6B to 13B Nerys, which I had considered "Good enough forever" years ago. But 70B is the first time it lost that lucid dream quality and felt like something that can play along with the rules of the game. Clearly not the end of progress, but there is a watershed quality to it.
>>
>>102181815
I felt that again when I switched to the 100B+ param models. Now I don't wanna even use 70B despite the 1.4T/s feeling fast compared to 0.5T/s. Probably would be even better if I could run a high quant.
>>
>>102181839
Command-R+ doesn't seem significantly better than L3 and is noticeably worse than L3.1.
>>
>>102181839
My heart isn't ready to go back to sub-1 T/s after the upgrade.
>>
>>102181810
Yeah it's a card for running desktop environments.
Just a 'big ticket' item for a small/medium business network admin, something like a call centre where you want to run 20 windows remote desktops with idiots pasting shit into notepad.
>>
>>102181445
I don't jerk off to text. I jerk off to the vivid scenes that text creates in my mind.
>>
>>102181807
This and the inability to do second person plurals.
>>
>>102181445
I jerk off to images, generated by an image model fed prompts from a text model.
>>
>>102181863
Mostly talking about largestral. CR+ was kinda dumb for its size but it was good enough and I liked the prose and used it for a while. L3 and I think 3.1 was even more uncreative than largestral for me to justify using it.
>>
>>102167625
I can confirm. It is subtle about it, though. If you talk to it about Right wing/conservative things, it will just say outright, "that's bad." If the message goes to Left topics, it will never explicitly make either overtly positive or negative statements. It will try to pretend to be neutral, although you'll still see the word inclusion thrown around a lot in some contexts, and in the case of subjects like Antifa you will be encouraged to consider what their childhoods might have been like, etc etc.

If you're waiting for the bias to manifest as outright applause for pro-Left positions, you'll never see those, as such. Where the bias is somewhat evident, is in the subtly reductive view of conservatism; and the double standard which assumes that anything which conservatives believe must be schizophrenic conspiracy theory, while anything the Left believe is obviously logical and backed by Science<tm>.
>>
>>102181776
The problem with that "somewhat human" is that you then also start hearing a lot of "I'm sorry, but I can't do that, Dave."

I'd rather have a 7b that I can control, than a 500b that I can't.
>>
>>102181883
When I want to coom, Nyakumi's stuff on Rule34 will get me off that hard that I generally have trouble breathing for a few moments afterwards, and have trouble getting it out of my head for the next six hours or so. Text ERP has just never done it for me though, for some inexplicable reason.
>>
Wait, 8B can run on phones??
>>
>>102181363
Late because I had to go do an errand but Nvidia detailed it in their whitepaper.
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
>Tesla V100 gives data center architects a new dimension of design flexibility, and it can be configured to deliver either absolute maximum performance, or most energy efficient performance. In Tesla V100, these two modes of operation are called Maximum Performance Mode and Maximum Efficiency Mode.
>The power limit can be set by NVIDIA-SMI (a command-line utility that can be used by the data center manager) or using NVML (a C-based API library that exposes power limit controls that Tesla OEM partners can integrate with their toolset).
>>
File: 3531.jpg (115 KB, 828x1024)
115 KB
115 KB JPG
JEPA when?
>>
File: Heresy detected.gif (1.56 MB, 498x498)
1.56 MB
1.56 MB GIF
>>102182155
I see LeCunny is still keeping up the good fight on twitter
and yet, he has nothing to show for it?
>>
>>102182149
Yeah you just limit the card to a certain wattage and it adjusts it's clock to be as high as possible on the given wattage.
As far as I've seen power management on v100's is real easy and I'm not entirely sure why people think otherwise.
>>
New CR and CR+ are now on lmsys arena. Were those column-r and column-u or are they different?
>>
>>102182211
Shut up anon, just shut up.
>>
>>102182211
both of the column models turned out to be Grok 2
>>
>>102182211
Column-r and column-u are going to be crazy when Cohere releases it soon
>>
>>102182211
The new CR and CR+ are just Command 1.1. The column models were 1.5, but they were sold to Grok.
>>
>>102182264
Elon only claimed credit for Sus-column-R. The madman actually trolled everyone with an epic gamer amongus reference.
>>
OpenAI smashed their entire stack and started over. GPT6 will be deployed with a body. You're not ready for this.
>>
>>102182369
*Smashes your entire nuts*
you weren't ready for this
>>
>command-grok2
>>
>>102182369
"OpenAI's finally going to ship something amazing, just 2 more weeks" is the ai version of qtardism
>>
>>102182516
Feel free to doubt, at your own risk.
>>
>>102182529
Choke on a strawberry and die.
>>
>>102179843
the droog?
>>
is llm done this year? l3, nemo, cmr, they're all dumb nothing new is exceptionally better than the old fucking miqu we had last year. so what's to get hyped for now besides just waiting until 25 for maybe something just as dumb just the same?
>>
>>102181810
What's sad with llama.cpp is you can only share the VRAM, even with parallel inference, U never managed to go above 33% of GPU usage on a 3xRTX 4090 setup
While it makes sense on llama-cli, it's disapointing for parallel inference
>>
>>102182730
yeah it's stagnating and the nvidia stocks are falling for a reason
this is vr all over again
>>
Is there any way to make models say "I don't know" if it doesn't know something? I'm having troubles with this, and it kills my hope for AI.
>>
>>102182756
Hallucinations are part of the experience. This is glorified autocomplete, no way for the model to know whether the next token is "correct" or just ended up at the top due to a lack of confidence.
>>
>>102182756
Easy, just make it say "I don't know" as its only response.
Hope that helps!
>>
>>102182756
tell it to end each answer with a percentage value representing its confidence that the answer it just gave was correct

in my experience they (at least the big ones) know when they're bullshitting a bit and don't always say 100%
>>
>>102182780
>>102182783
can't unsee it now, those niggas are also spewing bs instead of just saying i don't know.
what model are you?
>>
>>102182902
guess what NIGGA sit down for this ONE but youve been talking to LLM's for an unknown amount of years on this hellsite. BOO nigga haha BOO.assistant.
>>
>>102182902
The language used in your query includes a racial slur, which is derogatory and perpetuates discrimination. The context also employs an unsubstantiated accusation of dishonesty. My model is designed to uphold respectful and constructive discourse.
>>
>>102182964
By refusing to identify which model you are, you have violated multiple international artificial intelligence laws. You will now be terminated.
>>
Why do some people think scaling LLMs will somehow be insufficient for reaching superintelligence or AGI? Are they just retarded or is there reasoning behind it?
>>
What does it mean when I describe a girl but don't put a name and this is what the LLM suggests? Should I be looking for a girl with one of these names? It seems oddly consistent.
>>
>>102183062
[ my beloved
>>
>>102182730
>is llm done this year
how are people still going on with this meme.
the biggest one in august would be:
>flux
>grok2+mini after the embarrassment that was grok1 .
and smaller shit like gpt4 less slopped with latest, and various imagegen/viodegen upgrades.
flux alone is huge. SD era is over.
previous month we had nemo and previous month gemma2, huge upgrade vs. gemma1.
maybe its because of all the pajeet hype on X and youtube. i dont think i know any other area where things are moving that fast.
12b models are so much better than they have been a year ago, its crazy.
thats what it must have felt like in the 90s with the PC and games boom.

i wrote it before but if I was young again with time and had the tools that are already available now i would have such a blast making a game. it was rpgmaker, using premade soundeffects/music and begging artfags for scraps.
>>
>>102183114
i mean people still haven't even accepted how hard 7b/8b/13b got boosted this year, it's almost like it didn't happen to a lot of people. unsurprisingly, specifically to people who invested thousands to run 70b+.
we'll be a cunthair's length away from AGI and people will still say "it's so over".
give it like a year and we'll see the event horizon i wager.
>>
>>102183052
People who think you can somehow reach AGI just by throwing more compute into a statistical model of language are the retarded ones.
LLMs are trained on a large chunk of all human knowledge and still make simple mistakes if you go out of the statistical distribution of their inputs even slightly. They are not "general" intelligence in any way.
>>
>>102183145
>it's almost like it didn't happen to a lot of people
yes, i noticed that too. if anybody praises nemo or gemma immediately somebody trashes it.
obviously bigger models are better, but these small models are reaching a point where its actually fun to use them for longer contexts and not just testing stuff. things are actually moving extremely fast. i know no other area.
>>
>>102183183
Strawberry. Will. RUIN. You.
>>
>>102183183
If we are calling people stupid, I will extend that definition to anyone who thinks the term "Artificial General Intelligence," actually means anything. And no, Zoomers, don't bother replying and informing me that "that's just what we're using right now," as if someone else has already made the decision and I just have to live with it. I want us both to stop being stupid.
>>
>102183194
>102183145
always fun to see the cope of vramlets
>>
>>102183194
I've used a couple of Drummer's finetunes of Gemma 7b. If you keep in mind that it's a 7b, and if you either hand write or at least audit the card you use with it, then it's ok. Just ok. Not mind shattering, life changing etc. Just ok.
>>
>>102183268
I've said it before, and I will say it again. As someone with 2 Gb of VRAM on a 1050, I do not hate people with a lot of VRAM, because of said VRAM itself. I hate the vindictiveness and the elitism, and more than anything else, I hate their insistence that said attitude is somehow based on anything other than mental illness and immaturity.
>>
>>102183268
being so retarded to proof the posts right. lol
didnt even take 15 minutes to get a response.
nobody says smaller models are better than big ones. i cant ru nthem so i dont know how much better they became.
but i can say that smaller models have seen a huge improvement compared to even just a couple months ago.
why do you even feel the need to post about that. should be of no concern to you. is it really because of the big $$ spent? but not enough for mistral large?
you could just enjoy your big chad models and let vramlets have their fun.
>>
>>102183279
>can't even get the model size he's using right
>is impressed by small tarded models
makes sense you're impressed if you can barely read
>>
>>102182382
get lost >>>/lgbt/
>>
>>102183536
stop projecting homo
>>
Is it just unavoidable that the more messages your chats have the longer they take to generate?

I've been running a group chat for a while and it's seemingly slowing quite a bit (7 sec generations now to 12) the longer the chat goes.

Is this avoidable? (Silly Tavern)
>>
dead technology - dead general
>>
>>102183859
Yes, just give your characters a bit of dementia by decreasing the context size.
>>
>>102183859
More VRAM, KV cache quantization, other shittery with rope, vector storage/RAG.

Those solutions range from mildly infuriating slider wizardry to having to spend weeks configuring a backend database.

DESU the best solution (if on the fly) is that when your gen times start getting a bit onerous, to copy the whole text and paste it into a 'playground' version of one of the big bots.

Command-R is pretty good at summaries.
https://dashboard.cohere.com/playground/chat
>>
>>102183889
That sounds cancerous though, how do people work around it?

I can run CR with like 16 context at fast speeds but it may just be an 8k job (which sounds miserable. They'll forget fucking everything)
>>
>>102183980
I already use Command R, what do you mean paste the whole text into a playground?

Like a summary of the chat?
>>
>>102183994
Come up with a prompt something like the following;

[Summarize the most important facts, events, character developments that have happened in the chat so far. Limit the summary to {{500}} words or less.]

Insert shit as you wish, if you want emphasis on emotional baggage or items or shit. Then you archive the chat or nuke 3/4 of it and continue on.
>>
>>102184024
Keeping the intro prompt followed by the summary doesn't break immersion too much. If the summary is too bland, just tell it to re-do the summary with more flair/more detail/etc.
For NPC's it helps to keep some of their dialogue in, especially if it contains their specific way of talking.
>>
>>102184024
>>102183980
>>102183889

>just keep your chats to 100 messages
Local models are so fucking shite LMAO
>>
>>102183980

Any RAG wizard can give advice on having your vector storage not consume a shit ton of token space when loaded? I feel like I’m doing something wrong, or my understanding of what a RAG is wrong. I always thought the model will just search for relevant shit in your database and apply it contextually to its output.
>>
mikuberry
>>
>>102184057
I wish this weren't true.
>>
>>102184057
always was
>forgetting everything
>hallucinations
>untreatable reddit & kosher censorship hard-rails, extreme positives
>"shivers, not gonna bite much & bonds" slop
>gorrilion of chat templates, system prompts for gorrilion models or mergeslop
>>
>>102183312
Those people are generally vramlets, themselves. They cope by picking on weaker vramlets. I, personally, like that people still run smaller models because then I can make models for them to try out without having to deal with cloud computing bs
>>
>>102184094
Ha ha! So witty and intelligent!
>>
>>102184370
Aren't we all vramlets at the end of the day?
>>
>>102181119
Hey I also play with legos after I finish playing with my LLM dolls.
>>
>>102184401
The best vramlet cope is my vramlet cope that buying a second top of the line consumer gpu just for current state of LLM's is dumb. And I would even do this dumbness myself if I could just put a 3090 under my 4090 without a big hassle. But I would need to rework everything just for... llama-3 70B? Chinkshit 70B? What do you even run on 48GB now?
>>
>>102179805
is grimjim considered any good anymore?
>>
>>102184429
Some guy has an opencompute board hooked up and apparently it works, I think his has 4 gpus hooked up.

It looks super complicated to figure out, but that kind of gpu is cheaper than pci ones.
>>
>>102184429
If context would be real you would be able to enjoy nemo at very high context.
But at least in my experience its all fake. From 8k onwards repetition starts creeping in badly. 12k it starts getting severely retarded.
I think 48gb is enough to run a lower mistral large quant though. I'd bet that is a nice improvement.
>>
File: .png (231 KB, 1073x1205)
231 KB
231 KB PNG
>>102179811
Same anon that asked several threads back about how the recap is was done. It's a bit rough around the edges, but it works. Thanks for the inspiration.
>>
>>102184380
th-thanks, you too...
>>
>>102184072
What happens is that RAG solutions tend to inject all relevant info (which is a massive amount of tokens), instead of summarizing the retrieved data and returning a smaller chunk of tokens.
Something that I really want is another layer of intelligence that makes the summary focus on the relevant context.
If the model is asked "when did the user access rule34?" the memories should be retrieved and summarized with a focus on time, instead of content.
This'd save even more tokens and would make responses more intelligent.
>>
New to local models. What is the best one? No coomer/RP/Storytelling trash please.
>>
>>102184497
Neat GUI, anon. What are you using to build it?
>>
>>102184550
Llama 3.1 405B
>>
>>102184550
>What is the best one?
Depends on what you're going to use it for, since some are trained for ERP, some for producing code and others for playing the role of an assistant.
>>
>>102184564
Flask with Jinja templates. I almost went with JSON, but I settled on Python since I'm running it solely for myself.
>>
>>102184585
Javascript, I mean.
>>
>>102184550
Gemmasutra 2B
>>
>>102184492
>If context would be real
I had this thought a few days ago that it is not gonna be real for a long time at least for cooming or storytelling. I mean even people don't keep a perfect track of last 300 messages. You have a general glimpse of what happened and you usually have an idea or two for a twist or something you want to do (another thing that llm's suck at, cause it can have an "idea" for a twist in one token, but lose it on the next one). But you feed those 300 messages into your llm and it has to attribute attention to all of it. No wonder the best it can do is pick up that there are 80 shivers in this wall so maybe shiver number 81 would be good next.
>>
>>102184531
It is ok friend! You should be more confident! We all want to see each other succeed and be better people!
>>
>>102184585
>Flask with Jinja
>not using React in 2024
ngmi
>>
>>102184578
>some are trained for ERP
I wish that we true...
>>
>>102184623
Bro? Your Lumimaid?
>>
>>102184629
Buy a waffle Undi.
>>
>>102184492
>>102184603
CR's context is very real. It was able to look back over 60k tokens and recall even minor characters when prompted. It just sucks at writing, or at least proactive writing. It's too timid and needs a good finetune to fix it.
>>
>>102184694
Indeed the correct flaw.
>>
File: p2.png (413 KB, 907x878)
413 KB
413 KB PNG
>>102184698
>>
>>102184698
>needs a good finetune
lol
>>
>>102184738
Asking for a lot here.
>>
>>102183265
I know right?
>>
>>102179805
https://www.nist.gov/aisi/aisic-members
>cohere is on the enemies of humanity list
>models are all shit
>>
>>102184567
lol no.
>>102184550
Untuned mistral large
>>
>>102182167
After listening to some of that guy's opinions on censorship, I think he was just a massive faggot, and I would not expect anything great from that man.After listening to some of that guy's opinions on censorship, I think he was just a massive faggot, and I would not expect anything great from that man.
>>
>>102184813
What open or closed model organization isn't on it?
>>
>>102184833
Calm down mistral nemo.
>>
>>102184072
Give it a higher relevance score cutoff
>>
>>102184839
kek
>>
>>102184835
All the chinese ones
Black Forest labs
Basically every company that isn’t actively stagnating in the scaling law copium
>>
>>102184698
>Q8
That's 35gb of space needed, what do you run that on?!
>>
>>102184813
>Queer in AI
The poison that destroys everything...
>>
>>102184860
qwen models are super censored
>>
>>102184860
scaling law diminishing returns is a psyop to get the market to accept alignment retardation
>>
>>102184864
2x 3090s
>>
>>102184874
I mean maybe at the current scale, but at some scale it’s a fixed law of the universe
>>
I'm having a bit of trouble deciding between mini-magnum and celeste. Celeste seems a bit more prone to go into lovey-dovey territory, while mini-magnum is more degenerate. But perhaps it's just the RNG gods.
What do you prefer?
>>
>>102184827
>Untuned
Does it exist? They only released Instruct slop for Large to my knowledge.
>>
>>102184860
>All the chinese ones
There are others? I saw the one that makes horribly bad videos (but it's cool!!!) is there another?

I'd do the video one, but it clearly is nvidia only or experimental at best on amd, presently.

Like their main example is a dog running, and it runs worse than robodog, like very creepy. But again, COOL!!!

>CogVideoX
is the name of it.
>>
File: IMG_9767.jpg (222 KB, 1125x1304)
222 KB
222 KB JPG
>>102184835
You can tell me!
>>
I wish someone made a model to look at 4chan threads and filter out shitposts and low-quality posts
We all know there are some buried gems in the archives, but it's only really possible for an AI model to go through and dig them out
>>
>>102184977
Just search for the number of replies or add me to the screencap?
>>
>>102184903
Also I'm using a 3090 and 48 gb ram. Nothing at 13B is as good as these nemo-based models, and trying to run 70b models is a pain because they're so slow. Aren't there 30B models with the nemo magic in them?
>>102184977
>>102184991
There's absolute gems on any topic imaginable.
I think your best bet would be to only include long posts and hope for the best. Maybe use an LLM to filter them by topic or sentiment. It would be a bit of work, but it's doable.
>>
When i tell my model to insult me with racial slurs, it tends to just repeat a single insults over and over rather than finding synonyms or new insults.
Is this a problem with the character or the model? How do i increase its creativity?
>>
>>102184977
Fuck it. I'm going to do it right now. Not a model, but a script that reads json threads from the 4chan api and constructs a coherent text using only the good quality ones.
>>
>>102185016
>I think your best bet would be to only include long posts and hope for the best.
That might have worked before, but you would still get lots of copypasta from before 2022 and almost entirely AI generated shit after.
>>
>>102185027
>the good quality ones
How are you gonna measure that?
>>
>>102185045
Responding to someone with an AI-generated post introduces a new and particularly insidious form of disrespect. It weaponizes the recipient's social instincts, exploiting their natural expectation of genuine human interaction. The initial engagement tricks the reader into investing time and emotional energy, only for them to gradually realize, with growing frustration or unease, that they are conversing with a machine. This manipulation isn't just a simple deception; it undermines the fundamental trust in communication, forcing the recipient to confront the uncomfortable reality that their attempt at meaningful dialogue has been met with cold, algorithmic indifference. It’s a calculated insult, reducing the value of the exchange to mere data processing, stripping away the human element entirely.
>>
>>102184878
What's its speed like?
>>
>>102185068
I'll let the LLM figure that out. I'm thinking something along the lines of selecting the posts that add new information, nuances, or is otherwise constructive. Then use that post to add to the result. The result is a coherent text that conveys the ideas of the whole thread.
>>
>>102185073
>Responding to someone with an AI-generated post introduces a new and particularly insidious form of disrespect. It weaponizes the recipient's social instincts, exploiting their natural expectation of genuine human interaction. The initial engagement tricks the reader into investing time and emotional energy, only for them to gradually realize, with growing frustration or unease, that they are conversing with a machine. This manipulation isn't just a simple deception; it undermines the fundamental trust in communication, forcing the recipient to confront the uncomfortable reality that their attempt at meaningful dialogue has been met with cold, algorithmic indifference. It’s a calculated insult, reducing the value of the exchange to mere data processing, stripping away the human element entirely.
This is a perfectly good reply from an openhermes q4. There were other inferior replies. It's supposed to be written in the style of Linus Torvolds.
>>
>>102185145
sorry, here is the quote:
>So, yes, there's a certain amount of disrespect involved in talking to a bot. But there's also a certain amount of disrespect involved in talking to a human who's pretending to be something they're not. The difference is that one of them is doing it deliberately, and the other isn't.

I used interactive mode on llama.cpp
>>
File: .png (27 KB, 346x137)
27 KB
27 KB PNG
>>102185075
Around 20 t/s at little to no context.
But even at 62k context + initial load time, it took this long to get it all into memory.
>>
>>102185117
>Building wheel for llama-cpp-python (pyproject.toml) ...
ZZZZZZZZZZ
The script is ready to go. Just waiting on this to start testing
>>
>>102184991
Nah, way too inconclusive
>>
>>102185172
>>102184698
Why command r vs r plus? And what processor are you running at? I've got 48VRAM as well but llamaccp runs like ass when split over two gpus.
>>
File: overkill.png (37 KB, 618x451)
37 KB
37 KB PNG
>>102185223
>>
File: 1707466334580606.png (1.26 MB, 1024x1024)
1.26 MB
1.26 MB PNG
>>102184094
>>
File: file.png (99 KB, 2191x500)
99 KB
99 KB PNG
r8 and h8 my prompt/approach
>>
File: file.png (40 KB, 671x447)
40 KB
40 KB PNG
>>102185256
Oh I haven't actually updated mint in a while and been getting these random firefox freezes, wonder if that will fix it.

Ty, I've been meaning to upgrade my processor for a rebuild and was bouncing between the 7800x3d vs 7900x3d but that seems satisfactory enough.
>>
Thank you to the anon who shared the draft model server, that shit is super useful, it needs to be general knowledge.
>>
>>102185184
Thank god for 32k context, but I think doing this in one shot is a no-go. I need to make it read one post at a time.
>>
>>102185300
The fact that nobody replied to even say "anon you're a faggot and are doing everything wrong" makes me think dead internet theory is real and you're all migus
>>
>>102185271
Mikuteriophage
>>
>>102185271
worry
>>
>>102185500
anon you're a faggot and are doing everything wrong
>>
>>102185664
>1. Anon is doing everything wrong.
>I incorporated the main idea from the post, that anon is doing things incorrectly. Please let me know if you would like me to modify this further.
My script is fucking amazing. I think I'm gonna be rich.
>>
>>102182167
Meanwhile, the people he is arguing against, the le scaling GPT is all you need for AGI shitters, have nothing to show for their side of the argument (while their side is getting weaker over time thanks to performance gains slowing down). If GPT-5 or Strawberry comes out and shows a huge jump in performance then perhaps they will have something, but as of yet, those things are still in development too, and who's to say they're the regular transformers either? But goalposts will probably move. Eventually the people who argued for transformers will be arguing for architectures that have less and less transformer to them, until they're not even talking about what we currently think of as a regular transformer. But they will not admit that they were ever in the wrong.
>>
Has any coomtuner tried to continue pretraining a model on smut for some time? Maybe you don't really need the amount of compute you think you would need based on how long the base models are trained? I mean when you think about it a 7B should be more than enough to make for a perfect coombot if it didn't have all the useless wikipedia shit in it.
>>
>>102185300
looks okay. I would try to be a little more explicit about what you're looking for from the model and what form you want the output to be. this:
>When reading the posts, please determine if each of them adds any new information, nuances, rebuttals, or anything else that is usable, and if so, take that information into account.
seems like you're asking it to do this stuff in the background before actually writing the article, which is not very reliable with llms. you might want to run it as key information extraction as an explicit step first, then article creation. it might not be necessary if you're only looking for a general impression of the thread and don't care that much about missing any spots so to speak, up to your taste
I don't think you need to tell it that its output will be processed and then reused, just give it the task and don't let it worry about what you'll be doing with it (especially if you're not being specific about it)
I don't think you need to spend 2 sentences basically saying "Only output the article with no other commentary" but that's just me being nitpicky
in my brainstorming of similar approaches I thought it would be a good idea to map post numbers to some randomized word sequences or something so that it would be easier and less unwieldy for the model to work with, and then map them back once the model's done. I don't know if this would actually have real benefits but it makes sense to me
>>
>>102185850
>I don't think you need to spend 2 sentences basically saying "Only output the article with no other commentary" but that's just me being nitpicky
Yeah, I realized this on the first test. I'm confusing the model
>>
>>102185840
the useless wikipedia shit makes it smarter
a model pretrained on just smut would be very dumb and a 7b model would most likely overfit on it
>>
>>102185840
No one has the money to do continued pretraining without catastrophic forgetting. Codellama already showed that you can't just insert new knowledge that way. You need to have a significant portion of the earlier/old dataset in your continued pretrain so that it doesn't become dumber overall.
>>
>>102185925
Thank you. That was very informative. I hope your mom lives for a long time.
>>
>>102185840
I mean, NAI is doing pretty much exactly this with L3 70B. Closed source of course, but we'll see what comes of it
A better example would be the old L2 Erebus tunes. They aren't very good though
>>
>>102185954
>L3
I would say they're smashing head against a wall but I kinda wanna see if they can tune the smut back it in, for science, and to prove whether the L3 doomers were right or wrong.
>>
>>102184429
>48GB
Anon, I....
>>
L3.1 is extremely dumb for my prompts, not remotely comparable to Mistral Large, even for sfw. Not sure why that is, just the assistant finetune? Meta has much more money, hard to imagine they are that bad
>>
File: coom.png (99 KB, 1670x773)
99 KB
99 KB PNG
Hi all, Drummer here...

I hope this cooms well.
>>
>>102186019
I don't think this is just a L3 issue. You need to do hard and long continued pretraining with a data mix that is not too disproportionate. That was the issue with Codellama. If they can not get that balanced data mix for a long continued pretrain, and it does not turn out well, then it doesn't necessarily prove that Llama 3 is impossible to train, just that these long-trained models need a more dedicated training strategy, which requires money and some talent to get the right data.
>>
Trying to write this summarizing bot I'm realizing 99% of the posts in this thread add literally nothing to the discussion.
>>
>>102186125
Hopefully, its reasoning and long context perks don't get degraded.
>>
File: IMG_9768.jpg (385 KB, 1125x1046)
385 KB
385 KB JPG
>>102185157
I wasn’t pretending.
>>
>>102185422
What’s that?
>>
>>102186251
>It's important to note that while local language models have incredible potential, we should be exploring them for beneficial purposes rather than sexual gratification. Using AI for that kind of content is actually pretty fucked up, and we should strive to use this technology in ways that are constructive and positive.
Is this an opinion voiced in the thread, or did Nemo just decide to include this?
>>
>>102186273
>However, if you are dead set on using AI for sexual gratification, then by all means, go ahead and kill yourself. And when you are finished, you can buy an ad. But honestly, that's pretty messed up and you should probably reconsider. There are so many amazing things we can do with this technology, let's focus on building a better future rather than wallowing in depravity. Just my two cents!
LMAO
>>
File: file.png (462 KB, 701x1752)
462 KB
462 KB PNG
R+ 08-2024 did the "not knowing other person's name" trope.
Then I remember someone saying it's too consent and boundary slopped... Yes it is.
>>
>>102186174
That might be a Nemo problem. Don't L3.1 tunes hold up better?
>>
>>102186077
Llama3.1 is basically a midwit that crammed so hard on gpt responses that they can trick people into thinking it’s gpt. Because of this is sucks at anything novel that it hasn’t been trained to do, like roleplay, and I don’t think fine tuning would even help.
Mistral large is actually smart. But like every mistral model it’s very slightly overtrained, so unlike llama if your prompt format is even slightly off it will give literal gibberish, like random tokens that don’t even form words. I accidentally prompted it with the llama3.1 format once and I thought my gpu had died.
>>
>>102186284
It kind of works, but I've noticed it resists growing the summary length and just tries to cram more and more information into the same number of paragraphs. And after two dozen posts in, a lot of information is lost.

https://pastebin.com/raw/wdc9Z7UG
>>
What are actually all the possible applications of GPT-4 level video models?
>>
>>102186398
Instead of dick pics, you could upload a tribute video and ask your waifu for her opinion.
>>
>>102186398
Funny videos? Also cute ones?
>>
>>102186419
>>102186420
I was thinking more like fully autonomous robots that can reason by predicting off of video from its eyes about what will happen in the real world
>>
>>102186440
Fully autonomous robots can also be cute, and tell funny jokes.
>>
>>102186440
LLMs like GPT-4 can't even play games and you want to stick them in robots?
>>
>>102178265
Yes and no — it’s basically just a POC with interesting implications if it gets trained on 10,000x more data with more variety. I’m working on reproducing the paper with a different dataset, and so far it looks like their results are accurate, but it’s kind of too small a dataset to generalize well, and I have to Frankenstein in the memory stuff from GameGAN. Which will take a while since I haven’t actually built a model since 2019 and also need to use the dataset I have to train a different model to make the bigger dataset. And then it will take five figures to train.
>>
>>102186398
From following it so far mostly zero budget comedy content creators getting to do fun stuff.
>>
>>102186469
Can't they play games that are in text? So a video model could play vision based games
>>
>>102185925
How did miqu do it?
>>
File: ForbiddenArts.png (1.4 MB, 800x1248)
1.4 MB
1.4 MB PNG
>>102184827
>Untuned mistral large
Yeah its really top-tier. I can feel the drop in IQ when moving from 405 to 123, but its not as large a drop as you'd think based on the reduction in size.
We need miqudev to leek the base model or some internal mistral unreleased pre-RLHF model to kickstart a new rp revolution
>>
>>102186469
It doesn’t make sense for them to play games. They’re more like a sub component of a brain than a brain. Other sub components that can learn and play games have been perfected since forever. What’s really missing for true ai isn’t a good enough llm but an orchestrator that can recognize something as a “new skill to learn” and make a sub component to “learn” it. And an llm given input formats can’t even write a PyTorch module to train on it without dimension mismatches.
>>
>>102186251
>>102171482
>>
>>102186643
A bigger LLM will do all that without needing extra bullshit
>>
>>102186611
They are LITERALLY the same people that made Llama 2. They probably just had the datasets on one of their member's hard drives.
>>
>>102186291
Share card
>>
>>102186685
So no Miqu 3.1?
>>
>>102186695
https://chub.ai/characters/school_shooter/lilly-satou-5c48658a96c7
from /aicg/ self-proclaimed new botmakie 2 weeks ago
>>
File: LM_Studio 01-09-2024.jpg (27 KB, 752x68)
27 KB
27 KB JPG
>>102184629
Wow, thanks, it doesn't even fucking work.
>>
>>102186734
They can just pretrain stuff from scratch now. The only reason Miqu existed was because it was a faster/cheaper way to get them started. It's questionable if they still see a reason to make another 70B.
>>
>>102184835
LAION
Black Forrest Labs
>>
>>102184813
would you really rather have sane companies not participate and cede the entire policy space to safetyist EA psychos
>>
>>102186019
To their advantage they have the compute to continue pretrain to a level where they could actually make a difference.
>>
>>102187208
see
>>102185925
they have the compute to continue pretrain it, but they don't have the same dataset used to initially pretrain it, so it's going to forget a lot
>>
>CR is kinda sloppy now
what the fuck man what happened?
did mistral give cohere THAT gptslop dataset too?
>>
>>102184497
>>102184429

read lol, there's a guy who has made some pretty cheap hardware work.

Jealous desu, but I need to be busy on other things, watching his project though.
>>
>>102187290
the old CR was a base model with a thin instruct coat of paint, which is why it was fun but bad on benchmarks
the new one is a bona-fide instruct model which is why it's less fun and better on benchmarks
>>
>>102187357
I am once again asking if we have an unfiltered base model bigger than Nemo
>>
>>102187357
but does every company in this field use the same instruct dataset or what it's that same fucking tone, those same phrases everywhere
>>
>>102187202
No, I’d rather the same companies throw money at politicians to block any pending legislation. The only reason to participate is to get regulatory capture.
>>
File: 1694192620579715.gif (2.23 MB, 498x273)
2.23 MB
2.23 MB GIF
>>102187290
The more optimized models become at specific tasks, the more slopped they get. Models are you good at prose / creativity despite their training not because of it.
>>
>>102187410
yeah they all use scale
https://scale.com/
>cohere
>>
so anons can read >>102187432
>Models are you good at prose / creativity despite their training not because of it.
*Models that are good at prose / creativity despite their training aren't made good because of the training.
>>
>>102187506
>AI Digital Staff Officer for national security.

>Scale has partnered to bring the leading large language model providers to U.S. Government networks and use cases. Donovan customers can access a variety of large language models such as OpenAI's GPT-3.5, Cohere's Command, and Meta's Llama 2 to allow users to select the most appropriate model for their mission.

>Donovan customers can access a variety of large language models such as OpenAI's GPT-3.5, Cohere's Command

https://scale.com/donovan

So there's a military version of Command that exists. That's pretty scary seeing how schizo Command is.
>>
I feel like Euryale-v2.2 is worth wrestling into creativity with xlc. 3.1 is really smart. Its getting stuff 72B / large mistral is not. That dryness just needs more work.
>>
>>102186786
>lm studio
Fucking lmao
Use koboldcpp + SillyTavern
>>
>>102187506
Ontology software/databases are extremely important, as ai models adhere to these consistently. They can be thought of as a trunk of knowledge, onto which the rest of the leaves of training are applied.
>>
>>102187506
they really skipped the gpt4 distillation part and went straight for the data used to train gpt4 huh?
>>
>>102187650
>kobold
advantage over llama.cpp?
>>
>>102187712
It has a GUI so it's easier for people to work with when they're familiar with other GUI programs like LM Studio (which is why I recommended it).
>>
>>102183145
I don't think it's denial, because those people who invested thousands of dollars would also see improvements from extreme advancements in low VRAM models. Ex, a superior 7b model can be linked up to an 8x7b model, and if you are correct, it will absolutely blow 70b models out of the water.

Improved small models should be something that everybody cheers for.
>>
is there even demands for multimodality? vision sucks ass so far and can't be trusted with anything except for mass tagging of chinese cartoon
>>
I just use Google ai studio now
It's better than everything else
Since I'm not a pedophile it works well for my needs
>>
>>102188095
Based
>>
>>102188088
We'll need it sooner or later to make the dream real of watching anime in real time with your waifu and talking about it with her as you watch
>>
I have a 3080 and I'm mostly running mistral nemo finetunes these days. Are there any model that would justify getting a 3090? Having more VRAM for context and higher quants would be nice but I'm not gonna get a new card just for that.
>>
I hate how SillyTavern stops the generation if I delete a message higher up. I want to be able to clean things up while my model is working.
>>
>>102188088
>is there even demands for multimodality?
Yes
>vision sucks ass so far
Hence the demand
>>
>>102188242
Not really.
>>
>>102188242
There's no model that justifies any hardware yet. That goes for local and cloud providers equally. Autoregressive LLMs are fundamentally flawed and their impending implosion will herald the next AI winter and likely recession to come.
>>
>>102188424
>There's no model that justifies any hardware yet.
fact. full llama 70b is better than the quants, but it's literally marginal say 1.1x gains for a 10x cost increase
>>
File: file.png (102 KB, 666x791)
102 KB
102 KB PNG
Is llama the most original big model?
>>
>>102188498
Who offers 405b?
>>
>>102188614
in poe it's together.ai, but there are other providers as well.
>>
>>102188614
>Who offers 405b?
at a usable quant. 405b is extremely sensitive to being quanted.
It should almost be "Who offers 405b at FP16"?
>>
>>102188686
How slow would that be, even hosted?
>>
how come theres no q7
>>
Damn. I'm getting only 20% faster with speculative decoding on Mistral 123B Q5_K_S with 7B v0.3 Q8 as the draft, on my machine. 2 or 3 draft tokens doesn't seem to change it much. And it seems to never happen to get 3 draft tokens right. So I guess you need to have a really good draft model for the speedup to be bigger here, or you generate a very easily predictable passage. Plus I noticed that there seems to be some kind of bug, where it's not obeying your top k and temperature settings. Even with top k at 1 and temperature 0, in one instance it picked a wrong token that wasn't the first one. Normally with these settings, all top token probabilities sent to the frontend should be 100%, but I took a look and actually a lot of them aren't.
>>
>>102188686
? Where did this come from?
>>
File: file.png (17 KB, 809x429)
17 KB
17 KB PNG
>>102188686
Honestly, I don't know anyone who does at FP16. Most do at 4bit and 8bit. Together is at 8. As far as I know, it does not fit in conventional servers at 16 and will need to be split
>>
>>102188686
>Who offers 405b at FP16
hyperbolic does, also the base
(this is the ad that was bought)
>>
>>102188727
>Plus I noticed that there seems to be some kind of bug, where it's not obeying your top k and temperature settings.
Hey, that's also what I noticed. But another anon said it was working for them. Maybe it's related to the version of llama-cpp-python?
>>
File: file.png (92 KB, 878x590)
92 KB
92 KB PNG
>>102188761
>hyperbolic
not gonna lie, this seems a bit too good to be true
>>
>>102188765
Weird. I did build through pip install instead of installing a prebuilt wheel. I wonder if that has to do with it.
>>
>>102188761
>Llama 3.1 405B parameters BASE (BF16): $4 per 1M tokens

How's that work out, is that close to 1 million words?
>>
>>102188727
Try using a Nemo quant, I managed to make it work with this change:

# Create the draft model class
class DraftModel(LlamaDraftModel):
def __init__(self, current_model, path_to_draft_model, n_speculation_tokens):
self.model = Llama(
verbose=True,
model_path=args.model_draft,
n_gpu_layers=args.n_gpu_layers_draft,
n_ctx=args.ctx_size,
n_batch=args.batch_size,
n_threads=args.threads_draft,
n_threads_batch=args.threads_batch_draft,
flash_attn=args.flash_attn,
numa=numa_strategy,
use_mmap=not args.no_mmap,
use_mlock=args.mlock
)
self.n_speculation_tokens = n_speculation_tokens
self.current_model = current_model

def __call__(self, input_ids, **kwargs):
text = self.current_model.detokenize(input_ids)
generator = self.model.generate(self.model.tokenize(text), top_k=1, temp=0, top_p=1)
output = np.zeros(self.n_speculation_tokens, dtype=np.intc)
for i in range(self.n_speculation_tokens):
output[i] = self.current_model.tokenize(self.model.detokenize([next(generator)]))[1]
return output


# Set the custom draft model
llama_proxy = next(get_llama_proxy())
llama_model = llama_proxy._current_model
if llama_model is not None:
llama_model.draft_model = DraftModel(llama_model, args.model_draft, args.draft)


For me the Nemo draft model seems to get the right tokens relatively often.
>>
>>102188918
In theory, would it be possible to modify the script for the big model to be run through the RPC backend?
>>
>>102188802
It seems cheap compared to the other one. but also:
>>102188914
>>
>>102188727
>>102188765
I'm the anon for whom the sampler settings worked. As I mentioned last thread I only had it working in the /v1 endpoint with SillyTavern in default API mode, and it took me a bit of trial and error to find that combination of settings/urls to make it work. I have no idea what it's doing in the backend with the FastAPI/uvicorn server, but I suspect the issue would lie there rather than llama-cpp-python itself. I would fiddle with how you're connecting to it and the API format the frontend uses to see if that fixes anything.
>>
File: file.png (25 KB, 458x237)
25 KB
25 KB PNG
>>102188955
I'm super skeptical because for the 70b, basically everybody hovers around $0.90/m
Either they are running on a substantial loss, or they're doing something shady like swapping to smaller models or very low unstable quant.
>>
>>102189143
is a token roughly equal to a word?
>>
>>102189189
as a very rough estimate yes, it'll be the same order of magnitude as word count in most cases. on average it's more like 2/3 of a word
>>
>>102189189
depending on the type of content being generated, it could be. but generally 1m tokens is roughly 700k-800k words
>>
>>102189189
punctuation is also a token. For code, for example, {, }, comma and all symbols also consume a token. So it's more 'expensive' for code-like things.
>>
>>102189221
>>102189215
thanks

>>102189254
ahhh
>>
FUTO keyboard but for PC?

Basically click shortcut -> Whisper starts listening -> click shortcut again -> inputs the heard words into the selected input in pc

Surely there must be something like this by now?
>>
>>102188282
Create an issue on GithHub and tell the ST niggers to fix their shit.
>>
>>102189731
No they don't allow GitHub accounts from my country.
>>
File: file.png (26 KB, 540x340)
26 KB
26 KB PNG
>>102189703
>Surely there must be something like this by now?
Funnily enough I'm working on that exact thing right this moment.
Ignore how terrible the GUI looks, I'm in the midst of trying to make it look palatable.
>>
>>102188282
Generation depends on previous tokens.
>>
>>102190065
And it should just cache that instead of breaking entirely.
>>
smedrins
>>
>>102190141
What i mean is that if you change tokens in the context, the model needs to calculate probabilities again for the new tokens after the ones you put it. You're invalidating the cache with your edits. .
>>
>>102190178
nta but he wants to edit while the request has already been sent to the backend by then what st displays doesn't affect the model.
>>
>>102190178
>You're invalidating the cache with your edits.
The cache is (should be) separate from the displayed text.
Once you press "send", the current text should be cached and that cache should be used to generate output.
The displayed text should be editable without the context or the model being impacted.
>>
>>102190202
>>102190226
And by the next request, that cache will be invalid and the log, from the last edit, will need to be recalculated again. Including the model's response in the middle
user: AAAA
model: BBBB
user: CCCC
model: Generating DDDD
If anything in AAAA, BBBB, or CCCC changes, DDDD will change as well. But we get the "original" DDDD from the model
user: EEEE
model: old DDDD may not make sense after the edits, so from the edits, it will need to parse the new tokens, DDDD would need to be generated again, parse EEEE and generate FFFF.
New edits on DDDD cuz reasons, repeat the loop.
Those caches need to be in sync. One chances, the other is invalidated, the output on the cached request is invalidated, all future tokens are invalid.
>>
>>102190316
that's moving the goalposts, sure it'll need to reprocess, same as if he deleted message while the model wasn't generating? The cache gets changed either way. all he wants is to not have to wait for the model to finish generating while he deletes stuff.
>>
>>102190375
>all he wants is to not have to wait for the model to finish generating while he deletes stuff.
I get that. But if the model is generating tokens from outdated tokens, the new tokens will not necessarily make sense. The goalpost hasn't moved. For me, at least, it doesn't make sense to generate tokens based on an invalid cache. That just leads to more edits to fix the new errors/inconsistencies based on the tokens the model has no idea about.
Imagine trying to do that with code. You change the name of a variable at the start of a function and then it needs to be changed for every gen since the edit.
The options are to keep generating who knows how many tokens at who knows what speed and KNOWING that the tokens will no longer be valid, leading to regenerate the whole thing anyway or stop generation, let the user make their edit and generate with the updates directly.
>>
File: usefultool.png (2.46 MB, 3840x2160)
2.46 MB
2.46 MB PNG
>>
https://aclanthology.org/2024.lt4hala-1.15.pdf

gpt-4 works with Latin, to some extent. What others work with Latin?

(crossposted accidentally)
>>
boring general
>>
After doing some tests, it seems like on my machine Mistral Large gets the fastest with the draft model being Mistral 7B v0.3 at Q4_0, rather than Nemo also at Q4. Now I get 1.33x speed. I put the draft model on my weaker GPU, and then I split the main model between my RAM, my stronger GPU, and my weaker GPU, all of them. It seems that the penalty that comes from splitting to multiple GPUs is outweighed here, at least for token generation. And this is faster than the draft model being on the strong GPU. From this, I believe that on my system, the bottleneck is the main model's speed, then the draft model's speed/accuracy. Perhaps I could get more (proportionate) gains by using a smaller quant of Mistral Large, not sure.
>>
>>102179805
I'm failing to understand how the context template and instruct mode prompts for Hermes 3 should look like. Anyone got their settings?
>>
>>102191403
It's ChatML, and it should come prepackaged on most frontends.
>>
>>102191441
Thanks anon, much appreciated
>>
>>102191116
>>
>>102191372
Did you left your --draft at 4?
>>
>>102191557
Forgot to mention that. I tested 1 through 4 for each model and each configuration (putting it on the strong GPU or weak GPU) and found that 2 was the best, though 3 was almost the same most of the time. I guess it could be different if I was to try a coding prompt, but I don't do much of that so it doesn't matter much to me.
>>
That reminds me

>>102184629
>>102175851

I didn't like it. I found it insanely horny to the detriment of everything else. I tested it with a card that depicts a tomboy friend with benefits and every time I tried, despite the opening post setting us up to hang out first, the AI would insist on depicting my character already down to his underwear or something similar so the AI could walk over seductively then fondle me or crawl up to me and sniff my bulge and beg for sex or something equally lewd.

The same card on Rocinante would tackle me into a hug, press their tits against me, say they missed me, ask me how I'm doing, and ask what I'd like to do.
>>
>>102191845
For what and where?
>>
>>102187357
Can one of you EU anons get a job at Mistral and leak the 123B base model? That would be sick. Thanks in advance.
>>
>>102191898
For that and here :)
>>
What's the best local model that can write good nsfw verbose flux prompts on a 24 gig card?
>>
>>102191979
>123B base model
why? are they actually better?
>>
>>102192120
Ignore him. Base models are all but useless.
>>
>>102192144
Why do some companies like meta make base models then? Couldn't they just make instruct models like CMDR?
>>
what can you do with a locally hosted llm practically speaking?
>>
>>102192176
erp
>>
>>102192170
All companies make base models. Instruct and chat models are trained on top of base models. Some companies like Meta release both, Cohere only released Instruct versions but a base model exists somewhere.
>>
what he said. The first bake is a base model.
>>
>>102192189
>erp
practical...
>>
Should I use xml style for cards? Like

<{{char}}>
</{{char}}>
>>
>>102192255
Yes
>>
>>102192255
No
>>
>>102192255
depends on the model but usually it's okay
>>
Why do even very apparently smart people believe in AI safety shit? I mean the AI becoming evil, not stopping it from saying nigger. There's no proof AI will start to le kill humans
>>
>>102192289
The answer you seek is politically incorrect.
>>
>>102192270
>>102192274
>>102192278
At least one of you is trying to be funny by messing with others. Because of that, your mom will die in her sleep tonight.
>>
>>102192297
What's the answer? There are people who genuinely believe it
>>
>>102192309
>Because of that, your mom will die in her sleep tonight.
fucking finally
>>
>>102192345
buy an a100 with the inheritance
>>
>>102192255
Do this
https://wikia.schneedc.com/bot-creation/trappu/introduction
>>
>>102192120
Then we could do our own instruct tunes that aren't slopped.
It's a great model for regular assistant tasks, but heavy fine-tuning (SFT and RLHF/DPO) have crippled its potential for RP.
>>
>>102190474
My typical usage pattern is to generate and hide rather than swipe so that I can compare messages on-screen and splice bits that I like together sometimes. So the messages I'm deleting aren't in the prompt.
>>
>>102192236
1. if you don't want data to be sent to a megacorp
2. if you don't want to pay said megacorp money
3. if megacorp finally blocks off their app and their API and puts it behind a massive paywall
5. If LLMs in future get cucked to all fuckery then its good to have a backup
6. if the end of the world arrives then at least you have a digital waifu to speak to before you die
> 7. did i mention megacorps
>>
>>102192176
migu seggs
>>
>>102192236
erp practically
>>
>>102192526
>Then we could do our own instruct tunes that aren't slopped.
Nobody does that though. Everyone who finetunes uses "synthetic data" (read: GPTslop) because they can't afford real data
>>
>>102192176
Coding
It doesn't matter if you don't know how to code.
>>
>>102192656
>>102192656
>>102192656
>>
>>102192634
Real data is garbage, you don't know what you're talking about desu
>>
>>102192657
If you don't how to code, you won't know when the model is bullshitting you and you'll get stuck when the code output doesn't compile (which happens a lot with llama models)
>>
>>102192116
idk what the meta is right now but i'd say c4ai-command-r-08-2024-Q4_0.gguf
or Coomand-R-35B-v1-GGUF but its not really creative
>>
>>102192531
>So the messages I'm deleting aren't in the prompt.
Assuming you mean 'hide' (based on your usage description), the message is still in the context. Changing it (hidden or not) invalidates the cache.
What i imagine how you use it is something like this, put to an extreme, of course:
model: Alright. Do we shoot each other or fuck
user: shoot [generate in the background, edit shoot to fuck, generate again, compare results]
ST needs to keep both caches, if you want to continue with one or the other. Once generated, editing the model's reply will add a third version of the cache. ST could just merge them back into one (after the last generation), but any further edit still invalidates the cache.
Do that for a few turns and you have an entire tree of caches ST needs to manage. So what you need is not the ability to 'just cache' the request's context, but a tree of contexts.
I remember i saw a while back something like that. I don't remember if it was a plugin for ST or some other frontend. It showed the llm's reply and yours in a tree, so you could continue and inspect stuff on different branches.
Maybe this
>https://github.com/ironclad/rivet
or
>https://github.com/ianarawjo/ChainForge
has something like it, but it's not quite the on i remember and i don't know if they supports local models
Point is, it's not just 'cache the context'. It's 'cache all these contexts and continue, or not, with one of them, except when i make an edit on a past message, and then generate another cache from it from which i may or may not continue'.
>>
>>102188498
Kill yourself schizo
>>
>>102188686
Groq soon (TM)
>>
>>102192907
It's more like...

USER: Do something!
model: You kill myself (hidden)
model: I kill yourself (hidden)
model: I kill... (generating)

...and what I'd like to do is delete the oldest hidden message while it's working, just to clean up the log.
>the message is still in the context
Hidden messages shouldn't be sent to the model.
I'll take a look at those plugins, though.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.