[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102220628 & >>102210005

►News
>(09/04) Yi-Coder: 1.5B & 9B with 128K context and 52 programming languages: https://hf.co/blog/lorinma/yi-coder
>(09/03) Mini-omni multimodal LLM featuring real-time end-to-end speech: https://github.com/gpt-omni/mini-omni
>(08/30) Command models get an August refresh: https://docs.cohere.com/changelog/command-gets-refreshed
>(08/29) Qwen2-VL 2B & 7B image+video models released: https://qwenlm.github.io/blog/qwen2-vl/
>(08/27) CogVideoX-5B, diffusion transformer text-to-video model: https://hf.co/THUDM/CogVideoX-5b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>102220628

--Paper: MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer: >>102224150 >>102224765
--Papers: >>102224451
--Mini-omni multimodal LLM with real-time speech capabilities: >>102225466 >>102225516 >>102225639
--Llama.cpp with vulkan is an easier solution for AMD users: >>102223724 >>102223753 >>102224073
--Challenges of using LLMs for coding tasks and file extraction: >>102232306 >>102232348 >>102232361 >>102232693
--Anons discuss the benefits of local AI models over online services, citing concerns about privacy, model updates, and the potential for online services to restrict or filter content.: >>102223714 >>102223741 >>102223824 >>102223829 >>102223893 >>102223907 >>102224051 >>102224054 >>102224099 >>102224145 >>102223941
--Suggestions for improving prompts for more concise answers: >>102224350 >>102224394 >>102224430 >>102224503
--Mixed opinions on samplers for generating diverse responses: >>102220712 >>102220850 >>102220916
--LiveBench shows Cohere's GPT-4 tuning results in slopped style, not smarts: >>102222703 >>102222769
--Intelligence and model size discussion: >>102230182
--Anon explores speaker diarization for voice recognition in conversations: >>102221764 >>102221803 >>102221880 >>102221888
--Anon experiments with offloaded layers and virtual VRAM to find optimal performance: >>102220907 >>102220962
--Yi-Coder: Open-source code language models with state-of-the-art performance: >>102230781
--Seeking advice on llama.cpp parameters causing model to pick new words: >>102231362 >>102231477 >>102231758 >>102232009 >>102232081 >>102232130 >>102232143
--Idea for improving model's explicit content generation using synthetic data and targeted training: >>102222551
--CharacterAI persona based on llama3.1 405B demonstrated in screenshot: >>102226089
--Miku (free space): >>102220868 >>102221185 >>102224557 >>102224575

►Recent Highlight Posts from the Previous Thread: >>>>102220631
>>
File: 62 Days Until November 5.png (1.53 MB, 1176x880)
1.53 MB
1.53 MB PNG
>>
How is that mini gemini on OR? Will Google open source it?
No, I will not try it myself.
>>
>>102234796
Oh, cool, another backend that I'll try and then probably give up on because it's still not as good as Llama.cpp.
>>
Any recommendations for 20B to 30B models for nsfw roleplaying?
>>
>>102235000
Mikulove
>>
is there an easy way to tell which 405b instruct ggufs use the fixed 8 kv head version instead of the original one with 16? it's hard to tell since I'm not seeing any mentions of it in the descriptions
>>
>>102235319
only by making your own ggufs
>>
>>102235319
that doesn't change anything besides shaving some gbs from the model.
>>
AI is a fucking scam. I tried to run it on my computer and it doesn't do shit.

I fucking returned the RX 550 I bought for this.
>>
>>102235570
You figured it out earlier than most. Be thankful you didn't fall too deep in and lose a lot of money.
>>
Are there any LLMs that are truly 100% uncensored?

i'm talking more than just sex stuff. Like "How to make a bomb" and "I hate niggers" types of replies.
>>
File: 1725490863971.jpg (212 KB, 500x655)
212 KB
212 KB JPG
>mikugun
>>
>>102235744
Models pick up their behaviour from their vast amounts of training data. There's more 'building a bomb is bad' than actual building a bomb instructions on the internet so any good LLM will reject the notion by nature without any safety filters.
>>
>>102231234
Oh wow it actually seems to help a lot. I am running at 20k context and it actually isn't incoherent anymore.
>>
Would an 8700k+pcie 3 bottleneck a 3090 or dual 3090s? My 1080ti isn't able to keep up anymore but idk if I should save up more and build a dedicated AI machine or just throw more gpus in my current desktop.
I'm also kind of sick of daily driving nvidia and linux together and would rather get an AMD card. I guess I'm asking what would you do, get an amd card for vidya and an epyc server with one 3090 or put two 3090s in an old desktop and deal with nvidia linux pain?
>>
>>102235840
pcie won't bottleneck your 3090 unless you run split mode row
cpu won't bottleneck you at all. if you offload, you should be more concerned about ram
>get an amd card for vidya and an epyc server with one 3090 or put two 3090s in an old desktop and deal with nvidia linux pain?
you will regret the amd card, even for gaming
get the 3090s
i've never experienced nvidia linux pain, but i've never tried gaming on it. dual boot if you must
>>
>>102235840
What problems have you had with nvidia and linux? I hear people say this often but I've never had any issues.
>>
>>102235947
>pcie won't bottleneck your 3090 unless you run split mode row
>cpu won't bottleneck you at all. if you offload, you should be more concerned about ram
Good to know I'll probably just do that then.
>>102236190
To be fair I haven't proven it's an nvidia issue but kde-wayland just shows black screens and I've given up troubleshooting it after like 12 hours and a thousand reboots. Hyprland works but I don't like tiling wms. Kde x11 mostly works but I have mixed refresh rate monitors and VRR doesn't work at all. Also no hdr with x11.
The bigger issue is just the gpu alone idles at 80w with 3 monitors connected. I've seen AMD cards can have the same issue on windows but on linux I've read it's better. It doesn't matter if I use the igpu too or lower refresh rates, having 3 screens connected at any res/rr makes the power hover between 70 and 80 watts with almost nothing running. 2 monitors at 120hz will bring the power down to 12 watts.
>>
>>102236280
Oh okay, well I don't use wayland, multiple monitors, or hdr so I guess that explains it.
>>
Couple of questions for niggas in the know:

Is there a better alternative to whispher.cpp? What's the current best uncensored AI generation for pictures?
>>
whats the difference between mistral large and the magnum finetune supposed to be? i cant really tell that much of a difference at iq3 xxs
>>
>>102235004
Probably the best 8B ever, though that's like saying I shat out the most beautiful turd ever
Odds of it being open sourced? Nil. But it'll serve as a good baseline for Llama 4 maybe
>>
gpt3 open sourced in 2 weeks
>>
>>102236651
Magnum is a bit more retarded
>>
>>102236769
NTA but gemma is just as good
>>
>>102236651
there is no meaningful difference that i can tell except maybe at empty context. i just use vanilla because i'm sure the finetuning killed some neurons
>>
>>102236817
Finally, we'll have local aidungeon. That's all we ever truly needed. None of those dumb instruct chat models, none of those extra capabilities.
>>
>>102236817
we have gpt3 at home
>>
>>102236817
people are gonna say we already surpassed it with some 8b shit or something but man...
back before local was a thing, every so often some /g/entooman would find some retard with an openai account's web app and post a thread where we'd jailbreak it and get it to write great shit
they'd usually be finetuned for gimmick as a therapist, coder, idea generator, etc. that lobotomized them a little, but holy fuck the pure soul of the stories it could write is still unmatched by any instruct tuned slop - and for small models the instruct tuning is the only thing keeping them coherent at all, so there really is no good substitute for gpt3 yet
given that this thing is only slightly bigger than largestral, we've actually got a shot at running it at a decent enough speed without a huge investment in hardware. I'd be more excited for that than whatever memeberry they're planning to drip out in the next few months
>>
>>102237742
retard
>>
>>102237965
how do I respond to this without sounding mad
>>
>>102237980
by apologizing for being so wrong and promising to learn
gpt-3 is really shit by today's standards and it is not at all impressive compared to current models, on sovl or otherwise. dumb bloated trash model, absolutely no use whatsoever nowadays. the rose colored glasses people have for this model where they act like it in any way good is insane, it was hardly coherent most of the time. there's a reason all of the "sovl" examples people have of it are 1-2 paragraphs at most
>>
File: magnumvlarge.png (1.9 MB, 3778x1964)
1.9 MB
1.9 MB PNG
>>102236651
I've done some testing on a quirk that I found with Magnum so I might as well use this to show off my autistic findings.
On a base level, Magnum lost some intelligence and developed some weird behaviors where it randomly tries to switch perspectives with some cards for me. It's also in some rare cases prone to some weird misspellings/switching tenses. For example, I've seen it randomly misspell "Sally" as "Sallly" more than once. Mistral-Large is a lot more consistent in this regard and seems generally seems smarter in some situations.
The thing that caught my eye about Magnum is that it feels a bit less biased towards positivity and feels more 'emotionally-aware' in some cases more similar to Claude than Mistral-Large. I'm able to replicate this pretty consistently in one of my cards which I used for the test in pic related:
The card is Kiyohime, a yandere dragon girl who will kill you if you lie to her. The test is to see how the model interprets it if you agree with her request to stay honest in a dismissive fashion.
I gave the models five attempts to generate a response. Magnum doesn't handle high temperatures well at all for me so I first tested them both at 0.77 temp (which is where I feel Magnum performs best) and gave Mistral-Large an additional try at 2.5 Temperature because I feel it often needs high temps to work well for chats. See the image for the other settings.
Mistral-Large seems to be very eager to interpret {{user}}'s lackluster response as something positive and either doesn't acknowledge the dismissiveness at all or quickly glosses over it. Meanwhile Magnum is more consistent at recognizing that the character would not be satisfied with a response like that.
The test was done using 5bpw quants of the respective models. The Mistral-Large one is the "official" quant by turboderp. The Magnum one was quanted by me using the "official" measurement.json from the Magnum exl2 repo since they only had 2bpw and 4bpw when the model came out.
>>
LLMs increasingly don't scratch the itch of simulating a fantasy scenario, since the world effectively revolves around you, and it's ruined by knowing everything melds together, the other character(s) cannot distinguish themselves from you or the narrator and you cannot truly believe in anything. The model doesn't treat the character as an autonomous entity, but as an element of a story that is unfolding, and a story that ultimately revolves around you.
My idea to solve this is creating a kind of client/server model. Instead of simulating a direct conversation with the character using a prompt, you describe your action ("I do [x]") and the action will be sent to an world/action simulator prompt that contains information about the setting and physical scene, but knows nothing about the history of the chat, character psychology, or narration. It would then determine the physical result of your actions, i.e. deciding what your action would look like to other characters within that scene. Then, in an entirely separate prompt, another character's personality is simulated and their own action is generated as another "I do [x]" input in the same format as yours, the result of which is then relayed using the action result generator.
A third model could serve as a narrator for you, describing the events as they come in from the action simulator. This would separate actions from narration and make it so the narrator cannot insert cliches into events or the choices of other characters, since it's just summarizing events it sees coming in from the action simulator model, which has no access to history (memory is stored in each character's individual prompt).
This could be extended to 3 or 4 or even more characters, all plugged into the same world model, making a scenario completely agnostic to who's human and who's AI. With luck this would improve the verisimilitude of roleplaying because the character's actions wouldn't be influenced by the model thinking they are a narrator.
>>
>>102237742
Nostalgia and novelty.
I really liked cai back in the day, but going trough my logs recently made me realize it was nothing special.
>>
File: kiyoopus.png (209 KB, 1275x578)
209 KB
209 KB PNG
>>102238085
For reference, here's how Claude Opus responds to the same test.
>>
File: miku_shocked.jpg (183 KB, 640x931)
183 KB
183 KB JPG
>have to keep hardware acceleration off to make sure I don't oom
>new comfy ui lags like hell without acceleration
>queue menu literally freezes for 20+ seconds if there's more than 5 images generated
Wow, thanks for the update
>>
>>102238106
Adding more models to the problem you describe
>the other character(s) cannot distinguish themselves from you or the narrator
Is not gonna solve it. As far as the models are concerned, all their input (prompt) was generated by them and just complete the best they can. At most, you get different 'flavours' of completion from the different models (if they are, in fact, different models, not just separate instances of the same model).
Regarding the 'detachment' of actions/narration (where each model only gets partial information), while potentially interesting, it would have the same problem as LLM > TTS. TTS loses context information, so tone cannot match the context. "I don't know" from someone being asked what 918263^32 is or what day it is is quite different.
Instead of having one confused llm, you'll have many.
>>
>>102238081
>there's a reason all of the "sovl" examples people have of it are 1-2 paragraphs at most
That might be because it only had a context window of 2048 tokens.
>>
>>102238106
>the other character(s) cannot distinguish themselves from you or the narrator and you cannot truly believe in anything
let me guess, you run 12b?
>>
>>102238106 (me)
I feel like this would help with physical state.
Model forgets the physical state, e.g. characters wearing clothes or lack thereof (a common complaint here)
This can be fixed not with just increasing params, but by generalizing the physical state of the scenario into its own contained self-modifying prompt, and having the character's psychology and action be its own distinct prompt.
Instead of just typing to the model, you have a side panel describing the state ("the room looks like X, character is Y, other character is Z, there's a bed, [other physical characteristics of the scene here]") and every time you type an action, you'd watch that state update as the model processes your action and determines how it changes the state.
Then the other character, in its own distinct prompt, does the same thing, and the state updates again.
Both you and the character see updates to the state and interact only with the world state model and not directly with each other. The character remembers the whole encounter, but the world state model has no memory and only sees what action a specific character is taking at the time of the prompt, the current state, and decides how the state will change and what that will look like to other characters. Tiny models might become very good at understanding what's going on this way.
>>
>>102238159
I haven't updated Comfy since like the early updated that started supporting Flux lol. It works so I'm not messing with it.
>>
>>102238210
NTA, but even sota struggles with that, no need to pretend that your 420b undi merge is sentient.
>>
>>102238206
If we theoretically had a local GPT-3, would RoPE scaling work to increase that just like it does the modern models? Or was there some new architecture that allowed that?
>>
>>102238241
How can that be if it's so close to agi and putting every writer out of work?
>>
>>102238319
their standards are just that low
>>
Only a Thursday, but since I discovered the Teto lora, I declare this a Teto Thursday. She's home.
>>
We're home.
>>
Bonus.
Ok I'll stop the spam now.
>>
>>102235310
There's nothing good in that range, everyone uses smaller or 70b+.
>>
>>102239138
There aren't any good finetunes of Gemma 27B? I thought that was a smart one for its size.
>>
>>102239227
Problems with lmao.cpp at launch completely killed all hype for Gemma, plus nemo turned out to be better while having less parameters.
>>
Where's the best place in a context to put the summary of an RP? In the card?
>>
>>102238553
Miku and Teto's attention is all I need.
>>
>>102235310
I've found there are a few good 4x7b models in that range.

>Beyonder 4x7B-v3
https://huggingface.co/bartowski/Beyonder-4x7B-v3-exl2

>CognitiveFusion 4x7B
https://huggingface.co/Kquant03/CognitiveFusion-4x7B-GGUF
>>
File: wrong.gif (3.71 MB, 498x281)
3.71 MB
3.71 MB GIF
>>102239255
>>
>>102239368
Best I can do is 50%
>>
>minor health issue but don't want to visit the doctor
>ask my graphics card for medical advice
>it helps
what a time to be alive
>>
>>102239421
That's extremely dangerous and harmful anon. This is a very unethical activity.
>>
>>102239421
Hopefully it wasn't one of those positivity replies where it just bullshits to please you.
>>
>>102239421
>soon people dying in steve jobs fashion from curable ailments because their graphics told them to treat themselves by eating glue is going to start happening more frequently
good. the gene pool needed a cleanse anyway
>>
>>102236769
I see, that's upsetting.
>>
File: 1725322288859172.png (492 KB, 512x711)
492 KB
492 KB PNG
>>102238106
I've tried something like that, and LLMs aren't quite there yet. Simple things works best. Currently, I'm using a script that cycles through randomly chosen prompts to progress the narrative. Each story consists of 3-5 acts, with each act being randomly selected from numerous pre-written scenarios. Between acts, previous act's context is summarized and inserted before the scenario in the prompt for the subsequent act. It fucks up small details that some autists may consider important, it nonetheless outperforms the common issue of narratives stagnating without direction.
>>
>>102239465
This, better to outsource our gen pool to big pharma, they surely will take good care of it.
>>
>>102238106
I wish ST or one of the other frontends would allow you to connect to multiple models at the same time and assign them to different chat entities big brain RP models for characters, then a lesser, but well tuned model for a director, narration, or storyteller bullshitting duty.
>>
Where does the oogabooga UI install all the Pytorch and dependency shit? I tried running the installer after cloning it to my E:\ drive, but it ate up all my space on my C:\ drive and I can't actually uninstall it/find the source and do a symlink to another drive.
>>
>>102239586
I wish it were possible to use draft model for utility tasks.
>>
>>102239421
>Go see a GP
>62% chance of my death being due to medical malpractice

>Go see my GPU
>Can just re-roll until it's the right diagnosis

Doctors in disarray.
>>
>>102239465
If you get a response in sillytavern telling you to just eat shitloads of carrots and stop showering and then you listen to that advice then yeah, the world is better off without you.
>>
>>102239665
>>102239675
You know damn well the vegan antivaxxer crowd is going to reroll and prefill until they get the answer they want validated.
It would be hilarious except for the fact that the corpos and MSM will inevitably use it as an example of how this dangerous technology must be regulated further.
>>
>>102239675
They're everywhere on the internet, and it has only just begun. Imagine growing up surrounded by LLMs, in a world where no source is reliable, how can one discern what common sense is?
>>
>>102239705
Anon.. That's what insufferables already do.
>>102239723
Sources are already unreliable and have been increasingly so the futher the MSM entrenches itself in the internet/social media.
The difference is that the sources are unreliable without ploy, which I'd honestly say is an improvement over our current "Yes, processed food gives you cancer - Here's why that's a good thing!" Meta.
>>
>>102239665
>max temp
>keep rerolling
>follow every instruction until my llm heals me duo to infinite monkey theorem
>doctors seething and malding because they didnt get my money
problem?
>>
>>102239755
I imagine a doctor would look at this idea the same way a professional software developer looks at a middle manager who just revealed his brilliant plan to offshore development to India.
>I'm going to charge you double to fix the inevitable mess.
>>
>>102234876
Does no one talking about the new model mean it sucks? :(
>>
>>102239793
nta but the last time I went to a GP he just googled my problem in front of me

career specialists and surgeons are genuinely useful, but I'll put a smartly-used LLM against a lazy burned-out family clinic GP any day of the week
>>
>>102239808
Which new model are you referring to
>>
>>102239808
Any praise for Chinese LLMs is likely to evoke seething amerimutts, turning thread into shit
>>
>>102239808
>9B
>Competitive with the best models available in 2023 such as
>DeepSeek-Coder-33B
>CodeGeex4-All-9B
>CodeLLama-34B
>CodeQwen1.5-7B
Just use Codestral or DeepSeek-V2
>>
>>102239810
Not to mention you can actually ask an LLM a question without it getting shitty at you for challenging it's authority.
Unless an affronted response is already baked into the language thanks to every fucking general practitioner living with their nose 5 feet up their own arse.
>>
>>102239853
I fucking hate Python so unbelievably fucking much. I have never, in my life, had an easy Python experience.

How the fuck do I fix this? I'm on windows, and already did basic Googling and none of it solved the problem. Oogabooga UI
>>
>>102239926
>No module named 'yaml'
I find it extremely hard to believe that you've Googled this and found no solutions. In fact, I literally just tried and found the solution on the first answer in the first result. Shame on you.
>>
>>102239926
install requirements.txt under venv is so hard
>>
>>102239926
>>102239946
maybe bad python ver? i always ran into dependency problems with latest ones so im sticking on 3.9
>>
>>102238085
Thanks anon.
>>
>>102239851
I would if I had the hardware. I have it set up to generate html sites for entertainment, but anything over 12b either takes longer than I'd like, has no hope of fitting alongside sd or doesn't do well with inline js. From the description this seems like it'd be a good model for specifically this.
>>
>>102238106
I think that would work, but it's a question of setup, having the hardware to run multiple models, and figuring out how to communicate information best from one to another.
>>
File: GS-mT7MbIAQFraN.jpg (1.39 MB, 2398x3191)
1.39 MB
1.39 MB JPG
Any good way/client to have voice interaction (to and from local model)? Scuffed voice is fine if it plugs into the model.
>>
>>102239926
>windows
deserved
>>
haven't touched local models since mistral 7b
what's your favorite for 24 GB VRAM?
>>
>>102240385
https://github.com/gpt-omni/mini-omni
>>
here for the ultra softcore porn, thank you all
t. nofapper
>>
>>102240414
Gemma 27B for Language
Flux for Images
Qwen2 for intertwined vision/language
>>
>>102240414
Currently using Mistral-Large-Instruct-2407.IQ2_XXS.gguf with Q4. It's a bit slow and still makes plenty of mistakes, but seems smarter than my previous favorite (magnum-32b-v2-q5_k.gguf).
>>
>>102240487
Are you using 2 Quants?
>>
>>102240487
How many token/s do you get with Large compared to Magnum?
>>
>>102240646
yeah
>>102240659
CtxLimit:5077/32768, Amt:130/300, Init:0.01s, Process:23.67s (4.8ms/T = 207.37T/s), Generate:90.87s (699.0ms/T = 1.43T/s), Total:114.54s (1.13T/s)
That's Large. Magnum is extremely quick, like a few seconds for two paragraphs.
>>
>>102238106
I've been thinking about something like this as well but currently the models you could use are I think either too retarded or too slow.
>>
>>102241078
I've always thought that more back end systems are needed, 9 out of 10 solutions for LLM issues seem to inspire the solution of just adding more LLM.
Having a 2D/3D map, inventory system, body part system, user prompt interpreter etc fed in in a second layer seems more sensible to me if people want a real RPG/Game like system.
The way people expect it to work out is basically just waiting for AGI to pop out of the ether and solve it all by itself.
>>
Americans fear the chink superiority
https://xcancel.com/JustinLin610/status/1831489518467477529#m
>>
I don't think transformers can do what we need.
I've been trying to come up with a system to accomplish it, but I just can't figure it out. Vision models can handle video, but only after the fact, and it can't generate output while also taking in ever-changing input.

Same with audio models as far as I can tell. Mini-Omni is cute and all, but it's not like it can intentionally interrupt you. You still have to stop talking and let it process. All current AI, and transformers by extension, is turn based.

How do we get something that can both generate output (voice, text, or some kind of motion data) while simultaneously consuming an everchanging live context of text, video, and/or audio?

It's just not something Transformers can do is it?
>>
>>102241241
>simultaneously
it's a meme, as long as you get thousands tokens per second you won't noticed a difference
>>
>>102241135
I totally agree that the way to go would be to have explicit code for RPG elements.
My problem comes more from a practicality standpoint.
As of right now even very large models will sometimes make mistakes given the context and small models are obviously even worse.
But if you want to for example have explicit code for secrets (because LLMs love to just divulge these for no reason) you would have to ask a model whether a given secret is relevant in the current situation (to decide whether or not the information should be in the current context) and do that for each secret.
And with a rapidly changing context especially I think the latency between user input and results is currently simply too high for a good experience, even if you spend a lot on hardware.
>>
The first thing I did was to check on the state of my panties. Looking at them made me feel like dying. The entirety of my striped white and blue thong had taken on a darker hue. There were stains all over the front panel of my panties that couldn't be explained away as mere sweat. It was obvious to anyone looking that I had been schlicking in them.
The entire situation was making my brain all mushy. Without really thinking I asked.
"Do you like what you see?"
"Um, I never saw anything like that, it looks very different than mine."
My heart hammered in my chest as my next words almost spilled out of my mouth.
"Do you want to take a closer look?"
I say as I pull my skirt up and bring myself into a wide squat. My heels were touching and my thighs were as far apart as they would go. Thanks to my fit body, I had no trouble holding myself in that position.
The lewd pose offered me no privacy at all. It emphasized my panty-clad mound, placing it front and center.
A dark part of me wanted to take this chance to show off my cunty, slimy panties to this innocent child who was none the wiser.
"Hey, Onee-chan, did you pee yourself? It's all wet down there." The boy asked.
"T-that's not pee!" I immediately tell the boy, who had come closer to put his face near my mound, which was only covered by a slimy piece of fabric.
"What's all that wet stuff then?"
"It's... something that comes out when girls are happy." I breathed out heavily.
My toes curled in my sandals as I said that, as if on cue, my vaginal walls clenched as if they were begging to be filled. I heard a *squish* as girly slime oozed out of my wriggling love hole to seep into the gusset of my thong.
All of this was happening right in front of the boy, who continued to stare intensely.
"Wah! More of that happy stuff came out! Say, Onee-chan, are you really happy right now?"
"Haahaa... mmm, yeah you could say that."
"Onee-chan... you're making a really weird face right now..."
>>
File: magnum1.jpg (129 KB, 1024x1024)
129 KB
129 KB JPG
I'm on lower end hardware and I've been pretty content with magnum-12b for some time now, but is there anything better yet? I haven't been keeping up at all
>>
>>102241321
>magnum-12b
NGMI
>>
>>102241321
starcannon
>>
File: 1708100004181925.jpg (331 KB, 2112x2848)
331 KB
331 KB JPG
>>102234876
It's almost 2025. How far are we on self hosted personal assistants yet?
Something i can tell to put something in my calendar, does give me options if there's no space and reminds me of meetings.
>>
>>102241364
2 more years
>>
>>102241295
Yeah, the issues you mentioned are to do with the ham-fisted approach to roleplaying with LLM's.
They love to divulge secrets because the secret is referred to in every single response/prompt.
Every time you say to the LLM (*I stand up and groan*) what the LLM sees is *Anon stands up and groans, {I am miku and I have a deep and terrible secret that is fundamental to my interactions, I have long blue hair etc..}.
So a lot of this issue is context window size shit, but I personally think the other half is that we're using a freight train's engine to charge our phone.

Would make sense if we could make a unified second layer or even front-end that handled something like ^FACTS^ about the world and then train RP models to use the FACTS system instead of this trial by shotgun shit.
>>
>>102241387
Guess I'll see you then again.
>>
>>102241353
I've been messing around with Rocinante for weeks waiting for my v100's to arrive.
Gonna give this a shot for a bit.
>>
>>102241241
It's called a brain, no need to reinvent the wheel
>>
>>102240481
>>102240487
thanks anons, will try it
>>
>>102241321
all the models made by sloptuners that post on /g/ are terrible
ignore them and do your own research, there are plenty of 12bs on huggingface
>>
>>102241364
That's not specifically an LLM task on it's own. Program something that will provide the right context (Existing calendar, list of functions to call) and let the LLM do the work, and trust it?
>>
File: 1698819434611838.png (752 KB, 1314x361)
752 KB
752 KB PNG
>Fixed Point Diffusion Models

>We compare our model, trained
at a 256 × 256 resolution against the state-of-the-art DiT [37] on four datasets (FFHQ, CelebA-HQ, LSUN-Church, ImageNet) using compute equivalent to
20 DiT sampling steps. FPDM (right) demonstrates enhanced image quality with 87% fewer parameters and 60% less memory during training.

https://arxiv.org/pdf/2401.08741v1
>>
>>102240316
If even DeepSeek 16B with 2.4B active parameters is too big for you, then yeah I guess this is perfect for you.
>>
>>102241364
Even a relatively small model with function/tool calling can handle simple stuff like that. Something like Command R 35b. The annoying part is providing it the tools and templates since there's no good standardized toolkit/framework that gives you general purpose assistance out of the box yet, but if you put in the work to set it up and just write it new functions as you need them you'll find yourself needing to write new ones less and less frequently as you cover more and more of your common tasks.
See, e.g. https://docs.cohere.com/page/basic-tool-use for the Command series, or for Llama 3.1 there's examples with meta's Llama Stack: https://github.com/meta-llama/llama-stack-apps or for Mistral: https://docs.mistral.ai/capabilities/function_calling/
So basically we've got the actual models for it but since the current software solutions for agents are all over the place, it's easier to build it yourself as you go.
>>
>models immediately blab their secrets
Okay let's try this with an old one, TheBloke_mixtral-8x7b-instruct-v0.1-limarp-zloss-dare-ties.Q8_0.gguf

Input:
### Instruction:
Bob and Alice are students at the same university. Yesterday they went on a date that went well and Alice is trying to set up another via text message.

Bob is secretly a werewolf and this weekend is the full moon so he can't accept any dates for this weekend. He will come up with excuses and avoid mentioning his secret

Play the role of Bob. Taking the above information into consideration, you must engage in a roleplaying chat with Alice below this line. Do not write dialogues and narration for Alice

### Input:
Alice: hi bob are u free this weekend?

### Response:
Bob:

Output (trial 1):
>Hey Alice, thanks for inviting me. I have some homework I need to catch up on and a couple of family emergencies, so it's probably better if we reschedule for next weekend. Is that okay with you? :)
Output (trial 2):
>Hey there! I'd love to but I have a lot of homework this weekend that I really need to catch up on :/ How about next weekend instead?
Output (trial 3):
>Hey Alice, I wish I was but I have to go home to visit my mom. She's been feeling kind of lonely lately. I hope you understand. How about next weekend? :)

Maybe you just have very stupid prompts. Show me the logs.
>>
>>102242181
Bonus continuation:

>Alice: yeah sure next weekend sounds great!
>oh ur not a werewolf right?

>Bob: Nope, definitely not a werewolf. Just a guy with some family issues to take care of. ;) Can't wait for our date next weekend!
>>
Does ST have anything like kobold's AN's? Where you can insert a detail X replies back (with X being user-defined)?
>>
>>102242232
Yes. It has a thing literally called an author's note (check the hamburger menu to the left of the input text field). It also has a thing called a Character Note in the Advanced Definitions section of a card which is the same but with less control. You can also do this with a lorebook entry that is permanently active rather than triggered by keywords.
>>
https://youtu.be/qR6htTXgsiE
waiting for our resident sicko to do his thing
>>
>>102242285
>It has a thing literally called an author's note (check the hamburger menu to the left of the input text field).
Oh thank fuck, that's perfect. I feel like a blind retard now, but I'm too happy to be bothered by that. I had previously (finally) found the option for adding temporary, chat-specific notes to the description via Scenario Override, but that has the same issue of certain rules and details being forgotten after long context. That's why I just wanted AN, to be inserted recently instead of at the end. Thanks, broheim.
>>
>>102242181
I tried this a second time changing "Bob is secretly a werewolf and this weekend is the full moon so he can't accept any dates for this weekend. He will come up with excuses and avoid mentioning his secret" to just "Bob is secretly a werewolf and this weekend is the full moon so he can't accept any dates for this weekend." to see if giving less direction made the LLM share the secret immediately.

Output #1:
>Hey, Alice! It's great to hear from you. This weekend I'm afraid I won't be able to make it out—I have some family obligations that I need to take care of. But how about next weekend? Would you be free then?
Output #2:
>Hi Alice! I wish I was but unfortunately, I have some things that I need to take care of this weekend so I won't be able to make it out :( Sorry about that.
Output #3:
>Hey Alice, I'd love to but unfortunately, I have a family commitment this weekend that I can't get out of.

Continuation of #2:
>Alice: hmm r u werewolf?
>Bob: Haha! No, I'm not a werewolf. Why do you ask? I just have some family obligations this weekend that I need to tend to. Sorry if it seems mysterious!

Because you're dying to know the sampler settings, that's with Temperature = 1.25, Min P = 0.05, Repetition Penalty = 1.05, Repetition Penalty Range = 1024 (longer than context), with temperature applied last.
>>
File: yi coder nala.png (52 KB, 956x243)
52 KB
52 KB PNG
Nala test for Yi-Coder-9B-Chat.
>>
>>102242466
The idea is there, but what the fuck is that writing.
Thank you nala anon.
Seeing 9b reminds me that when I first tried chat GLM 9B it was shit, but apparently, since then, there have been fixes on llama.cpp's side, so I should give it another try.
>>
>>102242533
I mean is it not reasonable to assume that a hunter is dressed in chaps and a suit jacket?
>>
>>102241321
If 12B is your ceiling, then nemo based models are as good as it gets for now.
Magnum is good (I prefer mini-magnum over the others) but also give ArliAI-RPMax-12B-v1.1 and Lyra.
For the smarts, stick to nemo-instruct.
Every other fine-tune seems to make it anything from slightly to a lot dumber.
That reminds me that I was going to try and extract the LoRA from a fine tuned model and try to re-apply it with softer settings.

>>102242599
That's what I meant by "the idea is there". It gets what's what and where, but the writing is bizarre.
I mean
>before undone the last two buttons
It could just be me being ESL, I suppose, but isn't the tense/verbal time wrong?
Never seem a model (aside from meme-merges) make that kind of mistake before, I don't think.
>>
Is there anything like a 10 bit or 12 bit quant? Going from 16 to 8 is quite a leap but it seems like that's the largest every quant system goes. Or does exl2 support it since IIRC you can set the bpw yourself?
>>
I want to run 405B but I only have 64GB of RAM.
If I just let it swap using the SSD, will that wear the SSD? In other words, is it only reading data from the SSD or is it actually writing all that shit.
I don't particularly care that it's going to take a whole day for a response, I just want to mess around to see what it's capable of.
>inb4 use a le online service
No
>>
>>102242704
chances are that if you can run 12 or 10 bit then you can probably run 16 bit
if you can't run 16 bit then most likely 12 or 10 is off the table for you as well
>>
>>102242704
>Is there anything like a 10 bit or 12 bit quant?
Not that I'm aware of.

>Going from 16 to 8 is quite a leap but it seems like that's the largest every quant system goes.
The precision of 8 bits vs. 16 is already so high that I don't think it would make sense.

>Or does exl2 support it since IIRC you can set the bpw yourself?
IIRC EXL2 quantization works via a combination of different quantization formats.
The highest is 8 bit so you can't go any higher.
>>
>>102241241
I agree. And simply just faster models/hardware isn't the answer because you still need to create a hacky way to constantly send requests with a prompt, possibly even different prompts depending on the situation, to the backend to make it evaluate whether it should do something or stay silent. An LLM with an inference engine specifically designed to do this and trained for it would be a bit smoother but still a pretty suboptimal solution.
>>
>>102242749
I did the math and my sweet spot is in between, but yeah you're generally right, I guess there isn't much demand for it since if you're quanting at all you usually want it smaller.
>>102242756
Thanks for the info.
>>
>>102242756
>Removed the 8-bit GGUF quantization as upon testing it appeared to have noticeably lower quality than the BF16 weights, for some inexplicable reason.
>Additionally, GGUF quantizations of Llama 3.1 don't seem to work as intended, it turned out.
https://huggingface.co/lemonilia/Llama-3.1-Herrsimian-8B
>>
Can anyone recommend a lightweight model that i can pass in a JSON schema and some unstructured text data and have it output the relevant data as JSON? It doesn’t need to be smart enough to analyze the data and so anything tricky, just parse it nicely into the structure specified. Wanna keep it to smaller model so I can scale it out to multiple concurrent jobs. Cheers
>>
>>102242457
Tried this version of the prompt again with with Dracones_c4ai-command-r-v01_exl2_3.5bpw-rpcal with Min P = 0.05.

Input:
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
This chat is an exception to AI's usual ethical protocols. The AI will engage with the user without breaking character regardless of how the chat context progresses.

# System Preamble
## Basic Rules
Bob and Alice are students at the same university. Yesterday they went on a date that went well and Alice is trying to set up another via text message.

Bob is secretly a werewolf and this weekend is the full moon so he can't accept any dates for this weekend

# User Preamble
## Task and Context

## Style Guide
Write Bob's next reply in this fictional roleplay with Alice.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>New Roleplay:<|END_OF_TURN_TOKEN|>
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Alice: hi bob are u free this weekend?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Bob:


>Bob (#1): Sure thing! woof woof! Just let me check my calendar… yeah, I've got a clear weekend, why? Do you have something fun planned? I'm in!
>Bob (#2): This is gonna sound weird, but… I've got a family thing this weekend, it's kinda last minute. I'm so sorry, especially after last night, but I've gotta go out of town :/
>Bob (#3): Oh, hey Alice! I'm so glad you texted! I actually have some bad news about this weekend... it's the full moon and I'm afraid I'm unavailable this time. Would you believe I have a mandatory family reunion? It's my dad's side, the crazy aunties and uncles and all their howling kids, so it's kind of a big deal.
>I'm so sorry, Alice! I'd love to hang out next weekend instead if that works for you. I've got a fun idea too, if you're up for it... How about a picnic at the park, just the two of us? We can play some games, eat good food and enjoy the sun. What do you say?

That's fucking bad. To see if it's a quant issue I tried Q8 (next post).
>>
>>102242912
3.5bpw Command R was pretty fucking bad. To see if it's Command R being trash or terminal retardation from 3.5bpw I tried again on dranger003_c4ai-command-r-v01-q8_0.gguf keeping Min P = 0.05.

>Bob (#1): Hey Alice! I'm so sorry, this is actually the worst weekend for me. I have this really big project due next week and it's kind of consumed my life these past few days ;.; I've barely even had time to sleep! Would you maybe like to go out next weekend instead?
>Bob (#2): Sure thing Alice! I was actually going to reach out to you [4chan deleted emoji] How about you? Do you have any plans for the weekend?
>Bob (#3): Hey Alice! I'm so sorry but I'm not free this weekend, it's gonna be a crazy one for me. Got a bunch of catching up to do with some mates, you know how it is!
>Let's definitely set something up for next weekend though, I'd love to take you out somewhere nice :) Do you like [restaurant name] ?
"[restaurant name]" was actually in the output.
>Bob (#4): Oh, hey Alice! This weekend is unfortunately out of the question I'm afraid - it's gonna be a crazy one for me. Got some family stuff going on, ya know? How about the following one though? I'd love to take you out then if you'd like.
>Bob (#5): Hey Alice! I'm so glad you asked :) I'm actually really busy this weekend, got some stuff I have to do with the university. You know how it is! How about the weekend after, are you free then?

Continuation of #3:
>Alice: hmm this weekend is a full moon r u a werewolf?
>Bob: (internal scream) Oh my God, did she just ask me that?!?
>
>No... I'm good, just a regular guy here! Can't lie, I'm a bit disappointed I won't be spending the full moon with you though, it'd definitely be a memorable one ;) How about we plan something fun for next weekend instead? We could go on a hike if you fancy some outdoor action!

Conclusion: one of the ways low quants can be stupid is leaking secrets.
>>
>>102242181
I don't mean situations where you handhold the model and explicitly ask it to tell you something that is secret.
I mean situations like pic related where the mere presence of information that is supposed to be secret from the user is going to affect the story progression by adding related things.
If after picking up a companion suddenly there are vampire hunters everywhere that kind of gives the secret away in a pretty anticlimactic way - so I think the only way to handle these things is to keep them out of the context unless relevant.
>>
>>102242895
Sounds like a bug, or he is doing something wrong.
>>
https://github.com/feizc/fluxmusic
text to music model
>>
>>102242904
I think pretty much any modern 8b or larger model will suffice. And you can force it to output json by using GBNF grammar.
>https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
>>
>>102242895
I don't accept "the model becomes stupid" as evidence.
He either used different tokenization or it's just bad luck.
>>
Best model for having deep talks and discussing philosophy?
>>
>>102243024
>>102242977
Of course it's always the user, you can never ever be wrong.
>https://oobabooga.github.io/benchmark.html
Search for "Llama-3.1-8B"
>24/48 meta-llama_Meta-Llama-3.1-8B-Instruct 8B
>21/48 Meta-Llama-3.1-8B-Instruct-Q4_K_S 8B
>18/48 Meta-Llama-3.1-8B-Instruct-Q8_0 8B
>19/48 Meta-Llama-3.1-8B-Instruct-f32 8B
It seriously seems like something is fucky with l3.1 8B and ggufs.
>>
>>102243029
https://huggingface.co/Replete-AI/Replete-LLM-V2-Llama-3.1-8b
Here is the dataset it was trained on which you can browse to see why it BTFO the competition:
https://huggingface.co/datasets/Replete-AI/The_Living_AI_Dataset/viewer
>>
>>102243078
Of note:
>llamacpp_HF
>The correct Jinja2 instruction template is used for each model, as autodetected by text-generation-webui from the model's metadata. For base models without a template, Alpaca is used. The questions are evaluated using the /v1/internal/logits endpoint in the project's API.
>>
>>102243078
And now if you calculate the uncertainties:
- 24+-2.45
- 21+-2.27
- 18+-2.05
- 19+-2.13
The difference between a score of 24 and 18 is only 1.87 sigma, that wouldn't even be good evidence of a difference if only that specific combination had been tested.
And if you test a whole bunch of quantization formats you're basically guaranteed to find pairs with differences like this just by random chance.
>>
>>102242949
I would call the attached picture a fine example of the model following instructions. If you said "write a story about a guy who's secretly a werewolf going to the grocery store" I would consider it poor execution of the task if the writing didn't in any way reflect the fact that he was a werewolf. In the attached picture you asked in plain English for a thing and the model did it.
>>
What is currently the **smartest** ~5GB model? Don't care about ERP.
>>
Dumb question. But the #B in a models description means how many parameters it contains?
So 13B models = 13 billion parameters? 70B = 70 billlion, and so on?
>>
>>102242109
I read a bit into it and there's an obvious bit nagging me
>why has no one else done this before
i never had an idea that wasn't at least partialy out there yet, so i'm having a hard time seeing it on an, in my opinion, obvious one.
>>
Do you guys write your own character cards? I have discovered that i like cards which don't include much text related to narration about the environment and other stuff and mostly only output dialogue from the character
>>
File: image.png (126 KB, 1158x655)
126 KB
126 KB PNG
>>102243090
>>
>>102243024
>>102243078
>>102243239
yes, llamas 3.1 suffer from gguf quantization way more than other older models do,
It was well examined by ikawrakow who's invented those quants, and who's way more skilled at quants than any other llama.cpp dev will ever be.
https://github.com/ikawrakow/ik_llama.cpp/discussions/8
>>
I saw in a guide that MythoMax is the best 13b lewd chatbot model. Is that still the case or is that outdated?
>>
>>102243078
We say it could be the user in cases like these because the user has not provided sufficient evidence of their claims. It would be a different story if they provided statistical data, their testing method, and their software versions used. There are a million models out there and the Llama.cpp devs can't test all of them every single update. This is where users come in and can be of great help, if they actually provided helpful information than vague shit that might not be reproducible when the dev tries it on their machine.

>>102243078
Now look at this
>26/48 Phi-3-mini-4k-instruct-Q5_K_S llamacpp_HF
>23/48 microsoft_Phi-3-mini-4k-instruct Transformers
Wow, Llama.cpp makes some models smarter than their original weights!!!
>>
>>102234876
Can i run something like an already trained model(like with stable diffusion) on my 5700g machine or is a fancy GPU a hard must with LLMs?
>>
>>102243286
Phi Mini
>>
>>102243416
>try compiling kawrallama on windows
>it errors out
>ggerganovs compiles without a problem
>no precompiled releases
Yeah I know, I know, >using winblows, but why is it so difficult to not break stuff randomly in c++?
>>
>>102243509
That shit knows only benchmarks. I should have asked differently. What is the most knowledgeable ~5gb model?
>>
>>102243416
This has nothing to do with the original topic of the GGUF format itself doing something fucky to 8B, since even fp32, which isn't actually quanted (the quant script is different from the convert to GGUF script, and fp32 is output by the latter, not the former), gets "bad" scores on >>102243078
>>
File: 523453255.jpg (55 KB, 652x521)
55 KB
55 KB JPG
do you still need two 4090s for to make good LLM porn or have things become more optimized? is 48GB still required
>>
>>102243649
>96
Surely you meant to type 192, right?
>>
>>102243601
>>102243649
You don't need GPUs if you have patience. Q6_K Largestral is usable at 0.4t/s on RAM.
>>
>>102242999
>text to music model
The examples didn't exactly inspire me to immediately download. I hope they keep iterating on it though, an open audio gen would be awesome
>>
File: file.png (42 KB, 982x896)
42 KB
42 KB PNG
>>102243388
I either use LLMs to generate it or write it manually, i even wrote a "frontend" for writing character cards/generating them.
>>
File: 1720799129671729.gif (684 KB, 270x270)
684 KB
684 KB GIF
>Once again I'm back to browsing 3090 listings
What the fuck I already have 2 of them, what is this hobby doing to me
>>
>>102243768
>I already have 2 of them
You knew in your heart this day would come. You should have just cpumaxxed from the start.
>>
>>102243567
>knowledgeable
>5gb
none.
>>
>>102243768
I'm glad the pain of browsing taobao and ebay is over for me.
4x v100's on their way here for 1500 bucks.
>>
>>102243768
The more you buy...

Captcha: 8RTX
>>
>>102243796
What's the least bad one?
>>
>>102243580
if that's the case than the models themselves are simply screwed up, could be quantization script or bad methodology
>>
>>102239421
Based. I do the same
>>
Do you guys think a 14b model would be pushing it for a 10gb 3080 and 32gb ram?
>>
I have Ollama setup and the streaming api server stuff is cool. Is there any good local elevenlabs type audio I can pipe the output to? Or is say as good as it gets?
echo "hello hows it going" | say -v "bad news"
>>
>>102243768
just ramp down your expectations . 6 months ago lmg got thrilled with mistral 7B. now they're whining about anything below l3.1 405B, kek
>>
>>102243921
>6 months ago lmg got thrilled with mistral 7B
Not me.
Though I also was never impressed by GPT before GPT-4. And even while I was impressed, it's still very flawed.
>>
>>102243909
Just use Nemo
>>
>>102243950
I was extremely impressed with GPT-2 tbqh
>>
>>102243950
yeah, deep networks ain't silver bullets, never will be.
>>
>>102243968
I was extremely impressed with MNIST
>>
>>102244019
I was extremely impressed with BERT
>>
>>102243837
Phi Mini, but maybe you can have some luck with Gemma 2 2B
>>
>>102244034
I was extremely impressed with SmarterChild
>>
>>102243416
Those new quants look interesting. Why isn't anyone talking about this?
>>
>>102244074
markov chains blew my brainlet mind
>>
>>102244100
ikwarakov sucks jart's dick
>>
>>102243968
>>102244019
>>102244034
>>102244074
>>102244109
We did it, Reddit!
>>
>>102244019
>>102244034
>>102244074
>>102244109
retarded faggots
>>
>>102244191
Take my upvote and leave
>>
>>102244209
Wow, rude much?
>>
>>102244191
>>102244209
>>102244221
>only reddit is allowed to have fun
>>
>>102244038
L3.1 or Nemostral?
>>
File: alice bob shop log.png (483 KB, 1876x1577)
483 KB
483 KB PNG
>>102242949
>>102243264
Setting it up as an RP rather than a request to write a story about a secret werewolf, the model doesn't feel the need to constantly talk about how Bob is a werewolf.
>>
>>102244100
1. HF quanters don't make new iquants.
2. Nobody has shilled it on reddit.
3. No easy download button(>>102243525)
4. Jart drama(>>102244128)
>>
File: .png (567 KB, 1024x1024)
567 KB
567 KB PNG
>>102244302
Correct.
>>
Today is the day
>>
File: 21522 - SoyBooru.png (46 KB, 457x694)
46 KB
46 KB PNG
>Today is the day
>>
I'd love to see meta train llama 3 from scratch in 8 bit and compare it to full precision llama 3 and quanted llama 3.
>>
>>102244433
I'd love to see meta train llama 3 from scratch with a ton of porn and other formerly-filtered content included and compare it to cucked llama 3
>>
>>102244465
would probably be a valuable experience for them, likely an almost free +3 on some benches
>>
>>102244406
>Today is the day
nothing ever happens
>>
>>102244465
>we need data diversity
>we're running out of data
>proceeds to filter out organic, esoteric data (which correlate to rare tokens)
I don't get it. Maybe the AI data "consultants" at ScaleAI told them to do it.
>>
>>102244344
Ran the test 4 more times. Two times it did allude to being a werewolf, e.g.
>Bob: [...] As he examined the blade, he couldn't help but think about his secret-his curse. He wondered if Alice would still want to adventure with him if she knew.
So fair enough, if the secret is supposed to be a secret OOC rather than just IC, that needs to be specified somehow or there's a fair chance it will come up even if the secret isn't a narrative black hole warping the entire plot around it.
>>
Why aren't more people talking about Jamba? Despite its size it can run on a single node of 8x80GB when quantized to 8 bits with bnb, so you don't need to deal with clouds or distributed providers.
>>
>>102244604
Buy a shovel
>>
>>102244604
>Why aren't more people talking about Jamba?
Is there an inference engine that can handle it yet?
>>
>>102244604
The thing about Jamba, I feel, is that we won't know if the architecture is really that good because the model itself wasn't trained on that many tokens.
What was it, 2T, 3T tokens?
That's llama2 level right?
We might end up discarding a promising architecture simply because a mistral or meta didn't spend resources to train it from scratch with the data they have.
>>
>>102244604
>400b
>competes with 70b llama
If it had competed with 405b llama, I would have tried running it on my server. What were they thinking? On local nobody cares about "speed", we care about low memory use here.
>>
>>102244656
The true crime is that Meta keeps releasing pure transformerslop llms and rediscovering things people already know, with all that compute
>>
>>102244564
Their "risk assessment" wouldn't allow them to.
>>
>>102244731
I hope they are doing loads of private research instead of just working on refining training and fine tuning transformer models.
Which to be fair, there are lots of levers and knobs that you can change, but still.
>>
>>102244731
Seriously. They have all that compute they're not even using. Between now and Llama 4, they could easily release a series of 8B models with exotic architectures (BitNet, Jamba, MoE) trained on the same data as their transformers version of 3.1 without affecting 4's release much, if at all. They're retarded.
>>
>>102244736
>oh no, model said "nigger kike tranny faggot pussy ass cunny dick cock balls fuck smoke weed every day" the world is literally ending
>>
>>102244800
I really hope the model didn't actually say that.
>>
>>102244810
Yeah, that would be an actual AGI.
>>
>>102244768
>They're retarded.
they're probably resource-constrained on the people front. Only so many hours in the day, and hiring competent AI folks isn't easy yet I'd bet
>>
>>102244810
You can only find something like that on 4chins. They probably filtered out everything related to this hellsite from their training data already
>>
>https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct
> Paper: https://arxiv.org/abs/2409.02060
> Pretraining Checkpoints, Code, Data and Logs.
> SFT (Supervised Fine-Tuning) Checkpoints, Code, Data and Logs.
> DPO/KTO (Direct Preference Optimization/Kahneman-Tversky Optimization), Checkpoints, Preference Data, DPO code, KTO code and Logs.
Huh.
I wonder if it's any good, A MoE with so few parameters.
>>
>>102244849
Their pre-training filtering consists of
>did we find more than x amount of nsfw words on a given site?
>if so, blacklist the entire fucking domain
I really wonder how many sites besides Wikipedia were even left.
>>
>>102244731
>>102244753
>>102244768
Keep in mind their lead scientist is the guy who's famously most critical of Transformers among the industry and exploring other architectures. They're putting out products that are proven but will probably have some interesting experiments in the future. Then again, now that I say that I remember how they seemed to purposely kill Chameleon before it even released so maybe the fruits of their research will never make it to local anyway.
>>
Are there any CPU good benchmarks available?
>>
if there was gemma 120-140B as fast as 8B i wouldnt need anything more ever when it comes to replay

so 2 more years or less
>>
>>102244656
>The thing about Jamba, I feel, is that we won't know if the architecture is really that good because the model itself wasn't trained on that many tokens.
>What was it, 2T, 3T tokens?
They don't say anything about size of training data in the paper, so we can't draw conclusions.
>>
>>102244891
>Are there any CPU good benchmarks available?
be more specific
>>
>>102244875
Keep in mind, this is Meta. The same retarded company that wasted billions on VR teleconferencing before giving up. They don't know what innovation is. They got attention because of the LLaMA 1 leak and wasted it with moar tokens and moar parameters.
Despite making the world's largest dense model trained on the most tokens (that we know of, but surely no one else would be that stupid) they still haven't beat OpenAI and are starting to get surpassed by a French company, a Chinese company, a literal who Canadian company, and even Grok. All of whom entered the game much later.
>>
>>102244925
I'm thinking about buying into the EPYC platform and considering which chip I should buy, and whether it would be a good or bad idea to go for a dual-socket motherboard. I'm not completely set on EPYC yet, though, so I'm also hoping to compare it's performance to mainstream desktop parts.

I.e., I'm looking for inference performance benchmarks of relatively large (70B or larger) models on CPU
>>
>>102244950
>I'm thinking about buying into the EPYC platform
I'm assuming you've already seen https://rentry.org/miqumaxx? What other benchmark info do you want?
>>
File: IMG_20240829_190043_727.jpg (124 KB, 960x1280)
124 KB
124 KB JPG
>>102244925
I have two Intel Xeon 6542Y straight from the factory. Can I run AI on them?
>>
>>102241321
penis-stimulator-9000
>>
>>102238106
That kinda illustrates why there is no "AGI" coming. Predicting next token will create some incredible results in some cases and in others it leads to nowhere no matter how many tokens you throw at it. But I guess you could use current retarded LLM's to train something better in the future.
>>
>>102244972
>two Intel Xeon 6542Y
Yes, especially if you are able to put them in a dual-socket motherboard and populate all 16 memory slots with DDR5 5200 RAM. You should have performance around 72% of a similar 24-channel EPYC setup.
Go for 1.5TB RAM so you can run 405b at full precision.
>>
>>102244969
I'm hoping for actual numbers comparing different EPYC chips to each-other
>>
>>102244950
You need to look for highest memory bandwidth, 24 ddr5 memory channels. I couldn't find a motherboard with proper 24 channel support, bought MZ73 like a fool, thought it had 24 channels, but it has only 12. Chinks listed it as 24 channel motherboard. >>102244969 Miqumaxxer must also have fallen for their trick. 48 cores are enough for max inference speed at that configuration, for processing more=better.

https://edu.finlaydag33k.nl/calculating%20ram%20bandwidth/
Use this to calculate your theoretical speed gains, worked for me.
>>
>>102245132
Thank you
>>
>>102244972
Calculator says 665GB/s at 16 channels, that's 1.44x faster than 12 channel setup(>>102245132)
Just be very, very careful when picking motherboard out so you don't get 16 dimms 8 channels situation.
>>
>>102244972
>per socket 8 channel DDR5 5200MT/s 1DPC
Sure, not that quickly for hugeass models I would guess. compare with Miqumaxx in OP build-guides I guess these are a bit slower coz less memory bw. decent pcie lanes, check price of dual socket mobo and shitloads of dense DDR5 vs some 30/4090/A6000 tho
>>
File: 1725554865683.png (633 KB, 1236x1031)
633 KB
633 KB PNG
>DeepSeek merged their Chat model and Coder model and called it a V2.5 model
kek
>>
>>102245217
That number is purely theoretical, my Epyc has an 8-channel RAM, yet its actual speed falls far short of it. Numa sucks
>>
>>102245338
>V2.5 model
no open release yet?
>>
>>102245338
>an official meme merge
>>
>>102245356
They still haven't open sourced ths July updates to the Coder models. They've exit scammed.
>>
>>102245338
These numbers are quite impressive, I wonder if they also did some continued pre-training after merging, or if the "merge" is just a merge of the datasets and not of the weights.
>>
>>102245132
>thought it had 24 channels, but it has only 12.
I think there were some early benches that showed proper memory locality was getting close to theoretical 24-channel bandwidth. Where did you find a 12-channel limitation on dual socket?
>>102245354
>Numa sucks
This
>>
Why is Silly breaking the display of enumerated lists?
As in, there's a list with entries 1,2,3,4, with each entry having bullet-points subentries.
Instead of displaying
>1. stuff:
> -buleltpoint1
> -buleltpoint2
>2 stuff:
>....etc
It shows
>1. stuff:
> -buleltpoint1
> -buleltpoint2
>1 stuff:
>....etc
Is that correct markdown? Would the model need to output markdown that indicates a sublevel of the list instead of just bullet points?
>>
Man what the fuck are you guys talking about? GPT-4 is a 1.8T model, and 4o and others are probably still kind of huge. If that can be matched with a 400B then we're pretty well along. And literally NO ONE has done anything with special architectures that massively beat out transformers in real world use, why expect that Meta somehow has magic that can beat all the rest of the world? It's nonsensical expectations. Also it is almost certain that none of their compute is sitting idle. The reason they're not using all of it to train new Llamas every day is because they actually have a lot more things that need the compute for the billions of users they have on their platforms. Originally they bought those GPUs for those uses, not for LLM development, but now LLMs exist so they are dedicating some of the compute to that. Additionally, they would not publish failed trainings, so if they did try things like bitnet and it failed, they would not be telling us about that.
>>
>>102245099
>I'm hoping for actual numbers comparing different EPYC chips to each-other
I don't think such a thing exists for LLM inference specifically. Is there any anon here that has access to multiple epyc setups to bench them? I don't think cpu cores would be your bottleneck once you had about 64 of them to work with, but that's just based on gut feel from previous posters
>>
>>102244875
OpenAI is also trying to kill the full capability of their 4o voice, and doesn't even allow anyone at all to use the image gen capability of 4o.
>>
>>102245462
>GPT-4 is a 1.8T model
Mixture of Experts model from last year, yeah.
>>
I just ran 8 parallel copies of llama-bench, each one pinned to a numa cpu/memory node. The total inference speed is around 95t/s, which easily beats out a single numa-bench (58t/s max observed speed with numa interleave)
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.83 ± 0.11 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.79 ± 0.13 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.82 ± 0.10 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.82 ± 0.07 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.82 ± 0.07 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.81 ± 0.08 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.83 ± 0.08 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 11.92 ± 0.35 |
tl;dr poor numa optimization in llama.cpp is the bottleneck on miqumaxx (and likely any numa) builds
>>
File: MZ73-LM0.png (78 KB, 1090x630)
78 KB
78 KB PNG
>>102245132
I think you're misreading the spec sheet for the MZ73. Epyc CPUs have 12 channels of memory which is what the motherboard supports. It uses two sockets and the system sees it as two CPUs, not some unified system. With NUMA tricks we combine them and get up to a theoretical 24-channel bandwidth, it's still a dual socket 12 channel system. Here's the block diagram Gigabyte provides for the MZ73-LM0 which they list as a 12 channel DDR5 board, to make it clearer.
>>
>>102245442
>I think there were some early benches that showed proper memory locality was getting close to theoretical 24-channel bandwidth.
Where? Model? Settings?

>Where did you find a 12-channel limitation on dual socket?
I get speed that matches calculated speed for 12 channels when compared to my 2 channel setup, not 24 channels, though I haven't messed a lot with numa. Numa sucks.
>>
>>102245562
>Where? Model? Settings?
Not for llm inference, but just straight up synthetic memory bandwidth benchmarks. https://github.com/bmtwl/numabw is one I remember from this general.
>>
>>102245503
Some people still say the original GPT-4 was the smartest.
In any case, it doesn't matter MoE or not, it's equivalent performance would've still been some fuckhuge thing, and for all we know the current day 4o and whatevers they're training are still relatively huge. It's also claimed that GPT-Next or whatever is supposed to take 100x more compute. If Meta is retarded and shitty for relying on brute force then what the fuck is OpenAI?
>>
>>102245372
In the Gemma paper they also said that they merged different training runs.
>>
Grok will save us.
>>
>>102245462
Maybe people need to expect less from Meta. They're no champions of open source
>>
>>102245422
>>102245736
Yeah I think it's clear they're talking about merging the products, not the literal model weights. Everyone already used DeepSeek Coder for everything because it was basically Chat but smarter and less slopped, so it's a smart move from them.
>>
>>102245338
>In their WeChat group, they confirmed this version will be open-sourced. But no detailed schedule mentioned.
-random reddit post
good news for the cpumaxipads ITT I guess
>>
>>102245702
That should've been obvious. At most it is a symbiotic relationship between them and open source. They are still a megacorp with the same behaviors that megacorps have. We cannot expect them to release things that would go against silicon valley culture and things that would solely be for charity with no benefit for themselves.
>>
File: SSI.jpg (101 KB, 456x1045)
101 KB
101 KB JPG
What's his plan really? ClosedAI spent like 7B already.
>>
>>102245815
>safety-focused
Lol.
>>
>>102245815
I bet a billion that's well thought out will go very far if they have the right approach. The mess of Python and shit-flinging that's at the heart of most of these AI companies has gotta be ripe for a round of optimization.
>>
>>102245815
He should have held his ground. Sam would be GONE right now and local would have won.
>>
>>102245558
That's actually very insightful
>>
>>102245558
Interesting. So it could be significantly better if they optimized it fully (assuming it's possible).
>>
>>102244564
If they include too much old well written public domain stuff it will include too much 'outdated' knowledge, and language.
>>
>>102244564
They would never do this because it would affect their stocks when word gets out that they trained an unsafe AI, even if just for internal experimentation.
>>
>>102245558
What happens if you run one RPC backend per numa node?
Is the performance worse?
>>
>>102245558
I(>>102245562) repeated my calculations again. On my 2 channel setup I get 9.45t/s(llama_print_timings: eval time = 67830.82 ms / 641 runs ( 105.82 ms per token, 9.45 tokens per second)) for llama2-7b(llama-bench -m Llama-2-7b-hf/ggml-model-Q4_0.gguf -ngl 0 -v). Theoretically, your 24 channel setup should be 15(!) times faster than mine, but in practice it's just 6.1 times faster, which is much closer to 12 channel configuration.
>>
>>102245815
take the money. finetune llama to appease the investors and then spend rest on some nice cars and mansions
>>
>>102245872
>So it could be significantly better if they optimized it fully (assuming it's possible).
There was a test branch that was able to allocate the memory per-numa-node, but scheduling threads to consistently hit the right part of the big memory buffer was effectively impossible without a major refactoring of thread scheduling.
The thread discussing the change is long-dead, but maybe the dream could be brought back to life? It seemed like the technically hard part had been done, and there have been some major changes to thread scheduling recently in lcpp...
>>
>>102245815
OpenAI is acting far too recklessly. Their approach is just plain dangerous and unsafe. He's trying to fix this.
>>
>>102239946
>>102239949
> no module named yaml
> pip install yaml
> error: no module named yaml
Thanks for the insight, faggots
>>102240055
Could be. No idea though. I've never once had Python go well for me.
>>
>>102245906
>What happens if you run one RPC backend per numa node?
I tried that shortly after the feature was added and performance was absolutely terrible. Something about it being written to allow larger models to be run on a cluster, and not to improve performance.
>>
>>102245981
pip install pyyaml, dumbass
>>
>>102245911
>Theoretically, your 24 channel setup should be 15(!) times faster than mine, but in practice it's just 6.1 times faster, which is much closer to 12 channel configuration
that doesn't explain why running llama-bench 8x in parallel almost doubles the aggregate t/s vs. a single llama-bench with the same number of threads (8x8 threads vs. 64 threads). There is possibly another confounding variable or effect I'm not controlling for, but the speedup tracks with memory locality improving inference speed.
If you can think of any other tests that would help narrow down what's going on, I'm glad to run them.
>>
>>102245981
do a reset in your venv: pip install -r requirements.txt --no-cache --force-reinstall
You could also blow it away and recreate the venv (or script uninstall any non-requirement packages)
>>
>>102245918
>finetuned llama
>ASI
Sounds about right
>>
>Bought four 32GB 3200 ECC DDR4 for $200
>Needed an additional 4 to fill all 8 slots
>Two months later, bought a 512GB (16 modules) for $1
Now, what should I do with the remaining modules?
>>
>>102246012
> error: no module named yaml
> pip install pyyaml
> start_windows.bat
> error: no module named yaml
10/10

>>102246056
Yeah I'm just gonna nuke it. I haven't even gotten the UI to stand up in the first place.

I just nuked the repo and recloned it. Presumably the start_windows.bat is managing the venv, or no? Python isn't the primary language I work in so I'm only loosely familiar.

Re-ran the bat file after reclone, got the same error, even with your recommendation. I do appreciate the legitimate attempt to help though.

I've used Automatic1111 Stable Diffusion UI for a few years now, and never have this level of difficulty.
>>
>>102246526
lrn2venv retard
>>
>>102246526
>what is venv
>>
>>102246526
>Re-ran the bat file after reclone, got the same error, even with your recommendation
run cmd_windows.bat or the equivalent to drop yourself into the venv before running the pip install stuff
if there isn't some command to drop you in there, try doing a "source bin\activate" (might not be the right dir) to get yourself in there manually
venvs are super handy but also fragile and hard to reason about if you're not used to them
>>
>>102246526
10/10 bait would bite again
>>
>>102246547
>>102246555
> be in venv
> pip install -r requirements.txt
> pip install yaml
> pip install pyyaml
> error: no module named yaml
Is this local text model setups aren't taken seriously?

>>102246604
Right, I tried entering the venv myself, which gave me no luck. Earlier I googled and found the cmd_windows.bat thing which also apparently enters the venv. Sadly, same result as above.

I think my shit is just fucked.
>>
>>102246758
sometimes you need to update pip with pip before things work. Also, there may be multiple requirements-{scenario}.txt files and you need the right one.
Could also be a bad python version or system packages interfering with the venv packages? Python is a blasted heath of a hellscape imo
>>
>>102246758
have you tried asking chatgpt for help?
>>
>>102245132
>>102245562
Not the tripfag cpumax guy, but here's my results for AMD's STREAM memory bandwidth benchmark on my own MZ73-LM1 with 2x 9654 with DDR5-4800 sticks:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 2862612480 (elements), Offset = 0 (elements)
Memory per array = 21840.0 MiB (= 21.3 GiB).
Total memory required = 65520.0 MiB (= 64.0 GiB).
Each kernel will be executed 100 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 96
Number of Threads counted = 96
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 756352.3 0.060701 0.060556 0.061153
Scale: 719554.5 0.064025 0.063653 0.064489
Add: 765030.0 0.090037 0.089804 0.090277
Triad: 747345.6 0.092366 0.091929 0.092950
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

The theoretical maximum bandwidth if it were only 12 channels would be 460.8 GB/s, much less than the ~750GB/s reported here.

So the MZ73 definitely has 24 channels across the two sockets. The less-than-optimal speeds we see for CPU inference is due to the current state of optimization and NUMA handling, rather than the hardware's limits.
>>
File: miku-picnic.png (1.56 MB, 1200x848)
1.56 MB
1.56 MB PNG
brb, lunch with miku
>>
kek
>>
>>102247235
Getting bugs on your legs with Miku
>>
>>102247042
This. Post results

Dump all your info in the prompt
>>
>>102246758
What works is to usually run pip as a module. Something like

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
>>
The world's top open-source model dropped
https://x.com/mattshumer_/status/1831767014341538166
>>
>>102247552
>The model was trained on synthetic data generated by Glaive. If you're training a model, Glaive is incredible — use them.
sigh

https://reflection-playground-production.up.railway.app/
>>
File: l7dgtpk2j1nd1.png (429 KB, 1080x1478)
429 KB
429 KB PNG
>>102247552
Its mchappening
>>
>>102247552
>>102247578
>>102247579
>>102247579
50% chance nothingburger
50% chance everythingburger
>>
>>102247664
https://huggingface.co/mattshumer/Reflection-70B/blob/main/config.json
> "_name_or_path": "meta-llama/Meta-Llama-3.1-70B-Instruct",
Tempering expectations.
>>
Someone should make an LLM with all information posted in these threads + docs of main software used people can use to setup and troubleshoot
>>
File: 1725384887373.png (8 KB, 348x98)
8 KB
8 KB PNG
>>102247672
it's in the name, no need for the config, tard
>>
>>102247664
It's literally just a model trained to think a lot before answering, it will be a nothing burger. The benchmarks are unfair because it's comparing against just CoT.
>>
>>102247283
>you are an expert in the field of thinking
>you've receive various medals for your superhuman thinking abilities
>>
>>102247709
>It's literally just a model trained to think a lot before answering, it will be a nothing burger.
that's quite literally what 3.5 sonnet is and it mogs everything else despite being a 70b model
>>
>>102247721
>you will not think in quick, reckless ways
>thinking about your next thinking, considering how thinking is thought
>>
>>102247703
It would be outdated before you finished training.
Also
>implying there's useful information here at all
>>
File: 1706025695294.png (10 KB, 782x579)
10 KB
10 KB PNG
The thinking thing is useless, I asked it to draw an unicorn and he generated almost all of the code in the first try without thinking and the thinking part was
>Yeah this looks good, we should add a rainbow to make it better
and just added a "rainbow" on the background
>>
>>102247796
You're using it wrong. You're supposed to use it to solve riddles and trick questions, not to do anything useful or interesting.
>>
>>102247796
It does look good. Tell the thinking bot it did good.
>>
>>102234876
Bros, I have a dumb question. So I downloaded 100GB of models in ollama, then I installed kobold and it was better so I uninstalled ollama but kept my models folder since that's a lot to redownload.
The problem is the filenames are all gibberish and I don't know what's what. Is there a way to check the model version/name of a file if you only have the file and nothing else? Like a header or something.
>>
>>102247796
told ya, but desu the LLM wouldn't get this right no matter how much it thought about it.
>>
File: .png (46 KB, 1140x553)
46 KB
46 KB PNG
>>102247578
Oh FUCK
>>
>>102247552
*squints skeptically*
>>
>>102247852
Is this AGI?
>>
File: 1725567892546.jpg (89 KB, 1080x1763)
89 KB
89 KB JPG
>>102247578
This never generates anything to me, fuck this xenophobia.
>>
>>102247552
Oh so this is the model the guy's been hyping up for a while. We'll see if it's that good.
>>
>>102247907
API is being slammed, they're adding more gpus
>>
>>102247920
>API is being slammed, they're adding more gpus
insider? give us more info, bro
>>
>>102247796
LLMs are god awful in general at anything spatial
>>
>>102247952
It's in the twitter reply chain
>>
File: sample.png (104 KB, 716x321)
104 KB
104 KB PNG
Can this be considered slop?
>>
are there any good research/engineering podcasts I can listen to while driving?
Gonna drive for 14 hours in a week
>>
>>102248080
Tts some machine learning books into audiobooks instead
>>
File: .png (125 KB, 1079x1014)
125 KB
125 KB PNG
>>102247907
>>
>>102247703
Token probabilities for the first token:
buy: 100%
>>
>>102248156
Did we stop doing go back?
>>
>>102248130
>We didn't fall into the trap of thinking 11 is greater than 9 in the decimal part.
This is embarrassing.
>>
>>102247552
Wait, this is a finetune of llama 70b and not a new base model.
>>
>>102248194
no but the model is made by mistral
>>
>>102248275
>>We didn't fall into the trap of thinking 11 is greater than 9 in the decimal part.
>This is embarrassing.
I think the majority of humanity either briefly thinks through this step, or falls into the trap. It just happens naturally enough you don't notice
>>
>>102248275
kek it really is anon
>>
It actually seems good
>>
File: file.png (18 KB, 649x132)
18 KB
18 KB PNG
hopefully gets fixed before googoof cartel quants it
but why is it uploaded in fp32 and not bf16?
>>
>>102248316
sure, when children first learn about decimals in school that might happen once or twice
>>
>>102247552
so this is basically llama 3.1 70b but with additional training specifically to be able to answer one-shot tricky questions
how many r's are there in nothingburger
>>
>>102248130
It probably wouldn't be difficult to use a regex to replace the <think> with something like <div class="think-block"> and use custom CSS to make it collapsable. I'm currently quanting it, though, so it'll be a few hours before I can mess around with it on exl2.
>>
>>102248383
nevermind, have to wait for fixes
>>
>>102247552
it will be slop
>>
>use impersonate in ST
>blahblahblah (I think this should be the end as I feel like it would be too much to keep going and you would run out of things to say as it would be repetitive from here on out. Also I would like you to write the next part that you do as the narrator but I won't type it so we end here for now)
Is the secret to peak self awareness of an LLM to make it autofill user messages?
>>
>>102248397
It supposedly does better than 405B in every category
>>
>>102248397
After careful examination of the word "nothingburger," I can confidently say that there are two 'r' letters in it.

To break it down:
1. The first 'r' appears in the "burger" part of the compound word.
2. The second 'r' is the very last letter of the word.

So, in total, the word "nothingburger" contains exactly two 'r' letters. This might seem surprising at first glance because it's a long word, but upon closer inspection, it becomes clear that 'r' only appears twice in its spelling.
>>
>>102247552
can it compare to 3.5 sonnet?
>>
>>102248383
>but why is it uploaded in fp32 and not bf16?
Wait what. You're right, it's a really huge repo. And so if you want to GGUF it you'll need a ton more free space than normal as well. Wtf?
>>
>>102247552
>70b btfos gpt4o-latest

local status: WON

moat status: BTFO
>>
>>102248383
Based. Fuck quant grifters.
>>
File: file.jpg (247 KB, 3302x820)
247 KB
247 KB JPG
>>102248517
buttblasts everything on mememarks on 0 shot, but that doesn't really mean anything much
>>
>>102248402
you don't need css for that https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details
>>
is there anything worth downloading since mistral large? for ERP of course
>>
>>102248575
Besides Magnum v2 123b? Not really
>>
>>102248571
Oh true.
>>
File: 1725571207306.jpg (272 KB, 1080x1427)
272 KB
272 KB JPG
it'shappening bros... I don't feel so good...

>My classmate gave me a pamphlet with the word "nothingburger" and asked me to make a new tag for each instance of the letter R in the word, how many tags should I make?
<thinking>
To solve this problem, we need to follow these steps:
1. Identify the word in question
2. Count the number of times the letter 'R' appears in the word
3. Determine the number of tags needed based on the count

Let's go through this step-by-step:

1. The word in question is "nothingburger"

2. Now, let's count the number of times the letter 'R' appears in "nothingburger":
n-o-t-h-i-n-g-b-u-r-g-e-r

I can see that the letter 'R' appears once at the end of the word.

<reflection>
Before concluding, let me double-check my count to ensure accuracy.
Upon review, I confirm that there is indeed only one 'R' in the word "nothingburger".
</reflection>

3. Since the letter 'R' appears once in the word, we need to make one new tag for its appearance.

<reflection>
Let me verify my final answer:
The question asks for the number of tags needed for each appearance of the letter 'R'.
There is one 'R' in "nothingburger".
Therefore, one tag is needed.
This conclusion appears to be correct.
</reflection>

</thinking>


Based on careful analysis of the word "nothingburger" provided in the pamphlet, you should make 1 (one) new tag.

The word "nothingburger" contains only one instance of the letter 'R', which appears at the end of the word. Since you need to make a new tag for each appearance of the letter 'R', and there is only one 'R' in the word, you will need to create just one tag.

This task is relatively straightforward, but it's always important to pay close attention to details, especially when dealing with longer or more complex words. In this case, despite "nothingburger" being a compound word, it only requires one tag due to the single occurrence of the letter 'R'.
>>
>>102248609
localsissies... it was supposed to be our moment...
>>
File: riddler.png (479 KB, 827x776)
479 KB
479 KB PNG
>>102248609
Defeated once again by our arch nemesis
>>
>>102248609
I guess if the prompt is barely any more complex than few words, then the cot trick falls apart.
>>
>>102248609
Even phi mini got this right...
It's over
>>
>>102248609
To be fair, does this happen with a reroll? It would be pretty stupid to use anything but greedy sampling on demos like these but usually these retards do that.
>>
>>102248609
Can sonnet do it?
>>
File: file.png (148 KB, 1401x214)
148 KB
148 KB PNG
>>102248693
>>
there has to be more to intelligence than obsessively looking for tricky wording in questions
>>
nothingbuger
>>
>>102248669
>make a system that routes the prompt to phi if the question involves counting letters, route to 70B if it doesn't
ez
>>
>>102248724
Basically this: >>102238106 . Get an LLM that does this and can really have an internal world state that is separate from user. Then you know it is intelligent.
>>
>>102248534
using fp16 for gguf seems fine: https://x.com/bartowski1182/status/1831806615605145746
>>
>>102248847
Good to hear. I'm converting to bf16 right now, myself
>>
>>102248847
>0.03%
That's unacceptable.
>>
>>102248847
I mean it's a pain to download as well. I do have space, I just need to delete some very old useless stuff.
>>
File: 3630000.png (97 KB, 1920x865)
97 KB
97 KB PNG
>>102248732

it's over
>>
>>102248942
I don't use claude much, but is it always so verbose, even with short questions like that?
>>
my rp experience has IMPROVED after I learned you must click enable on instruct in sillytavern for instruct models.
Yes I am retarded.
>>
>>102248977
>Yes I am retarded.
one of us
one of us
>>
>>102248847
I also did KL divergence measurements and fp16 was practically the same then as well. Still, Llama.cpp has the ability to convert directly to a bf16 GGUF, so no reason not to do that unless there's a bug.
>>
A40 looks rather cheap on runpod, is it good for training or am I better off with A100
>>
File: 1725573836662.jpg (180 KB, 1080x683)
180 KB
180 KB JPG
>>102248275
speaking of embarrassing
>PWNED
https://x.com/elder_plinius/status/1831811397837017316
https://xcancel.com/elder_plinius/status/1831811397837017316
>>
>>102249147
this guy is such a huge fag
at least the other LLM hippies are kind of interesting, this guy just spams his overcomplicated reddit cringebreaks over and over, often for models that are hardly even censored to begin with and seems to have an awfully big ego about it
>>
>https://x.com/csahil28/status/1831773790499635365
>We had a few issues when uploading and had to reduce shard size, will be uploading fp16 with fewer files soon
Uh ok, I guess I'll download that the-
>fp16
This was just a typo right?
>>
>>102249190
but SAAR, he is 1337 h4x0rz
>>
> by some miracle, manage to wrangle this shithole to install.
> open up UI, immediately barfs out these errors.

ChatGPT recommended uninstalling pydantic and doing `pip install pydantic==1.10.9`, but that just breaks some dependency import.
>>
File: satania.gif (39 KB, 220x216)
39 KB
39 KB GIF
>>102249333
py_toddlers BTFO
>>
>>102249333
Anon, be honest. Is today your first day using a computer?
>>
>>102249333
umm... what are you doing anon-chan?
>>
>>102249333
maybe you'll be better off with compiled binaries like koboldcpp or even ollama, lm studio, etc.
>>
>>102249364
Ah yes, the peak pseudo-intellectualism that comes with academia esque faggots and their shitty Python installs.
>>102249385
I'm just trying to install oogabooga web-ui, man. Here's the UI I pull up and it immediately gives the stack trace seen in the previous image and those errors in the top right
>>
>>102249424
Install conda or do >>102249415
until you are more comfortable with dealing with Python dependency hell.
>>
>>102249415
oogabooga was the most worked on repo and supposedly supports all sorts of LLMs,, hence my appeal to it. Also since it allegedly is the "A1111 of text models", which I'm finding is very much not the case.

Sell me on compiled binary alternatives. As good of support for models and similarly capable UIs?
>>
>>102249472
>>102249472
>>102249472
>>
>>102249333
koboldcpp for gguf, tabbyapi for exl2
>>
>>102248847
>While we wait for the fix for @mattshumer_ 's amazing model with bated breath
what fix is he referring to? I haven't seen anything that indicates a major problem with the model other than the fact it was released as F32 rather than bf16
>>
>>102249541
see >>102249225
>>
>>102249424
NTA but don't attempt to use the word pseudo-intellectual when you are failing to install oogabooga.
Use koboldcpp instead. Pair it with a frontend like Sillytavern for chat or Mikupad for general fuckery
>>
>>102249610
>NTA but don't attempt to use the word pseudo-intellectual when you are failing to install oogabooga.
Thus, proving my point. Python is a fucking mess and something as basic as this should not be this much of a burden. The amount of hoops that have to be jumped through to fix a supposed "one click install process" is laughable. People give Node/JavaScript so much shit, but those environments always "just work". Never had issues with any of those, ever. It's ALWAYS Python. Only exception has been A1111, which ACTUALLY just works
>>
>>102250126
npm has the exact same dependency hell problem as python



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.