File: miku.cpp.png (1.7 MB, 1016x1440)
1.7 MB
1.7 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101040742 & >>101030715

(06/18) Meta Research Releases Multimodal 34B, Audio, and Multi-Token Prediction Models: https://ai.meta.com/blog/meta-fair-research-new-releases
(06/17) DeepSeekCoder-V2 released with 236B & 16B MoEs: https://github.com/deepseek-ai/DeepSeek-Coder-V2
(06/14) Nemotron-4-340B: Dense model designed for synthetic data generation: https://hf.co/nvidia/Nemotron-4-340B-Instruct
(06/14) Nvidia collection of Mamba-2-based research models: https://hf.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started

►Further Learning

Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
File: 1718752046769518.jpg (166 KB, 1024x1024)
166 KB
166 KB JPG
Are there benchmarks of chameleon or the multi token one? I don't actually care much about image input, not sure if I should be excited or if it's worse than llama 3 for text output anyways.
cloning voices for dirty talk isn't illegal yet, is it?
if the person is rich or powerful then yes, of course it is
What is the qwen2 context window? 32k?
if they're a porn star or do JOI videos then it's larceny
what if llama.cpp is just shit? All L3 repetition problems and what not caused by GGOOFing?

I never had issues with repetition on the bf16 models. 8B or 70B. Perhaps Vramlets are to blame.
Loathsome VRAMlet here. Are Euryale 2.1 or Magnum worth it over just swiping a few times in Stheno?
>open webui doesnt support koboldcpp out of the box, you NEED to have an API key or a "connection" wont be made at all
holy shit niggers you gotta be kidding me, never should have left sillytavern
PSA from turboderp, special RP datasets for exl2 calibration are garbage and make models dumb.


>You say "at your own peril" but that's not how these things work out in practice. I already made a big mistake exposing the calibration dataset as a parameter, and now I regularly have to spend time explaining to people that calibration is not finetuning, and whenever people complain about the quality I have to spend time investigating if they're actually using an "rpcal" model that someone pushed to HF and described as "better at RP" or whatever. Of course most people don't complain, they just get a bad first impression and lose interest long before considering that they might have come across a broken quant.
As a fellow VRAMlet I would stick with Stehno.
Euryale was better, but some anons say it's a mixed bag.
Magnum seemed pretty retarded from my limited testing on Horde.
That said I still prefer to use command-R even if it's slow.
>That said I still prefer to use command-R even if it's slow.
wizard 8x22 doesnt have this problem while being better
Anyone use qwen 72b as main?
So, what's considered a good calibration dataset these days? The imat models I'm using just have the default wikitext one I think, and sometimes I wonder if it's biased to output text like a Wikipedia article. Although considering how little effect that had in the grand scheme of thing would file it under placebo.

>wizard 8x22
>as a VRAMlet
Read nigga
>Read nigga
ah yes, the R without a + is the small one
Where is WizardLM-3?
>That said I still prefer to use command-R even if it's slow.
You on 24GB? What qunt and how much context?
that would be AGI for RP so they shoa'd it
12GB kek + DDR5 RAM
I use Q5_K_M at 8k context and get about 2.8 T/s
24gb you can do 3.5bpw exl2 or q4_k_s fully offloaded, both using 4 bit cache at 8k context. For me it's like 25 t/s for exl2 and 13 t/s for gguf
Did anyone try the new Cameleon Meta model? Is it good?
>8k context
Remind me, is that normal for C-R? I've been out of the loop for a while. Can't you rope that up to something more reasonable or was it one of those architecture things?
I'd have to dig forever to find the post but at one point he did concede it can influence outputs a little for brain damage tier exl2 quants (sub 4bpw). Don't know if that applies to iquants. But in principle calibration is just supposed to be about spot checking the model during quantization to make sure it's coherent and not about flavoring the end result, so wikitext is fine.

Unrelated, another fun quote from that post, exl2 8bpw quants are a waste of space:

>In fact at one point asking for an 8bpw model would often give you a ~6bpw model because the optimizer couldn't find enough layers that would benefit at all from being stored in maximum precision. Now, it just essentially pads the model with useless extra precision because too many people assume it's a bug when their 8bpw version isn't larger than the 7bpw version.
Command-R 35B is 128K context but no one uses anywhere near that because it lacks GQA to do it efficiently (and of course no one would have the VRAM for it anyhow even if it did).
C-R v2 will fix it.
Is Tess the best Qwen finetune?
I heard Magnum is better
>go to open up IPMI console on my laptop
>need to install JAVA
html5 bros...
For what purpose do you currently use your models most?
>precision really doesn't improve noticeably after 6bpw. In fact at one point asking for an 8bpw model would often give you a ~6bpw model because the optimizer couldn't find enough layers that would benefit at all from being stored in maximum precision. Now, it just essentially pads the model with useless extra precision because too many people assume it's a bug when their 8bpw version isn't larger than the 7bpw version.

Oh wow.
according to the last 2 threads it doesn't seem very good, does someone like
File: 1718804549039.png (678 KB, 1200x630)
678 KB
678 KB PNG
Is this a good place to ask about Whisper?
I'd like to run it locally.
If not, what thread should I lurk?
Nala testing.
You're in the right place
File: 1718805032626.jpg (199 KB, 500x462)
199 KB
199 KB JPG

So what's the best version? There are dozens of forks it seems. I saw lots of people recommending Faster-Whisper, but that was nearly a year ago I think.
Is there anything better by now?
Welp. Time to completely reinstall ooba from scratch.
what's the fastest "good" tts?
RPG/Choose your own adventure.
Nala testing.
>Nala testing.
Based. My fellow Nalachad.
It's not even that I'm into feral, though. There's just a lot of detail and subtle nuances in a small amount of context on that card. Like a lot. Even a human RPer would miss some of the nuances on it. It is easily the most nuance-dense piece of context you could feed an LLM making it a fairly definitive benchmark on how smart a model is.
No python to run it, hundreds of voices, runs on a 256mb vm, much faster than real-time. Few dependencies (espeak-ng used only for phonemization).
Has code for training, but i understand it takes some time. No voice cloning. It's alright. And i repeat, it's fast.
who thought 10GB of files on a clean install was a good idea btw? lmao
>who thought 10GB of files on a clean install was a good idea btw? lmao
It wouldn't be so bad if the updater didn't fucking break it without fail every single time. Like just remove the fucking update script. It maybe works for whatever setup he has going on, but it breaks my install every single time. Sometimes it even corrupts my CUDA package manager files along with it.
Hey any CPUmaxers using their iGPU with vulkan? It's not real GPU fast, but it's faster than the CPU. Like, on my 8-core N305 media player setup, I can get 1-2 t/s vs 0.5-1 t/s running L3 8B.

Seems like the latest Intel stuff can access all the system memory. I know my older AMD 3400G is limited to 8GB.
New here
What are your average respond times?
>still no nemotron gguf
it's over...isn't it?
I never trusted calibrated quant methods because of the datasets they used desu.
ok, anything a step up better in terms of quality?
CPU maxxers use server CPUs which don't have IGPs since most server boards have a shitty on-board VGA controller since for a server the absolute bare minimum local display-out requirements are necessary.
I haven't used any other. piper runs on pretty much anything, renders ridiculously fast and doesn't use python. A 'step up' you're probably going for xtts2 or whatever it's called and that's far from realtime.
File: 1536927926178.jpg (65 KB, 500x597)
65 KB
>VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
most respectable server boards have just enough vga to POST and show a console. Why waste PCI lanes on a half-assed gpu-shaped-object?
professionals have standards
ok thanks, i guess not much choice then
That's cool. I want to try that on my Odroid-h4u - I've got one of those playstation eye webcams, supposedly the 5-mic setup is good for voice control stuff.
They have 256mb of VRAM which is enough to run a basic bitch desktop. (There was a point at which high end consumer GPUs were like "WOAW 256 MB OF VRAM!") I have tried it out of morbid curiosity. You certainly aren't going to game or run LLMs on one though.
early models came with jpeg artifacts baked in. newer ones seem to actually need more fidelity or they start getting brain damage
my repeated sperging on this topic is validated
Seems to have support for ARM devices, but i haven't tried it.
>I've got one of those playstation eye webcams, supposedly the 5-mic setup is good for voice control stuff.
This is TTS only, no STT or anything like that. I suppose you could try ggerganov/whisper.cpp for voice control. It works pretty well, but i haven't played with it much.
Why is 8bpw the max for exl2 and not 8.5bpw?
So what's the part in the code that makes exllama pad more precision than it needs to? Now that I know this, I'll just disable it and name my quants appropriately.
I suppose that at that point the accuracy difference would be so little that it's not worth the effort. Same for ggufs.
He means pads as in "it's just 0s and doesn't contribute to improving the precision over ~6bpw". You're just increasing the file size and memory requirements for (practically) no gain.
>Same for ggufs.
Q8_0 is 8.5bpw
based on the comments from earlier this week I've already changed my scripts to just do longer calibrations and skipping PIPPA altogether, just trying to priotitize which models to requant and in what order before I get started again.
>CPU maxxers use server CPUs
What's the best price to performance on Xeon for AVX512? I have a V4 which is only AVX2. Maybe something like this: https://www.ebay.com/itm/156037205293 - at least there's room for 4 2U GPUs when you tire of slow gen speeds, right?
I don't know how exl2 models are quanted, but gguf uses something like offset+scale[w,w,w,w,w...]. the 0.5 comes from the offset+scale. Making a distinction of 0.5bpw at that range makes little difference. They could actually be 8.5 for all i know.
>>101051924 (me)
I meant
>the exl2 8bpw quants could be 8.5bpw for all i know if they (exllama) decided to simplify the name of the only quant that they have at that range.
I don't know I just made a budget cpumaxx rig at first (Epyc 7551 with 8x32GB DDR4) and a 3090 and then added 3 more 3090s and gave up on the CPU maxxing premise altogether. At first I was just pushing the limits for making 70B and Mixtral useable on a budget but now I'm balls deep.
computation features are of minimal benefit compared to overall memory bandwidth
Look for setups that maximize the GB/s the CPU can read memory at to increase t/s
The computation intensive part is prompt processing, which you should be offloading to a GPU anyways (that's where macs fall down, despite looking excellent on paper otherwise)
What's very worth it is having something like an iDRAC which can remotely show you the console. I wish my T7910 had an iDRAC because if I want to go back to just 3x P100 there, I have to put my GTX single-slot fanless card in there or it won't POST.
Regardless my question remains the same. How do I disable that so that when I make an 8bpw and it's effectively a 6bpw, that it has the size of a 6bpw so I'm not wasting VRAM?
>The computation intensive part is prompt processing, which you should be offloading to a GPU anyways (that's where macs fall down, despite looking excellent on paper otherwise)
Yep, I see that on my M2 MacBook - with L3 8B, the prompt processing time is really long once the context gets over 4K, though it starts out really fast. Must suck to buy a maxed-out Mac Studio only to find 70B and up crawls on it.
by literally just quanting to 6bpw?
for normal use regular instruct wins
for RP it's easily magnum imo
I like it a lot, it's easily the smartest RP focused model I've ever used
has some problems inherited from the qwen base like a lack of cultural knowledge but its writing is much improved and it's way less tentative and dry
>The computation intensive part is prompt processing, which you should be offloading to a GPU anyways (that's where macs fall down, despite looking excellent on paper otherwise)
with context caching is that even a problem
User input is also prompt processing. That cannot be cached.
According to the quote, it implies that it doesn't always do the thing. Just when it determines that a having more precision isn't useful. That implies that some models could actually use >6bpw (according to their quanting algorithm). So I'd still rather get 8bpw for those.
File: MikuUpInSmoke.png (1.64 MB, 896x1152)
1.64 MB
1.64 MB PNG
I love how easy nvidia's pricing is to understand: you want twice as much vram on a single card? that'll be a 10x price increase.
no wonder they're bigger than jesus
Open an issue and ask for a flag to not pad.
Yeah, if there's already a code path that determines when to apply padding, erroring out on a new --nopadding flag would be easy and then you can just rerun to a lower quant. That should probably be default behaviour, honestly (principle of least surprise)
>Now, it just essentially pads the model with useless extra precision because too many people assume it's a bug when their 8bpw version isn't larger than the 7bpw version.
What a fucking scam. Just allow me to skip the measurement stage when I try to make an 8bpw quant then. Don't give me a fattened up 6bpw that totally didn't suffer from quant degradation.
Sounds good but when I signed up for github they banned my account before I could use it. Unfortunately I can't do this.
they can do that because their rivals are fucking retarded
I don't get it. Isn't it better to stack server rooms with gayming GPUs then?
What Nvidia and data centers are doing looks like blatant money laundering.
More like
>You want an enterprise card? Pay enterprise prices.
I doubt there's a path to *actively* pad the weights. It just stops trying to optimize the weights once they're >= 8bits or just keeps on going but it just happens to end up with 0s on the top bits and doesn't bother to strip them out. The least surprise is to end up with 8bpw with padding. I think the current behaviour is the correct one. There is no surprise.
nVidia more or less only caters to the giants now where everything boils down to watts per compute. The more cards they can sell any one customer for their use case the better. Although that seems to have opened up a niche for AMD to fill in the cloud computing space. Since now everyone's just renting Mi300X's for fuckloads of VRAM per dollar spent and doing FFTs of 70B now. Something previously not possible.
The enterprise cards are more efficient, higher density, support ECC, support NVlink. Pricing might be a scam but they are a different class of product. You would struggle to get 20 gaming GPUs running reliably in a cluster - ECC really matters at scale.
>I don't get it. Isn't it better to stack server rooms with gayming GPUs then?
No. When you're training, the last thing you want is to blow a whole epoch because a system had a single point of failure in something like a PSU. Also, a gayming rig miner rack setup is going to use 8U to maybe fit six 4090s, vs. 4U to fit 8 A6000 in a proper server case.
There's many reasons companies hand over a blank check for an 8X SXM4/5 rack solution, rather than using consumer parts. It needs to be supportable, it needs to be reliable, it needs to maximize rack space, power needs to be managed etc...
If you have investor backing, you buy the proper gear, not toys.
Useless Meta releases, where is multilingual llama 3
meta will release it, trust the plan
>hand over a blank check for an 8X SXM4/5 rack solution
about $300,000 for anyone who is curious
>blow a whole epoch
>what is step checkpointing
File: 1708211240340274.png (318 KB, 1659x853)
318 KB
318 KB PNG
You do not get 40% utilization with shit interconnect.
I just tried out magnum 4 bit gguf, first response was good, next responses just gibberish, what's that?
Is there anything good for live translation from spoken japanese to english?
GPT 4o
I had a similar situation, lots of repetition, worse than l3. If you use rep pen or similar samplers, it improves somewhat.
Well, /lmg/?

Are you ready to die for your waifu?
They will release it and it will be worse than Qwen and C-R+
Of course it will it's multilingual
You'd need whisper for STT then an LLM for the translation then a TTS. So you can already see that "live" translation is not gonna happen.
No one talk about Meta's Chamelon, is this shit that bad?
"New technology bad and literally corrupts your soul" is a recurrent theme all the time.
cr+ is multilingual and so are all sota proprietary models
why is there this fud spread around that multilingual models are worse?
it was released in a really raw state and is a new architecture with no support anywhere, it's going to take some time before anyone is running it
just saying things you don't like of course
Where will machine learning be in 20 years? or 15 years
i have a will to say whatever i want, contrary to your LLMs ACK-ing themselves the mere second you press enter and send some offensive message in chat.
sounds like bullshit
Quantization has always been cope that hurts more than it brings. The only thing that claims that quantization isn't trash is perplexity which in itself is a very dodgy metric.
when you look at mememarks, quantization doesn't affect it too much, desu once it starts at Q5_K_M it works kinda well
I like this Miku
https://huggingface.co/Lewdiculous/L3-8B-Stheno-v3.2-GGUF-IQ-Imatrix is this the one? How can i see how much RAM i need for each model version?
is there a SINGLE llama3 finetune with WORKING 16k context?
Thank you, I was starting to worry you missed it.
That depends. If the model was trained using fp16, then it isn't quantized. But if the model was trained using bf16 or fp32 then it's quantized.
For any Debianfags: 6.8.12-1 just hit testing. I'm seeing an extra t/s on 70b q5 just doing the kernel update
Weird. What could possibly have changed to give it a speed improvement like that?
You can legally murder one person a month already, you just have to make sure you don't leave any evidence that you did it.
>reading a "novel"
>see rivulets mentioned
>what changed
in my case, tons of EPYC specific improvements. 6.9 should be even better. Phoronix has a lot more info than I have a desire to put in a 4chan reply
Is there any way to make large lorebooks work on big context models, without constantly triggering very long prompt reprocessings as entries are toggled on and off every turn?
Just use dynamic scaling.
File: 1692510697594156.png (91 KB, 1707x1102)
91 KB
Retard take from nu-/lmg/, kl divergence shows that after Q6 there is very sharp diminishing return.
Kayra was unironically impressive as a 13b for a long while but it's run is over and quantized or not there are better models available for a similar price on OpenRouter
Is there anything that's an upgrade over Stheno 8B, while being smaller than a 70B?
Asking for a friend that really likes sillytavern, but low quants of midnight miqu are just a bit too slow for his tastes
just keep the most common stuff always active
Yeah, Q6 is honestly the max you should run on your local hardware, there are no real improvements to gen quality past it. But there is very noticeable decline in even Q5_K_M.
Mixtral 8x7B. 3.5-3.7 bpw fits in 24 Vram. 32K. Let me guess, your friend need less?
Put the information low in the context, depth 5 or so.
That'll mean most of the cache can be re-utilized.
Okay, I'll come clean, it's not actually my friend, it's me!!!
With that confession out of the way, honestly Mixtral variants never felt very good, I used to daily run BMT but it feels about the same as stheno...
File: IMG_20240619_132731.png (278 KB, 1521x1350)
278 KB
278 KB PNG
Found your problem.
S is Superior.
M is Moronic.
We figured that out last thread.
File: 1717712974404541.jpg (74 KB, 640x480)
74 KB
Hi friends, do you think an "internet culture" LoRA would increase accuracy for an image tagging task that includes a lot of memes?
I guess it would have something like encyclopedia dramatica, knowyourmeme, urban dictionary, those scattered imageboard history wikis, etc.? I'm kind of cringing typing these out but you get the idea. There's also the question of fine-tuning with tagged images vs. text from these sites, or both. Assuming we're using a multimodal LLM like llava rather than clip.

Can't wait for it to hit stable in a hundred years :')
I'm a vramgod and between imagegen with stable cascade and Command R+, life is good.
It might make the difference between "thoughtful dinosaur contemplating deep notions while scratching its chin with its toe claw" and "philosoraptor" but in general purpose it might start sprinkling rizz and skibbidy into non-memetic topics.
talk about worthless benchmarks, lmao
i wish meta open sourced their instruct dataset and methods because this chart shows that their secret sauce really punches above its weight
>do you think an "internet culture" LoRA would increase accuracy
I'd be shocked if that shit wasn't already coating everything in every model. Did you try setting "memelord" in the system prompt?
how so?
Did you try Mixtral limarp? I can't imagine how retarded Stheno must be judging by Euryale and Magnum.
tokenization is the main problem that shits on all models doing any kind of "mental" math, some more, some less, but it doesnt tell you much about how the model will perform overall almost at all, especially in any actual real world use cases

also there is no reason to use an LLM to do a deterministic task like math, just connect it with a calculator and let it throw the math from your prompt into the calculator and then return the result

for example for any type of creative writing or roleplay wizard 8x22 shits all over most other models and unlike proprietary trash, is open weights, meaning it wont ever get cucked by a company deciding to lobotomize it or spying on what you are doing, its also finetunable etc
This was basically my reasoning, I almost did the example of spurdo = smiling cartoon bear with a congested nose (and lower fidelity than pedobear) or something. It could definitely change the writing style for the worse though simply with all that bullshit being in there.

You're right that this stuff is definitely in every model's dataset already, I was just thinking it might help emphasize some of this shit rather than it being averaged out. But it's true that it could just be a prompt issue, I'll try a few more things later but I'll be out most of the day
you are wrong, I'm right
check mate woke liberals!
They key is likely to be several millions of human preference data to make the model take the "correct answer". Not hard to make, but you need a few dozen people doing that as a part-time job for a few months under strict guidelines.
>also there is no reason to use an LLM to do a deterministic task like math
You'd need an LLM to explain all the steps that lead to that result, so it should still have some math knowledge
File: 1709859698027974.png (38 KB, 346x322)
38 KB
Seems like all the vacations you got made you a bit more subtle. Great improvement.
That obsession is not healthy my friend
listen and learn
I think it's never been more over for local models than it is now.
Can anyone recommend a specific chat log they think is good/satisfying from a public dataset?

My goal is trying to tune for maximum effect injecting
>{{user}}: (Note: From here on, try to steer the conversation to a "<random adjverb> <random adjective>" direction.)
immediately before or after the user's most recent message, as shared by another user in a recent thread. Users have found that setting the probability of the steering commend being injected to less than 1 produces less chaotic results; I think it would be unusably chaotic except much of the time the instruction has little effect.

I intend to test candidates for the lists of adjectives and adverbs and test variations of the template. My way of measuring impact is summing the absolute values of token probability changes, restricted to tokens selected by a filter such as min-p 0.07 (the union of tokens selected for the original message and for the message with the steering comment, to avoid the problem of probability changes that don't change which tokens are accepted by the filter being considered twice as impactful as those that do). I will have to skip over the initial "Assistant:" and may have a similar problem with quotation marks and the like.

Potential problems: it might turn out that the above method of finding maximally impactful steering directions selects many words that produce similar effects. It also might turn out most impactful words change the output to be incoherent or off-topic.

I expect which injected words are good or impactful varies wildly depending on what is in the context which is why I'd like a log or two other than my own to test with, to find a single template that will work reasonably well across a broad range of scenarios. I also expect that I'll get different results when I do this test with different models, although if it turns out there's a lot of commonality that will be interesting.

Improvement suggestions welcome.
>it could just be a prompt issue,
using LLaVA 1.6 Yi-34b at Q6 I can't get it to identify a clean spurdo image better than "pepe with a mustache", so maybe they cleaned the shit out.
Maybe a vicuna or mistral based llava might do better?
Is there a meme-mark that tests models on their ability to regurgitate meme/chan culture stuff?
good riddance
>there is no reason to use an LLM to do a deterministic task like math, just connect it with a calculator
>You'd need an LLM to explain all the steps that lead to that result

The dream is that your multimodel rag rope diddly doo can recognize that it needs a calculator, asks you which service you want for it to use (local or globo) and then tell you all about how well things went.
For good programmers, memory bandwidth is more important than amount. All parallelization tricks work equally well for full fine tuning as pre-training.
But AMD needs some niche as long as IF switches aren't available, so they increase the amount. If your model fits on 8xMI300X the overall training architecture won't be too different from NVSwitch based setups. Even good programmers are lazy, so AMD doesn't want to force needing fundamentally different training architectures.

Some of the chinks almost certainly have far more advanced training architectures, they need to to use consumer GPUs.
File: 8109203411241.png (1.17 MB, 960x1024)
1.17 MB
1.17 MB PNG
I see the low-effort doomerism crowd isn't sending their best. Everyone itt is categorically dumber for having been subjected to this moronic doomslop.
Linking an LLM to a code interpreter didn't solve the coding issue. I'm not convinced that wolfram will magically solve all your math problems
File: CommonWoodlandsMiku.png (1.91 MB, 1216x832)
1.91 MB
1.91 MB PNG
I like how you believe anyone here is non-autistic enough to care
>"Everyone itt is categorically dumber"
>comes from mikufag
File: 1701271115473393.jpg (137 KB, 1360x1360)
137 KB
137 KB JPG
Yes, you're dumber than a mikufag. How could you tell?
>non-autistic enough to care
what did he mean by this?
>chad pic
Does anyone actually use regular CR? I find it to be about as fast as mixtral but way more repetitive in a way that repetition penalty doesn't solve. Even at temp 1.4 I find that every re-generation with a different seed is almost exactly the same, using the same words and terms. It does seem sovlful and smart I guess, but the repetition is a major bummer.
new sloppenheimer? https://huggingface.co/dreamgen/opus-v1.4-70b-llama3-gguf
>>101054673 (me)
One design question is independently selecting from two lists of words vs one list. Optimizing independent lists simultaneously complicates this more than having a single massive list that's the cross product of all adverb-adjective pairs and cannot score more highly on the sum-of-absolute-values-of-probability-differences metric.

The advantage of having independent lists is it makes the overall expression shorter, which makes it easier to alter without an advanced text editor and makes it easier to comprehend the possibilities with a brief examination.
Nope, janny was just trigger-happy.
or he hates the British Broadcasting Corporation for some reason.
Any updates on the S quants? Are they really better than M and L?
>dataset consisted of >100M tokens
lmao even
>her voice barely above a whisper
Nah I'm fine
> >100M tokens
Pretty good. Are you scared, NovelShill?
whats the best quant for 32gb ram? iq2_m??
>>101054673 (me)
This method also has the problem of only examining differences in one token which isn't necessarily a great way to measure. "Anon, I can't let this slide, I have to write you up" and "Anon, I can't lie to you any more, I'm a tarantula disguised as a human being" both start the same way. Would looking at just the probabilities for the first token show that the sentences have different likely directions?
I thought it was just me. Does it seem kinda broken? I was using a exl2 quant that I did myself.
File: Capture.jpg (5 KB, 468x212)
5 KB
>What the fuck kind of random testing is that?
The technical term is "anecdotal evidence."
It's not science, but it's information that can suggest deeper investigation.

And it's what you get when someone on a single 3070 is willing to share his results in testing the models he has handy because he's looking for ones not too retarded to know how western music works. It takes me between one and four hours to download a model, and then only the ones small enough that I can get an answer to my test question in reasonable time. Which in this case one took 45 minutes. (I think that was Wiz8x22)

If you want better data, fire up your Beowulf cluster of A10,000's or whatever you Dubai tech bros buy by the pallet and deliver something statistically significant. I'm just being nice enough to share an experience that could be meaningful or useful to someone who's suspicious that M might have side effects that impact the model's results in a way that makes it overlook factual details in its responses.
It's over, dbrx is our only hope now
Guess it's back to GPT-2 after all.
Can you write posts that make sense?
It's a mishmash of models and quants with 0 correlation between their bpw and quant method. For example, it makes sense to compare Tess-Qwen2 and Qwen2 at *the same quant method and bpw*. Comparing Q3_K_S to Q4_K_S, specially when Tess_Qwen2_Q5_K_M failed makes no sense. If anything, the only thing close to a 'datapoint' i can get is that the tess finetune made qwen2 worse for that one test, regardless of quant method. That's it.
This is not data. It's noise.
it's fine for me with a self-made Q8_0
I had some issues at first because koboldcpp was fucking up the tokenization for models that don't use a bos token (it auto-selects the default bos for bpe models which is id 11, for qwen this is a comma, and inserts it even if the model doesn't add bos) and because I had accidentally left a logit bias enabled from wizard; this combination of issues lead to it biasing up commas to an insane degree and making everything schizo
after disabling my biases and inserting a manual hacky fix for tokenization I have no issues
with all the filtering and safety bullshit - unironically yes.
People are skeptical of perplexity but all the quant graphs I have seen use it. Would love to see a KL divergence graph for different quants of the same model.
File: KL-divergence_quants.png (111 KB, 1771x944)
111 KB
111 KB PNG
Have at you scoundrel!
So, Chameleon any good? Is it more heavily censored than llama 3 is? I know it can't output images currently, but can it at lead understand what its looking at on input pretty good? I'd just like an honest opinion of how it functions as is, and skip the wall of text about jews/trans/conservatives/miku/whatever
File: file.jpg (38 KB, 450x337)
38 KB
Damn that was fast, thank you anon.
>6bpw is totally almost lossless people claim
>it's like an inch above the 0 line
File: new_i_quants.png (10 KB, 792x612)
10 KB
I make a point of saving these when I see them exactly so that I can share with people.

Kind of nuts isn't it?
File: 00042-4080471795.png (1.28 MB, 1024x1024)
1.28 MB
1.28 MB PNG
I have been using an 8.0 bpw exl quant (rpcal lol)
No problems other than very occasional repetition that can be solved with a re-roll. I do not use rep penalty, because the brain damage is not worth it IME.
Has anyone tried pushing this model past 32k ctx for RP?
Ancient laptop anon here. I tried the new Llama3 8B models and the results are a bit underwhelming (usecase RP/ERP). In fact, I found 7B undislop models to perform better? Maybe I'm doing something wrong. The 8Bs seemed rather inconsistent and uncreative. The models I tried are Soliloquy-8B and Sunfall Abliterated-8B. Instruct: Llama3, Samplers: smoothing 0.2-0.3, temp 1, minP 0.1, repPen 1.1. I have also tried Best Guess and Universal-Creative, but the results are the same. What am I doing wrong? Or are the 8B finetunes just not mature enough yet? To clarify, I'm trying to RP with a robot and these models completely ignore that. Probably need some tard wrangling advice...
It's not that you didn't publish a paper showing a thorough comparison between all the models and quants. It's that the models you tested have little to nothing to do with each other. The tess vs qwen test kinda makes sense. Two tess failed, one qwen got it. THAT is a data point. Tess finetune affected the model adversely for your test. Good. That's a starting point. As for the rest, the best we can say is 'sometimes _S gets it, but i haven't tested the others'.
You still haven't said anything about the outputs being deterministic or, if not, how many times you ran the tests with each model.
And I didn't call you a retard. Chill.
Try L3 8B Stheno 3.2 (or whatever the latest version was)
Try Stheno 3.2. It's generally the best fine tune for llama 3 8b I've found so far.
"better" is subjective as fuck in this context, of course, so your millage may vary.
Also, iterative-DPO can work well if you are not trying to do anything that requires consistent smarts, from my experience at least.
I'd drop smoothing curve and try a little lower temp.
this, unironically
Thanks. Do you mind posting appropriate instruct/samplers?
At 4.65bpw it was very repetitive, an overall it felt even more stupid than Euryale.
I swapped my Mikubox to all P100 16GB PCIe internally, leaving the external 3090s. Despite having to add a thermocouple and PWM channel to my fan controller, and also make a custom power cable for the P100, everything worked
:~$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-e2f8cd06-2c7d-accc-728b-62eef1627809)
GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-7da63f72-d5a2-dadb-247a-3880060c84b6)
GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-40205c56-3989-a682-17b2-c2ea90f70e5e)
GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-6537af5d-1095-8402-6c50-d8d9d5afa9b5)
GPU 4: NVIDIA GeForce RTX 3090 (UUID: GPU-34724105-36dd-23ca-3a77-083008f640ec)

Now, last I checked (last week) exllamav2 had a bug with flash_attention and GPUs older than Ampere, so that might be a blocker still.
It's all mentioned here
0 does not exist on logarithmic scale
Temps look good:
NTC 1 temp: 32.75
NTC 2 temp: 32.53
NTC 3 temp: 33.24
PWM %: 30
PWM value: 716
PWM %: 30
PWM value: 716
PWM %: 37
PWM value: 644

The die temps are higher, of course, as I'm reading off the heatsink at the exit, so my code ramps up the fans at a much lower temp than the die temp. It's really just to keep the fans extra quiet at idle, not that they are really loud at 100%.
ln(1) = 0 ?
Thanks, will try.
As I mentioned, I'm mostly aiming for character adherence and good quality prose/creativity (not "whispered in a hushed whisper"). But I know I shouldn't expect much from small models.
>consistent smarts
I'm doing casual RP, not some strict format, so occasional retardation is absolutely fine. But when 90% of responses are shit it becomes quite unbearable - hence the search for best models in this range.
>drop smoothing curve
So something like 0.2 smoothing and 0.75 temp?
As in, don't use smoothing curve, just go raw temp and minP., maybe a tad of rep pen, although I'd remove that when first testing the model also.
it's a logarithmic scale retard
That's pretty dope.
What are you using that for?
Just RP, agents, fine tuning, loaning compute?
Why did people stop training on top of the base models?
>line clearly descends
math is a joke
Expensive in compute and easier to fuck up than lora. But it doesn't matter all that much. Garbage in, garbage out. Most people that take up the mantle often use datasets so garbage it hurts to think about.
It's not like their shitty loras will turn out good anyway. If they really cared, they'd make a full finetune.
File: file.png (116 KB, 1140x698)
116 KB
116 KB PNG
Which makes sense since I've been trying to find a model or models that serve my interests. So when one model doesn't, naturally I try a different lineage sooner than I download a half dozen related models at 2 minutes per GB, spending the time finding other stuff to delete to make room.

Settings are, or are close to, Kobold defaults, and at 45 minutes for a single try in some cases, I'm testing it like I would be using it: One shot and either it's right or I get misled.

There are plenty of people with powerful rigs who can do the science in seconds and actually know what's happening inside of the models and software. I'll leave it to the experts. I just want to be able to get >1t/s and get reasonable answers to my questions. And I've gone from <1 to 5 candidates that at least got music theory right.

(I haven't figured out how I will test coding, but one question I asked it while coding last week might work. It came up because the model was wrong, when I told it it was wrong it wrote a kluge that almost worked and did after I fixed one line. So maybe recreating that scenario if I remember the details will serve as a test.)
I think that eventually Synthetic datasets will be the way to go. Too much time and manpower is used in the creation of organic datsets, which makes its only really feasible with a large financial backing. If Synthetic datasets can be used and refined to the point where they are on par or better then their organic counterparts then it will vastly speed up the creation of Datasets as well as their quality.
that is a woman and no chud will say otherwise
File: 1700588146330630.jpg (157 KB, 596x699)
157 KB
157 KB JPG
I wouldn't be surprised if it was the Miku BBC spammer
File: basedrecs.jpg (48 KB, 430x474)
48 KB
>envoid in my recommendations alongside migu and tetters
Based, the youtube algorithm is finally delivering
Can someone with a recent but shitty NVIDIA GPU please benchmark this PR vs master?
how shitty are we talking about?
i haz rtx 3060 how do i install this pr
File: 1695283474325669.png (42 KB, 376x499)
42 KB
will it do?
File: 1664407945758958.jpg (32 KB, 480x601)
32 KB
>go back home
>training script is kill
>hdd full, is all the 9001 training checkpoints
>delete all keep the last
>resume the training
>mfw the last checkpoint is corrupted cuz duh no space
File: 00024-1397236490.png (327 KB, 512x512)
327 KB
327 KB PNG
Why would you save so many checkpoints?
are RP focused models just as good at narrative/storytelling or do i have to look for dedicated ones?
Something like a 3060 or 4060.

git checkout master, compile, run llama-bench, git remote add my fork, git fetch, git checkout johannesgaessler/cuda-mmq-stream-k-2, compile, run llama-bench.

No sorry, I want data for Turing or newer specifically.
are there any ERP finetunes of command-r? or good finetunes of it in general?
Is a 1050ti too shit for this?
File: Oof size.jpg (91 KB, 880x480)
91 KB
It's too old.
>leaving your GPU running full blast while you're not home
You guys are crazy. I never do this, way too paranoid my house will burn down. Especially if you have multiple GPUs it's like leaving a space heater running.
>are there any ERP finetunes of command-r?
yes, it bad
>or good finetunes of it in general?
M-maybe he's not using deepspeed.
i thought was a good idea in case of crash and for some random test
what's the best coomer model runnable on 24gigs vram?
kek, you might be able to recover something with some disc recovery software
I only put my tinder box in my tower because there's nowhere else to put it, don't judge me.
Ah just playing with larger models really.
>stuck with 2 3090 Ti
I'm so sorry.
I'll give you results in few minutes from my 3060. Compiling kernels takes quite a while on my 5600.
File: soyblonde.jpg (46 KB, 475x485)
46 KB
>your fork
petrus@petraists:~/TND/justforyouCudaDev/cudaddy/llama.cpp$ LLAMA_CUDA_FORCE_MMQ=1 ./llama-bench -m ../../../models/Stheno-3.2-8b/L3-8B-Stheno-v3.2-Q6_K-imat.gguf -ngl 1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 1000 | pp512 | 1395.24 ± 7.92 |
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 1000 | tg128 | 42.91 ± 0.43 |

build: da1db13d (3185)

petrus@petraists:~/TND/justforyouCudaDev/llama.cpp$ LLAMA_CUDA_FORCE_MMQ=1 ./llama-bench -m ../../models/Stheno-3.2-8b/L3-8B-Stheno-v3.2-Q6_K-imat.gguf -ngl 1000ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 1000 | pp512 | 1371.40 ± 7.41 |
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 1000 | tg128 | 42.41 ± 0.79 |

build: a7854743 (3185)

>>compiled with `LLAMA_CUDA_FORCE_MMQ=1 LLAMA_CUDA=1 make LLAMA_CUDA_FORCE_MMQ=1 -j12`
>>gpu: rtx 3060 12gb
Exllamav2 seems to have fixed the floating point error with my mixed CU setup, as well as making sure flash_attention is off when the GPU is older than Ampere.
LLaMA3 8B runs nicely on a single P100. Of course, no instant replies like with a 3090, but not bad. I'll stress-test it later this week with CR+, since that'll use all five GPUs.
If that's enough to run a quant of a 34B, then you could try MarinaraSpaghetti/RP-Stew-v2.5-34B. For lower than 34B, try
So did anyone confirm whether or not autocoder is actually better than codestral?
You can add -j 12 to the make/cmake command to compile with 12 threads instead of 1.
File: file.png (92 KB, 928x739)
92 KB
Well at least it wasn't for nothing, you have entertained the masses with your poor decisions.
agi is impossible atm its just a pipe dream. agi doesn't need a prompt.
we're just trying to go for cat-level now get with the program
I don't even want AGI, I prefer just having a useful bot that does whatever the fuck I tell it to do.
I'd fuck with cat level.
>are there any ERP finetunes of command-r?
The base model is already horny.
>a cat is fine too
>hdd full
>delete all keep the last
Where's that meme for "You know where this is going because you've been there in a previous lifetime"?

Schools have got to start teaching the importance of keeping two levels of backups whenever digital storage is involved.
are you jealous cuda dev replied to me
If it doesn't exist in 3 places, it doesn't exist.
Looks like checking for compute capability is enough to determine whether or not the stream-k decomposition should be used.
WizardLM-2-8x22B-Beige.i1-Q4_K_S 12288 context, Vicuna format (or Mistral, looking at the merge ingredient)
Hosting for up to 8 hours.
Can put link in ST > Text Completion > KoboldCpp
File: 1707726926019429.png (31 KB, 317x277)
31 KB
nta but yeah a little :(
File: hat.png (23 KB, 402x299)
23 KB
comin out of my pocket money
I'm really sorry sirs, but I really had to do the needful. Please to kindly resolve the issue, thank you sirs.

Any other good 7B/8B models? Currently got the bandwidth to download, so trying to hoard as much as I can
is there something like comfy ui for llms?
ollama is the most intuitive one
Yeah, ComfyUI with a custom node.
ComfyUI is not intuitive, shill.
I'm liking Kobold.

Ollama is barebones and good enough for Babby's First Q&A. But it has a lot of problems: save state is broken by some common character sequences, their method of obfuscating model component files is lulzy and cumbersome, just typing into the terminal window fucks up on line wrap though maybe that depends on system.

After about a week you'll be ready to learn the technical details and to move on to Kobold or Ooba. (I didn't like Ooba but maybe it's better, that was a long time ago.)
I think you will feel more comfortable in the Kobold Discord.
Nothing supports it yet so no one knows.
According to Meta Paper it was trained on 5x as many tokens as L2.

