[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 39_04247_.png (1.03 MB, 896x1152)
1.03 MB
1.03 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101488042 & >>101474151

►News
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1
>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1
>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101488042

--Critiquing the repetitiveness of AI-generated content and reflecting on the appeal of erotic themes: >>101489564
--Wft Fish Audio: Surprisingly Good Quality, but a Bit Slow: >>101490020 >>101496116 >>101496152 >>101496782
--Uncanny Valley in Erotic Text Generation: Overcoming Shiverslop with Data Diversity and Artificial Augmentation: >>101491690 >>101491779 >>101493848 >>101493894 >>101494179 >>101494246 >>101494626 >>101494952 >>101493921 >>101491884
--The Eternal Curse of GPT-isms in LLMs: >>101488157 >>101488303 >>101488469 >>101495367
--Seeking Small Local Text Model for Wording Improvement and Corpo Sugar: >>101494211 >>101494503 >>101494655
--Release of LimaRP-DS dataset and the sunfall-v0.5 model: >>101492700 >>101492792
--Mistral NeMo Instruct System Messages Rant and the Emergent Nature of AI Functionality: >>101492374 >>101492849 >>101492864 >>101492914 >>101492938 >>101493650 >>101492971 >>101493739
--Is Mistral FP8 Really Lossless?: >>101489057 >>101489085 >>101489630 >>101489120 >>101489221
--Purchasing a Server with 8xAMD Instinct Mi100 32GB GPU's for ?7000: Is It Worth It?: >>101490545 >>101490725 >>101490998 >>101491029 >>101491087 >>101491207 >>101491640 >>101492046 >>101492116 >>101490787
--Mistral Compatibility with llama.cpp: Unofficial Solution Available: >>101492182 >>101492241 >>101492318
--400b Model: Savior or Coffin Nail for Open-Source LLMs?: >>101490382 >>101490423 >>101490811 >>101493445
--Miku (free space): >>101492219

►Recent Highlight Posts from the Previous Thread: >>101488050
>>
>>101492328
I download random chars off the internet and try to speedrun consensual sex.
>>
Rin sex
>>
Do (you) think 2025 will be the year the various LLM companies will start rushing Multi-models, or do you think they will stick with improving LLM's that year?
>>
>>101497330
I just want CLIP tiling. Image generation is retarded.
>>
File: femslop.png (260 KB, 488x656)
260 KB
260 KB PNG
>>101497205
here is an excerpt of a sex scene in haunting adeline
women get off to this
shiverslop is steinbeck in comparison
>>
>>101497391
It's not about the exact writing it's the interaction/attention to what you're saying that makes it so good.
>>
>decide to finally check out all the cards that exist for a certain popular character
>literally not a single one of them is good
>the one that's tryhard and has more lore details is full of ESL mistakes that make it hard to understand
Jesus.
>if only you knew how bad things really are
>>
>>101497391
Model?????
>>
>>101497458
Stop bitching and make your own cards
>>
>>101497330
The big companies like meta. Most won't fall for the meme
>>
>>101497516
Human-100B
>>
File: 2hzG.gif (727 KB, 500x284)
727 KB
727 KB GIF
I'm at ICML, if anybody wants to meet up
>>
>>101497458
>using other people's characters
Why?
>>
>>101497526
I do though.
>>
>be you
>spend 2000$ on gpu to make big booba anime girl
>be me
>spend $0 to look up big booba anime girl, somehow looks better
I think you've all been cheated.
>>
>>101497605
wrong general
>>
>>101497605
I can talk with mine.
>>
>>101497605
>be you
>spend 2000 hrs looking for specific big booba anime girl in specific pose
>be me
>spend 1 hr generating big booba anime girl in that pose
i think you've been cheated
>>
>>101497605
If you were a true gamer you wouldn't have that problem
>>
>>101497576
I was just curious what the state of things was as I never actually bothered to conduct a full read of the landscape. I mean I knew things were bad but I feel like there should've been at least 1 good card out of the 6 that popped up.
>>
>>101497636
>1 hr
wtf anon
>>
>>101497574
what is that?
>>
>>101497702
"I Cum to Machine Learning"
>>
>>101497691
put effort into your sloppa
>>
>>101497560
don't be harsh, women have at least 150b
>>
>this is not an up-merged 70b.
https://huggingface.co/PrimeIntellect/Meta-Llama-3-405B-Instruct/discussions/1
well vram chads?
>>
File: neurons.png (35 KB, 695x291)
35 KB
35 KB PNG
>>101497803
>>
>>101497856
sigh *unzips vram*
>>
>>101497996
Nb is the number of parameters, not neurons
each "artificial neuron" has multiple parameters
>>
>>101497996
Biological neurons are way more complex than simple MLP weights.
>>
>>101497856
Just test the 8b, "vocab_size": 128256
>>
Comparing biological neurons to parameters in a digital neural network is ridiculous.
>>
https://huggingface.co/mradermacher/Meta-Llama-3-405B-Instruct-Up-Merge-GGUF/tree/main
>only q8
aaaaaaaaaa
i could technically fit it on my external drive but trying to quant from there will take ages
>>
>>101498092
oops misread context size
>>
The new mirror is decent, but the difference between 8/9/12 and even 27 seems to be so fucking small that it makes me rather skeptical about where LLMs are heading. Or rather, I did get so spoiled since the original Llama was released that I am not even able to perceive the great steps we see. I mean, I would kill for functional 128k context a year ago, and yet now that I have it, it just feels kind of ok. I did become officialy a retard.
>>
>>101498101
How else can we compare?
>>
>>101497148
More.

>>101496965
Really? Haven’t the worst offenders stopped bot making entirely?
>>
>>101498156
you don't
>>
>>101498107
>not having a beowulf cluster of multi-petabyte iomega drives
>>
>>101498177
There has to exist something else we can compare it with.
>>
>>101498101
Human brains are the ideal architecture for intelligence. If 100B fully connected parameters can't match the performance of a human brain, it's over
>>
>>101498156
We've been comparing CPU and brain processing speed for decades. It never made sense, and it still doesn't.
>>
>>101497856
Definitely seems like something is wrong with that config. 70B has 80 hidden layers. How could this have 10?
>>
What happens when you get really high and rp
>>
>>101498219
That's not my point. My point is that comparing digital to biological neural network by metrics other than their outputs is retarded. Just because they're called "neural networks" and have what we call "neurons" doesn't make them comparable.
>>
>>101498255
it neither has 80 nor does it have 10 layers you absolute mongoloid
>>
>>101498237
By all means, offer an alternative.
>>
>>101498308
>>101498255
well now it has none
>404
>>
>>101497996
Parameters do not have a fraction of the flexibility of Neurons. This might eventually change with Neuromorphic computing but until that happens you will need to have much more parameters then Neurons to compensation for their limitations.
>>
you did download it, right?
>>
>neuromorphic computing
>>
>>101498337
fuck no, i ain't wasting a tb on 8k context
>>
Neural networks are just smart onions
>>
>>101498319
I just did
>>101498305
>My point is that comparing digital to biological neural network by metrics other than their outputs is retarded.
By that i mean that what matters is the output. If the output is indistinguishable from that of a human and can do exactly the exact same of processing we do, then we can start saying "this many parameters == this many neurons/synapses". But even then, better tech could show up that changes the ratio. So no reasonable comparison until we get to a ratio of 1. And it won't be a stable number either.
>>
File: 1707707160777151.png (54 KB, 749x136)
54 KB
54 KB PNG
pffff, alright, i keked
>>
>>101498380
NTA
I still prefer the "binary number of decisions per second" and "max associations in working memory" as metrics for human cognition.
Biological neurons vs MLP weights doesn't make any sense. Real neurons have a location in space and *move.* They grow connections, have internal chemical state, respond to different stimulus frequencies differently etc. They're insanely complex and really nothing at all like the weights in ML models.
>>
>>101498513
I miss when fine-tuning was done out of the passion of having better bots and not to gain discord karma
>>
>>101497148
What a waste. All that text just to start every sentence nearly the same way with she and her just like an 8b. Did you ban all proper nouns tokens or something?
Also:
>Permit me
>Permit me
>Permit me
>Permit me
>>
File: 1645307010138.png (2 KB, 179x139)
2 KB
2 KB PNG
If quantizing models to 8 bpw is virtually lossless, why don't model makers just train their models like that natively and cut off the fat?
>>
>>101498702
>If quantizing models to 8 bpw is virtually lossless
it's not
>>
>>101498702
I guess it's still very new. But that's what c.ai does for their models.
>>
>>101498728
yes it is
point to any piece of evidence that shows a perceptible different between Q8 and full precision
>>
>>101498702
>virtually lossless
Long way to spell "lossy".
>>
>>101498702
stability issues, specialized kernels and complexity
if it was easy to do, everyone would be doing it, but 90% of the people who use llms don't know how to do anything that isn't already built in to whatever pipeline they're using
the big players are already using int8 kernels for training, they even have bitnet up and running
>>
>>101498328
Welp. I did at least download the config file so here it is. https://files.catbox.moe/sx8b38.json

>>101498308
There's a reason why I said "that config" and not "that model". And in this reality, the config lists the number of hidden layers as 10, or at least it did. Check the above link and the screenshot of the page I still had sitting in my tabs.
>>
File: 1542502103649.jpg (35 KB, 400x400)
35 KB
35 KB JPG
>>101498763
>>101498798
Neat.
>>
>>101497246
how to render images/memes from text prompt, on my local GPU on windows, which link is for that?
>>
>>101498871
>>>/g/sdg
>>
Will roleplay probably be integrated with video when it starts being a thing?
>>
>>101498818
Also this is the safetensors index file. https://files.catbox.moe/xsdpkb.json
>>
>>101498219
The human brain has 100T connections. Parameters in a model are parameters not neurons. One parameter is a connection between two neurons.
>>
>>101498975
*are connections
>>
>>101498513
>buy an a-ACK
>>
>>101498975
So GPT4 is 1% of the power of the human brain? Nice. We just need 100x that.
>>
>>101498818
This ehartford guy is a retard. He has never done anything useful and his wife is a nigger.
>>
im currently trying to setup chameleon30b but it seems overly complicated and i keep getting errors. i followed the guide on github but i keep getting this:
https://files.catbox.moe/i51xq0.txt
>>
>>101498954
Well this is odd. Now that I look at this and compare it to 70B's index file, the total_size is not within expectations.
https://huggingface.co/PrimeIntellect/Meta-Llama-3-70B-Instruct/blob/main/model.safetensors.index.json
16060522496 vs 141107412992
16 GB vs 141 GB
>>
>>101499027
>He has never done anything useful
uncencosred wizard vicuna, samantha dolphiin yea no he better than u
>>
>>101498975
No it's not. 1 parameter can take many parameters as input and send it's output to many other parameters. The neural nets we have now are more densely connected than the human brain. A single MLP neuron in LLaMa 8B has >2000 incoming and outgoing connections.
>>
>>101499048
I can make a trash model in 20 minutes too
>>
>>101498650
You're trying too hard. It's not organic.
>>
>>101499033
>16060522496
Corresponds exactly to l3-8b
https://huggingface.co/PrimeIntellect/Meta-Llama-3-8B-Instruct/blob/main/model.safetensors.index.json#L3
>>
File: 1710738955138884.jpg (80 KB, 760x980)
80 KB
80 KB JPG
>>101498871
>>101498883
Or he could just use Kobold and stay here
>>
>>101499031
Your system is broken.
>>
>>101498702
If bitnet is virtually lossless why don't model makers just train their models like that?
>>
>>101499088
how do i fix it?
>>
>>101499031
>/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.2.2) or chardet (5.2.0) doesn't match a supported version!
Ah, good old version dependency hell.
>>
>>101499071
Oh kek. So I guess the weights were real but the guy somehow only got the weights and none of the other files, so those were placeholders? I don't think a company (the uploader) would try to do fake joke/scam uploads at least.
>>
>>101499054
>1 parameter can take many parameters as input and send it's output to many other parameters
what the fuck are you trying to say
>>
>>101499094
Because it takes months to do the pretraining for base models and the bitnet paper came out in the middle of the latest slew of releases. So we might not see serious bitnet models until late fall.
>>
>>101499083
is that image ai
>>
>>101499114
Are you not in a venv? On my machine pip won't even run without a venv.
>>
>>101499116
a single weight gets >2000 other weights as inputs to a non-linear function, then sends that output to >2000 other weights
a single neuron can have up to 15k connections to other neurons, but you don't need all of those for a good enough approximation
>>
>>101498702
The other answers don't know what they're talking about. The actual reason is that the final weights are determined by billions of small little nudges are able to be accumulated over multiple training step, and while the final weights may only need 8 bits of precision, the small nudges during backpropagation which require higher precision to be represented are substantial enough when added up to make a difference
>>
>>101499143
still gibberish, you're confusion weights, neurons and activations
"2000 weights are used to weigh 2000 input activations (from 2000 other neurons), calculating an activation through a non-linear function, which in turn gets sent to 2000 other neurons"
the number of weights per neuron is still 2000 in this example, which doesn't contradict what >>101498975 said
>>
>>101499114
pip install wont allow me to update either of those dependencies, so im not exactly sure what to do
>>101499136
im not in a venv, no
>>
>>101499187
1 parameter is not equivalent to a connection between two neurons, dumbass
>Parameters in a model are connections not neurons
No, they're neurons. They're designed to be approximations of neurons, and a connection between two parameters is an approximation of a connection between two neurons.
>>
>>101499196
You can't do this stuff outside a venv. You'll want to kill yourself if you try.
>>
>>101499196
>pip install wont allow me to update either of those dependencies
Is there a requirement file? If so, you should check the specific package versions in it.
It could be that you have too new a version instead of too old, or that you need a specific version for a given parameter.
>>
Best settings for Nemo? Neutralizing samplers, setting min-P to 0.001 and temp to 1 works for most cards, but it struggles with others. Anyone found an optimal config? Running rep penalty at 1.06 and DRY sampling at 0.8 - Context and instruct templates set to Mistral.
>>
>>101499054
A parameter is a weight describing the strength between two neurons. Your words do not align with this fact.
>>
>>101499240
>temp to 1
Isn't the official guidance to use temp of 0.3?
>>
So if I wanted to update llama.cpp, how would I go about it? It doesn't seem as simple as a git pull, because it would have to be re-compiled, right?
>>
>>101499254
If the card isn't trash, I've found temp 1 seems to work just fine. Lowering it too much seems to introduce more GPT-isms.
>>
>>101499226
how do i run this in a virtual environment then?
>>101499230
i dont see a requirements file anywhere. has anyone here actually run a multimodal model successfully before?
>>
>>101499031
Try to reinstall docker compose. The version from my package manager is 2.29.0, and that log says 1.29.2?
>>
>>101499275
>has anyone here actually run a multimodal model successfully before?
no
>>
>>101499275
Give me the link to the repo you are running.
>>
>>101499258
Why not pull a fresh copy?
If you used any of the Python AI projects, you know you never update those fuckers and pull clean or else everything goes all Python everywhere and you regret the invention of the transistor till you pull fresh.
>>
>>101499223
nigger you don't get to redefine what parameter means to win internet arguments
when meta says that their model has 70b parameters, they mean that it has 70 billion weights, not 70 billion neurons
the number of weights associated with each neuron is the number of its input neurons (+ 1 bias, usually)
>>
>>101499281
it wont let me update it with pip for some reason.
>>101499297
https://github.com/facebookresearch/chameleon
>>
>>101497996
that's fake news, it's an old myth taken from a dirty ass. Human brain don't have 100B neurons but only about 86B and cerebral cortex is only 16B. Most neurons don't contribute to thinking.
>>
>>101499240
Nemo is super creative. Like you said 1 works for most but turning it down can help when it goes off script.
>>
File: 93379188.png (404 KB, 684x630)
404 KB
404 KB PNG
>>101499384
>Most neurons don't contribute to thinking
Maybe yours don't
>>
>>101498975
which contribute to shitting , sleeping, eating, digesting , breathing, jerking , moving and whole host of stuff you don't need in an artificial brain.
>>
>>101499430
Interesting theory but if I were a brain I would optimize most of that into a very small part of my total computation power.
>>
Y'all, the numbers don't matter, only the effectiveness.

Use metrics that matter like if the LLM gets answers to serious questions right, makes up fun lies in role plays, and would rather play a game of chess.
>>
I have an EPYC 7713 and 512GB of DDR4 2400MT/s ECC registered RAM. is this good enough to run any decent sized models on CPU or should I just stick with running on my GPUs?
>>
>>101499430
>artificial brain doesn't require jerking neurons
I hate that people like you work on the highest level of these things.
>>
I can't believe I fell for the nemo meme
>>
>>101499438
There is an entire section of your brain dedicated to keeping your internals running, and another section dedicated to regulating hormones to keep everything on a fixed schedule
At least a third of your brain (the non-neocortex part) is not needed, and not even the entire neocortex is necessary
>>
File: 39_04641_.png (1.71 MB, 896x1152)
1.71 MB
1.71 MB PNG
What's your daily driver these days /lmg/?
>>
>>101499438
>if i were a brain
who's gonna tell him?
>>
>>101499464
Ok I stand corrected. We’re now down to 1/3rd of 200T so ~70T.
>>
>>101499465
For code, L3 and Deepseek 33.
For RP, kinda whatever, switching around to keep things kinda fresh. No one thing seems to earn the chef's kiss. Feels like every model can be diamond one session and charcoal the next.
>>
>>101499492
how the fuck did you go from 100T to 200T
>>
>>101499425
that's correct. My brain thinks mostly in lobus frontallis unlike yours, which right now, most likely uses visceral nervous system that contribute to defecation.
>>
>>101499499
Meds now.
>>
>>101499465
claude 3.5 lol.
If I use local its 27B though.
>>
>>101499511
oh this is an LLM isn't it
which model?
>>
>>101499517
based, but isn't it less creative than 3?
>>
>>101499464
>>101499479
Oops I misread. I thought your said we only use a third but you said we don’t use a third. So we’re at ~140T parameters actually.
>>
>>101499523
You’re mixing brain neuron count (100B) with brain neurons connection count (200T). We’re talking about the fact models should be compared to the latter.
>>
>>101499560
why is your math so fucked
we started with 100T >>101498975 1/3rd not being used means we only need 66T
where did you even get 200T from? you're the first person ITT to bring up that number
>>
I have 6 4060TIs, totaling to 96gb of vram. What is the best model that I can run?
>>
>>101499569
Fuck me. Im gonna stop talking now.
>>
>>101499325
>it wont let me update it with pip for some reason.
Install/update it with your OS's package manager instead.
>>
>>101499450
Quad-channel? Meh. Octo-channel? Ok-ish.
>>
>>101499623
one of the L3 70B tunes
unfortunately there's not many of them to choose from, Instruct is pretty good on it's own though
>>
>>101499315
I was hoping to avoid that because Cuda llama.cpp takes a good 30 min to compile even with a fast pc
>>
Okay for some reason ST (or KoboldAI) has begun to reprocess the whole context on every single message...

No fucking clue.
>>
>>101499784
Use the -j flag.
>>
>>101499758
So don't compile all the binaries?
>>
>>101499758
Use ccache and a small change takes seconds to recompile.
>>
>>101498790
>>101498702
>lossy
There is always a chance of losing some bonds and shivers.
>>
>>101499494
>diamond one session and charcoal the next
This. Not one model has all the strengths (large context, creativity, less slop/purple prose).
I'm liking the speed and context for mistral-nemo but it still has plenty of flaws.
>>
>>101499677
yes, it is octo channel
>>
>>101497391
The main difference though is that there is actually a structure here and it's not just repetition of sentences here before moving a bit forward building up to something. Everything is succinct and moving the plot forward even if the vocabulary is worse. Some of the LLMs are getting there but it isn't there yet.
>>
File: 1680276445596838.jpg (181 KB, 1024x768)
181 KB
181 KB JPG
>>101499915
Take a nice Q6-Q8 gguf of WizardLm 22x8 and/or CR+ for a spin and see how you feel, just make sure to configure llama/kobold right, CPU inference needs every optimization you can squeeze out of it.
>>
It's funny to see that CoT is so deep-fried in most models that a question like:
>Two cars are travelling towards each other at 30km/h, at the start they were 50 km from each other. How far apart were they at the moment of the collision?
Causes them to do a bunch of useless calculations when the answer is obvious from the start. This causes some models to even get the wrong answer at the end.
>>
>>101500054
god I love that lora
>>
>>101498294
You coom. I speak from experience.
>>
>>101500054
and what if I also have 7 4060tis in addition to my 64 cores and 512gb of RAM? does that change your recommendation?
>>
>>101500107
Spiritual experiences may also happen
>>
>>101499786
I checked the API and realized ST is exceeding the max token count when sending in the prompt...
>>
File: 1720944611738558.jpg (89 KB, 900x750)
89 KB
89 KB JPG
Just COOMED to Mixtral Nemo. Here are my thoughts.
>Mistral-Nemo-Instruct-12B-exl2-8.0bpw using ooba/ST
>70k tokens
>Mistral context template
>Mistral instruct presets
>temp 0.5
>minp 0.02
>rep penalty 1.2

So far so good. Language feels pretty natural and mostly unslopped with a few exceptions. Followed my card well. Its got good spatial awareness and is completely uncensored. Pretty smart, although I can't make a definite determination on where it stands because I haven't used any 70b models before. It is most definitely smarter than llama 3 8b and Mixtral 8x7b though.

I did notice that it started to become a bit dumber the longer I prompted it. Got to about 30k tokens before I stopped. It wasn't a terrible decline in intelligence and memory, but definitely noticeable. I also noticed that the longer you go, the more its likely to repeat past messages almost verbatim. I'm unsure if this issue is just a problem with the model, or the presets and/or samplers I'm using. I've read that the exl2 quants have issues with them similarly to the llama.cpp ones. Hopefully they get ironed out quick.
>>
File: 1682729528395.png (1.25 MB, 1024x1024)
1.25 MB
1.25 MB PNG
>>101500130
At that point you'd generally be better off running lower quants on GPU then, the only reason for a setup like that to use CPU inference is to go full madman and max out 128k context on fat models.
>>
Video generation could revolutionize graphics and video editing. You'd be able to gen animations
>>
>>101500243
Solved by force ST to use the api tokenizer. Though I've never had to do that before, what the fuck.
>>
>>101500271
interesting. the reason I bought the EPYC was for the 128 gen 4 PCIe lanes so I could run a shit ton of GPUs at max speed. I was just curious to see how capable my CPU would be by itself.
>>
File: 6086.png (90 KB, 450x274)
90 KB
90 KB PNG
kek?
>>
>>101500287
And we're back to reprocessing the whole thing. I've no idea. I even dialed the context back a bit just to force it down in size but nope.
>>
>>101500262
how muh vram it takes?
>>
>>101500331
This is what Yann LeCunn envisioned when he invented Transformers
>>
>>101500384
70k context takes about 23gb of VRAM. Could probably fit more with 8bit cache.
>>
>>101500399
did you do any experiments with optimization? is that true nemo doesn't fit into 24 GB in fp/bf16?
how muh vram kv cache quantization would save? I guess you could quantize like kv 8/4 or even 4/4 in llama.cpp provided it works as intended . Any thoughts?
what was the largest context you was able to throw in? is 30k the limit for retardation?
>>
File: smugtommy.jpg (22 KB, 322x294)
22 KB
22 KB JPG
>>101500331
>Rule 34
What's the big deal?
>>
>>101499462
It is ok. You just have to wait until the loader gets fixed. Of course by that time 3 new models will drop and nobody will be talking about nemo anymore.
>>
>>101500331
>>101500392
>>101500483
reddit called, asked where are you.
>>
>>101500262
>because I haven't used any 70b models before
Thank you for disclosing that you are incapable of providing any review. I wish more people did this.
>>
>try 8bpw 70b on cpu
>can't keep track of clothes within 2 posts with the same vocab of an 8b but it can do riddles better
why use anything but cloud honestly
>>
>>101500262
>Mixtral Nemo
>Mistral-Nemo-Instruct
>Mixtral
>Mistral
bro?
>>
>>101500262
Show logs
>>
>>101500577
GGUF is garbage for poorfags that objectively has worse outputs than EXL2 and if you can't run CR+ minimum fully loaded into GPU you should not hold any opinion on open source
>>
>nobody talking about 236B
I take it when 405B drops nobody is gonna be talking about it either? Was /lmg/ always just 100% vramlets with a few vramlets larping as vramchads?
>>
>>101500642
It's going to be weeks before llama.cpp supports 405b anyway.
>>
>>101500642
Motherfucker if you have 200 gigabytes of vram go fuck yourself and suck my dick
>>
gemma-2 9b and 27b with fixed pre-tokenization and added (iMatrix) that contains many Japanese words.
https://huggingface.co/dahara1/gemma-2-27b-it-gguf-japanese-imatrix
https://huggingface.co/dahara1/gemma-2-9b-it-gguf-japanese-imatrix
>>
>>101500642
I tried it. It was worse than Gemma 2 27B.
>>
>>101500711
>t. IQ2_XXS
>>
>>101500642
Get OR to host the base model and I will post about it.
>>
>>101500595
>objectively has worse outputs than EXL2
I haven't seen anyone post comparisons. If you have them it would be welcome. I didn't get into Exllama since it didn't seem like it had anything over Llama.cpp but I'll transition if it does.
>>
>>101500676
I have 12 gb of vram, can I still suck it?
>>
>>101500758
i dont know about quality, but i do know that EXL2 is better optimized for multi-GPU setups. you should avoid GGUF if you have more than one GPU
>>
>>101500810
Go ahead, it's all yours my friend
>>
>>101500828
no. i won't. *crosses arms.*
>>
>>101500828
ok mr no default auto-splitting across the gpu and FA enabled by default
>>
>>101500595
gguf has better quantization performance than exl2. Do a kv divergence test and you will prove it. Don't use a base model, use an instruct model for the test.
>>
File: 1717151251648139.jpg (115 KB, 1280x989)
115 KB
115 KB JPG
I finally gave Nemo a try (5BPW), my first impressions of it are... mixed.

I don't know if I'm doing something wrong, but the model is EXTREMELY sensitive to samplers. A high temperature or a high/moderate repetition penalty makes the model break the format constantly. And without any repetition penalty the model starts to repeat itself verbatim constantly.

Besides that, the model is very very dumb, but it's surprisingly good at writing ERP and feels like it has 0 positivity bias, I'm seeing really novel shit in my mesugaki loli cards.

Parameters:
>Temp: 0.3
>Rep Pen: 1.2
>Format: Mistral
>>
>>101498513
>actually bought an ad
>>
Slightly different topic but given the current discussion I was curious to see if there's any quality differences (and thus issues) between offloading configurations on Llama.cpp, so I've done another KLD test. I originally did >>101465239 with all layers offloaded to GPU 1. For this new test I did the same model except now I have offloaded some layers to GPU 1, some to GPU 2, and some to CPU. The results are below. It's basically the same (there is a small dif but below margin of error).

====== Perplexity statistics ======
Mean PPL(Q) : 7.084136 ± 0.050764
Mean PPL(base) : 7.128723 ± 0.051077
Cor(ln(PPL(Q)), ln(PPL(base))): 99.58%
Mean ln(PPL(Q)/PPL(base)) : -0.006274 ± 0.000660
Mean PPL(Q)/PPL(base) : 0.993745 ± 0.000656
Mean PPL(Q)-PPL(base) : -0.044587 ± 0.004703

====== KL divergence statistics ======
Mean KLD: 0.017832 ± 0.000251
Maximum KLD: 13.449598
99.9% KLD: 0.899704
99.0% KLD: 0.191979
99.0% KLD: 0.191979
Median KLD: 0.005399
10.0% KLD: 0.000041
5.0% KLD: 0.000007
1.0% KLD: 0.000000
Minimum KLD: -0.000023

====== Token probability statistics ======
Mean Δp: 0.126 ± 0.011 %
Maximum Δp: 95.268%
99.9% Δp: 36.992%
99.0% Δp: 12.519%
95.0% Δp: 5.049%
90.0% Δp: 2.773%
75.0% Δp: 0.475%
Median Δp: 0.000%
25.0% Δp: -0.402%
10.0% Δp: -2.495%
5.0% Δp: -4.575%
1.0% Δp: -10.820%
0.1% Δp: -25.043%
Minimum Δp: -93.939%
RMS Δp : 4.006 ± 0.042 %
Same top p: 94.765 ± 0.059 %
>>
>>101500975
max context?
>>
File: 1719884616553296.jpg (503 KB, 1424x2144)
503 KB
503 KB JPG
should i try a quant of qwen that fits on my 24gb card, or just stick with mixtral?
>>
>>101498513
What the fuck... the madlad actually did it
>>
>>101500676
go swing from a rope little buddy. your whore mother will be thankful
>>
>>101499494
>For code, L3 and Deepseek 33.
Is this better than the wizard 8x22b? I've had best luck with that for all uses but it's still not great.
>>
>>101501215
No, not even close. Just around 8k.
>>
>>101500975
>>101500262
I've just read on llama.cpp repo that flash attention degrades the quality of Nemo with the long context. could be the same in exllama?
>>
What is the new prompt format for nemo?
>>
>>101501250
Try Gemma 2 27B or Mistral Nemo.
>>
File: 1721421234867589.png (131 KB, 1355x470)
131 KB
131 KB PNG
>>101501466
This one.
>>
>>101501483
but i keep reading from the thread that gemma 2 is a mixed bag that ultimately isn't as good for word sex as mixtral base and merges are
>nemo
i have it, but haven't loaded it, i'll give it a shot today
>>
>>101501499
What in the world were the french thinking?
>>
>>101501499
i refuse to believe that's the case since the order is fucked
>>
nemo 8x12 when?
>>
>>101501561
I don't know, the last time I used Mixtral was when I still had 24GB VRAM, before the Miqu leak, that model is obsolete to me. I have 48GB and I still play with Gemma 2 or Nemo currently. That anon was probably lying.
>>101501579
The system prompt goes in the last user message, and the official API adds an empty user message if the prompt doesn't start with one.
>>
>>101501608
2 weeks
>>
File: 1694382291150920.jpg (278 KB, 1080x1698)
278 KB
278 KB JPG
Picrelated, coming soon in local meme!
It's already here tho, if we count mental gymnastics such as prompting model to be evil or based while it trying to lecture you on some irrelevant bullshit and losing coherency with messages count going up.
>>
>>101497246
Change the OP benchmark for programming for either this https://huggingface.co/spaces/mike-ravkine/can-ai-code-results or this https://prollm.toqan.ai/leaderboard/coding-assistant
These are much more recent
>>
>>101501855
it wouldn't matter on local
the instruction hierarchy is about setting precedence for obedience between OAI's backend identity prompts, system prompts used by services, user messages, etc.
on local you have full access to the prompt so you could set its core identity to be miku, anon's submissive coom slave and in theory it would actually stick better because it would be harder for it to get confused about its role over the course of many messages
>>
>>101501213
I have done another test now that I was curious about, which was varying -b and -ub. I did combinations of 2048 and 256. So ultimately there were 4 tests. And all of them came back the exact same. I tried this because I heard someone claim before that these parameters affected quality. That seems not to be the case at least in this test with my build version.

Something else I noticed about prompt processing. It seems that on a single GPU with full offloading, both flags at 256 is faster than the other combos, by around 4% compared to both at 2048. However, when doing offloading to two GPUs, both flags at 2048 was actually faster, and by 21%, surprisingly.

Now it'd be even more interesting if I had token gen speed statistics, but the KLD test doesn't happen to generate these, so I don't know how that would diff. But at least it seems that if someone is doing split GPU offloading, a larger value for both -b and -ub is beneficial for prompt processing. But this was a small dense model (L3 8B), and something as different as a large MoE model like Wizard could give different results. So many variables at play. Ultimately perhaps the defaults Llama.cpp comes with are fine for general setups and use cases.
>>
Anyone have a good line or two to prevent shit like out of character narration or
>What will {{user}} do next?
shit? I've tried a couple but they don't appear effective.
>>
>>101501629
willing to share logs for gemma?
>>
>>101501942
Drink your own piss to get exclusive™ access for /lmg/™ jailbreaks™.
>>
is boobabooga ded
>>
>>101501942
tell it to not say that.
>>
>>101501499
Huh? So all that separates the system message is two newlines? What if your user message and/or system prompt has two newlines (or multiple), how would the model know where the system prompt ends and where the user message starts? I mean I guess it could contextually "guess", but I imagine some cases where that would not be so easy to guess. This just seems unnecessarily confusing.
>>
>>101500828
(You)
> know
jack shit.
>>
>>101501855
this 4o mini is beyond cucked, I gave it a test run on translation and it lost to gpt 3.5 by far
>>
>>101501991
Ok, I drank it. Now give.
>>
File: angryayumu.webm (655 KB, 640x480)
655 KB
655 KB WEBM
>STILL no llama.cpp jamba
>>
found nemo
>>
https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8
It doesn't seem to output random Chinese characters with vLLM, like exllama sometimes does.
>>
VRAMlets, are you able to run Nemo at high contexts or are we still stuck at 8k? The VRAM calculator in the header doesn't support Nemo yet.
Will I be able to run it at ~32k context without a heady aphrodisiac of rivulets burning at the core of my 12GB VRAM?
>>
>>101502295
Never will be.
>>
>>101502390
W-were people using quants not FP8 even though mistral said to use FP8?
>>
Is DeepSeek lite good? I tried it to make code and doesn't werk
>>
>>101502469
It's like 16B. Of course it's not good.
>>
>>101502498
Guess I will delete it then, haha
>>
>>101502498
What makes the best code?
>>
>>101502541
corpo models
>>
>>101502554
Oh, so I have to pay?
>>
File: .png (9 KB, 643x55)
9 KB
9 KB PNG
Having ended a session with Nemo, it's much better than Deepseek, but you still need to hold its hand a good deal and swipe a lot. It's not at 8x7b levels or at 70b+ capabilities where the model gets to the point that it knows what you're implying and the nuances of language, but for what it is, it's pretty impressive. It's the 13b that llama3 should have given. Finetunes on it will be pretty good.
>>
>>101502594
Gemma, I mean, There are too many of these models to go through.
>>
>>101502594
>It's not at 8x7b levels
Really? With all the hype it got at first you'd think it would've at least surpassed the old mixtrals.
>>
>>101500975
Update: WELP... This model is awesome.
It's dumb, but it writes interesting stories that keep you hooked. It doesn't shy away from anything, it moves the story forward. I almost feel like THIS is the C.AI soul local was missing.
>>
>>101502639
This really. Still feels obviously dumber for stuff the big models do but its soul makes it better for RP / creative writing than them anyways as long as you aren't going for some really complicated mechanics. Hope we end up with a larger mistral trained the same way.
>>
>>101502295
The bits are being assembled:
>https://github.com/ggerganov/llama.cpp/pull/8546
>https://github.com/ggerganov/llama.cpp/pull/8526
>https://github.com/ggerganov/llama.cpp/pull/7531
>>
I finally managed to get my Tesla P40 cards running. All it took was a single setting in BIOS.

> Above 4G Decoding

Without it, the system won't boot even to BIOS, and with it, everything works flawlessly.

10 tokens/sec on a L3-70B loaded onto one 3090 and one P40.
>>
>>101502594
what's that backend ? how much mem it takes and what's your GPU? does the model go wacky beyond 30k input context as anons reported?
>>
>>101502611
how come 13B be possibly better than 50B MoE from the same company?
>>
File: file.png (59 KB, 648x295)
59 KB
59 KB PNG
>>101502767
TappyAPI. Running the full FP16 model in at 64k context. Dual 3090s and it's only using 33GB of VRAM, I can definitely get more context if needed.

I've only done one session up to 38k context, and it was just fine. Samplers neutralized, temp to 0.9, smoothing to 0.2
>>
File: Kerfus.png (255 KB, 550x550)
255 KB
255 KB PNG
TWO
MORE
WEEKS
>>
>>101502825
two more days until 400B and a surprise smaller model that will save local models
>>
File: .png (35 KB, 714x274)
35 KB
35 KB PNG
>>101502821
>>
>>101497996
The 70B is the number of parameters, and each connection in a human brain has at least one parameter. So it's 100T, not 100B. Don't make it sound like we're still anywhere close.
>>
>>101502594
But is it at the level of an 8x7b that's been squeezed to fit on a 24gb card with 32k context?
>>
>>101502837
>inb4 llama-3.5 bitnet 3B trained on 6 gorillion tokens that is better than gpt-5 (with 8k context)
>>
>>101502767
>does the model go wacky beyond 30k input context as anons reported
Which anon said that? Its the first local model ive used that stays together over 32k.
>>
>>101502825
Miau!
>>
>>101502874
It's a bit dumber than 8x7b, and you need to fight or swipe enough times if you want its line of thinking to go a certain way. If you're a fan of spontaneous elements creeping in that still makes sense, it's great at that.
>>
llama3-400B-mini
>>
>>101502874
128K context.
>>
File: 405b 8 fucking k.png (52 KB, 1059x929)
52 KB
52 KB PNG
>405b
>8k
Captcha: 4444
>>
>>101502914
ok so mistral > gemma 27b
mixtral > mistral
but mixtral sucks
and gemma is great
>>
>>101502964
More importantly, functional context. We had models that marketed themselves as 32k but were barely usable at 8k or did complete shit the bed at 16-20, and the output did become total garbage.
>>
>>101502964
I thought Mixtral only had 32k
>>
>>101503014
32k was for Mixtral... he mean the new Mistral.
>>
>>101503014
mistra nemo is 128k that works.
>>
>>101502874(me)
Trying again
Mistral-Nemo fits in 24gb at 8bpw with full context
Mixtral 8x7b has to be quantized to less than 5bpw to fit in the same 24gb of ram with full context
The assumption is that at the same bpw, Mixtral beats Nemo. However, will that change after Mixtral's been quantized to fit on the card?
>>
>>101503092
Remember that it has its own formatting. Dont just use the regular mistral one.

>>101501499
>>
>>101503092
Also, use a lower temp than than you would with mixtral. Mixtral needed a high temp to be creative, nemo needs a lower one to not go off the rails. Though it can be fun with high temp depending on the card.
>>
>>101499758
Add -j <NUMBER OF CORES HERE> to the make/cmake call as described in the README in order to use multithreaded compilation.
>>
>>101503092
Mixtral 8x7b is at most 3.7bpw to fit into 24GB with 16k context.
>>
Finding the temp balance for Mistral is a challenge. But I like it better than Llama 3, Finetunes, and Gemma. For anons with a 12 GB RAM, it is pretty good, definitely an improvement over what we had.
>>
>>101503687
How high can you crank the context limit on it with 12GB?
>>
>>101503833
depends on quant
>>
>>101503833
With GGUF Q6 41/41 layers with no KV offload 256 Blas, I tried 64k, and it uses 10 GB of RAM, so with KV offload, I think you can go to 128k, no problem, since it does not take RAM. Without offloading, it will be maybe 16k, but then you can also quantize kv cache, which I have not tried. Anyway, with the setting, I have 6 t/s speed when I fill my context to 24 k on RTX 4070. For myself, that is fully acceptable, and I can see myself using it to a point where the speed goes down to 2 t/s. But for many, even 6 t/s is likely something no no.
>>
>>101504007
VRAM. fuck me.
>>
Does nemo work on llamacpp
>>
>>101503092
what's you mean by full context? 128k or what? did you quantize kv cache?
>>
>>101504007
Hmm, I've gotten pretty used to 35 t/s on Llama3 8B Q6, guess I'll have to see how slow I can handle. Sounds promising though, feel like 64k context or so would be the sweet spot for my stories
>>
>>101504119
One could try the smaller Quants. It is likely possible to run it in 32K context with better speed. For myself, if I ever get to a point where the speed or output sucks, I just summarize the story and continue fresh.
>>
Does anyone here use mistral's inference library?
>>
>>101501855
>sam altman says he might loosen filter for violence and sex
>actually finds a way to block jailbreaks
kino if true
>>
>>101502899
this one >>101500262
>>
Will Nemo Mistral be irrelevant in 1 week?
>>
https://xeeter.com/AlpinDale/status/1814814551449244058
>Have confirmed that there's 8B, 70B, and 405B. First two are distilled from 405B. 128k (131k in base-10) context. 405b can't draw a unicorn. Instruct tune might be safety aligned. The architecture is unchanged from llama 3.
>>
>>101502821
is that Vllm or exllama or what?
>>
>>101497246
Cohere's VP of research is named Sarah Hooker.
>>
>>101504944
Distilled from 405b? So they are completely different from the weights we have now?
>>
>>101499054
>1 parameter can take many parameters as input and send it's output to many other parameters
You retarded subhuman don't even know what these words you are using mean.
I recommend euthanasia.
>>
>>101499061
wizard vicuna was among the best at its time
>>
>>101504944
Distilled? Does it mean there is 0% "harmful" data in 70b and 8b models? Oh no no no...
>>
>>101499223
that's among the most embarrassing dunning-kruger gibberish i've read here.
>>
>>101504944
How big is the difference between gemma 9b and gemma 27b? I guess that's useful as a guess for the difference between distilled 70b and 8b
>>
>>101505101
I'm pretty sure gemma 9B is distilled from the 27B and it has plenty of harmful data, stop dooming about shit you don't know.
>>
>>101505150
Gemma isn't distilled. Both models were trained separetely on different datasets.
>>
>>101499258
How did you compile it for the first time but now somehow are stumped on how to compile it a second time after pulling? Are you braindead?
>>101499315
you retarded subhuman, the other braindead subhuman is talking about llama.cpp.
I hope you computer-illiterate niggers are replaced by llms soon.
>>101499758
My cpu is 7 years old. Last time I recompiled it took 3-5 minutes maybe. It will only recompile the parts that need to. But even a full compile of everything shouldn't take half an hour. That sounds like bullshit, unless they've added tons of new unhinged bloat utilities.
>>
>>101505165
https://www.reddit.com/r/LocalLLaMA/comments/1dpwi3x/gemma_2_9b_model_was_trained_with_knowledge/
https://medium.com/@nabilw/gemma-2-knowledge-distillation-llama-agents-and-more-ai-updates-2ea4a409c1ba
https://huggingface.co/blog/gemma2
>According to the Gemma 2 tech report, knowledge distillation was used to pre-train the 9B model, while the 27B model was pre-trained from scratch.
come again?
>>
>>101505189
>pre-train
bruh
>>
>>101505209
>For post-training, the Gemma 2 team generated a diverse set of completions from a teacher (unspecified in the report, but presumably Gemini Ultra), and then trained the student models on this synthetic data with SFT. This is the basis of many open models, such as Zephyr and OpenHermes, which are trained entirely on synthetic data from larger LLMs.
https://huggingface.co/blog/gemma2#knowledge-distillation
bruh yourself
>>
>>101500061
your illiterate use of past and present tense probably also contributes to the confusion of the model
>>
>>101505220
so they were trained on different datasets
>>
>>101505165
>Gemma isn't distilled
>>101505235
...
>>
>>101505259
yes, >>101505189 implies they were not
>>
>>101505283
My argument is: A pretty much fully distilled from bigger ones model (Gemma 9B) can still have harmful data, contrary to what doomer above was posting. I don't care if it's different than 27B that has nothing to do with what I said.
I never even mentioned 27B itself being distilled.
>>
>>101505209
the pre-training is the most important part by far
>>
>>101505187
>you retarded subhuman, the other braindead subhuman is talking about llama.cpp.
Are there a lot of people here who talk like this, or is it just one who is very vocal?
>>
File: 1721564511981.jpg (422 KB, 1554x2176)
422 KB
422 KB JPG
>48gb mini-vramchad soon
what's the best model that can fit on 2x24gb?
>>
To know that we'll soon have kobold compatible ggups for Nemo...it's a heady sensation.
>>
>>101505491
Gemma.
>>
>>101505516
There is already branch that you can use for it.
>>
So, they retrained Llama3 8B and 70B but still couldn't come up with some intermediate size?
>>
>>101505582
A 23B model would have been nice and made the model lineup roughly follow a geometric progression in size up to 70B.
>>
File: 19525689461a.jpg (89 KB, 400x400)
89 KB
89 KB JPG
>>101505552
i get that you're trolling, but at least use something more convincing
>>
>>101505118
>How big is the difference between gemma 9b and gemma 27b?
27
-9
= 18
the age you must be to post here.
>>
>>101504944
>source: trust me bro
>>
>>101505445
I think it's just me, but I've not been very active here in the past few weeks.
>>
>>101505636
Nah, you'll understand when you try everything else in the 48 bracket.
>>
>>101498176
>More
yep there's the dopamine
I post logs now and again, not too often.
there was something about oral onahole saber that was hard to resist

>>101498650
Edited in, I wanted a recurring phrase that stuck in people's head, apparently it worked
>>
>>101505636
If you're an exlless P40 plebian he ain't wrong, Gemma 27B 8-bit roped to 16K is in the sweet spot between speed and accuracy for those.
>>
nemo vs phi 3?
>>
>>101506007
No, rope scaling doesn't work well with Gemma.
>>
If some anons here don't know, right now you can trialscul GCP for access to Claude models, $150 with no CC and $300 with CC (it's free credit anyway, they won't bill you).

https://github.com/cg-dot/vertexai-cf-workers

Should be useful for some dataset generation with 3.5 Sonnet to improve local models.
>>
>>101506335
$150 with 3.5 Sonnet ($3/$15 for 1M) is enough for a couple thousand generations with decent context and output tokens.
>>
>>101506335
In this house:
-Claude is over-rated pajeet shit.
-You're mentally ill.
-'Tutoring' doesn't work.
-You need to fuck off back to /aicg/ you pathetic good for nothing locust.
>>
>>101506387
>-Claude is over-rated pajeet shit.
If you actually think this way, it's over for you. Have you ever tried 3.5 Sonnet? It's a god at programming and assistant tasks.
>>
>>101506387
>-'Tutoring' doesn't work.
Then why does the Phi series of models exist? Surely if you have a smaller more specific task, using 3.5 Sonnet to generate high-quality examples of output is going to help you improve a local model a lot, even if you don't have billions of tokens.
>>
>>101506398
>Then why does the Phi series of models exist?
microsoft pr
>>
>>101506410
So why is Phi-3 so good at a lot of tasks despite being so small?
>>
>>101506413
>So why is Phi-3 so good at a lot of tasks
which ones robert?
>>
>>101506185
Correct but for standard slop ERP 8-bit works fine enough upwards to 16K, though not always all the way.
For anything else I agree - stick to 8K.
>>
>>101506007
dual 7900xtx
will be 3 next month
>>
does nemo work on kobold yet?
>>
Is 5090/5080 going to be a significant upgrade? 4090 wasn't that much better than the 3090/80 for our uses but I'm far from an expert.
>>
>>101506425
There's no reason to do that when Nemo exists.
>>
>>101506466
I do not think so. These cards are mostly for gamers and there nothing really that would challenge these cards. At best we can hope for more Vram in the 5070/5080 cards.
>>
>>101506742
That being said, not all is negative in the HW space, and Intel and their Cpus are planning to do what Apple is doing with their chips, so if Nvidia keeps fucking us over, there may still be light at the end of the tunnel.
>>
Nemo 24B when?
>>
>>101506466
>>101506742
>>101506763
nvidia is a fat pig ready to be roasted. They're complacent and lazy because they literally create money. The demand vs supply is so insane that they probably don't even give a flying fuck if they lose the AI mega VRAM market share niche battle. And they will.
>>
Need help. So oobabooga's llama.cpp has a regular version, and an HF version(just like exl2 and exl do). So when building cuda llama.cpp(not ooba, just standalone llama.cpp), is it defaulting to HF samplers when I use it as a backend with sillytavern as the front end? Is there a command I have to use when loading the model to make it hf samplers? Or does it happen automatically when I load a model in a folder with HF samplers(like the ones you download off ooba).
>>
>>101506827
Why are you using ooba anyway?
>>
>>101506862
He isn't.
>>
>>101506798
We still haven't got the instruct/base versions of the 22B model MistralAI used for Codestral and Mixtral 8x22B.
>>
>>101506878
Phew.
>>
>>101506799
>>101506763
Alright, but when?
Do I just snag 2 x 3080 in the meantime and be happy? Or do I wait it out?
>>
what's wrong with ooba
is the dev a pinko?
>>
wait 2 more months for the flood of cheap 32gb v100s
>>
>>101506926
No one can tell you that. It could be a year, 3 years, 5 years. If you need LLM coom now, then suck it up I guess.
>>
>>101506926
I would likely wait. The 30/70b models we have now are not that much better than the 12/9/8b models. But ultimately, it depends on what card you have now and if you think it is worth having a 10–20% better experience.
>>
>>101506997
>>101506971
Alright cheers mates
>>
>>101507132
>>101507132
>>101507132
>>
>>101506466
5090 is supposed to have at least 40% more memory bandwidth



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.