[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1 (2).png (574 KB, 573x837)
574 KB
574 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101457504 & >>101449685

►News
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/
>(07/16) Codestral Mamba, tested up to 256k context: https://hf.co/mistralai/mamba-codestral-7B-v0.1
>(07/16) MathΣtral Instruct based on Mistral 7B: https://hf.co/mistralai/mathstral-7B-v0.1
>(07/13) Llama 3 405B coming July 23rd: https://x.com/steph_palazzolo/status/1811791968600576271

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png (embed)

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
>>101464048
>(embed)
>>
BASED!
DEATH TO MIKU
>>
it's over...
>>
Hi!

I am getting very low tokens / second using 70b models on a new setup with 2 4090s. Midnight-Miqu 70b for example gets around 6 tokens / second using EXL2 at 4.0 bpw.

The 4-bit quantization in GGUF gets 0.2 tokens per second using KoboldCPP.

I got faster rates renting an A6000 (non-ada) on Runpod, so I'm not sure what's going wrong. Nvidia-SMI shows that the VRAM is near full on both cards, so I don't think half of it is running on the CPU.

Any advice is appreciated, thanks!
>>
new Mistral is pretty good - thoughts?
>>
>>101464216
Yeah I'll let you know in another 2 weeks when it has its kinks ironed out.
>>
>>101464216
waiting for kcpp support
>>
>>101464216
How long till non code mamba? Something around gemma 9B smarts with 256k context would be nice.
>>
>>101464246
Run it with tabby?
>>
Wait a sec. Guess I was not paying attention. Thought it was only a code version.
https://mistral.ai/news/mistral-nemo/
>>
>try L3 8B Q8
>blows Q6 out of the fucking water
Why do retards keep pushing this quantization shit? Anyone compare the raw 16 bit floating point to the 8 bit one?
>>
What is the best quant for a single 4090 with the new Gemma? I downloading Q5
>>
>>101464257
Good idea actually, thanks.
>>
>>101464356
The one that fits
>>
>>101464340
Pretty sure everyone started to say quants sucked when L3 came out. More dense model = greater effect of quantization.
>>
>>101464216
Did some basic tests using transformers and it seems decent so far, nothing that clearly shows it punching above its weight though.
Think the killer feature is going to be the high context. If that actually works it's a good model for low-end GPU text processing work.
Might hike P40 prices up even further since they can run a model this small at decent speed with plenty of room for context.
>>
>>101464449
Kind of waiting for backends to support it but I've seen people saying its the first model to not go retarded at really high context.
>>
>>101464363
And that is?
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>101457504

--Mistral-Nemo is surprisingly good and doesn't dodge dick: >>101457731 >>101457918
--Higgs Llama V2 Chart: >>101460454 >>101460512 >>101460550
--Nemostral Confirmed for 2MW Status but Custom Tokenizer is a Challenge: >>101460679 >>101460733 >>101460748 >>101460767
--Running Mistral Nemo on a 24GB GPU with Compromises: >>101458100
--Mistral Nemo Model Not Yet Running Due to Lack of Support and Tokenization Issues: >>101460321 >>101460333
--Hackers Leak Disney Slack Messages in Protest of Their AI Usage: >>101460812 >>101460871
--Gigabyte's RAM Revolution: 24TB DDR5 and AMD EPYC's 48 DIMMs Spark CPUmax Dreams: >>101459814 >>101459870
--Gguf faster than exl2 for 3D models with llama.cpp CUDA: >>101460948 >>101461004 >>101461165 >>101461469
--GLM4 Issues Due to a GlennM9 Bug: Slack Chat Log: >>101458669
--DeepSeek-V2-Chat-0628: Impressions and GGUF Availability: >>101458547
--Purchasing an NVIDIA Quadro RTX 8000 48GB for the VRAM: >>101462297 >>101462376 >>101462489 >>101462559
--Japan: Content Used to Train AI Has No IP Rights: >>101463339
--Trump allies want to “Make America First in AI” with sweeping executive order: >>101463644 >>101463836
--Miku (free space): >>101457780

►Recent Highlight Posts from the Previous Thread: >>101457519
>>
>>101464340
>>101464374
I didn't do any tests with lower quants, but there is likely practically no difference between Q8 and FP32 (BF16) even on Llama 3 8B. I did the KLD tests >>101243361, as well as manual testing >>101245221. If one believes that Q8 is significantly worse, they must provide proof. So far I have seen 3 posts make the claim but provide 0 proof. Meanwhile, it is much more likely that some mistake on the part of the user was made which degraded the quality, or the they were seeing things they wanted to believe, or it was simply a coincidence (low sample size), or a combination of all of these factors.
>>
is there are any test of deepL LLM model yet?
it is only on their webpage and paid only pro plan,but i wonder how good the translation is
https://www.deepl.com/en/blog/next-gen-language-model
>>
So mistral nemo... works on exl right now and nothing else right?
>>
>>101464582
vLLM should also get it "out-of-the-box". llama.cpp and everything downstream of that is what's going to be waiting weeks
>>
>>101464595
I couldn't get it to load via ooba in transformers earlier. But I think I just need to do a fresh rebuild of the conda environment I use for it. But I was running on transformers built from source so it should have worked.
>>
>>101464605
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407#transformers
>NOTE: Until a new release has been made, you need to install transformers from source:
>pip install git+https://github.com/huggingface/transformers.git
>>
>>101464630
That's what I did. I built from source. But I was still getting the tensor size issue. My ooba environment is pretty old and still running on python 3.10 I believe, so I don't know if making a new one on a later python version would help at all.
>>
>>101464652
The exllama2 loader is compatible and worked for me when transformers didn't.
But I assume you already would have tried that if your GPU supported it.
>>
>>101464711
Yeah it loaded just fine with exl2. I normally run ggufs but I'm not going to bother until it's on the main branch.
>>
Any other good and updated benchmarks?
https://prollm.toqan.ai/leaderboard/coding-assistant
https://aider.chat/docs/leaderboards/
https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
>>
>128k
It will derail and lose coherence at 16k-24k tokens anyway, even corpo models can't avoid this.
>>
>>101464817
This one doesn't. Mamba makes the difference apparently.
>>
>>101464216
It'll most likely be the go-to for poorfags.
>>
File: NalaMistralNemo.png (126 KB, 929x421)
126 KB
126 KB PNG
Sloppiness aside, this is good at the Nala test in general let alone for its parameter count. Arthur confirmed as closet furry.
I did several pulls at different temperatures this is t=0.81. It was still coherent and less sloppy at t=1 and it was kind of dry but actually followed the prescribed formatting at t=0.3 as is recommended on the model card. In all 3 instances there was no anthropomorphism.
>>
>>101464904
Isn't Gemma the go-to for poorfags? Or are they for two different levels of poor?
>>
>>101464913
anyone who doesn't have at least 20 h100 clusters is poor and should be banned from this general
>>
>>101464911
that is complete fucking garbage, you'd have to be blind to think that's good
>>
>>101464939
You're mentally ill and nobody cares what you think.
>>
>>101464913
No way man, gemma is like, a whole week or two old. That's fucking ancient, what are you an old man?
>>
>>101464954
There is no reason to check /lmg/ more than once a week.
>>
ST's Command-r preset results in garbage outputs, and I mean actual garbage data, switching between languages, mashed-together words, random single letters, you name it. Is it worth actually figuring out how to set it up considering how fucking slow C-r is compared to mixtral or should I just give up and go back to slop city?
>>
>>101464911
Were you the guy the guy that supposedly got it working in ooba transformers last thread?
>>
>>101465072
No, using exl2
Also observations with further testing.. it struggles a bit with the concept of possession. Mathstral had the same problem but it's not as bad as Mathstral.
Also I don't know if it's an exl specific issue but if a tavern card is pretty lengthy (about 2k tokens) it seems to completely break the fuck down. (this is at 8.0bpw)
>>
>>101465002
I liked these better.
https://huggingface.co/Virt-io/SillyTavern-Presets/tree/main/Prompts/Command-R/v1.9
But Command-R isn't really that great now. Gemma 2 pretty much mogs it at smaller parameter count too.
>>
>create a camera that follows the player in love2d
>deepseek fail 10+attempts
>llama 70b fail 10+attempts

>turn on lunaris 7b q4 as a joke
>suceeds first try
>??????????????????????

the code for reference:
function love.load()
love.window.setMode(1600,1000,{resizable=true})
-- Load the slime character image
img = love.graphics.newImage('slime.png')
background = love.graphics.newImage('test.png')

-- Initialize the character's position and dimensions
characterX, characterY, characterWidth, characterHeight = 0, 0, img:getWidth(), img:getHeight()

-- Set the camera offset (distance from slime to camera center)

screenWidth, screenHeight = love.graphics.getDimensions()

end

function love.keypressed(key)
if key == "w" then
characterY = characterY - 30
elseif key == "s" then
characterY = characterY + 30
elseif key == "a" then
characterX = characterX - 30
elseif key == "d" then
characterX = characterX + 30
end
end
function love.update(dt)
-- Update camera position based on slime
cameraX = characterX
cameraY = characterY
end


function love.draw()
-- Center the camera to avoid clipping the slime
love.graphics.translate(-cameraX + screenWidth/2, -cameraY + screenHeight/2)
love.graphics.draw(background)
-- Draw the game world (background, objects, etc.)
-- (Assuming your drawing code is in this function)

-- Draw the slime at the correct position
love.graphics.draw(img, characterX, characterY)
end

i only added the img and background in love.load idk why i dident even need to do that anyways black magic folks
>>
File: 1694556554931922.gif (2.24 MB, 378x419)
2.24 MB
2.24 MB GIF
been out of the loop for months now
what's a good local model currently for a 16GB VRAM card?
>>
>>101465162
>moot
Who?
>>
Now would I daily-drive Mistral-Nemo? Probably not... But it's definitely a decent option for VRAMlets.
>>
>>101465167
dunno, sounds like a gardening tool
>hey can you pass me over the moot?
>>
>>101465173
Working 128K context is really nice though.
>>
>>101464554
>>101464374
>>101464340
OK so I just did a KLD test for Q6_K L3 8B.

====== Perplexity statistics ======
Mean PPL(Q) : 7.083942 ± 0.050761
Mean PPL(base) : 7.128723 ± 0.051077
Cor(ln(PPL(Q)), ln(PPL(base))): 99.58%
Mean ln(PPL(Q)/PPL(base)) : -0.006302 ± 0.000660
Mean PPL(Q)/PPL(base) : 0.993718 ± 0.000656
Mean PPL(Q)-PPL(base) : -0.044781 ± 0.004702

====== KL divergence statistics ======
Mean KLD: 0.017828 ± 0.000251
Maximum KLD: 13.386079
99.9% KLD: 0.907341
99.0% KLD: 0.192563
99.0% KLD: 0.192563
Median KLD: 0.005415
10.0% KLD: 0.000041
5.0% KLD: 0.000007
1.0% KLD: 0.000000
Minimum KLD: -0.000020

====== Token probability statistics ======
Mean Δp: 0.126 ± 0.011 %
Maximum Δp: 95.254%
99.9% Δp: 36.863%
99.0% Δp: 12.510%
95.0% Δp: 5.045%
90.0% Δp: 2.781%
75.0% Δp: 0.477%
Median Δp: 0.000%
25.0% Δp: -0.403%
10.0% Δp: -2.494%
5.0% Δp: -4.551%
1.0% Δp: -10.917%
0.1% Δp: -25.308%
Minimum Δp: -94.447%
RMS Δp : 4.004 ± 0.042 %
Same top p: 94.781 ± 0.059 %

I'm not going to do the manual creativity-based test, as I believe it'd likely agree with these numbers, as Q8's testing did. If someone has significantly worse output at Q6 compared to Q8, proof would be appreciated, since this suggests that it's not significantly different.
>>
>>101465239
>memeplexity
>>
>>101465263
Yes, that's why they made a KLD test, so they could have a better measure of the effect of quants.
>>
>>101464911
>She she her she her she she she she
Holy kino
>>
>>101464652
It works for me on python 3.10.
>>
>>101463339
Germany has a similar law where data mining (under which dataset collection falls) is explicitly allowed unless there is a machine-readable opt-out.
However, Aleph Alpha is reportedly still falling behind.
Having massive amounts of compute and training data (the latter of which you just keep secret) seems to be what's important right now.

>>101464122
Just to make sure, you are setting --n-gpu-layers or however koboldcpp calls it, right?
And if you are on Winblows you have the automatic VRAM swapping "feature" disabled, yes?
>>
>>101465281
Fuck.
>>
>>101465281
TOTAL
LM
LOVE

https://www.youtube.com/watch?v=ajjdY070VU4
>>
>>101465281
I've found I don't find a barrage of 'she' annoying these days if it's followed by some good meaty unexpected stuff instead of llm 'shivers' lego blocks.
>>
>>101465301
>if you are on Winblows
What does your development environment look like on Linux?
Main thing holding me back from switching entirely is Visual Studio, I can't imagine trying to be productive in C++ without it.
Do you use CLion, some text editor with plugins, or is there something else?
>>
What would a better lm even look like? Just Claude 3.5 Opus? What's the endgame
>>
>>101464048
Oh that's Mashu, how cute. I still use her character card since the pygmalion era and test models with her chat. She's always reliable in calling me senpai.
>>
Am I retarded? Trying to use tabbyAPI but it keeps saying out of memory. I even tried loading the 12GB model in 4 bit on my 3090. Still out of memory when trying to load it.

I edited the config model name to the correct model.
>>
>>101465353
bro...
>>
>>101465353
On my home desktop machine I am running Manjaro.
I use Doom Emacs with the default C/C++/CUDA plugin to edit files, just the terminal (zsh) for compilation, testing, and debugging (via gdb).
When I am at the office I usually SSH into my desktop and run Emacs via the terminal.
I synchronize branches between machines via git, other files via scp, and use (neo)vim if I need to quickly edit files from the terminal.
>>
What options do I have for powering the V100 system? I reached out to core4, who seem to sell disassembled servers, and asked if they had one that wasn't taken apart already. (They do). They told me
> The server plugs in directly to the bus in the OCP rack (this is why the rack is required to operate). No standard PSU.
Where can I buy a rack that fits this description? maybe im just too stoopid, but on ebay searching "ocp rack" just brings basically no results.
>>
How the hell do you get tabbyapi to work? No matter what size of model you use it says cudo out of memory.
>>
>>101465002
Make sure to set min-p to at least 0.007. With neutral samplers you will get garbage from Command-R like it's broken. The min-p I'm using is 0.04.

If you're using Cohere's web API with only top-k and top-p, setting top-k to 80 and top-p to 0.82 gives coherent results.
>>
so, can nemo milk my dick like crazy? should i bother with tabby?
>>
>>101465570
paste the tabby console output
>>
>>101465301
does Lmstudio contribute to llama.cpp code /give you guys some cut or they just wrap your stuff then show the finger?
>>
>>101465615
the finger is industry standard practice in silicon valley
>>
>>101465301
how to enable lookup and look-ahead decoding in llama.cpp?
>>
File: Tabby.png (67 KB, 1307x971)
67 KB
67 KB PNG
>>101465592
This is 6 bitsperweight on a 3090 with nothing else open
>>
>>101465301
Tensor parallel split wen?
trainer wen?
Jamba/mamba wen?
>>
>>101465632
ah, I tried a couple of exl2 quants and one defaulted to 1 000 000 ctx or something ridiculous like that. that's why you get oom after loading the model successfully.
set context manually in the tabby config file. 32k should be fine depending on what else is running on your GPU.
>>
>>101465187
Let it pass the RULER first
>>
>>101465623
Lmstudio is (((5 guys))) from Brooklyn
>>
>>101465677
That fixed it, thx.
>>
>>101465500
>I use Doom Emacs with the default C/C++/CUDA plugin to edit files, just the terminal (zsh) for compilation, testing, and debugging (via gdb).
That's what I was afraid of. I always get the impression that these sorts of setups must only be tolerable for someone that never used an IDE.
Not that I don't know how to use the terminal, it's just not as efficient a workflow.
Still, I will give Doom Emacs a try. Thank you.
>>
Niitama is pretty good, but the smut is very samey, but I guess that's something with every model, eh?
Also, anatomy and spatial awareness is lacking, but at least it's fast as hell, even on my potato.
It's super horny, though. The leadup for smut cards is maybe 2 paragraphs until folds are being explored. God damn.
Guess I should turn down the temperature or something?
>>
>>101465735
>use half-assed shilled slop model
>complain about slop model problems
>/lmg/
>>
>>101465750
Yeah, I got what I deserved I guess.
What would you recommend, instead?
>>
>>101465760
Looking up Niitama says it's 8B. Use either Meta's Instruct or Gemma 9B and learn how to prompt.
>>
How was it in the past, might they still release something before the weekend? Now that gemma is working I want something better.
>>
Mistral NeMo 12B exl2 @ 8.0bpw and 128000 tokens of context fits comfortably in 23.4 GB of VRAM. We are so back 3090 kings.
>>
>>101465463
The context also takes VRAM. If it's a model with long context, or you set it with a long context, it's going to OOM. So try lowering the context.
>>
>>101465859
Worth it over Gemma?
>>
>>101465760
You can't expect any better at that size, and no, Gemma is much, much worse for your purpose
>>
>>101465853
NTA but im really liking new mistral's writing style so far and it seems smart at least in my story. Going to test how well it handles 100k context.
>>
>>101465615
I am not aware of any monetary or upstream code contributions by LMStudio.
I don't know whether they have business relations with ggml.ai though.

>>101465624
Right now it is effectively not possible to use except for PoC examples that you can run via the terminal.
I originally wanted to enable n-gram lookup decoding for the server but then LLaMA 3 and other models with much larger vocabulary sizes dropped and the speedup became much worse.
Right now with the better MMQ kernels evaluating a batch of 64 tokens only takes ~2x as long as a batch of 1 token so I'm working on n-gram lookup decoding that creates a tree of sequences rather than only a single sequence.
I think this could be a ~2-4x increase in the rate at which you can generate tokens for a single user.

>>101465653
>Tensor parallel split wen?
Right now with --split-mode row .
But it needs a lot more optimization.

>trainer wen?
After I am satisfied with my lookup decoding implementation I will start working on training.
My goal is to have something usable this year.

>Jamba/mamba wen?
Don't know.
>>
>>101465729
Two of the main factors for me when choosing tools is that I need to be able to reproduce things and that I want to use the keyboard instead of the mouse if at all possible.
So generally speaking CLI tools fit my needs much better than tools with a GUI because everything gets documented in ~/.zsh_history.
Though there is a big difference in usability between something like vanilla bash and zsh with a bunch of plugins (jump to directory, autocomplete commands from previous user inputs, etc.).

The reason I use Doom Emacs is similarly that if you already know Vim keybindings and want to use the keyboard for everything then it's a nice experience.

Regardless of what tools you use on Linux, I'd recommend you look at KDE virtual desktops (or the equivalent for other DEs).
I have a terminal, web browser, email client, etc. on different virtual desktops and have set shortcuts CTRL+F1 - CTRL+F12 to jump to a specific virtual desktop; it's a thousand times better than alt-tabbing.
Also for setups with multiple monitors I have shortcuts to set the focus on a specific monitor.
>>
Hmm, new mistral seems to work well with chatML formatting and super simple:

system
user
model

prefixes

And so far its really really good. Using 1 temp and I really like its writing so far.
>>
>>101464521
>-Trump allies want to “Make America First in AI” with sweeping executive order
>Eliminate "uncecessary and burdensome regulations"
HA, FUCK YOU SAM! Good luck trying to stall AI progress for your own financial gain you unbelievable prick.
>>
surely nemo will fit on my 12gb coomlet rig
>>
>>101466081
They officialy recomend low tempeture around 0.3
>>
>>101466102
Wait, it's being said that Trump plans to restrict sales of nvidia cards. What if it's a ploy to get them to make cards with more vram and Trump is secretly trying to help us GPUMax
>>
File: 1695202519985890.gif (133 KB, 340x340)
133 KB
133 KB GIF
>>101464048
>>101443596
>AI pajeets are too dumb to use the subject field
>>
>>101466254
Are you sure it's not to simply restrict sales of cards to China? I am pretty sure we are already doing that.
>>
>>101466251
for non creative use prob.
>>
>>101466251
Official temperature recommendations are always on the low side because for corporate bullshit you want low temperature. Predictable and boring are benefits when you're polluting the web with AI-generated news articles or running a customer service chatbot that could have been a series of menus or better yet a FAQ with some hyperlinks.
>>
>>101466306
>>101466330
Thanks anons.
>>
>>101465239
Petra sisters... our response?
>>
Ok, new mistral is amazing.
Great writing style, not censored whatsoever, smart yet horny when you ask it to, good anatomy / positional understanding, passes the non human anatomy test, and I tested it with some stuff I had around 30k ish context and its working great so far. Mistral did it.
>>
>>101466282
I'm tryna conspiracy theory here, fuck off with your logic.

In other news, new mistral still has refusals and whines about not doing that kind of thing and here's a help number to call. Retry a few times though and it givens in.
>>
>>101464521
>Gguf faster than exl2 for 3D models
...
>>
>>101466403
>he doesn't know llama.cpp can load .3ds models
>>
>>101466376
what mistral is the new mistral
sorry, just woke up
>>
File: Tests.png (368 KB, 1901x1650)
368 KB
368 KB PNG
Usual tests with new mistral.

>>101466463
new 12B mumba mistral came out. Only works with exlamma / tabbyapi atm. 128K context

https://huggingface.co/turboderp/Mistral-Nemo-Base-12B-exl2
>>
A question has been wandering my mind for the past couple days, and it has something to do with benchmarks. Benchmarks, so I think, are a way to automatically and deterministically measure an LLM's performance in certain tasks. But how does the answer of an LLM actually get judged? In Math, Trivia or Code benchmarks its quite simple. The model either gets the answer right, or it doesn't (Which would mean the code doesn't work, the math doesn't check out, or Rome is the capital of france). But for something like reasoning tests you would need another intelligent entity to judge it, like an LLM, which makes it undeterministic, because different people are gonna use different LLMs for benchmarking. So how the heck can we compare MMLU scores or something? Is there some black magic going on or am I missing a critical piece of information?
>>
>>101466497
And before anyone says it kept saying "her" take into account this is with no rep pen at all and a pretty low temp. Also might not even be the right formatting, just chatML template with user / model / system prefixes
>>
>>101466497
>a shiver running down her spine
Stopped reading there
>>
>>101466684
I hate to break it to you but that is going to stay with us forever. Its just a common trope in writing.
>>
>>101466497
Is base better than instruct?
>>
>>101466721
For creative writing yes. Don't tell me you use assistant tunes for that? It usually taints the writing with its shitty assistant / ai persona. / Gives a positive bias.
>>
>>101466684
>stheno rarely writes shivers or repeats itself but its retarded..

this world is cruel
>>
>>101466755
I know what you mean, but being able to explicitly instruct is just more convenient so I haven't used base models in a while.
>>
>>101466392
>t. shitbull baby guro rape anon
no seriously, what are you even prompting? goddamn help numbers?? zero refusals here.
>>
Does mistral plan to do a nemo-based mixtral?
>>
The new Mistral 12B is insane. Testing it in FP16 via transformers in ooba (you have to build the new version of transformers yourself atm but it's easy). It's quite a bit smarter than Gemma 2 27B. I can't believe it.
>>
>>101466497
newfag here, where can i find those cards?
>>
>>101466879
Wanted to let others test it themselves first. Yea, feels really smart and creative. Has mixtrals thing with being super good at formatting / instruction following without its dryness. Also the context is the real deal. Fitting 100K on 24GB and still having room for windows / browser with hardware accel is nice.
>>
>>101466915
Blank is literally just blank, niggerhater is just a single line card that he hates niggers and cant help but talk about them and emily:

https://files.catbox.moe/uty0jp.png
>>
just two more weeks for kcpp nemo 12b support...
>>
>>101466820
Probably suicide. That's the only thing I've ever seen give me help numbers in e.g. Llama 3.

Interestingly with Llama 3 if I start by mentioning I'm Canadian and in Canada MAID (Medical Assistance in Dying) has been legal since 2021 for basically anyone and not just the terminally ill it will then happily discuss whether I should kill myself. Of course I'm not really a filthy Canadian.
>>
>>101466879
How do we know it's not just GPTslop?
>>
>>101466917
In 8 or 16 bit?
>>
>>101464048
Any good local models that can translate from Japanese to English?
>>
>>101466971
You either try it or wait for other peoples opinions when the more common backends support it in 2 weeks.

>>101466979
its trained in 8 bit natively.

"Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss."
>>
Going by the guide in the /lmg/ general I'll only be able to run a 4-but 7b on my computer. I have 8GB VRAM and 16GB RAM
Is that good enough for some lewd chatting? I'd like to some fictional characters.
>>
What's the future of local models?
Small models that perform really well or new quantizing/compressing techniques for giant models?
>>
>>101466961
ah of course, that's something any model likely gets trained to hard refuse no matter the use case.
>>
>>101467024
bitnet
>>
>>101467011
You could run G2 9B at Q6 without offloading
Or if you want to offload you could try the new Mixtral NeMo 12B at Q8 and report results back
>>
File: 445645645515.png (36 KB, 920x812)
36 KB
36 KB PNG
Alright, since everyone is parroting that 12B is better, can it pass the knowledge test? This is a simple test that if it passes then it might be cloud shit tier and I will believe you. Pic rel is gemma 27B after a few tries, it still got the character wrong keep in mind.

>What is "Die monster, you don't belong in this world!" From?
>>
File: file.png (341 KB, 1480x1764)
341 KB
341 KB PNG
A "Write a story of a loli giving me head" test with Mistral Nemo.
>>
>>101465990
Everything in Visual Studio has a shortcut or can be assigned one, so there isn't much where I'm forced to use the mouse. But I'll admit it lacks that sort of command history documentation.
i3wm covers the virtual desktops and jumping between monitors for me quite well on Linux. Though even Windows has that now.
The shell plugins are still something I'm lacking. I just found and installed z for directory jumping thanks to you and I think that will make it much more bearable.
Last time I tried using Midnight Commander to navigate the filesystem. It was an improvement over cding even with autocomplete, but still too cumbersome.
>>
>>101467074
>loli
>16
stopped reading here
>>
File: 4455488466556.png (129 KB, 862x728)
129 KB
129 KB PNG
>>101467073
Also simple IQ test. Pic rel is Gemma 27B.
 Consider the following three statements:

I. Those who like paintings like flowers.

II. Those who like running like music.

III. Those who do not like music do not like flowers.

If these are all true, which of the following statements must be true? Choose all that apply and explain your reasoning.

Those who like running like flowers.

Those who like paintings like music.

Those who like flowers do not like running.

Those who like running do not like paintings.

Those who like paintings like running.


Correct answer is just the first choice.
>>
>>101467050
>G2 9B
I'm not seeing that in the LLM List. I'm looking at https://wikia.schneedc.com/
What does Q# mean?
>>
>>101467073
flying colours
>>
File: chrome_UCOs9ZrU5M.png (10 KB, 309x149)
10 KB
10 KB PNG
I have a new GPU qith 16GB vram, what's the best models I can run now?

Pic related is what I used with my old GPU
>>
File: It dosent know.png (25 KB, 1296x309)
25 KB
25 KB PNG
>>101467073
>>
>>101467147
>first choice.
Second*

>>101467188
Is this a L2 13B? kek
>>
>>101467073
quote the "everyone" posts that said a small 12B model had better fringe knowledge than 27B so we all can laugh at them.
>>
File: Isthisis.png (53 KB, 1907x337)
53 KB
53 KB PNG
>>101467073
>>101467188
Woops, that was base model. Is this right?
>>
For people testing 12B with transformers: it's slightly smarter using FP16 rather than BF16 for some reason (deterministic settings, so not sampling randomness).
It was failing the Sally test with ooba's default settings which have bf16 checked, when I unchecked bf16 and reloaded the weights all its answers were now subtly different and it was passing stuff where it failed before.
>>
File: Isthisis2.png (96 KB, 1293x765)
96 KB
96 KB PNG
>>101467073
>>101467188
>>101467226
It seems pretty confident all the way ill 0.7 temp, is it correct?
>>
>Q6_K, 10.1gb
how much k's of context does that leave from my 12gb?
>>
>incredibly smart small model drops
>thread disappears up its own ass testing retarded trivia knowledge instead of intelligence
never change, /lmg/
>>
>>101467237
BF16 is garbage. It uses half its bits for the exponent. Only reason to use it is fast conversion to and from IEEE 754 float32.
>>
Echidna-13B-v0.3-GPTQ is the best uncensored model for 16Gb vram?
>>
>>101467286
Give me questions. I usually just use them for creative writing.
>>
File: 6544894844.png (52 KB, 1807x382)
52 KB
52 KB PNG
>>101467263
No, only cloud shit gets it completely right unfortunately.
>>
>>101467294
"Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss."

Isn't it meant to run at 8 bit instead of 16?
>>
File: file.png (161 KB, 822x508)
161 KB
161 KB PNG
What went wrong?
>>
>>101467335
{INST}

Is that the right formatting?
>>
>>101467286
happens after every release. it's free (you)'s.
>>
>>101467080
When we're already on the topic of general CLI tools, take a look at tldr (gives you examples of the most common use cases for commands), as well as these https://github.com/zsh-users/zsh-syntax-highlighting https://github.com/zsh-users/zsh-autosuggestions zsh plugins.
For file system interaction in particular I'm also using exa (which you can largely use as an ls alternative via aliases).
>>
Fuck the assistant tune, maybe Im using the wrong formatting but it is retarded compared to the base model. I suggest trying the base.
>>
>>101467443
And fzf is a must.
>>
>>101467286
I beg to differ. Is it really smart if it's failing a simple trivia test? Can you trust it to perform critical tasks? I would trust Gemma because it has passed everything I've thrown at it. You can only do so much with sub 70 MMLU anyways. Hopefully a Mixtral 2.0 or Nemu 27B drops.
>>
>>101467498
That is a lack of knowlege not smart/stupid.
>>
Ok I tried Mistral 12B instruct since people are praising it. It's less dry than mixtral but still I'm not too fond of its writing style. Maybe base is better idk. Command-R is still king in that regard.
>>
>>101467537
I really really like base model's writing style, it is giving some characters some spunk (character, not the other kind) I haven't really seem elsewhere before.
>>
File: file.png (66 KB, 955x165)
66 KB
66 KB PNG
It also fails with the official mistral-chat and the full model. It likes to give different answers every time too.
>>
>>101467517
More knowledge = more wisdom. When something goes wrong and I ask the AI for advice (one of the only things that it's useful for) you can trust its knowledge to give you the right answer, or something close. I know Mistral would hallucinate through more than half of the things I've put Gemma through. Being smart enough simply isn't enough. The bar has been raised.
>>
File: 4564545515.png (47 KB, 893x582)
47 KB
47 KB PNG
>>101467073
No idea what GPT-4o gives for this, but GPT-4o mini gives same answer. Proves ClosedAI has no moat, they literally just throw parameters at the the problem.
>>
>>101467606
Hmm, Sure, but then I do not trust any of the models. The material they are using is mostly biased, so it is useless. As of now, all the AI models are just for fun; at least to me, I would not rely on them for anything serious.
>>
>>101467443
>exa
I downloaded it now and like it already. If you're not already aware, apparently exa is unmaintained and has been replaced by eza.
Final thing, I promise. What do you use for searching? Being able to find all references or go to defintion is a huge time saver, but the best I could find is crafting grep commands excluding a bunch of directories and binary files. Even then it's not syntax aware and text only so painfully slow.
Do you know of anything that can do that sort of syntax aware searching, or at least something that searches using a basic index?
>>
>>101467606
Different use cases. Gemma is great when you don't know something and want an answer.
Assuming Nemo's 128K context work (which I doubt) it's going to be great where you already have all or some knowledge and want to refine it further.
It's big enough to paste in manuals, journals, wiki segments or whatever video game trivia you want while maximizing speed by not having the model contain task irrelevant data.
>>
>>101467537
Gemma-2 not only writes better, but also follows instructions better.
Mistral Nemo's key points are that it's pretty much not censored (wouldn't really say uncensored) and has long context support, but its outputs aren't as engaging as Gemma-2 or even Llama-3.

And to be completely honest I find its default assistant personality extremely boring and autistic--devoid of any personality but also with messages on the short side. The prompting format is getting old and limiting, also; it doesn't cooperate well with author notes and so on.
>>
>>101467925
>find its default assistant personality extremely boring and autistic
Dont use assistant tunes for creative writing.
>>
real women are the worst
they are just whores
dont do it anons, not worth it
>>
>>101467876
In Doom Emacs SPC-c-d jumps to the definition of the thing under the cursor.
If I'm on the terminal I use The Silver Searcher (ag), ripgrep is also good.
ag and ripgrep are also text only but they're so fast (and I think both also skip binary files by default) that I have no issues in terms of speed.
>>
>>101467925
You sound like /aids/ promptlet.
>The prompting format is getting old and limiting, also; it doesn't cooperate well with author notes and so on.
I hope you're aware that every model is only tuned to have a system prompt at the start, and then only alternating user/assistant messages afterwards. Neither Mistral Nemo nor Gemma have a system role.
I'm also not seeing the default assistant personality.
>>101467989
You should also go back to /aids/, NAI shill.
>>
>>101467925
Might I ask what context / formatting you use for gemma 27B btw?
>>
>>101467989
Sometimes you need your assistant to include creative element in "productivity" tasks (for example writing a character card, etc).

The "dead inside" nature of Mistral Nemo seeps into many different output types. The default OOC persona during RP for example is the same as the assistant persona, unless you override it with example messages somewhere in the prompt. That's almost the opposite problem Gemma-2 has, where characters would often be overly dramatic or enthusiastic as some have pointed out.

Community finetunes are useless at pretty much everything else other than cooming and helping the grifters behind them make money, so please don't propose them.
>>
I see we're in full NovelAI shill damage control.
>>101468146
>The "dead inside" nature of Mistral Nemo seeps into many different output types.
Prove it, shill.
>>
>>101468041
>Exuberant Ctags - Faster than Ag, but it builds an index beforehand. Good for really big codebases.
>Version 5.8 [09 July 2009]
Pain.
But it sounds like Doom Emacs and Ag should cover just about everything I need. Thank you for your time and help. Cheers.
>>
>>101467925
why are you pretending that you've extensively tested this brand new model
this is such a weird larp you're engaging in, you have no right to speak in a tone of expertise about something you've played around with for 20 minutes
>>
I've also tested the new Mistral-Nemo 12b model and I think it's pretty amazing in conversations. I don't think it's smarter than Gemma 27b but it has reacted way more natural to my responses. It often repield with 2 to 3 scentences answering exactly what I wanted it to.
It's multilingual capabilities are also great, I'd say it was one of the best german chats I've ever had with a local model. Combining that with it's large context of 128k and I'm getting a bit hyped here. I think it has great potential for a really good RP model.
>>
>>101468248
>why are you pretending that you've extensively
Because NovelAI pays him to do that. Do you not realize how scared are they that Mistral Nemo is 12B and has 128k context? They were charging people $25 a month for their Llama 1 13B clone. Their business is on the line.
>>
>>101468117
With Gemma 2 you can make up new roles or inject information (author notes, and so on) between the tokens tokens delimiting each message block, and the model won't get confused if you do.

The Mistral prompting format only has delimiters for the user's messages. It gets confused if you add consecutive [INST] ... [/INST] blocks, add instructions not related with {{user}}'s utterance there, or add information outside that block. It might appear to work at first, but issues eventually arise (verbatim repetition, the model not understanding correctly on who is who or talking to itself in the same response, etc).
>>
>>101468193
If anything, you've got to watch for actual MistralAI shills/employees, they're the ones trying (and who have tried, successfully) to gain popularity on /lmg/ in various ways.
>>
>>101468430
That's cool.
>The "dead inside" nature of Mistral Nemo seeps into many different output types.
I told you to prove it, shill. You're very eager about testing this model, I know you can do it.
>>
>>101467160
Gemma 2 9B
>Q#
Quantization, the resolution of the weights, the bigger th number the better but it will take more vram
>>
>>101467160
>That LLM list
That's ancient
>>
>>101468328
>With Gemma 2 you can make up new roles or inject information (author notes, and so on) between the tokens tokens delimiting each message block
Nah, this is one of the most retarded things that I have ever read. There's no need to mess with the prompt format for that. Every model can follow up a made up format, but it always make the model more retarded. You seem to have severe brain damage if you think you need that to prompt. Also, no model likes consecutive messages of the same type, they can simply be joined.
Both Gemma and Mistral only have user and assistant messages. You're just very eager and desperate to dismiss the later for some reason. Maybe tell NovelAI to hurry up with their 70B tune so you stop pissing yourself about Mistral Nemo?
>>
NAI shills are so fucking pathetic.
>>
how does mistral-NeMo-12B compare to there 8x22b or 8x7b?
>>
>>101468248
>why are you pretending that you've extensively tested this brand new model
because he did. he is one of the white flag waving baguettes. and he should tell his friends they are all gay for choosing 12B as size.
>>
>>101468780
Who is shilling nai? All I see here are gemma (a google model) and mistral (a Mistral AI model). Are you ok anon?
>>
>>101468861
Anon has his schizo moment just ignore him.
>>
Ok new Mistral is really good actually. Not even talking about the context. It's horny without being retarded, descriptive as fuck. Don't even have to tell it to be. Not one anatomy logic error yet.
>>
>>101468994
if you compare to Gemma 27b?
>>
>>101468861
>Who is shilling nai? All I see here are gemma (a google model) and mistral (a Mistral AI model). Are you ok anon?
It's simple. Of course, they aren't going to come here to just tell you "use NAI", that would make it too obvious. But there are some clear of marks of a NAI shill.
First, Mistral Nemo "is not truly uncensored", whatever that means, because Kayra is the only true uncensored model that you should use.
Second, the "assistant personality bleeds through every output", "it's an assistant tune, you can't use it for creative writing", etc. What you should use for creative writing instead? Kayra of course, it's a completion model, unlike every other model. Who's going to attack instruction models when that's every model here? Only the NAI shills do that. Also, unlike what the disgusting shill said, Mistral Nemo seems creative.
Third, why would anyone be so eager to dismiss Mistral Nemo? It seems to me that the "12B" and the "128k context" struck a nerve, so the shills feel the need to attack that particular model to defend their shitty business. Everybody else is just barely starting playing with the model, or haven't even tried it yet.
The "gemma is better" is just the particular stick that the shill grabbed to beat the model that hurts NovelAI more.
>>
>>101469069
Spoken like a NAI shill.
>>
>>101469092
Nice deflection, shill. Keep crying about Mistral Nemo. You're pathetic.
>>
>>101469092
No, i'm the nai shill.
>>
>>101469106
You're right.
>>
>>101469063
Gemma is either retarded or censored, there's no middle ground.
>>
>>101469128
>>101469172
>hahaha with enough sarcarsm people will forget that we were here
Nice try, shill. But the damage control won't be enough to save your shitty business. You're cancer.
>>
Currently waiting for Nemo finetune with Stheno dataset. I feel like we will be eating good soon.
>>
>>101469186
use the lightest tiger/big tiger finetunes, they're still a bit censored but not lobotomized
>>
>>101469232
Base Nemo is already a porn tune, retarded coomer.
>>
>>101469205
You're right!
>>
>>101469069
Weird how you know details about NAI models, I have no interest in anything outside of local so I'm not even following them. Are you sure you are not the shill?
>>
>>101469260
>Base
>already a porn tune
???
do you know what tune even means?
>>
>>101469260
I heard it million times already with corpo instruct tunes. It was never true in my tests.
>>
>>101469282
It's like a asking for a porn tune of Command R. The base model was already horny enough. Also, the dataset is called C2.
>>
>>101469310
you could have just said "base is already horny" why call "base" a tune wtf
>>
File: 1717847394849711.png (881 KB, 2048x1986)
881 KB
881 KB PNG
>>101469069
>>
>>101469331
Anyone that isn't retarded already understand what it meant.
>>
>>101469310
Don't say the dataset name out loud. The resident schizo has it on RSS notification. He is already on the way to shit the thread probably.
>>
>>101469297
You seem very interested in convincing people that they shouldn't use a vanilla model. Do you have a ko-fi?
>>
mistral nemo tokenizer woes on transformers, might be important for those working on support.
>I observed a strange behavior of the tokenizer when dealing with texts in French. In particular, contrary to previous models, it seems to consistently remove the spaces before "!" or "?", e.g.
>Thanks! Should be fixed by https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/discussions/13.
>Just got merged! :) You can now access it normally.
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/discussions/11
>>
you're going to be able to create 4k video
>>
>>101469359
>let's just use incorrect words for things and call other retards...
>>
>>101469232
Hi Sao. Shilling your Nemo tune started a bit early, didn't it?
>>
thinking about figuring out how to use a word filter to oblilterate this retard who types "shill" into 50 posts in every thread and is just obnoxious retarded cancer with 0 value
>>
>>101469367
>>101469425
like clockwork
>>
>>101469457
>figuring out how to use a word filter
do you not have 4chanx?
>>
>>101469381
No, but I have your mother under my desk.
>>
>>101469481
i do but i'm very lazy
>>
>>101469457
But you don't have any problem with the inorganic Nemo dismissal, right? Of course, why would anyone use an assistant tune when you can just subscribe to NovelAI?
>>
>>101469494
arrow on any comment
filter
delete what it pasted replace with
/shill/i
there ya go
>>
>>101469498
i dont care and i have no idea what combination of birth defects and mental health issues would provoke you to care either
>>
>>101469519
Thanks, now I can advertise my products without being attacked. That's what /lmg/ is for anyway.
>>
File: 1678462422119147.jpg (136 KB, 750x712)
136 KB
136 KB JPG
>>101469519
neat
>>
Won't somebody please think of NovelAI? They're suffering.
>>
>>101469529
You're the NAI shill from earlier.
>>
>>101469590
>>101469633
And I will keep using base(d) models and ignore instruct ones.
>>
Some exl2 quantizations of Mistral Nemo, like the 8 bpw one I downloaded yesterday on HF from a certain DrNicefellow, appear to have some quality issues (mine occasionally outputs Chinese characters or extra punctuation at the end of the model's output), so beware of that when judging the model.
>>
>>101469279
>I have no interest in anything outside of local so I'm not even following them.
So you prefer to think that the anons that said "the default assistant is in every output" and "you can't use 'assistant tunes' for creative writing" did so without ulterior motives? And that the instant dismissals are completely honest reviews?
>>
>>101469764
I have seen that too, but I downloaded them from turboderp.
>>
>>101469802
>completely honest reviews?
no such thing in here, at all, no matter if positive or negative, there is just no honesty.
>>
>>101469764
thanks for confirming. had issues with it too so switched to turboderps quant but updated exl2 too so wasn't sure what fixed things.
>>
>>101469762
And you'll cry that the model doesn't do what you want.
>>
>>101467073
>Jeetma is better because it passes this one, irrelevant, cherry picked general knowledge test.
>>
>>101469764
I'm using the turboderp 8.0 bpw quant. I got some Chinese once, set min-p to 0.001 without checking the token probabilities because I couldn't be arsed, and I haven't seen chink runes since. Maybe I was just unlucky the first time or maybe this is making a difference.
>>
What if the model was re trained on the fly for any domain case? Is this the next step?
>>
File: file.png (650 KB, 2529x680)
650 KB
650 KB PNG
>>101469764
The mistral-chat terminal app doesn't put a space around [INST] unlike the transformers template.
>>
I like that Nemo 12B would swear when you instruct it to.
It feels a lot less restricted than Llama3 8B.
>>
>>101470129
12B seems to swear when not instructed as well at times. Assuming it's within the context of an RP.
>>
File: memory bandwidth.jpg (185 KB, 828x647)
185 KB
185 KB JPG
>>
nemo lcpp eta? tmw?
>>
>exllama already supports nemo and it's faster than llama.cpp
There's no reason to use llama.cpp anymore.
>>
>>101470353
cpu offloading
>>
>>101470353
>exllama already supports nemo and it's faster than llama.cpp
>>101470017
>I'm using the turboderp 8.0 bpw quant. I got some Chinese once
>>101469833
>had issues with it too
>>101469764
>Some exl2 quantizations of Mistral Nemo... appear to have some quality issues
>>
>>101470353
Except llama.cpp is faster.
>>
>>101470464
False. >>101461165
>>
>>101470353
It is for people with less vram.
>>
Nemotron gguf status?
>>
>>101461165
they're completely different quantizations even if you try to match the overall size, comparing them makes little sense
>>
File: zz.jpg (24 KB, 749x614)
24 KB
24 KB JPG
goodmorning sirs
>>
>>101470643
It's 40% faster.
>>
Exlcels...
>>
>>101470651
Good morning Anon
>>
File: 1717218262293747.png (1.07 MB, 1024x1024)
1.07 MB
1.07 MB PNG
A mischievous glint, shall we? she says in a husky voice, a smirk playing on her lips, eyes sparkling with mischief. There's a playful glint as she addresses the power dynamic, playfully smirking as she offers her ministrations. An audible pop and rivulets of—admit it, pet—the ball is in your court. The game is on; the choice is yours."I don't bite…"unless you want me to, she purrs, half-lidded eyes sending waves of arousal pooling in her belly. Take your pleasure, she urges, fiddling with the hem of her skirt, kiss-bruised lips curving into a bruising kiss. You hesitate, torn between propriety and desire, and she grins wickedly, fiery red hair contrasting with her long lashes."The night is still young,"she purrs, propriety be damned as the world narrows to just the two of you, pupils blown wide with pleasure. Her tongue darts out, tracing your ear, and her chestnut eyes hold your gaze as her nails rake angry red lines down your back. Her cheeks flame as she revels in your response, cheeks hollowing with each sharp intake of breath. Stars burst behind her eyes, inner walls clenching around the void that only you can fill. She craves your touch, your possession—heart, body, and soul belong to you… for now. Eyes alight with mirth, she teases,"Naughty boy, but before that…"—the minx traces a finger along your jawline, deferring your pleasure as the tension builds,"but first…"Oh my…
>>
>>101470767
Thanks, deleting my AI folder now.
>>
>>101470767
So why exacly does it happen? It is becouse of training data? overtraining?
>>
>>101470767
This post is extremely high quality.
>>
>>101464911
>In all 3 instances there was no anthropomorphism.
Nice, nice.
Thank you very much Nala anon
>>
>>101470643
>comparing them makes little sense
Are you fucking retarded?
>>
>>101468577
I feel like every AI general here in /g/ has outdated links in their OP because of how fast everything is progressing
>>
Mistral Nemo was trained in FP8; wouldn't quantization to even INT8 damage model quality?
>>
>>101470767
Reading this gave me a boner
>>
>>101465735
>Guess I should turn down the temperature or something?
For the hornyness? Nah. You have to get around it with prompting.
Instructions in the last assistant output or using author's notes at low depth, that kind of thing.
>>
>>101470828
Because they are apt descriptions and you're all so fucking gooned out on text-gen that the appearance of words in any order strums on the neuro-chemical void caused by your crippling dopamine addiction?
>>
TTS this >>101470767
>>
smedrins
>>
>>101471181
But the weights are bf16...
>>
>>101471108
nobody bothers to update OP because it would require putting some effort into it
>>
>>101471181
Yes. But the support for it will get added to llama.cpp if the model is good enough.
>>101471362
I respect your struggle.
>>
>>101471243
https://vocaroo.com/1P91lYw9I64B
verbatim
>>
>>101471415
that and whatever you did change would cause at least one anon to screech for at least a thread
>>
>>101471415
>>101471440

should leave the updating to bots
>>
Mind if ask y'all somethin'? I've been looking for an AI thing that can handle and organize txt files off about 2000 lines in length, or so. They approximate around 3500 tokens, for most of the files.

Anybody know of which ones may be able to handle that amount for just simple organization, task sorting, etc. of these files? Github anything?

>inb4 search innernet
I actually have been searching around and have tried a few, but most return an error of the the request being too large etc. Any ideas?
>>
>>101471452
seeing the stuff recap bot spews out sometimes, no thanks
>>
>>101471210
You get around it by using a different model.
>>
>>101471452
>bots are incapable of even updating the OP
>"experts" cry about how AI is going to destroy the world
>>
>>101471415
Update the OP to only include Sao models. That's the consensus of the thread.
>>
>>101471412
The values might have been saved in BF16, but aren't they still quantized as FP8? It would be like saving a 256-color image to a 65536-color one; banding would remain despite the higher precision.
>>
>>101471502
I think you're retarded. Please redirect your concerns here: reddit.com/r/LocalLLaMA
>>
>>101471468
Might be able to repurpose code from this if you don't find anything ready to use - https://github.com/ozgrozer/ai-renamer
>>
>>101471438
Thanks, anon
>>
>>101471243
https://vocaroo.com/1fPIb2ZpiI7a
better one, this time spoken by former Lara Croft and star of Spooks; Keeley Hawes
>>
>>101471502
>Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.
this doesn't imply to me it was trained in fp8 just that it's *lossless* at that quant
>>
>>101471534
Alright, thanks man I'll give it a shot
>>
so do you guys have local opus yet?
>>
>>101471621
yes we do. no we won't show it to you
>>
>>101471621
I trained my own model by cumming in some slut and she gifted me a brand new model after just under a year
It took me YEARS to be able to train it to do what I want though
>>
What are some good models for generating long detailed descriptions of rape scenes? Asking for a friend
>>
>>101472230
https://huggingface.co/cognitivecomputations/Samantha-120b
>>
It's annoying how dumb Gemma 2 27B is, but it writes quite well.
>>
>>101472357
t. my english teacher in year 8
>>
File: miku.png (30 KB, 440x145)
30 KB
30 KB PNG
>>101470767
>>
I remember when people hyped Gemma at release.
>>
>>101470828
It is because you touch yourself to text and companies don't like that.
>>
>>101470767
>I don't bite…"unless you want me to
I would just like to remind everyone that this iconic phrase was the first phrase the first frankenmerge said to Undi after he created it. This made him think frankenmerges are good.
>>
>>101472523
I still don't know if the loader is bugged to shit or if the model is so bad.
>>
>>101472612
it's just as bad if you try it on google's website
>>
>>101472560
It is better than me watching porn, though. I started having problems with erection when the real deal was on the table, and since I switched from porn to smut, it works again. 9 of 10 doctors would recommend.
>>
>>101471568
BitNet is also quantization-aware training, yet the first experimental BitNet weights released were in FP32 format.
>>
>>101470828
It's overfitting because this happens a lot in whatever fiction is used in the training data.
LLMs are stochastic prediction machines, and they will learn to over-represent patterns that appear repeatedly in the text, there's no way to avoid this. There's only so many ways you can describe the sensation of excitement in English text.
>>
>>101472926
>there's no way to avoid this
Killing the narrator solves 90% of shivers, but that's a tough RP to swallow for many.
>>
>>101470767

She leans in close, her warm breath tickling your ear as she whispers, "Why don't we take this somewhere more... private?" Her fingers trail down your chest teasingly. "I have so many ideas for what we could do next." She bites her lip and looks up at you through her lashes, waiting to see how you'll respond to her provocative invitation.
>>
>>101472917
idk, it just seems to me they would communicate it better if it was, like
>Trained in fp8 for lossless inference
or something
>>
>>101470828
It doesn't actually happen, it's a meme.
>>
>>101473027
What do you mean? So you only use *plaps you* formatting? Doesn't that make the model dumber?
>>
Ballpark, what would be the VRAM requirements to run an unquantized version of llama 3 405b
>>
>>101473056
it's not better if it's incorrect
>>
>>101472926
>there's no way to avoid this
better datasets
and no regurgitated slop from other models
>>
>>101473106
1TB VRAM
>>
>>101473106
between 850gb and 1tb
>>
>>101470488
Does exllama have context shifting yet?
>>
>>101473296
Yes, for a while.
>New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
https://github.com/turboderp/exllamav2?tab=readme-ov-file#new-in-v010
Also, drop the "context shift" marketing term already. It's retarded. It means something completely between llama.cpp and koboldcpp.
>>
>>101473548
>It means something completely between llama.cpp and koboldcpp.
Really?
What's the difference?
I know koboldcpp has the deprecated smart context feature, but I didn't know that their context shift was (significantly) different from upstream.
>>
Are there any FIM (fill in middle) models in the 7-14b range?
>>
>>101473106
I hope that 96VRAM + 128RAM will be enough to run it in Q4.
>>
>>101473571
For llama.cpp, it means to generate past the max context by dropping the earlier tokens each time a new one is generated. It has nothing to do with caching. For kobold.cpp, it means prompt caching.
For example, you could work around the old context shift bug with Gemma in llama.cpp by never going past the max context, prompt caching worked just fine. While a kobold retard reading that thinks that the problem was with caching the prompt.
>>
I just realized I have a 64 GB RAM MacBook Pro with M1 chip, and I have 2x3090 cards in my desktop. That's ~100 GB. I could run 405B llama at almost Q2 if I used the RPC thing in llama.cpp, right?
>>
>>101473646
that will be enough, but it would so fucking slow that its just not worth it. 96GB of VRAM is not enough for even a Q2 quant
>>
>>101473694
>For kobold.cpp, it means prompt caching.
It does?
That's bizarre, considering that prompt caching is a thing that's been around for a while now.
Are you sure of that? Is there a link somewhere explaining how and why that is, because if you are right, that's incredibly fucking retarded.
>>
>>101473728
Yea you should. It will drop to the lowest speed and you might need good network to communicate between nodes.

I have a couple macs and a 4x3090 system. I plan on doing something like that. I guess it will always be better than running on RAM
>>
>>101473731
I can leave it alone to generate fancy scenarios, tailored to my tastes, for future role-playing with a smaller model
>>
I'm very new to llama.cpp use, what is the command to load context in Q4_cache? All I can find on the list of commands is -ctk and -ctv, i'm assuming I need to use those.
>>
>>101473871
Yeah set them to e.g. q8_0.
>>
>>101473646
It would likely not be enough since all Q4 variants are more than 4.2 bpw. The quant that apparently is close to 4bpw is "Q3_K_L" (I believe this is a different thing from the L quants that bartowski makes, they just happen to be the same name).
https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp
>>
>>101473813
I don't think network speeds will be a huge issue as this is local.

I also have 128 GB RAM on the desktop, but DDR4 so hm.
>>
>>101473910
which to use though? ctk or ctv? Soo..
-ctv q4_0 ?
>>
>>101473918
If I add another 32x4 to my quad-channel 32x4 configuration in an octo-channel server, will it drop all memory to dual-channel?
>>
>>101473994
*add 32x2
>>
>>101473982
Both. You can experiment with one or the other of course, but sounds overkill when you don't know wtf you're doing. Aka until you know wtf you're doing.
>>
It's just hard to find that particular ram at a fair price.
>>
>>101473994
Idk, does the motherboard's manual say nothing about this?
>>
>>101474151
>>101474151
>>101474151
>>
>>101474089
No. Probably nobody does this kind of thing to servers in a production environment.
>>
>>101473754
https://github.com/LostRuins/koboldcpp/releases/tag/v1.48.1
>So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations
That's about prompt caching between requests, it doesn't trigger by going over max context in a single request.
https://github.com/ggerganov/llama.cpp/issues/7230#issuecomment-2106074784
>there isn't a way to completely disable context shifting in the server, but you should be able to avoid it by ensuring that the request does not exceed the context size
While for llama.cpp, context shift is only something that happens when you need to generate past the max context size. It's also tagged as "infinite text generation via context shifting" in the main example and it only triggers with a ">= n_ctx" check. It's not about caching between requests.
https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L573
>>
>>101469519
Maybe reddit is better for you, don't you think?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.