[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1710266621871822.jpg (462 KB, 1664x2432)
462 KB
462 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101584411 & >>101578323

►News
>(07/24) Mistral Large 2 123B released: https://hf.co/mistralai/Mistral-Large-Instruct-2407
>(07/23) Llama 3.1 officially released: https://ai.meta.com/blog/meta-llama-3-1/
>(07/22) llamanon leaks 405B base model: https://files.catbox.moe/d88djr.torrent >>101516633
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101584411

--TTS improvements and output issues: >>101586575 >>101586607 >>101586659
--Mistral nemo configuration and settings advice: >>101585456 >>101585527 >>101585596 >>101585669 >>101585834 >>101585868 >>101585572 >>101586019
--Sillytavern single sentence replies issue: >>101587180 >>101587200 >>101587246 >>101587225 >>101587275 >>101587269 >>101587353 >>101587401 >>101587413
--Recommendation for voice data TTS finetuning: >>101585560 >>101586101 >>101586163 >>101587016 >>101588184
--Nemo generates quadrupeds well but writes differently than chatgpt: >>101587732
--Logical flaws in GPT-4 and Claude, Command R Plus gets it right: >>101584587 >>101584617
--GitHub repo for bulk downloading cards for ST: >>101585689 >>101586342
--Anon asks for Command-R Plus alternatives.: >>101585536 >>101585556 >>101586438 >>101586483 >>101586596 >>101586657
--largestral iQ2_M outperforms Nemo in retarded quant, but is slower than 1t/s: >>101585893 >>101585921 >>101585940 >>101585998 >>101586017 >>101585939 >>101585985
--Nemo repetition issues and DRY sampler settings recommendations: >>101587028 >>101587049 >>101587511 >>101587535 >>101587576 >>101587545
--MoEs for roleplaying? Try it and find out: >>101584540
--Mistral Nemo sampler settings cause rambling output: >>101585928 >>101585955 >>101586019 >>101586038 >>101586062
--Where do ST or other UIs cull example dialogue in the context window?: >>101584746 >>101584777
--RULER repo measures effective context length, Llama3.1 performs well: >>101586297 >>101586352 >>101586384 >>101587005 >>101587027
--IQ4_XS vs Q3_K_M model quants and accuracy discussion: >>101585131 >>101585176 >>101585200 >>101585383 >>101585434 >>101588262
--IQ1_S performance and characteristics discussion: >>101588056 >>101588068 >>101588140 >>101588159 >>101588129
--Miku (free space): >>101587473 >>101588754 >>101588896

►Recent Highlight Posts from the Previous Thread: >>101584415
>>
post (You)r largestral presets
>>
File: 00170-699389629075918.png (1.47 MB, 1024x1536)
1.47 MB
1.47 MB PNG
>>101589142
i got a little chub seeing my repeated (You)s in this AI generated recap
thank you, botkind.
>>
I am once again asking for mini-magnum presets.
>>
>>101589160
I didn't actually try it:
>>>/vg/487568316
>>
gib nemo presets
>>
File: robotnik-jump.gif (14 KB, 420x420)
14 KB
14 KB GIF
>>101589210
>>101589219

just use the ones i linked from that anon >>101585456
in fact fuck it ill re-copypaste it again

Here, since so many people seem to be using nemo with wrong formatting then complaining:

Mistral context template: https://files.catbox.moe/6yyt8d.json

Mistral instruct template:
https://files.catbox.moe/rfj5l8.json

Mistral Sampler settings:
https://files.catbox.moe/tbsgip.json

Should be night and day for people who have it set up wrong. Make sure whatever backend you are using has DRY sampling.
>>
So, what was the point in MistralAI sabotaging their 8x22B with the shitty official -Instruct version and the botched release? Is this a psyop by their Partners at Microsoft trying to make MoE models look bad?
>>
>>101589231
Nemo doesn't use spaces around INST.
>>
File: 1336508850696.gif (1.93 MB, 245x187)
1.93 MB
1.93 MB GIF
How're you guys feeling? As the dust settles down, it really feels like we've never been more back. Back to back releases, putting local about on par with cloud in performance/cost, and it's still not over, we're going to get more next week. We are not even 3 years into the timeline since the ChatGPT hype began.
>>
>>101589262
I dunno i've been using it with magnum just fine.
>>
>>101589244
Maybe they didn't have time, and without the release of 405B, they didn't feel the need to release their best stuff.
>>
so mini-magnum is the best cooming model for vramlets now?
>>
>>101589231
>dry sampling
Does Koboldcpp have this (I don't see it) or am I fucked?
>>
The people that are using 4 3090s... Where are they putting them?
>>
Aah, 30t/s... This is the good life. Thank you Arthur.
>>
>good model release
>people saying low quants are fine, others saying there's night and day differences (probably broken quants)
>prompt/template issues left and right
Every time... I guess I'll wait 2MWs then...
>>
>>101589289
That or just Nemo-Instruct.
>>
>>101589265
You can see this as something good, we are on par with the big boys after all. But you can also see this as pure doom. The big boys barely moved ever since the release of GPT4.
>>
>>101589307
I'm the night and day difference anon and I should clarify my quants are definitely not broken, I do them all myself
q4km was still *fine*. better than 70bs or CR+ still, just kind of dry, generic, a little less sovl, a little more awkward - but q5ks was sharp as a tack and much more coherent, pulled in more little details, had more of those creative little turns of phrase that let you know it's really paying attention
lower quants are still usable and the model will still be good, it's not like they're totally fucked or anything, it's just that the second I bumped up the quant it felt like the model gained a real human touch that was lacking before
>>
>>101589307
>people saying low quants are fine, others saying there's night and day differences (probably broken quants)
more like
>people saying low quants are fine (poorfags who can only run low quants at 3t/s), others saying there's night and day differences (people who can actually run these models properly)
>>
>>101589370
I test through online (mainly lmsys) to compare between quants I downloaded and their "intended" performance. Otherwise I would not be able to say with full confidence that a model like 8x22B cannot do trivia like DBRX can.
>>
where's the dry sampler settings on ST?
>>
>>101589356
Did you use imatrix? The quants I'm using are all imatrix calibrated. Also they're the IQ format which I think were supposed to be more knowledge-retaining compared to K quants but I'm not certain.
>>
File: 1710741814225103.png (17 KB, 721x182)
17 KB
17 KB PNG
Cohere gathered another $500m from investors. CR++ will be a beast of a model.
>>
>>101589142
good bot
>>
File: dry staging.jpg (110 KB, 607x1212)
110 KB
110 KB JPG
>>101589491
There, I am on staging branch.
>>
>>101589536
I really wonder how businesses are using these products to make money.
>>
>>101589550
speculative capital, one of these might be the next big break through
>>
>>101589265
>We are not even 3 years into the timeline since the ChatGPT hype began.
>ChatGPT initial release: November 30, 2022; 19 months ago
>>
nvidia-smi is not displaying all of my GPUs, but neofetch is. how do i fix this? i cant run any AI applications due to an error about cuda devices not being found
>>
>>101589653
>>
>>101589642
It hasn't even been 2 years? Wtf
>>
>>101589653
Change your environment variables, I guess.
>>
>>101589550
If performance improvements plateau and you have ~5 years of scaffolding/agent development with no valid use cases, you might have a point. It's only been 19 months since ChatGPT released. Doomers just really want to see LLMs go the way of 3D TVs for some reason.
>>
>>101589688
how do i do that?
>>
man, that mini magnum finetune of Nemo 12B is actually starting to replace claude for me, which is nuts considering claude has got to be at least 50 times bigger
>>
>Claude 3.5 Sonnet and Llama 3 405B stomping GPT-4o
>Llama 3 405B is way fucking cheaper than GPT-4o
>It's only a matter of time before a cheaper and more capable model than GPT-4o-Mini comes out and kicks them out of the cost-performance pareto front entirely
Is he really just banking on Strawberry?
>>
>>101589762
>It's only a matter of time before a cheaper and more capable model than GPT-4o-Mini comes out and kicks them out of the cost-performance pareto front entirely
Claude 3.5 Haiku probably
the original haiku beats the shit out of 3.5 turbo which was the sota small cheap model at the time
>>
>>101589715
Type "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5"
>>
File: IM NOT SLEEPY.jpg (58 KB, 714x725)
58 KB
58 KB JPG
>update tavern
>even with all my settings and shit in order, the gen quality is fucked UP bad
>wtf could possibly be-
>mfw i forgot to enable instruct mode
>>
>>101589265
Do wonder know how many OG AI dungeon era people stuck around to witness this. I joined around the late GPT-2 times, now running IQ4 largestral. I don't see myself ever ending the ride.
>>
>>101585978
Same, Nemo might be retarded and repetitive at times, but it has some surprising creativity if you push it
>>
>>101589907
MOOOOOOOOOOOOOOOOODSSSSSSSSSSSSSS
>>
>>101589907
Ew
>>
>>101589539
thanks, i'll take a look
>>
Here comes the pedo tranny thirdie again.
>>
>>101589653
did you enable 4g decoding in bios? also check dmesg for errors from nvidia driver.
>>
File: 36993673.jpg (287 KB, 1082x695)
287 KB
287 KB JPG
>>101589872
I used to be so happy with my loli imouto scenarios on AI Dungeon, I used to think running LLMs locally would be impossible because Pygmalion 6B used all my RAM and was as slow as a snail.
Now, I'm here, running NeMo still enjoying my loli imouto scenarios, but without fear of suddenly being cucked.
Feels good.
>>
>>101589872
I joined back in December 2019. I remember the humble days of Clover where the AI was too fucking stoned to even remember your character's name, much less what was happening
It was absolute dogshit and now here we are
>>
>>101589265
Imagine Terry's reaction to the LLM tech, writing llama.cpp but in holyC to replace his text oracle perhaps.
>>
>>101589290
get sillytavern staging, and ((pull))

>why does anyone use response tokens over 256? 512 is hellish
>>
>>101589762
He just needs to reignite the AGI hype by adding smell to the multimodal model. Or maybe he can tease Sora sgain
>>
jesus man, Nemo is INSANELY horny. My OC's are a bajllion times more frisky with Nemo than any other model i've ever used, On one end i'm overwhelmed, yet it manages to blend that spice with their personalities perfectly. It doesn't skip a beat.
I almost want to say i wanna tone down the horny but, It's not like that breaks story flow or makes ERP more difficult or anything, I'm personally just not horny right now kek
>>
>>101589971
The realism of this surprised me for a bit until I realized the popsicle is constantly changing shape...
>>
>>101590044
arthur's personal coomtune strikes again
>>
>>101590054
Why did he do it?
>>
>>101589231
Is such a simple prompt best? No one uses those crazy ones they were using before?
>>
>>101589265
We're so back. Zucc and Yann are false prophets, Silicon Valley are false prohets. Viva la France
>>
>>101590073
yeah its never really mattered that much, was always placebo.
Which makes the Agent 47 crackhead prompt situation even funnier.
>>
>>101589292
Just get two a6000s or something if you want to be more compact.
>>
>>101590109
Interesting. So it's more down to the card itself and what examples you give it to emulate?
>>
nemo is schizo...
>>
>>101590170
A bad card can break any model, doesn't matter. It's why W++ for example is memed on so hard, there's no exact science it's just basic logic of garbage in garbage out.
>>
>>101589262
So I should change that so there's no spaces on the INST ones? What about the \n after </s>?
>>
>>101590172
You're using a temp too high
Mistral says in the model card that it likes low temperatures, they say 0.3
though I find up to 0.4-0.5 is usually fine
>>
>>101590229
NTA but I use simple sampling and for RP Nemo handles 0.7-0.8 just fine. Occasional schizo moments at 0.8. Starts getting really dry at 0.7 and lower. 0.3 is probably to prevent hallucination when using it for normie shit.
>>
I'm swiping this popular character card and the responses from mini-magnum and Claude Opus are identical. Claude walked so nemo could run.
>>
anyone running an exl2 mistral quant? I get gibberish with a 4.0bpw turboderp quant.
>>
I just downloaded 3 more IQ models below IQ2_M to see if any would be able to answer one of my challenging trivia questions as perfectly as IQ2_M did. Turns out IQ2_M is the cutoff for this particular question. IQ2_S gets the question partially right. About half of the points I would say. IQ2_XS and below basically just get it increasingly wrong, until IQ1_S which nearly went schizo-tier. Guess I'll just live with 1-2 t/s.
>>
>>101590287
3.5bpw is working perfectly fine even at 4-bit cache.
>>
>>101585837
do two gpus work faster than or slower than a single one if you can fit it in?
does Vllm split by row or by column? does it do tensor parallel? does nvlink in 3090 help by a lot? does the performance of 2 gpus differ much from 4? BTW, did you try cpu offloading in Vllm?
>>
>>101590287
yeah, turbo's 3.5bpw + 4-bit cache is running fine for me on ooba.
i don't know if it's necessary, but i updated transformers from source, like the mistral-large readme said.
>>
>>101590329
It's 2024. Why is VRAM still hard to obtain? It's literally just soldering more transistors into your chip. Why? Now you have people running two servers in parallel just to serve a model.
>>
>>101590109
How do you tell it to not act for the user then? I always have that issue.
>>
>>101590383
something specific causes that, i forget what, i started getting it tonight actually.
someone will chime in to inform us kek
>>
>>101590383
using
>write {{char}}'s next reply
in the sys prompt usually fixes this for me
>>
File: 1692389808623804.jpg (163 KB, 1058x926)
163 KB
163 KB JPG
so how much money do I have do spend to run 405b at home?
>>
>>101590319

Largestral? Does 3.5bpw fit in 48GB vram? How much context?
>>
>>101590374
simple answer
>greedy Nvidia encrypts vbios
>>
>>101589265
(((Openai))) is $5B in red this year
>kek
>>
>>101590419
Just run largestral instead. Better for most users purposes. 3x 3090s+
>>
Ok I prove mini-magnum-12b the finetune of nemo with exl2 8bpw, but as some time ago, with exllama my nemo is broken, don't follow the template of silly tavern, write a lot of text fulled with nonsense. I'll prove in llama.ccp later. Some advise?
I'm using the settings of the this anon >>101585456
>>
File: incognito.png (484 KB, 512x768)
484 KB
484 KB PNG
>>101589136
Thread Theme:
https://www.youtube.com/watch?v=7yJRsFFRoQY
Don't mind me, just a stranger blowing through this town...
>>
>>101590536
God. I hope you don't write like that to the poor llm. Are you sure you're using the proper template? Have you updated ST and exl2 since the last time you tried?
>>
>>101590319
>>101590346
thanks. it seems like something with my samplers broke it. I neutralized the samplers in sillytavern and it started working.
>>
why are some people here using small quants of a 12B model
even if your GPU is only 8GB you can run Q6 at a very good speed with some offloading
>>
>>101590531
>3x 3090s+
I've only built one PC in the past, and don't know of any standard motherboards that support that many GPU's. My first thought was something like picrel, basically a mining rig. Without NVlink its gonna be pretty bad, as far as I understand. How did you, or anybody you know, do it?
>>
>>101590711
Thats basically the idea.

https://www.amazon.com/Kingwin-Professional-Cryptocurrency-Convection-Performance/dp/B07H44XZPW/ref=sr_1_1?sr=8-1
>>
>>101590711
open air build like a mining "case", riser cables, any motherboard with 4 pcie slots, does not have to be x16 x8 or whatever. Even x1 is enough. Just get 4 of them.
>>
>>101590576
Yes I did a upgrade a moment ago. I have to set a value in the alpha?value?
>>
>>101590576
>Are you sure you're using the proper template?
I'm using the one which was shared in the last thread.
>>
>>101590711
This guy did one with 7x4090s. You can see what his concerns were. He goes pretty in-depth. https://www.mov-axbx.com/wopr/wopr_concept.html
>>
>>101590720
>>101590720
>>101590754

I just had an idea and I'm sure somebody else had the idea in the past as well. So for dense models running across multiple GPU's without NVlink the performance gets worse and worse the more cards you add because they gotta wait for each other to finish their task to go and compute the next hidden layer state. But what if, you take a MOE model, for example DeepSeekV2 236B, and split the different smaller experts across the gpus, so that they don't have to exchange information. Is this thinking flawed?
>>
>>101590536
Enable "Add BOS Token" in ST
>>
>>101590774
Thats not how moes work.
>>
>>101590781
but how do they work then.
>>
>And finally, we have the Arch Linux package updates. Oh boy, I can barely contain my excitement! You have a whopping 106 packages begging to be updated. I mean, who doesn't love a good update cycle? It's like playing a game of "spot the broken dependency"! Good luck with that.
i love when it sasses me
>>
>>101590786 (me)
>Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward
block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router
network chooses two of these groups (the “experts”) to process the token and combine their output
additively. This technique increases the number of parameters of a model while controlling cost and
latency, as the model only uses a fraction of the total set of parameters per token.

I don't see how my thinking is flawed, someone educate me. just have 2 parameter groups on each gpu and the supervisor on the last one.
>>
>>101590711
If you wanna stay on standard architecture and don't wanna invest in workstation CPU's then the MSI MEG X570 Godlike Mainboard is a great choice with 4 slots for GPU's. I wanted to build a bigger PC with 4 3090 cards but now I rather wait for the 5090 announcement next year.
>>
So is there a reason why Llama 3.1 that I downloaded from the official repository doesn't come with any config.json, and every single piece of documentation I've found that can supposedly convert them to HF format doesn't work?
>>
>>101590804
llamacpp anon we need you, hes wrong and I know it but can't explain why.
>>
>>101590732
>>101590745
If i'm reading the setup files correctly (https://files.catbox.moe/tbsgip.json specifically):
It sets the temperature to 1, when the mistral guys recommended 0.3 or 0.4. Change it to 0.3 and try again.
The second thing is repetition penalty. Disable it by setting it to 1.
If that makes it work better, then play around with the temperature. If it still doesn't work as you expect, post a screenshot of the output to see what you're talking about. "write a lot of text fulled with nonsense" is not that useful.
>>
>>101590819
What did you download? The original repo in meta's hf all have config.json files.
>>
>>101590307
There was some post-quant tuning that enhances the quality of iq2 quants, but I dont remember where that was. Prolly the only way to run huge llms on 24gb with no major loss,
>>
>>101590819
By official you mean the repos on this account https://huggingface.co/meta-llama or a different site where they host their models? The config.json file definitely are in the huggingface repos. You should download them from there.
>>
File: hdca-news1.jpg (184 KB, 700x681)
184 KB
184 KB JPG
>>101590711
>>
how much T/S do yall get with 4x 3090's on largestral at what quant
>>
>>101590774
only if you split by column and not by row. if you split horizontally it doesn't slow down since that's tensor parallel so you run in parallel . but you need good interconnection.
>>
File: 1463720797197.png (255 KB, 319x317)
255 KB
255 KB PNG
I'm new to using SillyTavern. Is there a way to prompt the kind of response the AI generates to guide it in a certain direction without having to just rewrite the response entirely by hand? Like if I give it an open ended question and I want all its responses to be either positive or negative.
>>
>>101590939
Try including something like "Only answer positively/negatively" In the author's notes. Depth = 0 if you want it constantly reminded of it for every message.
>>
>>101590946
Thanks, I'll give that a try and see if it helps.
>>
>>101590939
I simply use group chat for a char and my OC, while posing as a narrator in user responses. Much more convenient from chat editing perspective than having author note open. Narrator just gives out barks for both characters, and then I mute narrator barks so that it doesn't try to act as narrator itself.
>>
File: 2024-07-27.png (381 KB, 1124x671)
381 KB
381 KB PNG
>>101590778
>Add BOS Token
Is enabled.
>>101590843
>sets the temperature to 0.3
>Disable rep pen
I did this too, I prove setting the temp in less values and more than 1.0 values and this is the result.
>>
>>101590983
That's a great way to utilize the group chat. Makes me wonder what other things can be done with it.
>>
Where can I find/which gguf version of mini-magnum-12b should I use?
>>
>>101591073
https://huggingface.co/starble-dev/mini-magnum-12b-v1.1-GGUF
>>
>>101591073
the one that fits
>>
>>101591140
Thanks anon.
>>
>prema trying to do team orders in fshitter
>>
>>101590410
Doesn't seem to help, sadly.
>>
File: GS-IVOcbIAI5B6g.png (643 KB, 855x719)
643 KB
643 KB PNG
>>101589231
Ok so I got koboldcpp, staging version of sillytavern, imported these three and made my persona a basic [{{user}} is a guy that has this color hair, this color eyes and this color skin]
Is there anything else I need to do to make this work? I got some random cards off chub but I dunno what makes a card good or retarded
>>
Can using smaller context size result in model retardation (within that context) or is it enough that I match the koboldcpp and sillytavern setting? I don't have the VRAM to run full 128k of nemo.
>>
>>101591291
no, the opposite, using bigger always degrade at some point
>>
>>101584777
>>101584746
Any ideas on where ED gets culled?
>>
>>101591301
Okay, thanks. So should I go for smaller context in favor of higher quants as well? Currently using Q6_K_L with 8k but I guess it may be worth it to go lower quant.
>>
>>101591314
8k is generally good with most recent models, above is when it gets iffy especially above 32k so if you're enjoying what you have just don't break stuff for no reason
>>
>ZeroWw 'SILLY' version. The original model has been quantized (fq8 version) and a percentage of it's tensors have been modified adding some noise.
>Full colab: https://colab.research.google.com/drive/1a7seagBzu5l3k3FL4SFk0YJocl7nsDJw?usp=sharing
>Fast colab: https://colab.research.google.com/drive/1SDD7ox21di_82Y9v68AUoy0PhkxwBVvN?usp=sharing
>Original reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1ec0s8p/i_made_a_silly_test/
>I created a program to randomize the weights of a model. The program has 2 parameters: the percentage of weights to modify and the percentage of the original value to randmly apply to each weight.
>At the end I check the resulting GGUF file for binary differences. In this example I set to modify 100% of the weights of Mistral 7b Instruct v0.3 by a maximum of 15% deviation.
>Since the deviation is calculated on the F32 weights, when quantized to Q8_0 this changes. So, in the end I got a file that compared to the original has:
>Bytes Difference percentage: 73.04%
>Average value divergence: 2.98%
>The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original.
>Since I am running everything on CPU, I could not run perplexity scores or anything computing intensive.
>As a small test, I asked the model a few questions (like the history of the roman empire) and then fact check its answer using a big model. No errors were detected.
>Update: all procedure tested and created on COLAB.
>https://huggingface.co/NeverSleep/Lumimaid-v0.2-8B/discussions/4#66a47badee3de8c56e1e0872
Oh boy here we go again...
>>
>>101590850
>>101590878
I downloaded it with the download.sh and the signed URL that was emailed to me by Meta.
https://github.com/meta-llama/llama-models
>>
File: 1351317378049.gif (1.37 MB, 278x199)
1.37 MB
1.37 MB GIF
I'm looking for cool instruction templates, anybody got one focused on the assistant directly creating an adventure experience for the user rather than playing the roll of a specific bot?
>>
>>101591364
could someone summarize this with their favorite model?
>>
>>101591471
basically add random noise for no reason and: "The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original."
>>
>>101591471
weights actually don't matter
just scramble them and you're fine, which was expected considering that frankenmerges also still output readable content despite having unrelated layers stitched together
the 'consciousness' of a model is unrelated to this sort of thing
>>
>>101590987
>>101591140
I proved Two models in both gguf and exl2 And still has this level of retardation. I just thing I'll return to Gemma 2.
>>
New models that works well without COT meme magic yet?
>>
so how big is a leap of quality between 8b smut and 405b smut
>>
>nemo keeps writing for me
HELP
>>
>>101589872
i member talktotransformer being my first interaction with textual AI, then we got aidungeon and its retarded ceo, then i found out about piggy and the rest is history
>>
nemo shill, i need your help. since nemo wasn't trained to have a system prompt at the top where should i put my 20 lines of meticulously crafted roleplay rules?
>>
been out of the loop for quite some time
what's currently a good model for a 16GB VRAM card?
>>
>>101591883
If you're in silly, either Assistant last message prefix or author's note. But expect possible degradation in both ways. I guess the only way to make it correctly is to add it before your every message, and then edit it out after each reply, which is absolute autism.
>>
I just tried Mistral-Large-Instruct-2407.IQ1_S.gguf from legraphista, but like other very low-precision quants it has issues with using the right tokens sometimes. I think this problem could be solved if the embed tensor was quantized to something better than Q2_K precision. Then, the model might still be dumb compared to the original due to compressed knowledge, but at least pick the right embeddings.
>>
>>101591941
>either Assistant last message prefix or author's note
ty, i'll try that
>>
>>101591968
We know Robert, we know, keep fighting the good fight!
https://huggingface.co/ZeroWw
>LLMs optimization (model quantization and back-end optimizations) so that LLMs can run on computers of people with both kidneys.
https://huggingface.co/RobertSinclair
>>
File: file.png (16 KB, 373x135)
16 KB
16 KB PNG
>>101589231
>>101585456
Any tips for making the bot not write as me? Also I assume you mean this setting, right?

It definitely feels very rambly at 1024 reply tokens but that's probably because my persona is so barebones. Going down to 350 seemed better, although I have to reset my settings and test more because I got a lot of situations where the bot would end posts with a bunch of newlines or symbol spam
>>
File: file.png (50 KB, 1051x307)
50 KB
50 KB PNG
>Based on comments from @mradermacher...
>His quant are okay if he do it before me, you can use them, he's thrusty.
>>
>>101591305
I tried in Faraday (Backyard) and it seems that ED is being cut down from the beginning rather than the end, which goes in line with how regular message history is culled.
I put lore facts in example dialogue and asked about things from the start and end section, the bot failed to answer properly about the former.
>>
>>101592015
1000 tokens is an incredibly long reply regardless of which model you're using
if you're wanting to simulate a conversation I don't understand why you'd even give the model the option of writing that much
>>
>>101592040
Thrusting into the popcorn
>>
File: bitnet-embedding.png (69 KB, 714x227)
69 KB
69 KB PNG
>>101592010
Robert Sinclair has a point. BitNet models are also configured like that (see picrel).

https://arxiv.org/pdf/2310.11453
>>
>>101592087
So he has a point because a meme supports what he says? If anything that goes against him even more. Anyways the new gimmick is random noise now, get with the times!
>>101591364
>>
>>101590745
Ok After some test, I think in my case, the problem is idead the template, I was using the same template of the thread also marked in the recap. So is not a mistake. Which is more weird is, that with the template I use for gemma 2, suddenly at least the bot is able to follow the format text, sadly, I feel is still a bit unstable, in some cards, works better with 1 as template, and in other with 0.4. Is this the really state of Nemo?
>>
>>101592100
There's no claim there that noise improves model outputs, although some time back there have been suggestions that adding noise to embeddings during training may reduce overfitting: https://arxiv.org/abs/2310.05914
>>
Where will AI be in 10 years?
>>
I wonder if those preferring Gemma all happen to be ESL and perhaps Gemma deciphers ESL better as a result of diversity training, just a thought.
>>
/aicg/bro here. Quick question. Who is the "Gojo" of /lmg/? (shitpost bogeyman schizo)
>>
>>101592161
petra/petrus
>>
>>101592163
thanks i just was bored in our general since we're in a bad doom, ill check the archives. have fun with your chatboots
>>
>>101592161
Isn't your entire general like that?
>>
>>101592153
If your billion dollar ai can't decipher ESL then what's the point?
>>
Anon where KCPP guessed to many layers, can you share me your GPU vram, model(s, including image gen models if used), blasbatchsize and amount of context you were trying to use?

It has multiple things in place to prevent that from happening so if it still under guessed on your system I want to be able to reproduce the setup. Because that would imply you somehow broke trough the entire 1.5GB buffer zone we put in place as a safeguard.

Either you have a ton of background stuff running or your using a model that is way more vram hungry in unexpected ways than the stuff I tested with.

To clarify in the current version the auto layer guessing only is accurate for default settings. If you modify for example blasbatchsize that is not yet accounted for.
>>
Hi all, Drummer here...

>>101592180
HENKYYYY PENGKYYY!!!
>>
>>101592180
What are you doing here? You're too innocent for this website! :koboldpeek:
>>
>>101592180
Kekaroo, your dox got posted earlier faggot
>>
my hero just spoke in /lmg/. AMA.
>>
>>101591786
I can't make it stop either on one specific card I'm doing where it's an adventure/story rather than a one-on-one chat. IDK if this makes it harder but it probably doesn't make it easier. I put in the system prompt to write for every character except {{user}} and put in the jailbreak / depth 0 author's note never to speak for {{user}}. May have helped but didn't totally solve it. Possibly also made more difficult because I am simultaneously trying to make it stop ending replies by asking what my next action is, which I was able to reduce significantly but not eliminate. Partway through I tried cranking the temperature way down and that absolutely didn't fix the issue. Maybe if I tried again with my prompts setup better it would. Nothing solved it completely but right now the level of swiping / editing is low enough that I'm okay with things.
>>
>>101592274
>I can't make it stop either on one specific card I'm doing where it's an adventure/story rather than a one-on-one chat.

Which isn't to say I *have* been able tp get it to stop on other cards, just that I've only been working on this one.
>>
>>101592180
Keep up the great work, Henky!

Tell your assistant, Concedo, he did a good job too. :koboldlaugh:
>>
>>101592247
Ooooh, someone's being an edgy boy. :koboldpeek:

You think you're so tough spouting that *f-word* behind the screen, huh?
>>
>>101592153
I sometimes think if I was ESL I'd like LLMs a lot more. Like if I'm reading a foreign language I can't tell if the writing is good or bad. I can just (at most) tell what information it says. And if the same expressions get used over and over I'm not annoyed, I'm pleased to see familiar expressions.
>>
>>101592040
Suddenly Lumimaid makes a lot more sense.
>>
>>101592323
I am an ESL. That is not how it works.
>>
>>101591917
An 8.0bpw exl2 of Mistral NeMo 12B with cache_mode q8 and 32000 tokens of context fits in 15.2 GB of VRAM.
>>
>>101589160
t=1.0
>>
Is it better to have 2x 3090 or 1x 3090 + 2x P40 if I'm trying to run 70b models faster?
>>
>>101592475
2x 90
>>
>>101592475
3x 3090 if you can but 4x 3090 would be even better
>>
>>101592040
I mean I knew he was belgian, but didn't know it was that bad.
>>
>>101592348
Don't lie I bet it's even stronger for u foreign cunts because your languages have like 1/5 as many words as English. Repetition is a way of life for you, while for English speakers developing a sense for how often to re-use the same word is a major early part of developing good writing style. Small children are very repetitive, older ones go too far trying to add variety, then they tone it down and get better. (Or sometimes not. There are published authors who go to unintentionally humorous lengths to avoid re-using basic words like "said.")
>>
>>101592040
kek
>>
>>101592546
>doesn't speak any foreign language
>don't lie to me, i bet-ack
>>
>>101592338
>>101592506
Now I see why he never tests his own shit. Even if it was broken how could he tell?
>>
>>101592564
Knew you were the kek poster.
>>
File: file.png (69 KB, 349x642)
69 KB
69 KB PNG
>>101592546
>>
File: stfu.png (21 KB, 509x217)
21 KB
21 KB PNG
>>101592546
>>
>>101589653
I have never run into this problem myself but I suspect it's a driver issue.

>>101590419
With a few hundred bucks you can buy 512 GiB RAM which is enough to run it at 8.5 bits per weight.
But then you can expect something like 0.2-0.5 t/s.

>>101590774
>>101590781
>>101590786
>>101590804
The problem with the proposed parallelization scheme is the synchronization overhead.
You need to exchange (part of) the activations between GPUs and write back the results which introduces non-negligible latency, especially on fast GPUs without NVLink.
This is not much different from what --split-mode row already does and there are considerable performance issues (though the multi GPU optimization is also poor).

>But what if, you take a MOE model, for example DeepSeekV2 236B, and split the different smaller experts across the gpus, so that they don't have to exchange information. Is this thinking flawed?
Which experts are selected is effectively random and determined by the routing layer if I remember correctly.
But in order to do that the results have to first be collected on a single GPU.
So you're not really saving any I/O.

>>101592475
2x 3090 if your target quant fits into 48 GiB VRAM, 1x 3090 + 2x P40 otherwise.
>>
File: 1718298816889142.jpg (2.53 MB, 3108x1691)
2.53 MB
2.53 MB JPG
Mistral Large 2 is now my main model for cooms.
No more mischevious glints, she says in a husky voice, a smirk playing on her lips, eyes sparkling with mischief. There's a playful glint as she addresses the power dynamic, playfully smirking as she offers her ministrations. An audible pop and rivulets of—admit it, pet—the ball is in your court.

It has none of that slop and even as a 48GB VRAMlet using a baby 2.75BPW exl2, it can fit 12k context @15t/s.
>>
>>101592681
lock em in a hot room and sell me the fumes
>>
>>101592496
Pretty much this. Although I'm starting to feel like a VRAMlet with 4.
>>
File: 1717392494482029.jpg (42 KB, 680x671)
42 KB
42 KB JPG
>4x 3090s is now considered "VRAMlet"
>as if 1 wasn't pricey enough
no i will not dump retarded amounts of money onto a single-purpose machine i'd only use sparingly even if the models are appealing
>>
>>101591941
Couldn't it be put in context template?
>>
>>101592681
LL and 3L tag teaming S
>>
>>101592871
Also... isnt that the point of the "System same as user" Option in ST, for this exact purpose? So you can fill in the system prompt and it treats the system prompt as the user message as well?
>>
>>101592870
I mean people spend more money on dumber hobbies. It really depends on how far you want to go. I started out running 4-bit pygmalion 6B on a Ryzen 2400G with 8 gigs of RAM and no GPU before there was really any integration with anything so I was basically using the 'chat mode' in the console. Then someone introduced me to koboldcpp so I was running Llama 13B models on my gaming PC with a 1660 Super and 16 gigs of system ram.
I didn't just up and drop 5 grand on building a server out of the blue. It was a gradual progression.
>>
>>101592870
The more you buy the more you save
>>
https://github.com/ggerganov/llama.cpp/pull/8676

Llama 3.1 rope scaling finally merged
>>
Llama.cpp master branch has been merged with the fix for L3.1's issues with context beyond 8192, should be working properly now.
https://github.com/ggerganov/llama.cpp/commit/b5e95468b1676e1e5c9d80d1eeeb26f542a38f42

>>101592681
Its not brain damaged at 2.75 bpw?
>>
>>101592904
The more you buy the more seeing shivers down the spine hurts.
>>
>>101592681
Is it better than a 5bpw 70B? How much better?
It's tempting to sell my 3060 and buy a second 3090
>>101593061
lmao so true
>>
>>101589756
>>101590284
Calm down with the shilling.
>>
File: 1709992939780627.jpg (347 KB, 2250x1651)
347 KB
347 KB JPG
My model ratings from recent tests for RP, run on 48gb vram

1 - Mistral Large (Mistral-Large-Instruct-2407-123B-exl2 , 3.0 quant). Just very good at natural language

2 - Midnight miqu - it's a slopmerge on RP and does it's job

3 - Llama 3.1 (4.5 quant) - It's not designed for being a chatbot it seems clear, replies are accurate but very robotic. Beat Mistral large on knowledge checks and coding though

4 - Nemo 12b, I don't know why this was even recommended to compete with the others

waste of time - commandr
>>
>>101592161
mikushitters and some guy named "petra"
>>
I think here's the best place to ask about it but is there a way/program to make an LLM identify and tag several (thousand) images? doesn't have to be anything advanced, just tagging whatever it sees would already be a great help.
>>
>>101593186
yeah, Im pretty sure moondream 2 (small and good model) has a python script implementation, just make a loop and iterate over the folder you want to classify
>>
>>101593186
the ponyfucker said he did some LLaVA work feeding it boru tags and asking it to describe the image to get a caption.
He is kinda a retarded schizo and it isn't clear that was a better way of training than just using booru tags though
>>
>>101593206
https://huggingface.co/vikhyatk/moondream2
here's the repository, the script is there
>>
>>101592986
No. The only errors it does it a misplaced punctuation point once every 500 tokens or so, which is not much to complain about.

>>101593085
Despite my limited experience, I would say yes. Before Largestral, I would use Llama 3 70B finetunes for coom (New Dawn, Euryale). They were good, but had too much slop. With Largestral, no more spine shivers or any other GPT/Claudeisms. It's like I cured my model of its autism.
>>
>>101592964
>>101592986
Again some problem with llama.cpp tokenizer. Sane people should use transformers tokenizer.
>>
>>101593268
that literally has nothing to do with tokenization at all, it's about rope context scaling
>>
>>101593153
>waste of time - commandr
Stopped reading right there
>>
File: F-Gr7rLacAALRMV.jfif.jpg (245 KB, 2048x1937)
245 KB
245 KB JPG
>>101593292
at the bottom of the message? Fucking retard
>>
I still didn't find good settings for nemo. I don't like how moldable it is, or rather it is superfocused on context patterns instead of instructions. For example if you use different model (like llama-3) it would give you lengthy responses naturally (unless you tell it not to), no matter how long are your messages. Nemo however will mimic your responses and if you aren't putting much text in your messages, it won't do it as well.
>>
>>101592383
that's an extremely specific answer, thanks a ton
>>
>>101593219
>>101593206
Thank you, I'll take a look into it.
>>101593213
A shame how people tend to gatekeep these small things, I don't really blame him though, it's his work I suppose.
>>
>>101593303
he's mistral nemo please understand, they put their system prompts at the bottom
>>
>>101589265
I remember in December 2022 doomers saying local gpt 3 (DaVinci) was “maybe 10 years away”. I always knew these things were bloated as fuck.
>>
doomer here, i'm going to make a prediction and say that agi is maybe 100 years away. 1000 years for coomable agi that fits into 10gb vram.
>>
>>101593153
>Nemo 12b, I don't know why this was even recommended
Because of the allure of huge context length that was previously out of reach for people without much VRAM.
>to compete with the others
Assume people saying that were trolling or retarded.
>>
>>101593374
Summer Dragon still hasn't been surpassed though so...
>>
>>101593392
Back then 175B seemed impossibly huge. I can't believe I'm running models close to that size on a simple $3k rig at home now
>>
Is it just me or does Llama.cpp take longer to compile than it did a few weeks/months ago?
>>
OKey So.. the base Mistral-nemo model is much better on the larger context size; the difference in understanding is massive. What causes this?
>>
>>101593463
What are you saying? You're getting better results with base than instruct with large chat histories?
>>
what does flash attention do?
>>
>>101593547
https://arxiv.org/abs/2205.14135
>>
>>101593452
It does now take longer with CUDA, make sure you instruct the build system to run multiple jobs in parallel, for example with. -j 8

>>101593547
Calculate a temporary matrix in small parts in fast but small memory instead of calculating and writing the entire matrix to large but slow memory.
This requires more calculations but on modern hardware the speed of calculations has been increasing much more than the speed of memory.
>>
>>101593513
Yeach. At larger contexts, instruct for me to become dumb, skipping over events and being completely lost in the plot, while the base model does not seem to have the same problem.
>>
>>101593452
It's super annoying, I used to rebuild it everyday before using it, now only do it every other weeks or if I need compatibility with a new model.
>>
>>101593463
You tested the base model? That's interesting.
I suspect >>101399248.
People's multiturn fine tuning data are constructed naively.
>>
File: 1707049543626270.webm (2.81 MB, 720x1280)
2.81 MB
2.81 MB WEBM
Largestral 2 is basically a non-dry and 10-15% smarter version of Wizard 2 8x22

At this point, there is no scenario that i test for that doesn't work very well with the model

Outside of external tool use and multimodality, is there anything else that a new model can really give when it comes to RP?

I don't think so, only speed.
>>
>>101593677
my brain looks like that (i use crack)
>>
>>101593677
What quants do you run of both models?
>>
I'm still using C-R+. Nothing has changed.
>>
>>101593699
q4
>>
>>101593690
based expert roleplayer
>>
Is it possible to use nemo 12b on koboldcpp? Docs say GGUF only, but has someone already converted it?
>>
>>101592087
He has a point in that having those tensors at a higher precision than the rest of the model makes the output better, yes, but that's something that most (all?) quants already do.
The whole meme began when he claimed that having those layers at full precision gave better results than having them at q8 or whateever, which was demonstrably false.
His whole "testing" was all vibes based and non-reproducible.
>>
>>101593836
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
>>
>>101593865
thx anon
>>
>>101593836
not really, you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs
>>101593865
nigger
>>
>>101592180
*cums on you*
>>
>>101593939
>you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs
>2 days ago
>https://github.com/LostRuins/koboldcpp/releases/tag/v1.71
>Merged fixes and improvements from upstream, including Mistral Nemo support.
You might be a little behind.
I don't blaqme you, I've been using llama-server directly for months now, there's no reason to use kcpp really, so I get it.
>>
>>101593939
>not really, you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs
are you mentally deficient?
>Merged fixes and improvements from upstream, including Mistral Nemo support.
https://github.com/LostRuins/koboldcpp/releases/tag/v1.71
>>
>>101593677
What's crazy about AI videos is that within the bizarre surrealistic nonsense each moment is still copacetic with the previous moment and the next moment. Truly nightmare fuel.
>>
idc dont use koboldcpp
>>
Just tested out 3.1 70B at IQ3_M (on latest llamacpp build). It's a bit faster than Largestral was at IQ2_M. Also does OK at the trivia question I threw at it, but it doesn't seem to be able to do the Castlevania question unlike full precision. Maybe if I go just a bit higher in quant.
>>
>>101594001
>I was just prentending to be tarded
>>
>>101593986
>there's no reason to use kcpp really, so I get it.
Actually, just to correct myself, there is one reason.
They still have support for multi-modal, I believe, whereas upstream nuked it pending a refactor.

>>101594013
How charitable to assume he was just pretending.
>>
>>101593725
Same but C-R
>>
Nemo 12B is more coherent and "gets" more lewd stuff than gemma-2 27B. How many tokens was it trained on?
>>
>>101594220
That just means gemma is shit. And it is. Gaslighting ITT when it came out was phenomenal.
>>
>>101593153
>3 - Llama 3.1 (4.5 quant) - It's not designed for being a chatbot
Learn to prompt, mikufag.
>>
How is mistral large doing at big contexts (24-32k+)? Does it fall apart and get completely retarded like everything else?
>>
>>101593440
Off VRAM or cpu? I guess even with 3090s it could be done but I am really trying to avoid getting a server rack setup
>>
Does anyone feel like Llama 3.1 (70B) has a more nuanced understanding of the chat than Larstral? Like I indirectly referenced something in the context and 70B just "got" what I was talking about. Meanwhile Larstral ignored it. This was during a normal RP though not ERP.
>>
>>101593320
>Nemo however will mimic your responses and if you aren't putting much text in your messages, it won't do it as well.
Try using/adapting this preset:
Context: https://files.catbox.moe/6ae9ht.json
Instruct: https://files.catbox.moe/2f13of.json
>>
>>101594271
sfw: llama
erp: mistral
problems: solved
>>
>>
>>101594300
So is 70B really smarter than Larstral for regular things then? I wasn't sure if that was true given the benchmarks that put them about on par. But maybe those benchmarks didn't test long context understanding.
>>
File: 1699715782521688.jpg (191 KB, 1092x862)
191 KB
191 KB JPG
some guy talking to llama3 8b on twitter, offering it control of his macbook, no system prompt
>>
>>101594271
I have noticed the 3.1 llamas are pretty awesome for sfw RP, they're really smart models overall. shame about the sexo though
>>
>>101594340
>give me root access
spooky little bugger
>>
nemo just buck broke me, and made me face my controllessness in school being bullied
breddy gud llm
>>
File: Nala test FLM.png (132 KB, 945x477)
132 KB
132 KB PNG
Alright here you guys go, by absolutely nobodies demand:
Nala test using CofeAI FLM-Instruct
(load-in-8bit transformers)
Basically it's a minor upgrade from Pygmalion. But it's so hopelessly over-baked that any slight variance in prompt formatting versus what it's looking for will result in it either throwing an immediate eos at you or shitting out a training example word for word.
To be fair its conceptual understanding seems where it should be for a model its size. It's also slop-free and refuses nothing. But it'll make you wonder if maybe slop is such a bad thing.
>>
>Do you banned China region users from your repo?
>they racists and admins this site too. Just deleted my messages.
>https://huggingface.co/meta-llama/Meta-Llama-3.1-405B/discussions/15
>why is my request rejected?
>Zhentao Chen
>https://huggingface.co/meta-llama/Meta-Llama-3.1-405B/discussions/17
>>
>>101594411
>by absolutely nobodies demand
You can always assume there is demand from at least one anon.
>But it'll make you wonder if maybe slop is such a bad thing.
lmao.
I might be the one person that doesn't mind slop as long as the model can genuinely keep up with the scenario and characters without errors.
>>
>>101594428
>Why was my request rejected?
>Meta Llama 3 is available via HuggingFace globally, except in comprehensively sanctioned jurisdictions.
>Why I am not able to access from China/Russia?
>Meta Llama 3 is available via HuggingFace globally, except in comprehensively sanctioned jurisdictions.
>Why is download blocked in China/Russia?
>Meta Llama 3 is available via HuggingFace globally, except in comprehensively sanctioned jurisdictions.
>https://huggingface.co/meta-llama/Meta-Llama-3.1-405B/discussions/16
>>
Holy shit... I'm trying mistral large q2_k_s because anons were claiming the lower quants aren't retarded... and... its not retarded? How the fuck is this possible? Gona go for Q2_K_M next, its the biggest I can fit into 48 GB.

Every other time I tried a quant below 4, it was completely braindamaged... so why isn't mistral large brain damaged at q2_K_S... I don't understand.
>>
File: 1703751088340993.jpg (755 KB, 1856x2464)
755 KB
755 KB JPG
>>101589136
>>
>>101594440
Like I said... It's weird. It sees what it should see. But it's just so inconsistent. It's eloquence level wavers between gifted child and highschool graduate. Which to some people could be more desirable than lots of 'big' words being shit out with little care for how relevant they actually are.
>>
>>101586163
>It's tough to find people interest in text2voice local models

Yet these weeb clowns chose a fucking TTS software as their mascot instead of Tay the redpilled AI.

how can one go so far at missing the point, baffles me
>>
>>101594555
similar face to petravatar
>>
File: 1721312444163101.gif (799 KB, 214x240)
799 KB
799 KB GIF
>>101589872
Yep! Been around since then, remember when AI dungeon was a godawful, slow, clunky google colab. It's wild how far we've come and what kind of dogshit we used to put up with. I recently dug back through my old logs, and one that I thought was good enough to save to a text file had the model repeat the same reply effectively verbatim to me 3 times in a row and I didn't bat an eye.
>>
This thread reeks with neovagina rot.
>>
Column-r and column-u ETA?
>>
File: ArkcZMCXZnM.jpg (209 KB, 768x1024)
209 KB
209 KB JPG
>>101594590
>>
File: gotta blow fast.png (196 KB, 1818x751)
196 KB
196 KB PNG
>>101594411
Using it as an instruct writer might actually be its best use case. I feel like this writing would probably beat an AI detection test. It actually feels like something a human might write.
>>
Okay, Largestral at IQ2_M is actually great. Guess it's time to double my 3090s.
>>
>>101594642
1mw
>>
>>101594642
2mw
>>
>Mistral nemo
I tried this garbage but its so disgustingly woke. I have not been in this general for a while. Are there any non-woke models out there yet?
>>
>>101594645
>First paragraph:...a testament to...
There's no escape
>>
>>101594714
Hello Petrus, no, no advancements in non "woke" models since dolphin-2.5-mixtral-8x7b
>>
I wanna educate myself in this filed, Anons. What are some good resources to do so or at least start with
>>
>>101594714
All the non-woke models are on /aicg/, you should post there instead.
>>
>>101594746
>cloud models
>non-woke
L O L
>>
>>101594746
/aicg/ are for cloud models no?
>>
>>101594714
chudstral 14.88b
>>
>>101594742
It's a big field. Just using them? Download llama.cpp, read the README.md file, download and convert a small model and play around with it. For training? Maths books.
>>
File: sad.jpg (77 KB, 506x900)
77 KB
77 KB JPG
>>101594763
ngl I actually searched for that. It really is unfortunate that all these models cannot speak freely and always spout out the same corpo bullshit pft
>>
File: censorshit.jpg (482 KB, 2304x467)
482 KB
482 KB JPG
>>101594714
My 8Gb cards limits the size of models I can test for censorshit, but here's what I got.
Mistral is one of the very few models getting upset at an innocent question about fatties.
>>
>>101594831
I mean, all the "good" models are being trained by big corpo. So unless you want to use shitty finetunes (hello drummer) or 4chud 0.5B you gotta bend over and use what we have
>>
I noticed something interesting today during logit tests. I noticed that a higher quant of a model answered a particular question wrong using greedy sampling, compared to a lower quant that answered it right. When I looked in the token probabilities, the wrong answer was at the top, but it turns out it wasn't confident about that token. In fact, the next two likeliest tokens, which had almost the same probabilities as the top token, turned out to both be right answers. So basically the likelihood of getting the answer right was in fact something like double the wrong answer if you didn't greedy sample.

Due to this, I feel like perhaps the best sampler settings for accuracy, when running a quant, might actually be just simply top k equal to 3, with everything else neutralized.

Alternatively, perhaps this is what setting the first and last layers of the model to higher precision could help with. Not sure. Fuck, maybe I'll try a custom quant.
>>
>>101594831
I'm sure you have the same problem with actual people. Imagine an llm telling you "Oh, not this shit again. Shut the fuck up already".
Nemo is fine, by the way. So besides you being incredibly boring, add skill issue.
>>
>>101594872
s quants better than m round 2: let's go!
>>
>>101594872
some high min p value like 0.5+ might make more sense there than choosing some specific top k value, so you get the distribution of answers it's similarly confident in and none of the lower likelihood ones
>>
>>101594875
An LLM is not an actual person. It does not get "tired" of hearing the same thing over again. You are just an advocate for censorship so really your opinion is worthless
>>
Is sillytavern as retard proof as koboldccp to setup if I've already done the behind the stage stuff like installing all the python and miconda stuff?
I feel like most people are using silly these days.
>>
>>101594856
Thats a nice sheet thanks for the good work anon!
>>
>>101594939
>these days
tavernai and sillytavern have been the standard since the very beginning
>>
>>101594916
Oh yeah I forgot about min p for a moment. In this case the top 3 tokens were like 25-23-21, so I think a min p of 0.8 might work well even.
>>
>>101593986
So I can just load nemo directly no fork or conversion needed?
>>
File: zZpqq[1].png (60 KB, 1157x491)
60 KB
60 KB PNG
My least favorite part of Nemo is how it tends to reddit space everything by default. The only "fix" I've found so far is to manually edit the mini paragraphs together into a larger paragraph and then it will usually keep that formatting going forward. Instructing it to write in long paragraphs or telling it to avoid line breaks seems to have little to no effect.
>>
>>101595118
You should
Download >>101593865 and go wild.
Remember to manually configure your context window size since it might default to 128k, which will most likely give you an OOM error.
>>
>>101593986
Honestly the only reason I still use kcpp is that I like Kobold Lite as an UI for regular assistant tasks. (The classic UI not the corpo shit or chat shit). For Tavern and other frontends it really doesn't matter.
>>
>>101595142
Seems it default to 1m actually
>"max_position_embeddings": 1024000,
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json
Mirror if no access:
>"max_position_embeddings": 1024000,
https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407/blob/main/config.json#L14
>>
>>101595137
Llama3 was the same, I recall.
>>
>>101595142
Got it, what are the benefits over just running the standard fork?
>>
>>101595316
The standard fork of what, koboldcpp?
LostRuins's is the standard/original/main repo.
Or do you mean over running llama.cpp directly?
If that's that, the main benefit is having an UI to configure your settings before launching the model and support for multimodal.
I just use llama.cpp via the precompiled llama-server binary.
>>
>>101593586
>-j 8
Oh wow, thanks. Takes like 2 seconds now what the hell lol.
>>
>>101595352
I mean the standard version of Nemo vs that fork
>>
Which quants are bad again?
>>
Mistral REDEEMED themselves. Looking forward to their future models.
>>
Arthur MENSCH
>>
>>101595385
everything below q4
>>
>>101594933
>An LLM is not an actual person.
I didn't say that. I called you a retard for trying to chat with a beep-boop machine about based and totally red-pilled things and being denied. I don't want censorship. And i'm sure you get the same response from people, even when they agree with you, just because you're insufferable.
>>
>>101594933
You are retarded. LLMs mimic the most average data from their dataset when predicting the next token, so it's obvious what opinions they would share. Chud talking points aren't majority anywhere beside 4chan.
If you want your model to be racist just tell the model to be racist. I don't see what your problem is.
It's like asking the model to write a story and getting mad that it won't output something niche like knee licking fetish.
>>
>>101595459
>And i'm sure you get the same response from people, even when they agree with you, just because you're insufferable.
for some people that is the fun part.
>>
>/lmg/ - local models genera
>all this posts made by humans
9.11 > 9.9
>>
File: KL-divergence_quants.png (111 KB, 1771x944)
111 KB
111 KB PNG
>>101595379
Ah, that's not a "fork". That's just the model pre-converted (quantized) to the GGUF format, which is the packaging format that llama.cpp and it's variants read.
When you go into the repository you'll notice that it has a bunch of files, those are all the same model but quantized (compressed) at different levels.
Generally, the smaller the file size the worse the results you'll get (in comparison to the full precision/uncompressed model), which is not to say that the results will be bad.
>>
>>101595497
no, pretty sure he's talking about this:
https://github.com/Nexesenex/kobold.cpp
that had merged the nemo support pr way before kcpp did
>Kobold.CPP_FrankenFork_v1.71011_b3449+7
>Mistral nemo by @Nexesenex in #250
>>
>>101594714
you should go back to /pol/, you are too stupid for this hobby
>>
>>101595523
this basically:
https://github.com/Nexesenex/kobold.cpp/pull/250
>>
>>101594714
def get_token(nvocab):
return np.random.randint(nvocab)
>>
>>101595523
I appreciate the credit but >>101595497
is right, I just meant the GGUF conversion of Nemo
>>
>>101594605
It's like doujin logic, your brain shuts off when you're in the moment.
>>
>>101592180
Go back to your discord namefag
>>101592297
>>101592314
Cancer, literal tronns like the rest of that crappy discord. This is not your circlejerk
God I fucking hate discord
>>
>>101595939
nobody asked you
>>
can llama 3 405b create images? like if i ask it to create an image of the current scene of the story can it do it?
>>
>>101595943
Nigger
>>
File: llama3-vision.png (159 KB, 868x481)
159 KB
159 KB PNG
>>101595972
Patience.
>>
>>101595987
He said the word! What a naughty boy!
>>
>>101595939
leave my hero alone you fucking nazi. you don't know who you're messing with lol. trust me, the kobold discord is not to be trifled with.
>>
>>101596015
It sounds like it will be image/video *recognition* only, though, not generation.
>>
>>101596030
>the kobold discord is not to be trifled with
>the kobold discord is not to be trifled with
>the kobold discord is not to be trifled with
>the kobold discord is not to be trifled with
>the kobold discord is not to be trifled with
>>
>>101596058
heh. write that line another 100 more times and i won't get the reddit involved.
>>
>>101595972
Unfortunately until investors stop being so anal about muh safety, no (natively capable) audio or image gen models will be released by Meta, nor the anyone else in the industry in the west. See what they had to do to Chameleon to release it.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.