[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1710266621871822.jpg (462 KB, 1664x2432)
462 KB
462 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101584411 & >>101578323

►News
>(07/24) Mistral Large 2 123B released: https://hf.co/mistralai/Mistral-Large-Instruct-2407
>(07/23) Llama 3.1 officially released: https://ai.meta.com/blog/meta-llama-3-1/
>(07/22) llamanon leaks 405B base model: https://files.catbox.moe/d88djr.torrent >>101516633
>(07/18) Improved DeepSeek-V2-Chat 236B: https://hf.co/deepseek-ai/DeepSeek-V2-Chat-0628
>(07/18) Mistral NeMo 12B base & instruct with 128k context: https://mistral.ai/news/mistral-nemo/

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101584411

--TTS improvements and output issues: >>101586575 >>101586607 >>101586659
--Mistral nemo configuration and settings advice: >>101585456 >>101585527 >>101585596 >>101585669 >>101585834 >>101585868 >>101585572 >>101586019
--Sillytavern single sentence replies issue: >>101587180 >>101587200 >>101587246 >>101587225 >>101587275 >>101587269 >>101587353 >>101587401 >>101587413
--Recommendation for voice data TTS finetuning: >>101585560 >>101586101 >>101586163 >>101587016 >>101588184
--Nemo generates quadrupeds well but writes differently than chatgpt: >>101587732
--Logical flaws in GPT-4 and Claude, Command R Plus gets it right: >>101584587 >>101584617
--GitHub repo for bulk downloading cards for ST: >>101585689 >>101586342
--Anon asks for Command-R Plus alternatives.: >>101585536 >>101585556 >>101586438 >>101586483 >>101586596 >>101586657
--largestral iQ2_M outperforms Nemo in retarded quant, but is slower than 1t/s: >>101585893 >>101585921 >>101585940 >>101585998 >>101586017 >>101585939 >>101585985
--Nemo repetition issues and DRY sampler settings recommendations: >>101587028 >>101587049 >>101587511 >>101587535 >>101587576 >>101587545
--MoEs for roleplaying? Try it and find out: >>101584540
--Mistral Nemo sampler settings cause rambling output: >>101585928 >>101585955 >>101586019 >>101586038 >>101586062
--Where do ST or other UIs cull example dialogue in the context window?: >>101584746 >>101584777
--RULER repo measures effective context length, Llama3.1 performs well: >>101586297 >>101586352 >>101586384 >>101587005 >>101587027
--IQ4_XS vs Q3_K_M model quants and accuracy discussion: >>101585131 >>101585176 >>101585200 >>101585383 >>101585434 >>101588262
--IQ1_S performance and characteristics discussion: >>101588056 >>101588068 >>101588140 >>101588159 >>101588129
--Miku (free space): >>101587473 >>101588754 >>101588896

►Recent Highlight Posts from the Previous Thread: >>101584415
>>
post (You)r largestral presets
>>
File: 00170-699389629075918.png (1.47 MB, 1024x1536)
1.47 MB
1.47 MB PNG
>>101589142
i got a little chub seeing my repeated (You)s in this AI generated recap
thank you, botkind.
>>
I am once again asking for mini-magnum presets.
>>
>>101589160
I didn't actually try it:
>>>/vg/487568316
>>
gib nemo presets
>>
File: robotnik-jump.gif (14 KB, 420x420)
14 KB
14 KB GIF
>>101589210
>>101589219

just use the ones i linked from that anon >>101585456
in fact fuck it ill re-copypaste it again

Here, since so many people seem to be using nemo with wrong formatting then complaining:

Mistral context template: https://files.catbox.moe/6yyt8d.json

Mistral instruct template:
https://files.catbox.moe/rfj5l8.json

Mistral Sampler settings:
https://files.catbox.moe/tbsgip.json

Should be night and day for people who have it set up wrong. Make sure whatever backend you are using has DRY sampling.
>>
So, what was the point in MistralAI sabotaging their 8x22B with the shitty official -Instruct version and the botched release? Is this a psyop by their Partners at Microsoft trying to make MoE models look bad?
>>
>>101589231
Nemo doesn't use spaces around INST.
>>
File: 1336508850696.gif (1.93 MB, 245x187)
1.93 MB
1.93 MB GIF
How're you guys feeling? As the dust settles down, it really feels like we've never been more back. Back to back releases, putting local about on par with cloud in performance/cost, and it's still not over, we're going to get more next week. We are not even 3 years into the timeline since the ChatGPT hype began.
>>
>>101589262
I dunno i've been using it with magnum just fine.
>>
>>101589244
Maybe they didn't have time, and without the release of 405B, they didn't feel the need to release their best stuff.
>>
so mini-magnum is the best cooming model for vramlets now?
>>
>>101589231
>dry sampling
Does Koboldcpp have this (I don't see it) or am I fucked?
>>
The people that are using 4 3090s... Where are they putting them?
>>
Aah, 30t/s... This is the good life. Thank you Arthur.
>>
>good model release
>people saying low quants are fine, others saying there's night and day differences (probably broken quants)
>prompt/template issues left and right
Every time... I guess I'll wait 2MWs then...
>>
>>101589289
That or just Nemo-Instruct.
>>
>>101589265
You can see this as something good, we are on par with the big boys after all. But you can also see this as pure doom. The big boys barely moved ever since the release of GPT4.
>>
>>101589307
I'm the night and day difference anon and I should clarify my quants are definitely not broken, I do them all myself
q4km was still *fine*. better than 70bs or CR+ still, just kind of dry, generic, a little less sovl, a little more awkward - but q5ks was sharp as a tack and much more coherent, pulled in more little details, had more of those creative little turns of phrase that let you know it's really paying attention
lower quants are still usable and the model will still be good, it's not like they're totally fucked or anything, it's just that the second I bumped up the quant it felt like the model gained a real human touch that was lacking before
>>
>>101589307
>people saying low quants are fine, others saying there's night and day differences (probably broken quants)
more like
>people saying low quants are fine (poorfags who can only run low quants at 3t/s), others saying there's night and day differences (people who can actually run these models properly)
>>
>>101589370
I test through online (mainly lmsys) to compare between quants I downloaded and their "intended" performance. Otherwise I would not be able to say with full confidence that a model like 8x22B cannot do trivia like DBRX can.
>>
where's the dry sampler settings on ST?
>>
>>101589356
Did you use imatrix? The quants I'm using are all imatrix calibrated. Also they're the IQ format which I think were supposed to be more knowledge-retaining compared to K quants but I'm not certain.
>>
File: 1710741814225103.png (17 KB, 721x182)
17 KB
17 KB PNG
Cohere gathered another $500m from investors. CR++ will be a beast of a model.
>>
>>101589142
good bot
>>
File: dry staging.jpg (110 KB, 607x1212)
110 KB
110 KB JPG
>>101589491
There, I am on staging branch.
>>
>>101589536
I really wonder how businesses are using these products to make money.
>>
>>101589550
speculative capital, one of these might be the next big break through
>>
>>101589265
>We are not even 3 years into the timeline since the ChatGPT hype began.
>ChatGPT initial release: November 30, 2022; 19 months ago
>>
nvidia-smi is not displaying all of my GPUs, but neofetch is. how do i fix this? i cant run any AI applications due to an error about cuda devices not being found
>>
>>101589653
>>
>>101589642
It hasn't even been 2 years? Wtf
>>
>>101589653
Change your environment variables, I guess.
>>
>>101589550
If performance improvements plateau and you have ~5 years of scaffolding/agent development with no valid use cases, you might have a point. It's only been 19 months since ChatGPT released. Doomers just really want to see LLMs go the way of 3D TVs for some reason.
>>
>>101589688
how do i do that?
>>
man, that mini magnum finetune of Nemo 12B is actually starting to replace claude for me, which is nuts considering claude has got to be at least 50 times bigger
>>
>Claude 3.5 Sonnet and Llama 3 405B stomping GPT-4o
>Llama 3 405B is way fucking cheaper than GPT-4o
>It's only a matter of time before a cheaper and more capable model than GPT-4o-Mini comes out and kicks them out of the cost-performance pareto front entirely
Is he really just banking on Strawberry?
>>
>>101589762
>It's only a matter of time before a cheaper and more capable model than GPT-4o-Mini comes out and kicks them out of the cost-performance pareto front entirely
Claude 3.5 Haiku probably
the original haiku beats the shit out of 3.5 turbo which was the sota small cheap model at the time
>>
>>101589715
Type "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5"
>>
File: IM NOT SLEEPY.jpg (58 KB, 714x725)
58 KB
58 KB JPG
>update tavern
>even with all my settings and shit in order, the gen quality is fucked UP bad
>wtf could possibly be-
>mfw i forgot to enable instruct mode
>>
>>101589265
Do wonder know how many OG AI dungeon era people stuck around to witness this. I joined around the late GPT-2 times, now running IQ4 largestral. I don't see myself ever ending the ride.
>>
>>101585978
Same, Nemo might be retarded and repetitive at times, but it has some surprising creativity if you push it
>>
>>101589907
MOOOOOOOOOOOOOOOOODSSSSSSSSSSSSSS
>>
>>101589907
Ew
>>
>>101589539
thanks, i'll take a look
>>
Here comes the pedo tranny thirdie again.
>>
>>101589653
did you enable 4g decoding in bios? also check dmesg for errors from nvidia driver.
>>
File: 36993673.jpg (287 KB, 1082x695)
287 KB
287 KB JPG
>>101589872
I used to be so happy with my loli imouto scenarios on AI Dungeon, I used to think running LLMs locally would be impossible because Pygmalion 6B used all my RAM and was as slow as a snail.
Now, I'm here, running NeMo still enjoying my loli imouto scenarios, but without fear of suddenly being cucked.
Feels good.
>>
>>101589872
I joined back in December 2019. I remember the humble days of Clover where the AI was too fucking stoned to even remember your character's name, much less what was happening
It was absolute dogshit and now here we are
>>
>>101589265
Imagine Terry's reaction to the LLM tech, writing llama.cpp but in holyC to replace his text oracle perhaps.
>>
>>101589290
get sillytavern staging, and ((pull))

>why does anyone use response tokens over 256? 512 is hellish
>>
>>101589762
He just needs to reignite the AGI hype by adding smell to the multimodal model. Or maybe he can tease Sora sgain
>>
jesus man, Nemo is INSANELY horny. My OC's are a bajllion times more frisky with Nemo than any other model i've ever used, On one end i'm overwhelmed, yet it manages to blend that spice with their personalities perfectly. It doesn't skip a beat.
I almost want to say i wanna tone down the horny but, It's not like that breaks story flow or makes ERP more difficult or anything, I'm personally just not horny right now kek
>>
>>101589971
The realism of this surprised me for a bit until I realized the popsicle is constantly changing shape...
>>
>>101590044
arthur's personal coomtune strikes again
>>
>>101590054
Why did he do it?
>>
>>101589231
Is such a simple prompt best? No one uses those crazy ones they were using before?
>>
>>101589265
We're so back. Zucc and Yann are false prophets, Silicon Valley are false prohets. Viva la France
>>
>>101590073
yeah its never really mattered that much, was always placebo.
Which makes the Agent 47 crackhead prompt situation even funnier.
>>
>>101589292
Just get two a6000s or something if you want to be more compact.
>>
>>101590109
Interesting. So it's more down to the card itself and what examples you give it to emulate?
>>
nemo is schizo...
>>
>>101590170
A bad card can break any model, doesn't matter. It's why W++ for example is memed on so hard, there's no exact science it's just basic logic of garbage in garbage out.
>>
>>101589262
So I should change that so there's no spaces on the INST ones? What about the \n after </s>?
>>
>>101590172
You're using a temp too high
Mistral says in the model card that it likes low temperatures, they say 0.3
though I find up to 0.4-0.5 is usually fine
>>
>>101590229
NTA but I use simple sampling and for RP Nemo handles 0.7-0.8 just fine. Occasional schizo moments at 0.8. Starts getting really dry at 0.7 and lower. 0.3 is probably to prevent hallucination when using it for normie shit.
>>
I'm swiping this popular character card and the responses from mini-magnum and Claude Opus are identical. Claude walked so nemo could run.
>>
anyone running an exl2 mistral quant? I get gibberish with a 4.0bpw turboderp quant.
>>
I just downloaded 3 more IQ models below IQ2_M to see if any would be able to answer one of my challenging trivia questions as perfectly as IQ2_M did. Turns out IQ2_M is the cutoff for this particular question. IQ2_S gets the question partially right. About half of the points I would say. IQ2_XS and below basically just get it increasingly wrong, until IQ1_S which nearly went schizo-tier. Guess I'll just live with 1-2 t/s.
>>
>>101590287
3.5bpw is working perfectly fine even at 4-bit cache.
>>
>>101585837
do two gpus work faster than or slower than a single one if you can fit it in?
does Vllm split by row or by column? does it do tensor parallel? does nvlink in 3090 help by a lot? does the performance of 2 gpus differ much from 4? BTW, did you try cpu offloading in Vllm?
>>
>>101590287
yeah, turbo's 3.5bpw + 4-bit cache is running fine for me on ooba.
i don't know if it's necessary, but i updated transformers from source, like the mistral-large readme said.
>>
>>101590329
It's 2024. Why is VRAM still hard to obtain? It's literally just soldering more transistors into your chip. Why? Now you have people running two servers in parallel just to serve a model.
>>
>>101590109
How do you tell it to not act for the user then? I always have that issue.
>>
>>101590383
something specific causes that, i forget what, i started getting it tonight actually.
someone will chime in to inform us kek
>>
>>101590383
using
>write {{char}}'s next reply
in the sys prompt usually fixes this for me
>>
File: 1692389808623804.jpg (163 KB, 1058x926)
163 KB
163 KB JPG
so how much money do I have do spend to run 405b at home?
>>
>>101590319

Largestral? Does 3.5bpw fit in 48GB vram? How much context?
>>
>>101590374
simple answer
>greedy Nvidia encrypts vbios
>>
>>101589265
(((Openai))) is $5B in red this year
>kek
>>
>>101590419
Just run largestral instead. Better for most users purposes. 3x 3090s+
>>
Ok I prove mini-magnum-12b the finetune of nemo with exl2 8bpw, but as some time ago, with exllama my nemo is broken, don't follow the template of silly tavern, write a lot of text fulled with nonsense. I'll prove in llama.ccp later. Some advise?
I'm using the settings of the this anon >>101585456
>>
File: incognito.png (484 KB, 512x768)
484 KB
484 KB PNG
>>101589136
Thread Theme:
https://www.youtube.com/watch?v=7yJRsFFRoQY
Don't mind me, just a stranger blowing through this town...
>>
>>101590536
God. I hope you don't write like that to the poor llm. Are you sure you're using the proper template? Have you updated ST and exl2 since the last time you tried?
>>
>>101590319
>>101590346
thanks. it seems like something with my samplers broke it. I neutralized the samplers in sillytavern and it started working.
>>
why are some people here using small quants of a 12B model
even if your GPU is only 8GB you can run Q6 at a very good speed with some offloading
>>
>>101590531
>3x 3090s+
I've only built one PC in the past, and don't know of any standard motherboards that support that many GPU's. My first thought was something like picrel, basically a mining rig. Without NVlink its gonna be pretty bad, as far as I understand. How did you, or anybody you know, do it?
>>
>>101590711
Thats basically the idea.

https://www.amazon.com/Kingwin-Professional-Cryptocurrency-Convection-Performance/dp/B07H44XZPW/ref=sr_1_1?sr=8-1
>>
>>101590711
open air build like a mining "case", riser cables, any motherboard with 4 pcie slots, does not have to be x16 x8 or whatever. Even x1 is enough. Just get 4 of them.
>>
>>101590576
Yes I did a upgrade a moment ago. I have to set a value in the alpha?value?
>>
>>101590576
>Are you sure you're using the proper template?
I'm using the one which was shared in the last thread.
>>
>>101590711
This guy did one with 7x4090s. You can see what his concerns were. He goes pretty in-depth. https://www.mov-axbx.com/wopr/wopr_concept.html
>>
>>101590720
>>101590720
>>101590754

I just had an idea and I'm sure somebody else had the idea in the past as well. So for dense models running across multiple GPU's without NVlink the performance gets worse and worse the more cards you add because they gotta wait for each other to finish their task to go and compute the next hidden layer state. But what if, you take a MOE model, for example DeepSeekV2 236B, and split the different smaller experts across the gpus, so that they don't have to exchange information. Is this thinking flawed?
>>
>>101590536
Enable "Add BOS Token" in ST
>>
>>101590774
Thats not how moes work.
>>
>>101590781
but how do they work then.
>>
>And finally, we have the Arch Linux package updates. Oh boy, I can barely contain my excitement! You have a whopping 106 packages begging to be updated. I mean, who doesn't love a good update cycle? It's like playing a game of "spot the broken dependency"! Good luck with that.
i love when it sasses me
>>
>>101590786 (me)
>Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward
block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router
network chooses two of these groups (the “experts”) to process the token and combine their output
additively. This technique increases the number of parameters of a model while controlling cost and
latency, as the model only uses a fraction of the total set of parameters per token.

I don't see how my thinking is flawed, someone educate me. just have 2 parameter groups on each gpu and the supervisor on the last one.
>>
>>101590711
If you wanna stay on standard architecture and don't wanna invest in workstation CPU's then the MSI MEG X570 Godlike Mainboard is a great choice with 4 slots for GPU's. I wanted to build a bigger PC with 4 3090 cards but now I rather wait for the 5090 announcement next year.
>>
So is there a reason why Llama 3.1 that I downloaded from the official repository doesn't come with any config.json, and every single piece of documentation I've found that can supposedly convert them to HF format doesn't work?
>>
>>101590804
llamacpp anon we need you, hes wrong and I know it but can't explain why.
>>
>>101590732
>>101590745
If i'm reading the setup files correctly (https://files.catbox.moe/tbsgip.json specifically):
It sets the temperature to 1, when the mistral guys recommended 0.3 or 0.4. Change it to 0.3 and try again.
The second thing is repetition penalty. Disable it by setting it to 1.
If that makes it work better, then play around with the temperature. If it still doesn't work as you expect, post a screenshot of the output to see what you're talking about. "write a lot of text fulled with nonsense" is not that useful.
>>
>>101590819
What did you download? The original repo in meta's hf all have config.json files.
>>
>>101590307
There was some post-quant tuning that enhances the quality of iq2 quants, but I dont remember where that was. Prolly the only way to run huge llms on 24gb with no major loss,
>>
>>101590819
By official you mean the repos on this account https://huggingface.co/meta-llama or a different site where they host their models? The config.json file definitely are in the huggingface repos. You should download them from there.
>>
File: hdca-news1.jpg (184 KB, 700x681)
184 KB
184 KB JPG
>>101590711
>>
how much T/S do yall get with 4x 3090's on largestral at what quant
>>
>>101590774
only if you split by column and not by row. if you split horizontally it doesn't slow down since that's tensor parallel so you run in parallel . but you need good interconnection.
>>
File: 1463720797197.png (255 KB, 319x317)
255 KB
255 KB PNG
I'm new to using SillyTavern. Is there a way to prompt the kind of response the AI generates to guide it in a certain direction without having to just rewrite the response entirely by hand? Like if I give it an open ended question and I want all its responses to be either positive or negative.
>>
>>101590939
Try including something like "Only answer positively/negatively" In the author's notes. Depth = 0 if you want it constantly reminded of it for every message.
>>
>>101590946
Thanks, I'll give that a try and see if it helps.
>>
>>101590939
I simply use group chat for a char and my OC, while posing as a narrator in user responses. Much more convenient from chat editing perspective than having author note open. Narrator just gives out barks for both characters, and then I mute narrator barks so that it doesn't try to act as narrator itself.
>>
File: 2024-07-27.png (381 KB, 1124x671)
381 KB
381 KB PNG
>>101590778
>Add BOS Token
Is enabled.
>>101590843
>sets the temperature to 0.3
>Disable rep pen
I did this too, I prove setting the temp in less values and more than 1.0 values and this is the result.
>>
>>101590983
That's a great way to utilize the group chat. Makes me wonder what other things can be done with it.
>>
Where can I find/which gguf version of mini-magnum-12b should I use?
>>
>>101591073
https://huggingface.co/starble-dev/mini-magnum-12b-v1.1-GGUF
>>
>>101591073
the one that fits
>>
>>101591140
Thanks anon.
>>
>prema trying to do team orders in fshitter
>>
>>101590410
Doesn't seem to help, sadly.
>>
File: GS-IVOcbIAI5B6g.png (643 KB, 855x719)
643 KB
643 KB PNG
>>101589231
Ok so I got koboldcpp, staging version of sillytavern, imported these three and made my persona a basic [{{user}} is a guy that has this color hair, this color eyes and this color skin]
Is there anything else I need to do to make this work? I got some random cards off chub but I dunno what makes a card good or retarded
>>
Can using smaller context size result in model retardation (within that context) or is it enough that I match the koboldcpp and sillytavern setting? I don't have the VRAM to run full 128k of nemo.
>>
>>101591291
no, the opposite, using bigger always degrade at some point
>>
>>101584777
>>101584746
Any ideas on where ED gets culled?
>>
>>101591301
Okay, thanks. So should I go for smaller context in favor of higher quants as well? Currently using Q6_K_L with 8k but I guess it may be worth it to go lower quant.
>>
>>101591314
8k is generally good with most recent models, above is when it gets iffy especially above 32k so if you're enjoying what you have just don't break stuff for no reason
>>
>ZeroWw 'SILLY' version. The original model has been quantized (fq8 version) and a percentage of it's tensors have been modified adding some noise.
>Full colab: https://colab.research.google.com/drive/1a7seagBzu5l3k3FL4SFk0YJocl7nsDJw?usp=sharing
>Fast colab: https://colab.research.google.com/drive/1SDD7ox21di_82Y9v68AUoy0PhkxwBVvN?usp=sharing
>Original reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1ec0s8p/i_made_a_silly_test/
>I created a program to randomize the weights of a model. The program has 2 parameters: the percentage of weights to modify and the percentage of the original value to randmly apply to each weight.
>At the end I check the resulting GGUF file for binary differences. In this example I set to modify 100% of the weights of Mistral 7b Instruct v0.3 by a maximum of 15% deviation.
>Since the deviation is calculated on the F32 weights, when quantized to Q8_0 this changes. So, in the end I got a file that compared to the original has:
>Bytes Difference percentage: 73.04%
>Average value divergence: 2.98%
>The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original.
>Since I am running everything on CPU, I could not run perplexity scores or anything computing intensive.
>As a small test, I asked the model a few questions (like the history of the roman empire) and then fact check its answer using a big model. No errors were detected.
>Update: all procedure tested and created on COLAB.
>https://huggingface.co/NeverSleep/Lumimaid-v0.2-8B/discussions/4#66a47badee3de8c56e1e0872
Oh boy here we go again...
>>
>>101590850
>>101590878
I downloaded it with the download.sh and the signed URL that was emailed to me by Meta.
https://github.com/meta-llama/llama-models
>>
File: 1351317378049.gif (1.37 MB, 278x199)
1.37 MB
1.37 MB GIF
I'm looking for cool instruction templates, anybody got one focused on the assistant directly creating an adventure experience for the user rather than playing the roll of a specific bot?
>>
>>101591364
could someone summarize this with their favorite model?
>>
>>101591471
basically add random noise for no reason and: "The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original."
>>
>>101591471
weights actually don't matter
just scramble them and you're fine, which was expected considering that frankenmerges also still output readable content despite having unrelated layers stitched together
the 'consciousness' of a model is unrelated to this sort of thing
>>
>>101590987
>>101591140
I proved Two models in both gguf and exl2 And still has this level of retardation. I just thing I'll return to Gemma 2.
>>
New models that works well without COT meme magic yet?
>>
so how big is a leap of quality between 8b smut and 405b smut
>>
>nemo keeps writing for me
HELP
>>
>>101589872
i member talktotransformer being my first interaction with textual AI, then we got aidungeon and its retarded ceo, then i found out about piggy and the rest is history
>>
nemo shill, i need your help. since nemo wasn't trained to have a system prompt at the top where should i put my 20 lines of meticulously crafted roleplay rules?
>>
been out of the loop for quite some time
what's currently a good model for a 16GB VRAM card?
>>
>>101591883
If you're in silly, either Assistant last message prefix or author's note. But expect possible degradation in both ways. I guess the only way to make it correctly is to add it before your every message, and then edit it out after each reply, which is absolute autism.
>>
I just tried Mistral-Large-Instruct-2407.IQ1_S.gguf from legraphista, but like other very low-precision quants it has issues with using the right tokens sometimes. I think this problem could be solved if the embed tensor was quantized to something better than Q2_K precision. Then, the model might still be dumb compared to the original due to compressed knowledge, but at least pick the right embeddings.
>>
>>101591941
>either Assistant last message prefix or author's note
ty, i'll try that
>>
>>101591968
We know Robert, we know, keep fighting the good fight!
https://huggingface.co/ZeroWw
>LLMs optimization (model quantization and back-end optimizations) so that LLMs can run on computers of people with both kidneys.
https://huggingface.co/RobertSinclair
>>
File: file.png (16 KB, 373x135)
16 KB
16 KB PNG
>>101589231
>>101585456
Any tips for making the bot not write as me? Also I assume you mean this setting, right?

It definitely feels very rambly at 1024 reply tokens but that's probably because my persona is so barebones. Going down to 350 seemed better, although I have to reset my settings and test more because I got a lot of situations where the bot would end posts with a bunch of newlines or symbol spam
>>
File: file.png (50 KB, 1051x307)
50 KB
50 KB PNG
>Based on comments from @mradermacher...
>His quant are okay if he do it before me, you can use them, he's thrusty.
>>
>>101591305
I tried in Faraday (Backyard) and it seems that ED is being cut down from the beginning rather than the end, which goes in line with how regular message history is culled.
I put lore facts in example dialogue and asked about things from the start and end section, the bot failed to answer properly about the former.
>>
>>101592015
1000 tokens is an incredibly long reply regardless of which model you're using
if you're wanting to simulate a conversation I don't understand why you'd even give the model the option of writing that much
>>
>>101592040
Thrusting into the popcorn
>>
File: bitnet-embedding.png (69 KB, 714x227)
69 KB
69 KB PNG
>>101592010
Robert Sinclair has a point. BitNet models are also configured like that (see picrel).

https://arxiv.org/pdf/2310.11453
>>
>>101592087
So he has a point because a meme supports what he says? If anything that goes against him even more. Anyways the new gimmick is random noise now, get with the times!
>>101591364
>>
>>101590745
Ok After some test, I think in my case, the problem is idead the template, I was using the same template of the thread also marked in the recap. So is not a mistake. Which is more weird is, that with the template I use for gemma 2, suddenly at least the bot is able to follow the format text, sadly, I feel is still a bit unstable, in some cards, works better with 1 as template, and in other with 0.4. Is this the really state of Nemo?
>>
>>101592100
There's no claim there that noise improves model outputs, although some time back there have been suggestions that adding noise to embeddings during training may reduce overfitting: https://arxiv.org/abs/2310.05914
>>
Where will AI be in 10 years?
>>
I wonder if those preferring Gemma all happen to be ESL and perhaps Gemma deciphers ESL better as a result of diversity training, just a thought.
>>
/aicg/bro here. Quick question. Who is the "Gojo" of /lmg/? (shitpost bogeyman schizo)
>>
>>101592161
petra/petrus
>>
>>101592163
thanks i just was bored in our general since we're in a bad doom, ill check the archives. have fun with your chatboots
>>
>>101592161
Isn't your entire general like that?
>>
>>101592153
If your billion dollar ai can't decipher ESL then what's the point?
>>
Anon where KCPP guessed to many layers, can you share me your GPU vram, model(s, including image gen models if used), blasbatchsize and amount of context you were trying to use?

It has multiple things in place to prevent that from happening so if it still under guessed on your system I want to be able to reproduce the setup. Because that would imply you somehow broke trough the entire 1.5GB buffer zone we put in place as a safeguard.

Either you have a ton of background stuff running or your using a model that is way more vram hungry in unexpected ways than the stuff I tested with.

To clarify in the current version the auto layer guessing only is accurate for default settings. If you modify for example blasbatchsize that is not yet accounted for.
>>
Hi all, Drummer here...

>>101592180
HENKYYYY PENGKYYY!!!
>>
>>101592180
What are you doing here? You're too innocent for this website! :koboldpeek:
>>
>>101592180
Kekaroo, your dox got posted earlier faggot
>>
my hero just spoke in /lmg/. AMA.
>>
>>101591786
I can't make it stop either on one specific card I'm doing where it's an adventure/story rather than a one-on-one chat. IDK if this makes it harder but it probably doesn't make it easier. I put in the system prompt to write for every character except {{user}} and put in the jailbreak / depth 0 author's note never to speak for {{user}}. May have helped but didn't totally solve it. Possibly also made more difficult because I am simultaneously trying to make it stop ending replies by asking what my next action is, which I was able to reduce significantly but not eliminate. Partway through I tried cranking the temperature way down and that absolutely didn't fix the issue. Maybe if I tried again with my prompts setup better it would. Nothing solved it completely but right now the level of swiping / editing is low enough that I'm okay with things.
>>
>>101592274
>I can't make it stop either on one specific card I'm doing where it's an adventure/story rather than a one-on-one chat.

Which isn't to say I *have* been able tp get it to stop on other cards, just that I've only been working on this one.
>>
>>101592180
Keep up the great work, Henky!

Tell your assistant, Concedo, he did a good job too. :koboldlaugh:
>>
>>101592247
Ooooh, someone's being an edgy boy. :koboldpeek:

You think you're so tough spouting that *f-word* behind the screen, huh?
>>
>>101592153
I sometimes think if I was ESL I'd like LLMs a lot more. Like if I'm reading a foreign language I can't tell if the writing is good or bad. I can just (at most) tell what information it says. And if the same expressions get used over and over I'm not annoyed, I'm pleased to see familiar expressions.
>>
>>101592040
Suddenly Lumimaid makes a lot more sense.
>>
>>101592323
I am an ESL. That is not how it works.
>>
>>101591917
An 8.0bpw exl2 of Mistral NeMo 12B with cache_mode q8 and 32000 tokens of context fits in 15.2 GB of VRAM.
>>
>>101589160
t=1.0
>>
Is it better to have 2x 3090 or 1x 3090 + 2x P40 if I'm trying to run 70b models faster?
>>
>>101592475
2x 90
>>
>>101592475
3x 3090 if you can but 4x 3090 would be even better
>>
>>101592040
I mean I knew he was belgian, but didn't know it was that bad.
>>
>>101592348
Don't lie I bet it's even stronger for u foreign cunts because your languages have like 1/5 as many words as English. Repetition is a way of life for you, while for English speakers developing a sense for how often to re-use the same word is a major early part of developing good writing style. Small children are very repetitive, older ones go too far trying to add variety, then they tone it down and get better. (Or sometimes not. There are published authors who go to unintentionally humorous lengths to avoid re-using basic words like "said.")
>>
>>101592040
kek
>>
>>101592546
>doesn't speak any foreign language
>don't lie to me, i bet-ack
>>
>>101592338
>>101592506
Now I see why he never tests his own shit. Even if it was broken how could he tell?
>>
>>101592564
Knew you were the kek poster.
>>
File: file.png (69 KB, 349x642)
69 KB
69 KB PNG
>>101592546
>>
File: stfu.png (21 KB, 509x217)
21 KB
21 KB PNG
>>101592546
>>
>>101589653
I have never run into this problem myself but I suspect it's a driver issue.

>>101590419
With a few hundred bucks you can buy 512 GiB RAM which is enough to run it at 8.5 bits per weight.
But then you can expect something like 0.2-0.5 t/s.

>>101590774
>>101590781
>>101590786
>>101590804
The problem with the proposed parallelization scheme is the synchronization overhead.
You need to exchange (part of) the activations between GPUs and write back the results which introduces non-negligible latency, especially on fast GPUs without NVLink.
This is not much different from what --split-mode row already does and there are considerable performance issues (though the multi GPU optimization is also poor).

>But what if, you take a MOE model, for example DeepSeekV2 236B, and split the different smaller experts across the gpus, so that they don't have to exchange information. Is this thinking flawed?
Which experts are selected is effectively random and determined by the routing layer if I remember correctly.
But in order to do that the results have to first be collected on a single GPU.
So you're not really saving any I/O.

>>101592475
2x 3090 if your target quant fits into 48 GiB VRAM, 1x 3090 + 2x P40 otherwise.
>>
File: 1718298816889142.jpg (2.53 MB, 3108x1691)
2.53 MB
2.53 MB JPG
Mistral Large 2 is now my main model for cooms.
No more mischevious glints, she says in a husky voice, a smirk playing on her lips, eyes sparkling with mischief. There's a playful glint as she addresses the power dynamic, playfully smirking as she offers her ministrations. An audible pop and rivulets of—admit it, pet—the ball is in your court.

It has none of that slop and even as a 48GB VRAMlet using a baby 2.75BPW exl2, it can fit 12k context @15t/s.
>>
>>101592681
lock em in a hot room and sell me the fumes
>>
>>101592496
Pretty much this. Although I'm starting to feel like a VRAMlet with 4.
>>
File: 1717392494482029.jpg (42 KB, 680x671)
42 KB
42 KB JPG
>4x 3090s is now considered "VRAMlet"
>as if 1 wasn't pricey enough
no i will not dump retarded amounts of money onto a single-purpose machine i'd only use sparingly even if the models are appealing
>>
>>101591941
Couldn't it be put in context template?
>>
>>101592681
LL and 3L tag teaming S
>>
>>101592871
Also... isnt that the point of the "System same as user" Option in ST, for this exact purpose? So you can fill in the system prompt and it treats the system prompt as the user message as well?
>>
>>101592870
I mean people spend more money on dumber hobbies. It really depends on how far you want to go. I started out running 4-bit pygmalion 6B on a Ryzen 2400G with 8 gigs of RAM and no GPU before there was really any integration with anything so I was basically using the 'chat mode' in the console. Then someone introduced me to koboldcpp so I was running Llama 13B models on my gaming PC with a 1660 Super and 16 gigs of system ram.
I didn't just up and drop 5 grand on building a server out of the blue. It was a gradual progression.
>>
>>101592870
The more you buy the more you save
>>
https://github.com/ggerganov/llama.cpp/pull/8676

Llama 3.1 rope scaling finally merged
>>
Llama.cpp master branch has been merged with the fix for L3.1's issues with context beyond 8192, should be working properly now.
https://github.com/ggerganov/llama.cpp/commit/b5e95468b1676e1e5c9d80d1eeeb26f542a38f42

>>101592681
Its not brain damaged at 2.75 bpw?
>>
>>101592904
The more you buy the more seeing shivers down the spine hurts.
>>
>>101592681
Is it better than a 5bpw 70B? How much better?
It's tempting to sell my 3060 and buy a second 3090
>>101593061
lmao so true
>>
>>101589756
>>101590284
Calm down with the shilling.
>>
File: 1709992939780627.jpg (347 KB, 2250x1651)
347 KB
347 KB JPG
My model ratings from recent tests for RP, run on 48gb vram

1 - Mistral Large (Mistral-Large-Instruct-2407-123B-exl2 , 3.0 quant). Just very good at natural language

2 - Midnight miqu - it's a slopmerge on RP and does it's job

3 - Llama 3.1 (4.5 quant) - It's not designed for being a chatbot it seems clear, replies are accurate but very robotic. Beat Mistral large on knowledge checks and coding though

4 - Nemo 12b, I don't know why this was even recommended to compete with the others

waste of time - commandr
>>
>>101592161
mikushitters and some guy named "petra"
>>
I think here's the best place to ask about it but is there a way/program to make an LLM identify and tag several (thousand) images? doesn't have to be anything advanced, just tagging whatever it sees would already be a great help.
>>
>>101593186
yeah, Im pretty sure moondream 2 (small and good model) has a python script implementation, just make a loop and iterate over the folder you want to classify
>>
>>101593186
the ponyfucker said he did some LLaVA work feeding it boru tags and asking it to describe the image to get a caption.
He is kinda a retarded schizo and it isn't clear that was a better way of training than just using booru tags though
>>
>>101593206
https://huggingface.co/vikhyatk/moondream2
here's the repository, the script is there
>>
>>101592986
No. The only errors it does it a misplaced punctuation point once every 500 tokens or so, which is not much to complain about.

>>101593085
Despite my limited experience, I would say yes. Before Largestral, I would use Llama 3 70B finetunes for coom (New Dawn, Euryale). They were good, but had too much slop. With Largestral, no more spine shivers or any other GPT/Claudeisms. It's like I cured my model of its autism.
>>
>>101592964
>>101592986
Again some problem with llama.cpp tokenizer. Sane people should use transformers tokenizer.
>>
>>101593268
that literally has nothing to do with tokenization at all, it's about rope context scaling
>>
>>101593153
>waste of time - commandr
Stopped reading right there
>>
File: F-Gr7rLacAALRMV.jfif.jpg (245 KB, 2048x1937)
245 KB
245 KB JPG
>>101593292
at the bottom of the message? Fucking retard
>>
I still didn't find good settings for nemo. I don't like how moldable it is, or rather it is superfocused on context patterns instead of instructions. For example if you use different model (like llama-3) it would give you lengthy responses naturally (unless you tell it not to), no matter how long are your messages. Nemo however will mimic your responses and if you aren't putting much text in your messages, it won't do it as well.
>>
>>101592383
that's an extremely specific answer, thanks a ton
>>
>>101593219
>>101593206
Thank you, I'll take a look into it.
>>101593213
A shame how people tend to gatekeep these small things, I don't really blame him though, it's his work I suppose.
>>
>>101593303
he's mistral nemo please understand, they put their system prompts at the bottom
>>
>>101589265
I remember in December 2022 doomers saying local gpt 3 (DaVinci) was “maybe 10 years away”. I always knew these things were bloated as fuck.
>>
doomer here, i'm going to make a prediction and say that agi is maybe 100 years away. 1000 years for coomable agi that fits into 10gb vram.
>>
>>101593153
>Nemo 12b, I don't know why this was even recommended
Because of the allure of huge context length that was previously out of reach for people without much VRAM.
>to compete with the others
Assume people saying that were trolling or retarded.
>>
>>101593374
Summer Dragon still hasn't been surpassed though so...
>>
>>101593392
Back then 175B seemed impossibly huge. I can't believe I'm running models close to that size on a simple $3k rig at home now
>>
Is it just me or does Llama.cpp take longer to compile than it did a few weeks/months ago?
>>
OKey So.. the base Mistral-nemo model is much better on the larger context size; the difference in understanding is massive. What causes this?
>>
>>101593463
What are you saying? You're getting better results with base than instruct with large chat histories?
>>
what does flash attention do?
>>
>>101593547
https://arxiv.org/abs/2205.14135
>>
>>101593452
It does now take longer with CUDA, make sure you instruct the build system to run multiple jobs in parallel, for example with. -j 8

>>101593547
Calculate a temporary matrix in small parts in fast but small memory instead of calculating and writing the entire matrix to large but slow memory.
This requires more calculations but on modern hardware the speed of calculations has been increasing much more than the speed of memory.
>>
>>101593513
Yeach. At larger contexts, instruct for me to become dumb, skipping over events and being completely lost in the plot, while the base model does not seem to have the same problem.
>>
>>101593452
It's super annoying, I used to rebuild it everyday before using it, now only do it every other weeks or if I need compatibility with a new model.
>>
>>101593463
You tested the base model? That's interesting.
I suspect >>101399248.
People's multiturn fine tuning data are constructed naively.
>>
File: 1707049543626270.webm (2.81 MB, 720x1280)
2.81 MB
2.81 MB WEBM
Largestral 2 is basically a non-dry and 10-15% smarter version of Wizard 2 8x22

At this point, there is no scenario that i test for that doesn't work very well with the model

Outside of external tool use and multimodality, is there anything else that a new model can really give when it comes to RP?

I don't think so, only speed.
>>
>>101593677
my brain looks like that (i use crack)
>>
>>101593677
What quants do you run of both models?
>>
I'm still using C-R+. Nothing has changed.
>>
>>101593699
q4
>>
>>101593690
based expert roleplayer
>>
Is it possible to use nemo 12b on koboldcpp? Docs say GGUF only, but has someone already converted it?
>>
>>101592087
He has a point in that having those tensors at a higher precision than the rest of the model makes the output better, yes, but that's something that most (all?) quants already do.
The whole meme began when he claimed that having those layers at full precision gave better results than having them at q8 or whateever, which was demonstrably false.
His whole "testing" was all vibes based and non-reproducible.
>>
>>101593836
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
>>
>>101593865
thx anon
>>
>>101593836
not really, you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs
>>101593865
nigger
>>
>>101592180
*cums on you*
>>
>>101593939
>you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs
>2 days ago
>https://github.com/LostRuins/koboldcpp/releases/tag/v1.71
>Merged fixes and improvements from upstream, including Mistral Nemo support.
You might be a little behind.
I don't blaqme you, I've been using llama-server directly for months now, there's no reason to use kcpp really, so I get it.
>>
>>101593939
>not really, you gotta either use a fork of koboldcpp or wait for the retard to implement the tekken token bs
are you mentally deficient?
>Merged fixes and improvements from upstream, including Mistral Nemo support.
https://github.com/LostRuins/koboldcpp/releases/tag/v1.71
>>
>>101593677
What's crazy about AI videos is that within the bizarre surrealistic nonsense each moment is still copacetic with the previous moment and the next moment. Truly nightmare fuel.
>>
idc dont use koboldcpp
>>
Just tested out 3.1 70B at IQ3_M (on latest llamacpp build). It's a bit faster than Largestral was at IQ2_M. Also does OK at the trivia question I threw at it, but it doesn't seem to be able to do the Castlevania question unlike full precision. Maybe if I go just a bit higher in quant.
>>
>>101594001
>I was just prentending to be tarded
>>
>>101593986
>there's no reason to use kcpp really, so I get it.
Actually, just to correct myself, there is one reason.
They still have support for multi-modal, I believe, whereas upstream nuked it pending a refactor.

>>101594013
How charitable to assume he was just pretending.
>>
>>101593725
Same but C-R



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.