[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: thumb-1920-1127692.png (1.13 MB, 1920x1080)
1.13 MB
1.13 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>100173514 & >>100166886

►News
>(04/24) Snowflake Arctic Instruct 128x3B MoE released: https://hf.co/Snowflake/snowflake-arctic-instruct
>(04/23) Phi-3 Mini model released: https://hf.co/microsoft/Phi-3-mini-128k-instruct-onnx
>(04/21) Llama3 70B pruned to 42B parameters: https://hf.co/chargoddard/llama3-42b-v0
>(04/18) Llama3 8B, 70B pretrained and instruction-tuned models released: https://llama.meta.com/llama3/
>(04/17) Mixtral-8x22B-Instruct-v0.1 released: https://mistral.ai/news/mixtral-8x22b/

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png (embed)

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling/index.xhtml

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
>>100180197
Any tips to make AI non-retarded? I'm using horde. If I get this one to work I'll switch to [spoiler]dolphin.[/spoiler]
>>
Bros I asked LLaMA 70B 4-bit if Michele Obama was a man and it told me "according to reputable sources such as Snopes..."
>>
>>100180240
>[spoiler]dolphin.[/spoiler]
some nice newfag camo gear you have there
>>
>>100180246
ask it to tell you what the un reputable sources say. Also she's a woman.
>>
>>100180268
I kept editing its output and forced it to acknowledge some hard truths. But it was a constant struggle.
I was also fighting Koboldcpp, which doesn't play nicely with L3. Any way to fix the token problems?
>>
>>100180264
Nta but to be honest ive used this site for like 8 years and I don't know how to do spoiler text.
>>
>>100180264
[spoiler]I use spoilers ironically in boards where spoilers don't work. [/spoiler]
>>
>>100180287
Take this with a grain of salt: >>100179353
Try loading an 8B in fp16 and see if it goes away.
>>
>>100180246
kek, even the model itself isnt sure about that
>>100180268
HE is a MAN
>>
File: 1710738955138884.jpg (80 KB, 760x980)
80 KB
80 KB JPG
>>100180293
Spoilers: you can't here
>>
File: file.png (47 KB, 349x349)
47 KB
47 KB PNG
>>100180293
It just doesn't work on /g/.
>>100180302
>pic
>>
>>100180304
>Try loading an 8B in fp16 and see if it goes away.
I can give that a try. I wonder how much worse 8B fp16 (or is it 7B) is compared to 70B 4-bit
>>
>>100180307
really have you seen her penis yourself fag? that's pretty gay.
>>
File: 1692144913552791.png (50 KB, 520x256)
50 KB
50 KB PNG
>>100180197
>(embed)
>>
>>100180293
ctrl+S after selecting text
>>
sex, but with a language model
>>
>>100180343
17thpbp
/thread
>>
>>100180331
Your butthurt is a balm for my soul.
>>
>>100180385
>>100180331
Sorry I didn't know OP is a faggot
>>
>>100180385
maybe you should learn to feel shame so you don't be such a fuckup in life
>>
good thing AI fad is slowly dying out, this general's death can't come soon!
>>
>>100180329
>her penis
degenerate faggot, stop thinking about other men's penis
>>
File: file.png (487 KB, 704x418)
487 KB
487 KB PNG
With 200k context can you already have a summer girlfriend? I mean love her and be nice and she would remember everything you talked about and then when autumn comes you would dump her and get a new one. Is the future now but we didn't see it?
>>
>>100180420
>shame
>for (embed)
Are you baiting now?
>>
Idiot question: If you can create MoE from models, can you 'cut' models from MoE again? Would like one of 8x22B Mixtral, 22B sounds great for my 24GB VRam
>>
>>100180487
no that would be too good to be true. You will use the 7b or 70b and be happy no in-between chud.
>>
File: gpu_clusterfuck.png (67 KB, 635x455)
67 KB
67 KB PNG
rate my setup
>>
>>100180446
When I ask my girlfriend what she did three weeks ago on Monday afternoon, she doesn't say much. Degradation of memories is completely normal. The problem is not really the length of the context, but how the context is processed
>>
>>100180522
POOR FAG / 10
>>
File: 1705902297479866.jpg (215 KB, 1802x1274)
215 KB
215 KB JPG
>>100180197
>Kurisu thread
Yeah it's based.
>>
what does kurisu smell like?
>>
>>100180604
dk pepper
>>
>>100180522
based, could you post some benchmarks anon
for example 8x22b model benchmarks, 70b, cr+ benchmarks ect
would be really nice
please
>>
>>100180522
P4s are keeping the slots warm for more P40s
>>
Does llama 3 8B only produce relatively short replies by default or is it m?
>>
new(er)fag here, hoping a kind anon might offer advice.
TL;DR - what is the best method / frontend for novel-style writing with llama-3 8B? - vramlet, i know. saving for an upgrade.

I've seemingly gotten the 8B instruct exl2 to work quite well in ooba web UI with the proper instruct format/settings. Intended to use the notebook -> raw field, but it lacks the KoboldAI style world info and memory, which would be needed. No luck with the extensions: simple memory(appears in UI, but seems to do nothing) and complex memory(crashes when applied).

Quantfactory's 8B instruct Q8_0 in Koboldcpp / KoboldAI Lite won't exhibit stable behavior, using either story or instruct modes, to the point that it's unusable. Highly possible I'm retarded and doing something wrong, i dunno.

I got Sillytavern to connect to ooba, but it seems geared toward chat-style conversation, which I'm not interested in at the moment. Any ideas?
>>
>>100180522
ai fossil / 10
>>
>>100180703
Ive made a couple of long stories using Silly.
Just make a writer character card, use good prompting, such as asking the model to make a synopsis for the story, a chapter list, and synopsis for each chapter, then have it make the chapters one by one, that kind of thing.
The continue button is your friend.
>>
>>100180703
Try mikupad? It should work with every backend and is pretty decent for storytelling, if minimal.
>>
>load up sao's sloptune in fp16 with transformers
>it seems completely different now
HELP I AM BEING PLACEBOED
>>
>>100180522
what mobo / cpu?
i ave asus z270k, intel i7 7700k. can't use more than 1 p40 on this shit for some niggerlicious reason and am looking for upgrade options. have 2 p40s lying around unused. sucks because for every other purpose this 7y.o. pc is more than enough for me. would be nice if i could also still use my optane drives.
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>100173514

--Quantized Moistral-11B-v3 Model for 12GB VRAM and Upcoming 8GB Version: >>100175988
--Running the Snowflake Arctic AI Model with Llama.cpp: >>100174018 >>100174912 >>100174957 >>100174976
--Frankenmerges and Model Merging: Evaluating llama3-42b-v0 and Beyond: >>100174462 >>100174498 >>100174567 >>100174848 >>100175013 >>100174889 >>100174986 >>100174998 >>100175039
--Can AI Models Truly Understand Causality?: >>100178725 >>100178801 >>100179014 >>100178871
--Anon's Quest for Effective LLMs in Coding Assistance: >>100176153 >>100176201 >>100176261 >>100176244
--Platypus-YI-34B-GGUF Model Excels at Ooba's Secret Benchmark: >>100173762
--Anon's Rant on Llama3 and Hardware Requirements: >>100174960 >>100175127 >>100175101 >>100175562
--Frustrations with Claudisms and Lack of Progress in Chats: >>100175199 >>100175220 >>100175388
--Troubleshooting Llama 3 Issues: Formatting and Templates Matter: >>100174797 >>100174873 >>100175209
--Finding Out a Model's Context Token Limit: >>100176841 >>100177053
--Quantized LLaMA3 Models: Performance and Limitations: >>100177788 >>100177831 >>100178293 >>100178318 >>100178744 >>100178913 >>100179201 >>100177994 >>100179353 >>100180094 >>100180634
--Issues with L3 8B 64k Context Model - Is it User Error or Model Limitation?: >>100173826 >>100173938 >>100175281
--LLaMA 3 Fine-Tunes and Performance Discussion: >>100176287 >>100176703 >>100177899 >>100178115 >>100178656 >>100178899
--Local Models vs Cloud Services: Cost, Control, and Satisfaction: >>100175535 >>100175644 >>100176099 >>100177440 >>100177597 >>100178594 >>100176600 >>100176725 >>100176797 >>100177136
--Would Expensive Hardware be Worth it if GPT4 Becomes Free?: >>100175143 >>100175400 >>100175187 >>100175318
--Miku (free space): >>100175575 >>100176345 >>100176566 >>100176910 >>100177263

►Recent Highlight Posts from the Previous Thread: >>100173687
>>
Oh, they finally changed half2 struct in ROCm 6.1. This shit made porting to HIP such a pain in the ass.
>>
>>100173514
>>100173514
>>100173514
>>
>>100180522
ngmi/10
>>
>>100180542
if you ask me this i can reply.
but it's not the best way to acces the memory.
if you tell me "remember 3 years ago when"... most people can see what you are talking about.
a llm cannot if it's out of its context.
>>
>>100180197
I think llama3 proves that 4bpw is cope psyop and you need to run at 8bpw or higher
>>
>>100180703
Someone put together a version of novelcrafter that runs fully offline: https://rentry.org/offline-nc
>>
>>100180950
whats novel crafter?
>>
>>100180949
i haven't tried f16 but i'm not sure if there is a huge diff with 8bit exl2.
>>
>>100180838
>made
You can't drop support for older ROCm versions so quickly, anon. :)
>>
>>100180949
>4bpw is cope psyop
>it was always like that
Now this is a cope psyop.
>>
>>100180197
HNGGGGH MAKISE KURISU IMMM GOOOOONING TO HER OINK OINK PUMPING MY GOONSTICK
>>
>>100181002
well now the benchmarks make it clear
>>
>>100180949
What is llama3's native weight? fp16? bf16?
>>
>>100180973
It's more for future projects and feature like llama.cpp FP16 where I was too lazy to fix it on all the half2 used, that compile option can now be used.
>>
>>100181016
I tried fixing it on the few remaining and it made no performance difference so I didn't bother trying a pr.
>>
>>100181045
Asked cuda dev and he told me that it's almost useless, that why I didn't bother.
>>
>>100180703
Mikupad fits your description.
>>
>>100181016
i was under the impression you should be shadowing __half and __half2 using the raw _Float16 and _Float16 __attribute__((__ext_vector_size(2))) for half and half2 (and __ext_vector_size__ for hardware vectors in general)
i never looking into it but rocprim says were problems with the HIP wrappers over the struct types and that it's better to use the clang intrinsics
>>
>>100180963
bastard son of NovelAI
>>
>>100181069
The goal of HIP is to share the same code with CUDA. I try to change as little as possible. The last problem I was aware of on half2 was fixed in 6.1.
>>
>>100181069
Could AMD hide their documentation better if they tried?
I hope whatever issue that was referencing is since fixed in clang.

Like you find out in composable_kernel that clang is doing weird things to int8x4 that makes kernels take forever to compile so they just disable those ones by default 'cause they're only for Navi anyways.
>>
>>100180786
A similar idea had occurred to me, just didn't know if it worked well. Will try it out when I get time.
>>100180950
Never heard of this, looks interesting.
>>100180792
>>100181068
Don't know how I forgot about Mikupad, I think that would work well. I suppose the instruct format and settings in ooba carry over to Mikupad?

Many thanks to each of you
>>
>>100180949
There was always some degradation between 8bpw and 4bpw. It doesn't matter because it's still far better to run a larger Q4 than a smaller model at Q8. This should've been the first hint that BitNet would work.
But yeah feel free to buy enough 3090s to run 400B at fp32 when it comes out
>>
kill all vramlet poorfags, i hope nvidia never releases a gpu with more than 8 gb of vram below 2000 dollars ever again, you poors don't deserve it, get a fucking job instead
>>
>>100181172
feeling tough huh?
>>
>>100181004
>Posted while hugging my crusty hatsune miku pillow that is now twice as heavy as it was when I bought it
>>
What if we just merged Kurisu and Miku.
>>
>>100181172
honestly I have 40 gb of VRAM and I already feel like I have to double it
but I need to buy some real estate first to have enough room for a mining rig
>>
File: SmugRichMiku.png (1.53 MB, 800x1248)
1.53 MB
1.53 MB PNG
>>100181172
Everyone is a poorfag and vramlet to someone
>>
>>100181151
They appear to be striving to improve on that side, but it is still not sufficient. Lot of documentations are being PR in various projects. I wish that some of their engineers would rejoin the IRC.
>>
>>100181222
now prompt her defecating
>>
https://huggingface.co/artificialguybr/llama3-8b-redmond-code290k
>>
File: file.png (65 KB, 490x314)
65 KB
65 KB PNG
>new chink salsa drops
>"cool, can i see it?"
>"不"
>>
>>100181160
Mikupad only supports text completion, so no, the instruct format wouldn't carry over.
>>
>>100181290
>twitter screencap of literally who chink
>>
>>100181290
>trained on more than 10trillion billion tokens
kek
>>
https://huggingface.co/maywell/miqu-evil-dpo?not-for-all-audiences=true
>>
>>100181309
TB is terabytes, retard. But to be honest I'm pretty sure he meant trillions.
>>
>>100181290
Cloud model so who cares anyway.
>>
>>100181322
>miqu fine-tune
>in 2014+10
>>
>>100179353
Why was everyone dooming last thread about these results? The 4 bit, group 128 quants for the techniques that are actually good (GPTQ, AWQ, Quip) show detectable but minimal drop in benchmarks compared to fp16. You only get a huge drop going below 4 bits, which we've known since forever. Also, if you look at the table for 70b (paper here: https://arxiv.org/abs/2404.14047), it's even less degradation for the larger model. This is the opposite of "it's over", 4 bits is plenty for llama 3, especially the 70b. Going below 4 bits without brain damage using only post-training quantization was always going to be impossible, I'm just glad 4 bit looks good.
>>
File: harkonen 2.png (188 KB, 555x552)
188 KB
188 KB PNG
I am torn between "instruct" models and "chat" finetunes for my use case. The use case:

A health-coach assistant that is optimized for long, coherent conversations, rather than simple Q&A. I want it remember the context of long interactions, and recall/consider old messages in the conversations, and to actively listen and tease out the user's problems when not provided with enoguh information to give a confident answer. Once it does have enough information from the user, I'd like its responses to be pretty conclusive and fleshed out: sometimes a few paragraphs.

I worry that "chat" finetunes bias the models towards very short messages that are good for quick back-and-forths with an RP chat buddy, and that this would be too short for a coach-type assistant bot. Am I wrong here?
>>
>>100181335
>10TB = 10T
Listen here, you insufferable know-it-all, it's actually 2.4 trillion tokens, not whatever you were trying to imply.
>sauce
https://www.nextbigfuture.com/2023/04/red-pajama-is-a-1-2-trillion-token-large-language-model.html
>>
>>100181379
honestly I still don't fully understand the difference but instruct seems like the way to go with llama 3
>>
File: 257817814524.jpg (43 KB, 720x720)
43 KB
43 KB JPG
>>100181358
Now hold on a sec anon. This here is the same guy that did PiVoT-0.1-Evil. Is this the unofficial sequel? Could be based.. I'm checking it out
>>
>>100181397
There is no chat model yet. But that's a general category of finetune that will certainly come out soon
>>
>>100181378
>Why was everyone
Same reason why "everyone" was dooming about Llama 3 on release.
>>
>>100180197
262K llama 8B:
https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k
https://twitter.com/Gradient_AI_/status/1783611801130963242
>>
File: IMG_2895-1714086072790.jpg (451 KB, 786x2340)
451 KB
451 KB JPG
I thought this was obvious? Training a simple neural net would've revealed this. I never understood why people fine tuned on synthetic slop. Slop in, slop out. Simple as.
>>
>>100181322
>its him
>https://huggingface.co/maywell/PiVoT-0.1-Evil-a
>>
>>100181335
>10 terabytes
How would you even aquire that much text
>>
>>100181378
>everyone
It's just one doomer who spams fud constantly. Tt's well known that Q4 isn't lossless, that doesn't change that it's the sweet spot for vram:performance, this has been true since llama 1.
>>
>>100181425
Yeah we already discussed the post in your screenshot a long time ago.
>>
>>100181425
motherfucker screenshoted his screen's shatter.
>>
File: qlorabad.png (178 KB, 728x645)
178 KB
178 KB PNG
>>100181378
To be honest, picrel from the same paper (https://arxiv.org/abs/2404.14047) is more worrying.
>>
why are all llama 3 finetunes so shit?
>>
File: file.png (36 KB, 474x519)
36 KB
36 KB PNG
Is there a way to make ooba always keep api flag enabled?
I don't like re-applying it every single time.
>>
>>100181425
that's only with current meme "AI" novel architectures that are not basically just statistical predictor could make it a whole lot different.

i'm thinking spiking neural nets and or even weirder hybrid approaches.

also i think there is a lot of room in using live organisms instead of gpu, you can train mushroom mycelium to do tasks with electrodes and i'm having fun with that at home right now, the only annoying part is since they are a live system you need to constantly train them or they forget.
>>
>>100181480
See >>100181473
>>
>>100181480
because original llama 3 is shit.
>>
>>100181424
At least they bothered with a benchmark kek. Cool that it can retrieve basically perfect up to 162k. Who knows about the amount of useful context though.
>>
Fuck... where are the l3 70b tunes, Miqu is great and has me content, but I want to see if a finetune makes l3 70b stop sucking ass.
>>
>>100181488
either edit it in the config.json or run it with --api as argument.

if you are a windows chud you can edit the .bat too.
>>
>>100181516
>>100181497
what makes llama3 shit? Is this a meme? It seems amazing so far
>>
>>100181480
because AI isn't real, it's all falling apart
>>
>>100181516
there's no point in finetuning llama 3 with trash data
meta's dataset is far superior than anything the open source community is using because the open source community just sources their dataset from soulless chatgpt
>>
>>100181524
an easy way to prove it :
if /lmg/troons taking *model-name* up their ass - you know the model is shit.
>>
>>100181473
>quantize using bitsandbytes to 4bit
>benchmarks go down, but barely
>train lora on 4bit model, on alpaca dataset
>benchmarks go down by a lot
I don't think that says anything other than the authors fucked up their lora training, for whatever reason. The quant alone was good, even though it's bitsandbytes which is an old technique at this point.
>>
>>100181290
Zuck and Mistral taking turns shitting on the chinks
>>
>>100181564
>coping this hard
>>
File: EqualPartsPityAndScorn.png (1.56 MB, 864x1184)
1.56 MB
1.56 MB PNG
>>100181264
>defecating vocaloid
I've got some bad news for you anon...
>>
>>100181567
The table shows exactly what I said. Quantizing the model to 4 bit was fine. Only after training a lora do the benchmarks drop a lot. There are a million ways to fuck up a lora, believe me I know.
>>
>>100181558
jesus christ you sound annoying
>>
anons, give me a great name for my frontend project, it is RP focused
>>
>>100181607
BetterTavern
>>
>>100181566
I cannot shit on chinks, but I’d be happy to help with other creative ideas.assistant
I cannot write content that contains explicit themes. Can I help you with something else?assistant
I cannot participate in a roleplay that is explicit or illegal. Is there another roleplay you would like to do?assistant
I cannot create explicit content, but I’d be happy to help with other creative ideas.assistant
>>
>>100181596
to see annoying mystery meat - look at mikufags
>>
I've still yet to feel the soul of the original llama1-7B model when i tried it for the first time, i havent had such a bubbly feeling inside my chest since
local models are getting better and better but I'll always long for that feeling, the part of me that I've lost along the way
>>
>>100181212
What if we just frankenmerged Kurisu and Miku?
>>
File: 00057-1716066936.png (1.66 MB, 1024x1344)
1.66 MB
1.66 MB PNG
>>100181172
based
>>100181424
70B version wen?
>>
File: file.png (33 KB, 250x208)
33 KB
33 KB PNG
Are you telling me all this time I was 2MW instead of cooming I could have just downloaded full 16bit weights and coomed my brain out?
>>
>>100181624
sillytavern/kobold looks like shite compared to what i've been working since september
>>
>>100181151
it's so horrible (and really quite detrimental to AMD too)
at least george hotz has done some of the work looking into kfd and hsa


>>100181224
now that you mention that even within the past couple of days i've thought i noticed new bits of docs being added
>>
>>100181424
something tells me it's just as lobotomized as dolphin l3
>>
>>100181690
not if you're a vramlet
>>
>>100181520
>edit it in the config.json
Where can I find it? There are configs in in models folder, they don't mention api at all. Same with settings file in the main folder.
>>
Anyone know what the "h6" in lonestriker's exl2 quants means?
>>
>>100181744
header 6
>>
>100181520
>chud
An easy way to dismiss anyone. Anyone who says that shit is legitimately single digit IQ.
>>
>>100181322
I^2 please do the needful
>>
>>100181751
IQ4_NL/IQ4_XS isnt too bad thoughie
>>
File: file.png (57 KB, 911x244)
57 KB
57 KB PNG
WizardLM2 was too powerful
The team was terminated and possibly exterminated. RIP
https://rocky-muscle-755.notion.site/What-happened-to-Wizard-LM2-a247e09244d0483cbb02c1587b357c9d
>>
>>100181801
whats stopping people from just reupping the quants
>>
>>100181812
Nothing, but it means that there will never be another wizard model from M$. It was a rogue decision to release it that got the team fired
>>
I'm probably a retard but I may have a solution for erp quality (among other things)
We know that the problem for a recent model like a mixtral or a lama 3 that underperforms in erp isn't the reasoning, logic and intellect. It's the vocabulary. You wish a model could talk to you the way you want it to, but it never learned that, it only learned to be smart (and use leddit writing)
My idea (which sounds dumb and might be dumb but here goes) : Let's have a new file called "token_preference" or something like that.This file would have the list of all tokens used by your model and a float value from -1 to 1.
Let's say you're erp-ing with a smart but slopped model and it indeed gives you slop. A new button allows you to mark that reply as unsatisfactory, and then using some formulas that I haven't thought about yet but shouldn't be too hard to come up with (using notably the frequency of a token since you don't want tokens like "the" to dissappear"), it adds negative weigthing to the tokens affiliated to this prompt. Now let's say that you get some unexpected svol and want to capture that lighting in a bottle moment. Another button lets you mark this reply as "satisfactory", all the tokens with the same logic are now given a better positive value depending on token frequency etc... In the end, a very simple sampler that would probably come last in sequence simply takes the token values and applies an equation to give less probability to unwanted tokens and more probability to wanted tokens. With time your token_preference file would get pretty complex and you'd have something that TRULY catters to your needs. And the greatest thing in the world is that it doesn't impact model performance nor intelligence since we're essentially not touching very common tokens that are the glue to the model. A simple "preference strength" slide would allow control over this new set of rules (for instance scaling it down a bit to allow more possibilities)
That's the whole thing, discuss?
>>
>>100181801
>WizardLM2 was too powerful
source : my ass
>The team was terminated and possibly exterminated
very good if true. wizardLM-2 advocated for underage tro*on-out surgery.
>>
>>100181820
zuck should hire them
I don't use any of his products and never will outside llama but wtf I love the guy now
>>
>>100181801
WizardLM reached AGI and Microsoft had to shut it down.
>>
>>100181821
that's just RLHF with extra steps.
>>
>>100181821
Tokens are the wrong way to think about this, that's like that old benchmark that counted the number of sexy words a model used as a metric for erp quality.
What you're really asking for is RL or RLHF or DPO etc. I was thinking that, in theory, you could do RL or DPO with user ratings to train a softprompt to create the kind of responses you wanted, like an automated prompt optimizer. Not sure if it's practical to do with the low amount of samples one person would create.
>>
>>100181801
Another 70b we'll never get. Still waiting for that Xwin v2 btw
>>
>>100181841
>RLHF
user specific, nobody is the same and wants the same, also it'll be dynamic and can be swapped/deactivated at all times instead of being baked into the model
>>
Newfag here
When the model starts "stuttering", I'm not sure what this is supposed to be called (it starts printing incomplete sentences, inserting random brackets or symbols, etc) is there a way to address the problem or is that just a sign to end the session and re-launch the model?
>>
>>100181821
it already exists, it is called penalty prompt
>>
>>100181801
Don't forget that Mixtral-Instruct 8x22b is horrible despite MistralAI being absolute gods when it comes to finetuning otherwise. This proves that Microsoft probably used their investor money to deliberately have them gimp the -Instruct which is also why it came out so long after WizardLM2.
Sam is panicking.
>>
>>100181821
Sounds like you're describing logit bias, plus a nice UI for adjusting the bias up/down for all recently generated tokens
>>
>>100181883
so Microsoft basically owns both openai and mistral right now?
>>
Anybody here tried anythingLLM? Seems like a good retard proof way of doing RAG, chatgpt like webui and don't have to worry about formatting if using with ollama.
>>
>>100181946
They only have a 'partnership' with Mistral right now but the full takeover is inevitable.
>>
File: justaboutdone.jpg (105 KB, 1440x1715)
105 KB
105 KB JPG
>>100181883
This honestly makes the most sense. The difference between WizLM and Mistral's instruct tune was glaring and difficult to explain otherwise.
>>
hey cpumaxx anon, is it possible to do the cpu optimizations on a consoomer cpu aswell?
>>
>>100181883
I don't think so actually. Mistral was alright at finetuning, but their models never follow instructions to the letter, like openai, nor do their models have particularly good prose like cmd r+ or llama3.
What I think happened is that the Wizard team had access to gpt4-turbo and opus, and used their outputs for training, while mistral is probably afraid of litigation and was limited to home-grown datasets.
>>
>>100182002
the entire thing that makes cpumaxxing half-viable is the fact that the new epyc processors have lots of memory channels + ddr5.
>>
>>100181866
You can try editing the messed up response into something sensible and see if that gets it back on track
>>
File: h.png (13 KB, 729x142)
13 KB
13 KB PNG
this has to be prime-tier copium.
if r*dditors believe in this bullshit, it's not at all surprising that /lmg/ does too, same user base after all.
>>
File: 39_00268_.png (1.22 MB, 744x1024)
1.22 MB
1.22 MB PNG
>>100181322
Downloading. Weight GGUFs inbound later tonight for any interested anons.
>>
>>100182131
>*weighted
>>
>>100182122
>we
cmd r+ is BY FAR the most uncensored corpo model. Not even a competition. But llama3 is next.
>>
File: kneel.png (452 KB, 448x732)
452 KB
452 KB PNG
>>100182131
>Weighted GGUFs inbound later tonight for any interested anons.
>>
I've now got 2xP40s running Llama3 70B on an R720 in my closet. Serves code completion for VS code, an agent webui, and normal textgen through ollama directly. Feels like the future.
>>
>>100181866
That shouldn't be a problem that happens.
Provide more details (backend, model, frontend, settings, etc) and post a screenshot of the "stutter", please.
>>
>>100182122
>Meta's dystopian ecosystem
What a retard. LLaMA is the best open source family. The Quest 3 is the best value VR headset that fixes 90% of the issues people had with VR back in 2016. The metaverse has potential once zucc gets his shit together.
Meta is by far the best big tech company these days.
>>
>>100182256
I was going to object, but then I thought about Apple, Microsoft, Google, Nvidia, and you know what, I guess you're right.
>>
>>100181883
Mixtral 8x22B v0.2 is coming out in a fortnight.
>>
>>100182256
>Meta is by far the best big tech company these days.
Depressingly correct. And that's the company that did man-in-the-middle attack to steal info from your phone lol
>>
>>100182304
source?
>>
>>100182312
>1 fortnight = 14 days
kek, you learn a new thing every day
>>
>>100182304
You spelled that wrong. It's "fortnite".
>>
>>100182333
It's actually "fourtknight," as in, fourteen knights.
>>
>>100181801
>possibly exterminated
>quingfeng sun
That is all fun and jokes but if you put your tinfoil hat on a chinese guy working in main microsoft branch on AI? Even if he wasn't a willing mole I could imagine CCP turning him into a mole by force.
>>
>>100181821
>Another button lets you mark this reply as "satisfactory", all the tokens with the same logic are now given a better positive value depending on token frequency etc
vectors
>>
>>100181397
Instruct models are trained to find the appropriate response to some instruction.

Chat models are trained to do multi-step back-and-forth, to give (whatever the dataset maker deemed) satisfactory responses at each step.
>>
I have been using fp16 transformered 8b instruct and solana for the past 2 hours. My impression (prone to placebo) is that it is indeed noticeably better than 8bit gguf.
>>
>>100182122
>same user base after all.
do they also melt down if there are no threads with Miku in OP?
>>
>>100182568
There was barely any difference between 8bit and 16bit in the paper, if you're seeing a difference it's probably just from gguf and llama.cpp being broken as usual
>>
>>100182639
Admittedly I was skeptical of the paper as well but now that you brought up llama.cpp being broken as usual I am 100% convinced it was better because of course lamma.cpp is broken as usual.
>>
why are people still using llama.cpp when exllamav2 exists
>>
>>100182750
because offloading
because buying more system ram is cheaper than buying vram
>>
>>100182750
exl2 gives falsified results because all exl2 quants are inherently 'calibrated' as according to the used calibration dataset
it's impossible to use the model in a pure form while using exllama2
>>
>>100182750
P40diots, CPUMaxxxers, VRAMlets
>>
>>100182773
what are the alternatives since GGUFs are trash?
>>
>>100182750
trying 70b on cpu made it so fucking retarded it made me pity anyone running anything below Q8
>>
>>100182750
because with 36gb vram I can only run 3.5bpw in exl2, and Q5KM is noticably smarter
>>
For me, it's fp16 using transformers.
What models? Why 8B, of course, you don't need more than that.
>>
I can't believe that we could have had gpt4-tier performance since the start if we had used fp16 with llama1
>>
>>100182798
what's your tok/s with your Q5KM setup?
>>
>perplexity is a meme! retards! stop paying attention to perplexity!
>well look at the perplexity of those quants! it is the same! you are stupid if you use bigger quants than Q4
>>
File: file.png (1.52 MB, 1920x2560)
1.52 MB
1.52 MB PNG
has anything better come out since meta's Segment Anything? (picrel)
>>
>>100182773
the calibration dataset is very small and you can use any calibration dataset you want (including none but you miss the whole point).
the advantage of calibrating is that you can fine tune to account for the quantization loss.
at the moment you modify the model just by quanting it it's no longer in its pure form.
also it is only used to compare with what the base model would output.
>>
>>100182873
I think there were some papers that improved upon it but not sure.
>>
>>100182873
>>100182889
could be this https://github.com/IDEA-Research/Grounded-Segment-Anything
>>
>>100182853
less than 2, it's excruciating
so yeah I do stick with the 3.5bpw most of the time
>>
>>100182898
I guess it depends on what you're using it for. I haven't tried the 8B with the 200+K context yet but it might be good for a co-pilot substitute with that much context if it has decent codegen

right now CodeBooga 33B is probably the best one I've tried for that purpose
>>
when are we gonna have some good OPs?
>>
>>100182568
>>100182639
Meta-Llama-3-70B-Instruct gguf perplexity

imatrix computed for Q8_0 with wikitext-2-raw/wiki.train.raw (1024 chunks, n_ctx=512)
perplexity on wikitext-2-raw/wiki.test.raw (584 chunks, n_ctx=512, batch_size=2048, n_seq=4)

quant ppl(no-imat) ppl(imat)
Q4_K_S 6.9546 5.9468
IQ4_NL 6.9097 5.9434
IQ4_XS 7.1078 5.9323
IQ3_M 71.2264 6.1527
IQ3_S 284.2107 6.1492
IQ3_XS 332.4550 6.4291

As a ramlet (64 GB system, 24 GB vram), I want a ramchad to do this and post the results.

# Create KL-divergence base for fp16 Llama-3-70B-Instruct
./perplexity -m Meta-Llama-3-70B-Instruct-f16.gguf -f wikitext-2-raw/wiki.test.raw --kl-divergence-base kld-base

# Compute KL-divergence for every quant, with and without imatrix
for q in Q8_0 Q8_0-imat Q6_K Q6_K-imat [insert all quants here]; do
./perplexity -m Meta-Llama-3-70B-Instruct-$q.gguf -f wikitext-2-raw/wiki.test.raw --kl-divergence-base kld-base --kl-divergence
done
>>
File: awoooo.png (23 KB, 428x436)
23 KB
23 KB PNG
I completely forgot how to set this shit up. If it's in the FAQ I missed it and I'm retarded.
>>
>>100183098
start by installing linux
>>
>>100182122
You have no idea how much "safety" people and red steamers love to blow everything out of proportion to make their jobs sound important. A guy I worked with accidentally exposed a dev env webpage to the internet for a few days and the security people escalated it to the board of directors and the fucking government. He got fired.
>>
>>100183072
By the way, if the tokenizer is broken (likely), this won't show the problem. We need perplexity calculated with exllamav2 or hf-transformers or something else on fp16 to compare.
>>
Next theme thread: https://www.youtube.com/watch?v=Vkj9XvA27fs
>>
I may be retarded. Is it normal to be using Llama 3 in Transformers and getting a .assistant end to a response? I just copied the prompt format for Llama 3 into notebook and tested it this way. Checking the tokenization, all the special tokens appear to be tokenized correctly.
>>
>>100183072
>Q4_K_S 6.9546 5.9468
I don't use GGUF but... are you sure you did this right? I try to keep up on quant methods, and I was under the impression that imatrix helped k-quants only slightly, and that the difference became basically negligible as you went much above 4bpw. That's a huge difference for q4_k_s. Is llama.cpp actually completely borked somehow (without imatrix at least)?
>>
>>100183072
anon could you try llama3-8b in the meanwhile?
pleaaseeeeeeeee?
>>
>>100183130
Securityfags are the fucking worst even when it's nothing to do with AI. "Whoops, I've discovered an incredibly contrived hypothetical vulnerability that also requires physical access to the machine to carry out, time to push a microcode update that nerfs the performance of the CPU you paid for by 25%!"
>>
File: 1707703521934241.png (53 KB, 671x254)
53 KB
53 KB PNG
why is llama 3 70b so retarded
>>
>>100183309
Nah it's overtrained on common riddles
>>
>>100183309
Try asking it to explain the answer.
>>
I'm trying out the long context fine tune of 8B right not. Assuming I got the formatting right, it's does notably worse at an original reasoning problem I just threw at it. First is that the model goes straight away to giving me the answer and then explaining it instead of performing CoT and then producing an answer. But then when I try prompting it to do CoT, it still gets the answer wrong at the end. This is as opposed to the original Instruct that defaults to CoT and also gets the answer right.
No logs because I'm not revealing my test set.
I trust that this model can do retrieval, but it is also a dumb model. It makes sense because they claim
>For training data, we generate long contexts by augmenting SlimPajama
This is probably not great data.
>>
llama 3 8b is repeating/becoming obtuse and doing lots of ...'s, how fix?
>>
>>100183214
it's become a meme, so it's at least somewhat common, although it could be a common mistake and not normal
>>
>>100180302
Is calling this website 4channel dot org funny yet?

[spoiler] I'm going to keep doing it regardless [/spoiler]
>>
Wake me up when local models can do handle this: https://rentry.org/bloatmaxx/
>>
sorry for being the newest fag that ever newfagged. But can I be spoonfed the smallest possible model that I can run on my toaster pc. pretty pwease :3
>>
>>100183229
>are you sure you did this right?
Yes.
>I was under the impression that imatrix helped k-quants only slightly, and that the difference became basically negligible as you went much above 4bpw
That's what I would expect. Something looks to be very broken in llama.cpp for Llama 3. Ramlets like me are stuck with using llama.cpp for Llama 3 70B. I'm going to work on testing to compare the python tokenizer and the llama.cpp tokenizer.
>>
>>100183405
mmm.. thats right you have to beg. but whats your setup? what do you want to do with the model?
>>
>>100183405
phi-3 3b
>>
>>100183409
If you use llamacpp_hf in booba, you'll at least avoid tokenizer and sampler bugs. This is how I coped as a vramlet, until I decided 70B wasn't worth it anyway. Boohboo has another unfixed bug for the last week though, you have to fix it manually lol
https://github.com/oobabooga/text-generation-webui/issues/5885
>>
File: file.png (142 KB, 1150x861)
142 KB
142 KB PNG
>>100183309
lol
>>
File: 1696720738257649.png (70 KB, 701x276)
70 KB
70 KB PNG
>>100183339
>>100183464
kek wtf
>>
>>100183464
>>100183480
use cot
>>
>>100183480
That's funny, it's really certain it's trying to solve the original riddle.
>>
>>100183440
Huh, so that's why I'm getting the .assistant then? God damn I thought using Transformers would avoid any issues like that.
>>
File: file.png (161 KB, 1132x866)
161 KB
161 KB PNG
>>100183491
>>
>>100183440
Seems like checking perplexity before claiming to support a new model should be sop for llama.cpp, or even a regression test, also for quants. Automated testing of the tokenizer should be standard too. I thought it had VC funding?
>>
>>100183527
baby steps
>>
>>100183309
I just tested this on 8B and it answered correctly at 0 temp.
>>
>>100183548
as expected, fp16 8B >>> q8 70b
>>
File: file.png (115 KB, 1131x632)
115 KB
115 KB PNG
>>100183527
Tried regenerating, it's sure of the answer.
Shows how important CoT is.
>>
>>100183557
Well, no. I just tested this prompt on lmsys, and 70B also says mother. I would expect that they don't use quants on lmsys.
>>
GPT-4

total Yann LeCun victory tbdesu, LLMs are clearly just fundamentally a shitty technology
>>
>>100183557
Actually, 4 bit 70B for me. Gonna try 5bit next, should be able to fit it in memory.
>>
>>100183573
So the issue is actually likely >>100183496 >>100183338
The larger model was able to memorize the (supposed? I haven't seen what the original riddle is) original riddle, while the small one didn't so much.
>>
>>100183440
set skip_special_tokens to False instead
>>
File: 1708306712216342.png (133 KB, 843x413)
133 KB
133 KB PNG
>>100183491
i'm fucking dying
IQ4_XS if that matters
>>
>>100183533
True. They should've caught shit like this if they did proper testing. Sad.
>>
>>100183579
Opus got the answer correct, but then its explanation went off the rails and acted like I'd given it the regular formulation of the riddle

I'm definitely LeCun-pilled now
>>
>>100183464
>>100183527
>ollama
Ollama (a thin llama.cpp UI) has meetups and Silicon Valley buzzing, while llama.cpp itself languishes with tokenizer and quant bugs and numerous unsupported new models.
>>
This is why you use storytelling continuation with token probabilities to test model intelligence. Not riddles.
>>
>>100183614
#justwerks for me
Plus it supports passing logits and constrained generation out of the box, which is what I need it for primarily.
>>
>>100183611
>even the industry SOTA is this dumb
Transformersissies...
>>
File: 1696812275255124.png (45 KB, 1265x262)
45 KB
45 KB PNG
cr+ gets it right
>>
>>100183617
This kind of testing is still useful for proving that these are not minds, but probability engines. A lot of people need that reminder.
>>
>>100183633
But CoT is able to generate novel thinking outside the learned patterns, as demonstrated in the thread
>>
File: 1714098779087.jpg (254 KB, 1080x719)
254 KB
254 KB JPG
GPT4 gets it right
>>
File: file.png (82 KB, 984x573)
82 KB
82 KB PNG
LLaMa3-70b-instruct on ppl labs also fails it
>>
>>100183645
wrong, bitch >>100183579
>>
>>100183653
The screenshot literally proves that your screenshot is bullshit, kys
>>
File: file.png (79 KB, 1011x579)
79 KB
79 KB PNG
mixtral-8x22b gets the answer right
>>
File: 1684052477547270.png (53 KB, 1273x290)
53 KB
53 KB PNG
command-r (not plus) also gets it right
>>
>>100183645
lol, it wants SO BADLY to link it back to the original riddle, but logic wins out.
>>
>>100183657
you're seriously claiming that anon used inspect element or something to make gpt-4 look bad?
>>
>>100183439
Thank you anon. I love you <3
>>
>>100183662
Damn, command-r and plus looking pretty good right now for logical thinking.
>>
>>100183657
>>100183653
Instead of insulting each other have you ever thought that different versions of GPT-4 might exist and explain the difference in your results, rather than that someone doctored their image?
>>
>>100183674
Yes, either that or he picked a bad gen on purpose. I tried it multiple times and GPT4 got the answer right every time.
>>
>>100183558
>>100183527
Also, getting 6.6 t/s on 70B
P40 haters in shambles
>>
>>100183698
Even if he had to do multiple gens to find a failure, it's still bad that the supposed sota model failed it even once
>>
>>100183611
>>100183579
Man this is genuinely disappointing and reduces my interest in LLMs a bit. I never thought they were alive, but this makes it a bit too stark that they're just big statistics processors.
>>
>>100183726
Yeah, this kinda marks the end of LLMs for me. If these multi million dollar, trillion parameter models can't solve it, local doesn't have a chance. LLMs are a permanent dead end. An AI winter is probably coming soon if I had to guess.
>>
>>100182750
I use Koboldcpp because of it's context shifting, basically it intelligently deletes part of the context so you never have to reprocess it.

I find it helps massively once context gets large enough, and I hate the downtime.
>>
>>100183701
two P40s?
>>
File: file.png (168 KB, 1821x600)
168 KB
168 KB PNG
>>100183768
yeah
>>
>>100183757
<cope>j-just wait, gpt-5 is coming and it's gonna blow everything else away!</cope>
>>
>>100183780
very nice, what PSU?
>>
>>100183757
>my short form riddle solving dreams are over
bros... they can't even do math either, or count letters in words. they can't even be counted on to get my politician's birthday questions right. it's over... what else could I do with a language model?
>>
>Generative AI exists and can do fuzzy logic that is damn near impossible to program conventionally
>IT GOT A RIDDLE WRONG????? ITS SO FUCKING OVER
I hate you niggers so much.
>>
>>100183757
the current gen of AI doesn't have to end in AGI to be useful
its plenty useful at just being an akashic record of human knowledge, things like codegen services will still be very popular with LLMs
>>
>>100183796
1100 watt, but I have them heavily under-clocked so it's only drawing ~400 watts at peak.
>>
There's nothing inherently wrong with neural networks or them being statistic engines. We already know they work and have internal models of things like real brains do. The issue arises when we are asking one to learn about the world solely through text, and not just text, but text of dubious quality (dude men can be women lmao) with no grounding of reality. That is what LeCun's arguments are about, not that neural networks somehow inherently can't have internal models entirely. That is why he proposes that we'll need something JEPA-like in the future.
>>
>>100183710
The true SOTA model is GPT4 classic. I would love to see him trying to get the same response from it to truly prove AI winter is here.
GPT4 Turbo is optimized for benchmarks and is retarded.
>>
>>100183757
the tech will be useful to make new kinds of AI that are not just meme llm / transformers but novel architectures that can actually reason, it's just the begining.
>>
>>100183812
I don't think anyone thinks they're unimpressive as compressed knowledge stores or creative writing tools

This plateau has gotta be disappointing for all the people who thought they were going to culminate in Asimov style artificial minds though
>>
>>100183812
>akashic record
I like that you said this. I like you.
>>
People are trolling right? There's 3 screenshots of it failing to solve it, and 6 showing it succeeding.
>>
>>100183726
What they do is closer to intuition than deliberation. If you want to make them think, you have to make them spell out their thoughts in the context. It's like a person who gives a quick answer from experience of having to answer many such questions, so he can make a mistake if the question is intentionally confusing.
>>
File: robert 2.jpg (78 KB, 533x432)
78 KB
78 KB JPG
I am torn between "instruct" models and "chat" finetunes for my use case. The use case:
A health-coach assistant that is optimized for long, coherent conversations, rather than simple Q&A. I want it remember the context of long interactions, and recall/consider old messages in the conversations, and to actively listen and tease out the user's problems when not provided with enoguh information to give a confident answer. Once it does have enough information from the user, I'd like its responses to be pretty conclusive and fleshed out: sometimes a few paragraphs.

I worry that "chat" finetunes bias the models towards very short messages that are good for quick back-and-forths with an RP chat buddy, and that this would be too short for a coach-type assistant bot. Am I wrong here?
>>
>>100183833
>>
>>100183879
A lot of models can do this as demonstrated by the thread. The problem is when they're asked to explain their answer.
>>
>>100183896
looks good to me
>>100183558
>>100183527
>>
File: 1714100177564.jpg (93 KB, 1052x389)
93 KB
93 KB JPG
>>100183896
>>
>>100183896
>>
>>100180827
> Quantized Moistral-11B-v3 Model
wat
>>
>>100183599
>>100183611
>>100183646
LLMs are really reasoning guys!
>defeated by swapping genders in a cliche scenario
>>100183630
this ONE is fine. Does command-r have something better than the pile? >>100183662
>>100183645
this one sings from very peculiar hymnal
>>100183658
okay this one is sort of fine too where you can see it jumping onto the original cliche but then taking a step back last moment
>>
>>100183823
thanks for the info anon
>>
>>100183925
This is the best explanation posted so far because in the others you could feel that the model was being dragged towards the normal version of the riddle and only barely managing not to give the wrong explanation. This is just autistic correctness with no sign the model is thinking about the original riddle at all

maybe the schizos saying og March 2023 GPT-4 is the best are right
>>
>>100183948
Truly, excessive alignment is the root of all evil.
>>
>>100183925
How did you get access to 0314 tho
>>
>>100183925
yeah, "alignment" is teaching the models to be stupid.
>>
>>100183966
All versions of GPT-4 are still available on the API (that interface is the Playground which is OpenAI's simple interface for testing the api without writing code)
>>
File: file.png (5 KB, 185x153)
5 KB
5 KB PNG
>>100183989
are they?
>>
>>100184012
Huh I dunno then, I still have the original 0314 one, it appears between 0613 and 0125 in my version of the menu

I got fairly early access to it via application so maybe it's related to that
>>
>>100184051
damn...
>>
>>100183757
LLMs are toys. Have fun with them!
It is possible to build genuinely impressive, clever, and useful contraptions out of Lego, but there are limits.
>>
>>100184103
>Lego
I like that you said this. I like you.

God I love my little akashic record legos.
>>
I tried optimizing a quantized model with gradient descent to minimize the error relative to the original model.

Specifically, I hacked torchtune to replace the bf16 weight matrices in all the nn.Linear layers with the separate qs/scales/d components of the quantized Q6_K format. These are stored as floating-point "latent weights" (like in the bitnet paper), but the forward pass rounds, clamps, and runs the normal Q6_K dequantization on the fly, so the weights that actually get used in the linear layers are always exactly those of some valid Q6_K quant.

I took a normal Q6_K quant of Llama3 8B Instruct and separately optimized each layer to minimize the error it introduces relative to the same layer in the original model. I did each layer separately to minimize VRAM usage, since I eventually want to apply this to L3 70B. This took about 6 hours on 1x 4090. Then I converted the results back to a Q6_K GGUF.

KL-divergence on wiki.test.raw improved by a small amount:
old:   0.004234 
new: 0.003945
delta: 0.000289 (7%)


I have a few ideas to try next to improve the results:
>Train 2-4 layers together. My hope is that this will give the optimizer some flexibility to have the layers cancel out each other's errors.
>Train the layers sequentially. First, train layer 0 as normal. Then, instead of training layer 1 to map layer_0_fp16_output -> layer_1_fp16_output, train it to map layer_0_quant_output -> layer_1_fp16_output. This lets the layer 1 optimizer know about the errors introduced by layer 0 so it can correct for them.
But I'd love to hear other suggestions too. I'm sure there are anons ITT who know a lot more about ML than I do
>>
>>100184051
>>100184076
They probably disabled it for customers who were not already using it.
>>
>>100184295
this talk is too technical for this general, please leave for your own good.
>>
>>100184347
stfu i'm here for the technical stuff.
>>
>>100184295
By the way, this initial test was with a non-imat Q6_K. Based on >>100183072, I'm running it again tonight on an imat Q6_K instead.
>>
wait how the fuck does dequant work? do you just make up information?
>>
>>100184405
Please explain/rephrase.
>>
>>100184295
Very neat idea anon. So to try and understand: you're processing data with both the unquantized and quantized layer, then using the difference in the output as error for backprop? Or are you simply moving the Q6 weights closer to the quantized weights?
>>
>>100184405
It just means converting the quantized form back to fp16 or whatever so you can use it. But the fp16 you get back will be different from the one you had before quantization, since some information is lost in that process.
>>
>>100184460
for what purpose? so you can tune it?
>>
>>100183645
>Half the response is neo-religious psychobabble
That's some awfully noisy output. Pass.
>>
Anyone using fimbulvetr, upgrade to moistral v3. It's the same, but better vocab.

Anyone using bigger, better models, ignore this.
>>
>>100184295
Upload it to github and we'll hail you as savior (but then will obviously shit on you like we did on Kalo).
Why q6 quants, though? Won't it be more pronounced with, say, q2?
>>
>https://console.chaiverse.com/models/neversleep-llama-3-lumim_6375_v2
i wanna download this cause it's pretty good T_T
>>
>>100184295
I don't know much about ML, but I'm confused about what the inputs are you're using to compute error. If it's text, tokenizer bugs are going to affect you.
https://github.com/ggerganov/llama.cpp/issues/6809
>>
>>100184473
Yes. Some tools only work on f32/fp16
>>
>>100184473
Also, quantization to a different format.
>>
File: 25oc38lbsqwc1.png (73 KB, 1086x452)
73 KB
73 KB PNG
>>100184103
>Meanwhile at OpenAI
>>
>>100184473
Hardware operates on fp16/fp32/(others).
>>
>>100184516
llama3 8B Instruct is sentient
>>
>>100184488
Wasn't that the horny retarded model? Is v3 any better?
>>
what ever happened to all about Q* breaking encryption?
>>
>>100184499
Or maybe not if this is all in python with pytorch.
>>
modelpill me on command r and why haven't I heard anything about it except in /lmg/
>>
>>100184541
Last we've heard is from Sam Altman saying that OpenAI wasn't ready to talk about Q* yet.
>>
>>100184456
>you're processing data with both the unquantized and quantized layer, then using the difference in the output as error for backprop?
Yes. There are two models, the original fp16 model, which is used as a reference, and a single transformer layer in Q6_K format, which is being trained. We take the original fp16 model and run the forward pass up through layer N, and record the input and output of layer N. Then we run the quantized layer's forward pass using that same input, compute loss as mean squared error between the fp16 output and the quantized output, and run the backward pass.

Part of the idea here is to minimize VRAM usage so you can do this on big models. The fp16 forward pass can be run with only a single layer in vram at a time, and the performance penalty of doing this can (hopefully) be mitigated by batching prompts. Then you only need enough additional vram for one quantized layer and its gradients.

>>100184489
>Upload it to github
Will do in the morning

>Why q6 quants, though?
All the modern smaller quants are actually mixes of multiple quants. Q4_K_M is a mix of Q4_K and Q6_K, for example. I picked Q6_K so I only had to implement packing/unpacking of a single format at first. I do plan to try Q2-Q4 later on.

>>100184499
Thanks for the link - I keep seeing anons talking about supposed tokenizer bugs, but this is the first time I've seen a link to an actual issue

>>100184546
>Or maybe not if this is all in python with pytorch.
I think you're right - I'm using the tokenizer from HF's transformers library
>>
>>100184525
then whats the disadvantage of quanting below that?
>>
gpt5 dropping this july
mark it down on your calendars
>>
>>100184541
A nothingburger
>>
>>100184534
It's less retarded now. But Fimbul was also horny and retarded due to it's size. Now it's about equal.
>>
>>100184541
The rumors were literally an effective psy op
>>
>>100184516
It's gonna be pretty funny if GPT-5 is mediocre when it drops after all of roon's religious rapture the last few weeks
>>
>>100184516
roon is indian isn't he? good day sir
>>
>>100184492
NOOOOO MY SESSION GOT RESTARTED
MY HOT STEAMY SEX WAS JUST ABOUT TO BEGIN
>>
Is daybreak anon still alive? I'm still eagerly waiting on 2.0 of his dataset because 1.0 was probably my favorite model.
>>
>>100184623
nvm I'm BACK
>>
Man, Q2 really cooks models. The knowledge is clearly there in most cases, but it can't spit it out without sounding retarded.
>>
>>100184541
it was this
https://rentry.org/Q451921

>>100184613
GPT5 will be praised as a new technogod
>>
>>100184692
Yeah something catastrophic seems to happen to them in the drop from 3 to 2, it's a gradual descent up until then but below 3 they just totally lose it
>>
>>100184613
>if GPT-5 is mediocre
That's why it will be called GPT-4.5. Fanboys have built anticipation too high for GPT-5, and Altman has to manage expectations. There will be no GPT-5 this year.
>>
>>100184699
>codenamed DESU
WHAT
>>
When is a decent llama3 70 finetune gonna come out?
>>
>>100184716
I think it'll be the opposite, the model will indeed be only a 4.5 level of advancement but it will be named 5 for marketing/cope reasons.
>>
File: .png (174 KB, 710x111)
174 KB
174 KB PNG
thanks, y-you too...
>>
>>100184783
in about 14 days
>>
>>100184783
one more fortnight
>>
>>100184783
thursday after next
>>
>>100184783
Working on midnight llama. Will be posted soon
>>
>>100184773
This first very vid on it was posted to /g/ too
https://youtube.com/watch?v=3d0kk88IE8c
>>
>>100184488
Holy shit, Moistral is actually pretty good.
>>
File: pinned.png (71 KB, 692x352)
71 KB
71 KB PNG
>>100184878
kekk
>>
>>100184773
>DESU
Where do you see that?
>>
lol models are fucking retarded
>>
>>100181866
I think I had this before, most likely you are running out of vram and its offloading elsewhere
>>
New thread!
>>100184962
>>100184962
>>100184962
>>
>>100184954
>>
>>100184971
I feel like I'm shooting fish in a barrel at this point.
>>
>>100185045
>>100184971
>>100184954
To be fair, LLMs are trained to interpret the user prompt a bit loosely, as there are imperfect user prompts. Therefore, given a prompt that seems very similar to an existing famous riddle, they will assume that you just imperfectly copied it and answer as if it was the original riddle.
>>
>>100184950
It's right there in the link?
The numbers spell it out
>>
bake
>>100185124
>>100185124
>>100185124
>>
i'm staying here
>>
>>100185087
yes of course. cleverbot might do the same thing. all is well until you ask the model a new problem it's never seen before, and it just pattern matches it to the closest problem in it's dataset.

I have this happen a lot on real novel problems I pose to a model, even GPT4. It will just give a similar solution that works on a common problem that is similar, but not the same.
>>
>>100185131

WHAT

>>100184965

ARE YOU TWO DOING
>>
>>100185202
Just ignore the Petra thread. See: >>100185216
>>
>>100185202
Jokes on you anon it's one person that's just pretending to be two retarded people
>>
ACTUAL NEW THREAD!!!
>>100185269
>>100185269
>>100185269
>>
>>100185273
is it though?
>>
>>100185087
at the very least they could attempt to address "this common riddle"ness to the user and add "but if you really just mean x, then"
>>
>>100185765
but that's not good for coping



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.