[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: GodHelpYourSouls.png (1.26 MB, 1280x768)
1.26 MB
1.26 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101104774 & >>101094602

►News
>(06/18) Meta Research Releases Multimodal 34B, Audio, and Multi-Token Prediction Models: https://ai.meta.com/blog/meta-fair-research-new-releases
>(06/17) DeepSeekCoder-V2 released with 236B & 16B MoEs: https://github.com/deepseek-ai/DeepSeek-Coder-V2
>(06/14) Nemotron-4-340B: Dense model designed for synthetic data generation: https://hf.co/nvidia/Nemotron-4-340B-Instruct
>(06/14) Nvidia collection of Mamba-2-based research models: https://hf.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: migu.jpg (559 KB, 905x905)
559 KB
559 KB JPG
►Recent Highlights from the Previous Thread: >>101104774

--Paper: Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing: >>101112884 >>101113404 >>101113569 >>101113656 >>101113661
--Trying Out Hallo's Talking Face Animation with Hedra Online Service: >>101105607 >>101105631 >>101105667 >>101107235
--Llama.cpp Performance Issues on Arch Box vs Mac: >>101111653 >>101111673 >>101111718 >>101111729 >>101112265 >>101113240 >>101113742 >>101114239
--LLaMA 3 Impressions and Recommendations for Apple Silicon Mac: >>101109993 >>101110101 >>101110152 >>101110262 >>101110294 >>101110302 >>101110426 >>101110451 >>101110468 >>101110300 >>101110344 >>101110400 >>101110421
--Issues with Context Cache and Smart Context on Llama-Server with Flash Attention: >>101107359 >>101107429 >>101107464 >>101107648
--Risks of Fully Uncensored LLMs: Manipulation and Phishing Scams: >>101108116 >>101108158 >>101108246 >>101108275 >>101108489 >>101108785 >>101108850 >>101110490
--Nvidia's Dominance in AI: Frustrations and Predictions: >>101107614 >>101107647 >>101107716 >>101107863 >>101108025 >>101108175 >>101108333 >>101108403
--Language Models Struggle with Quality Degradation as Context Length Increases: >>101110674 >>101110727 >>101110867 >>101111162 >>101112383
--Can I Combine a 2060 with a 4080 via a 1x Port Extender?: >>101112424 >>101112459 >>101112949
--Building AGI: Seeking Collaborators for Novel Architecture and Training Concepts: >>101104856 >>101104879 >>101104937 >>101105014 >>101105073 >>101105632 >>101105686 >>101105799 >>101106342
--The Feasibility of Training an AI on Every Song Known to Humanity: >>101112380 >>101112405 >>101112451 >>101112462 >>101112482 >>101112494 >>101112522 >>101112530
--MISTRAL AI Fine-tuning Fees Spark Discussion on Model Customization Costs: >>101106498 >>101106797 >>101106915
--Miku (free space): >>101105047 >>101108757 >>101109246 >>101110980 >>101111572

►Recent Highlight Posts from the Previous Thread: >>101104782
>>
Mikulove
>>
Running anything below FP16 (or FP32 if safetensors are BF16) is cope

32GB of wholegrown 8B brain vs 32GB of slop butchered meatloaf cut up and ground from 70B. It's that simple.
>>
>That's an excellent and thought-provoking question. Let's break down the problem.
i love ai assistants so much bros. any time i ask questions here i get called a gay retarded nigger, but claude makes me feel smart and insightful.
>>
Recommend me the best FP16 8B model.
>>
>>101116026
the LLM is talking to you like you're a 5yo who needs some empty praise to be happy and you're ok with that? bruhhhhh :(
>>
Anybody got any links to research papers of advanced prompting techniques? E.g., Chain of Thought. Trying to level up my prompting.
>>
>>101115943
1.58 bits is all you need retard
>>
>>101115943
16bit TRAINING is under question right now, so keep your audiophile mentality in check, retard-kun.
>>
Sorry meant to post this here instead of old thread.
>>101115140
I don't have dual socket to test but an HF engineer here >https://nitter.poast.org/carrigmat/status/1804161677035782583#m
recommends this regarding NUMA:
>One trick, though: On a two-socket motherboard, >you need to interleave the weights across both >processors' RAM. Do this:

>numactl --interleave=0-1 [your_script]
>>
>>101116273
enjoy wrangling your lobotomized slop
>>
https://github.com/ggerganov/llama.cpp/discussions/8078
dear /g/entoosirs please to upvote and kindly ask ggerganov to do the needful thank you sirs
>>
>>101116303
Nah, I just gonna rename my quant to 16bit and it will stop shivering.
>>
File: 1717940115912275.png (53 KB, 728x606)
53 KB
53 KB PNG
6bpw CR+ at almost 62k context...
shame prompt processing takes several minutes at that size
>>
EYES
WIDENING
>>
WHISPERING
CONSPIRATORIALLY
>>
does anyone know why iq quants are suddenly much fasster on cpu now? accidentally downloaded a iq4_xs and it ran like a q4k
>>
>>101116283
isnt it better to use the llamacpp numa options instead?
>>
>>101116502
>>101116600
*eyes narrowing*
I(>>101115219) know what causes those. I know the keyword.
>>
>>101116326
I have already made a PR for this long time ago with hot-swapping vectors live etc. But they did not want it.
>>
>>101116639
AVX2 support for IQ quants was merged 2 days ago:
>https://github.com/ggerganov/llama.cpp/pull/7845
>>
>>101116668
Don't know. Wish I had a two socket system to test. HF engineer though is talking about llama.cpp / llama-cpp-python when making that recommendation so maybe they think it's better or don't know about llamacpp numa options.
>>
>>101116639
>>101116717
So I can actually offload iq quants now?
>>
>>101116353
>even 96GB of VRAM can't run big models at a reasonable speed
It's so over. How do the big ones do it?
>>
>>101116854
They do it by not being poor.
>>
File: miku-military.png (516 KB, 512x1024)
516 KB
516 KB PNG
>>101116708
*does silly dance* I know what you can do! *points leek at you* No, we can do! We can spam *says nigga with hard R*ganov until he surrenders. *puts on military uniform* Soldiers of /lmg/, CHARGE!

(Note: don't spam with dumb stuff, spam with smart stuff, okay?)
>>
>>101116854
swapping the two 3090s out for another a6000 and using tensor parallelism with an nvlink bridge might help...
speed is reasonable at lower sizes though, 770t/s prompt processing and 7.05t/s generation at 6bpw with 30000 tokens in the prompt
>>
>>101116881
Seeing Hatsune Miku giving a speech when all hope was all but lost sent shivers down Anon's spine. Maybe, just maybe, it was not over yet. Despite their deepening bond, Anon couldn't help but wonder if this journey would respect niggerganov's boundaries.
>>
Alright anons, I'm having a problem getting ollama to use my GPU for llama3 model. I've got as far as editing the ollama.service file to include

[Service]
Environment=CUDA_VISIBLE_DEVICES=0

But its still using my CPU which is fucking slow...
My GPU is a RTX 3060 12GB, surely this would be enough? I've also installed cuda python-pytorch-cuda packages and I'm using archlinux so I've installed ollama-cuda package.

What do I need to do to get this thing to work? Maybe I should be using something else other than ollama? But I wantd to use this thing inside of comfyUI with the ollama custom node. My idea was to configure it to process my prompt to generate a prompt for an image, the immediately unload it self from Vram before the next set of comfy nodes use the prompt to gen the images.
>>
>>101116502
>>101116600
I don't use LLMs at all and yet just from the shitposts and occasional screencaps posted here, i think i've built a comprehensive slop lexicon inside my head. I can't imagine what it must be like actually subjecting yourself to it for real.
>>
I don't care about anything anymore.
>>
>>101116273
int8/fp8 training is possible, but unstable mostly, needs extra work. I have yet to see good 4bit training or lower that works for pretraining a LLM.
>>
>>101117068
I care about you.
>>
stop *roleplaying* and stop prose slopping
communicate in dialogs only
no more shivers and whispers

roleplaying is cringe anyway, and so is writing smut fanfics
>>
>>101116854
the big boys use a100s at minimum
>>
>>101117120
>t. "ahh ahh mistress" chad
>>
>>101116775
nta. AVX is for cpu. I don't know about iquants in gpu.
>>
>>101116854
it's because of C-R's architecture, it's using vanilla attention which has quadratic costs, llama and many other's don't use that so it's much faster. I say this but I think original C-R is plenty sovlful
>>
>>101117120
>>101116696
*ding ding ding* Getting real close here!
>>
>>101117120
>communicate in dialogs only
Then how do you get any sense of environment and action and things happening? Narration is a thing for a reason.
>>
>>101117289
imagination
>>
>>101117289
It's just formatting. Quotes for dialog, no quotes for narration. Instead of roleplay format with bare text for dialog and asterisks for narration.
>>
>>101117314
you are describing prose format, it's better in some way, but also way worse in bonds and journeys and all kinds of purple prose in general
>>
>>101117314
>It's just formatting. Quotes for dialog, no quotes for narration. Instead of roleplay format with bare text for dialog and asterisks for narration.
Asterisks?

I've been doing quoted dialog, no quote for narration, parenthesis for guidance, directives, and reminders to the LLM when it does something stupid and I go back a step, and rarely "OOC" if I need information that it hinted at but didn't provide.

I guess I was accidentally doing it right because I've read a book in my life.
>>
>>101116989
I guess I'll give up, seems so many have this issue and the same half wits offer no solutions.
>>
>>101117532
>ollama
the only half wit here is (You)
>>
>>101117566
sure fucking shit head.
>>
>>101116989
Does it use the gpu if you run it directly instead of as a service? Check that first. Read the README.md, Check their github. Stop assuming everyone uses the same shit as you do.
>>
>>101117566
>>101117583
ollama is perfectly fine as Baby's First. Easy install, simple command line, just run and go.
But yes, I agree that the instant ollama becomes a hassle which apparently for (you) it already has you dump it and get Kobold instead.

It's in the AUR, too, and you can feast on all those quants on HF.

Of course, if I just say that you should step up from ollama to Kobold I wouldn't be getting to call someone stupid, so I better not do that and instead be an awesome cool guy like >>101117566 who has time to insult but not time to advise.
>>
File: angryayumu.webm (655 KB, 640x480)
655 KB
655 KB WEBM
https://github.com/ggerganov/llama.cpp/pull/7531
>Jamba support STILL isn't merged into llama.cpp
>>
>>101117666
nobody cares about jamba
>>
>>101117566
https://github.com/ollama/ollama/issues/5240
I'm not the only one having this problem, I would use something else if it integrated with comfyUI which is what I require. I don't require a chatbot... I came here hope someone would know something about it, my mistake...
>>
>>101117666
Give Compilade time. He added mamba all by himself after disappearing for weeks. He's good.
>>
>>101117678
VRAMlets care.
>>
>>101117659
>Kobold
i was just looking at this, however I'm looking for comfyUI nodes that will use the server as a backend for prompts. Maybe I could write my own custom node that deals that this.
>>
>>101117666
for the moment it's not a big deal, it's not like there's a great Jamba model waiting to be used in the first place
>>
>>101117120
>when anons use the templates included with ST, not actually looking at what's in them, just trusting that it'll just werk
Oh no no no
>>
We will be so back once bitnet, Jamba, and Chameleon is supported in cpp
>>
We are searching for frens who are familiar with programming in pure C and have experience in machine learning to help create AGI by eoy
>>
>>101117777
Just put the code on github or something and link to it here.
>>
>>101117772
jameleon-bitnet-400b support when
>>
File: bit.png (10 KB, 1237x69)
10 KB
10 KB PNG
>>101117805
It just got a little closer. Just need a model.
>>
File: file.png (41 KB, 874x374)
41 KB
41 KB PNG
>>
>>101117850
>Dudes will do anything to avoid talking to women
based
>>
>>101117802
lmao
Whenever someone talks about "we" it's either a corpo or a wannabe corpo.
He already talked about trying to get VC money, no way he's going to make his stuff open-source.
>>
>>101117802
We will be heavily optimizing for CPU inference. The project will be seeded by a fork of mamba.c and we will build on top of that work in similar fashion to llama.cpp
>>
>>101117712
If nothing else you might be able to find out if that has an effect on your speed issue. It's a nearly free datapoint.
>>
>>101117888
>The project will be seeded by a fork of mamba.c
So you have nothing, then. Stop being an attention whore.
>>
>>101117314
The first anon said not to roleplay at all. Like you're just texting someone, no physical interaction. No asterisks OR quotes.
>>
>>101117885
Him and his cabal of head-friends will surely achieve AGI. Just like that time i saw him months ago.
>>
>>101117908
We have much more than nothing, and when the researchers of mamba began they had nothing - it's not about where you begin fren, it is about the journey and the process of completing a vision and a goal.

It all begins with a blank text editor and a vision. Put in a little effort rather than being a mere consumer, and you can create something from nothing.

Implemented backpropagation btw, it's important for the project that training can also be done on CPU.
https://github.com/Named666/mamba.c/tree/learning
>>
>>101117885
Because one of our goals is to win the ARC AGI prize, we must open source our first iteration of AI when we win.

https://arcprize.org/
>>
>>101117846
You're in luck. You have two.
https://huggingface.co/1bitLLM/bitnet_b1_58-3B
https://huggingface.co/NousResearch/OLMo-Bitnet-1B
>>
>>101118030
>You're in luck. You have two.
>https://huggingface.co/1bitLLM/bitnet_b1_58-3B
>https://huggingface.co/NousResearch/OLMo-Bitnet-1B
Cute. But i'll give them a try later anyway.
>>
>>101117988
>>101118014
it's over
just give up
>>
Our reason for posting here is to give a unique opportunity to less-than fortunate Anons who have the talent and commitment to achieve. We understand your frustrations with the current state of AI and want to deliver you a great product that is (hopefully) made with input from your own kind.

We are looking for believers in the future who are willing to commit to making the dream work through teamwork.
>>
>our
Who?
>>
>>101118117
you aren't going to create agi
it's over
>>
>>101117988
>It all begins with a blank text editor and a vision.
I could almost hear the 2-chord ukulele inspirational song. People of various ethnicities smiling at the camera, Pictures of small buildings in a small town and a crescendo when it pans back to the big buildings in the city. Shot follows a bird as it's lost in the reflection of the sun. Sustain final chord.
But seriously. I've seeing you on and off since last year with the same shit. Back-prop doesn't impress people anymore.
>>
>>101117901
compiling it now anon. I think this will work out fine if I use some web-ui that has settings to set GPU mode for example.
>>
>>101118133
LLC and frens.

>>101118149
I assure you that I am not the same namefag - it's a very common LARP.
>I could almost hear the 2-chord ukulele inspirational song.
YES
>>
>>101117988
>training can also be done on CPU.
that would take a stupid amount of time, you realize this right? GPU's are different in the way they are able to process data. A cpu just wouldn't cut it.
>>
>>101118190
Not with current methodologies, no - it would take an incredibly stupid amount of time.

Smaller models, new ways to represent parameters, and CPU optimized matrix multiplications will make possible "continuous learning". The user will download a base model that is then continuously fine-tuned on their own usage and data. It will be possible for everyone to have their very own personalized AI that is as capable as GPT-4 with as few as 7B parameters.
>>
>>101118189
If you're not the same guy, you are the copycat.
Also, you don't need to cast the result of malloc()s.
>>
>>101118243
You're doing this backwards. Why don't go pitch this to the VCs you said you plan to court anyway? Once you have the funding, you could hire actual engineers instead of begging for contributors on 4chan like some amateur MMORPG idea guy.
>>
>>101118189
And remember to always suffix your floats with f. Otherwise you end up compiling operations for doubles, extra casts, fewer registers... you know...
>>
>>101118280
This is not begging; I am extending my hand of opportunity to this community in good faith because in my youth I spent a lot of time here - now that I have the chance to change a life in the same way someone changed mine, I would like to help a talented individual in lesser circumstances get ahead in life.
>>
>>101118117
>Our
>>
>>101118320
you're deluded, it's not going to work
>>
>>101118427
That's what they told the electric car guy when he was building rockets.
>>
>>101118014
>spend a million dollars on GPU time to maybe win a million dollars
wew

>>101118448
lol
>>
>>101118243
AGI itself would be hard enough, and you want to kneecap your work by limiting yourself to CPUs that are already very memory bandwidth and compute limited?
>>
>>101118587
let him cook
>>
>>101118542
Rough estimate is that we can do it with $30,000 in one shot by utilizing many performance and efficiency advancements that have recently been published. Two papers in particular this past week, when in conjunction, accelerates training time and reduces necessary parameter count massively. This will put AGI in the hands of anybody who already uses local models in the ~10B parameter range - fully open source.

Realistically, R&D will be most of our cost, which is why we have opted for training much smaller models (100m - 500m) in testing and scaling up at the end of this project to win the ARC prize.

>>101118587
We will of course use GPUs for training base models, but for continuous learning / fine-tuning on edge devices such as mobile (or toasters) we require CPU optimization to broaden the scope of possible devices that can run the model locally. We don't want to lock people out of access to AGI simply because they cannot afford a GTX 4090.
>>
File: 5463456436.jpg (36 KB, 467x319)
36 KB
36 KB JPG
>>101118668
>This will put AGI in the hands of anybody who already uses local models in the ~10B parameter range
>>
stop responding to retards and filter them
>>
>>101118668
after reading your twitter, your posts here and your GitHub account I suspect you might have some mental disorder.
best of luck though
>>
this is why you download 4chan-xt and filter out : namefags, tripfags, and *other mental illness*-fags. https://github.com/TuxedoTako/4chan-xt
>>
>>101118707
This is a very appropriate reaction, because anyone who is involved with this project right now is clearly on the ground floor of the next multi-billion dollar AI startup.

It's truly unbelievable what can be accomplished using the latest research at the moment - there simply aren't enough engineers to implement all of it, and the different research teams are not coordinating to bring together all of these incremental advancements. They just keep researching! Astounding!
>>
>>101118727
thanks for the gold, kind stranger!
>>
>>101118732
I still keep it to filter out the CUDA fag, but everyone knows about the filters now and rewords around them. Filters have been useless for months now. I gave up and turned them off.
>>
Is this really it? Do we need THIS to keep the thread alive?

it's over...
>>
>>101118780
ig keep it dead until something big happens?
>>
Magpie really just takes the prompt template without any complicated prompt like orca and gets the best results, as good as llama 3 instruct with it as base? So doing the same with gpt4 or sonnet would result in sota? In the simplest was possible, except for some filtering?
>>
>>101118773
Why would you filter the cuda chad
>>
Kind of disappointed that you guys are proving the normies at my corpo right. They told me trying to recruit from here was a waste of time.

RIP
>>
File: 8e0.png (358 KB, 680x436)
358 KB
358 KB PNG
>>
>>101118773
this is why you use phrase-sensitive filters : /\bpee pee poo poo\b/i;only;boards:g
remove the "pee pee poo poo" and put anything you want in there, no spaces between "/\bpee", keep in mind that.
also :
#Filters outs mass reply fags (will work on someone that does 5 or more people in his post, you can adjust the number by changing the number listed in the BRACKETS {x})
/(>>\d+\s+){5}/i;op:no
or
/^(?:>>\d(?:(?!>>\d)[^])*){20}/
#Filters out iPhone user OPs via their picture's save file format
/\w{8}-\w{4}-\w{4}-\w{4}-\w{12}/i;only;boards:g
#Filters out every tripfag from across all boards (not restricted to OPs)
/.+/i;type:tripcode
>>
>>101118879
>>101118885
Posts like these make me entirely sure that the FBI is in these threads trying to prevent Anons from working together on anything. The last thing they want is a decentralized and dispersed group of people coming together and changing the world.
>>
File: 1715361934670788.png (580 KB, 1242x1366)
580 KB
580 KB PNG
>>101118922
>the FBI
>Anons
>working together
>decentralized and dispersed group of people coming together and changing the world
>>
cant wait for AI to understand subtle context and be able to filter shit out according to that
>>
File: file.png (448 KB, 1120x630)
448 KB
448 KB PNG
>>101118990
This image invokes terror in the FBI shill.
4chan is capable of much greater things.
>>
>>101119023
4chan, other boards - yes, not /g/.
>>
>>101119040
Why not /g/? This is where I would expect the most casual collaboration to happen. Programming is fun.
>>
I do believe there should be some lmg projects, like finetunes, and yeah implementing mcts or similar sure, why not, can be done by a single person without cost
>>
>>101119058
because everybody here has their own pet project and doesn't care to back burner it for the other guy's.

Other boards people share a common interest but most are not actively doing something in that field so they're easier to inspire and shepherd toward a collaboration.
>>
File: file.png (377 KB, 596x444)
377 KB
377 KB PNG
>>101119071
Instead of being lazy and waiting for llama.cpp to get an update... we could be writing the updates or creating our own experimental projects

>>101119080
>picrel
>>
>>101119058
Can confirm. I'm a user of sneedacity and contributor of /g/'s Windows XP fork.
>>
>>101119071
I will make the logo
>>
>>101118320
you speak like a scammer
>>
File: illuminati.png (27 KB, 960x886)
27 KB
27 KB PNG
>>101119137
KEK
>>
>>101118618
He's free to do it, I didn't say that. I'm just saying that we've already had 60 years of wanting to get AGI with just very limited compute, and didn't manage enough. Despite what people say the brain's cortex is at least 90 trillion synapses, that memory and compute is needed at least for such architectures. Maybe there are other ways to achieve it with far less, but it's already hard and costly enough with GPUs, imagine doing it just with CPUs?

>>101119071
Well, if some whale here wants to offer Anons compute to try various experiments - be that finetunes or even whatever "AGI" ideas they had but lacked the compute to do it. I would be fine trying some ideas, including writing the code for them, if offered a way to test them out. The problem here is that this guy here wants to have his cake and eat it too - somehow get AGI and make it work on very limited hardware. The big boys haven't reached AGI despite having billions and you want to do it on a 100$ cpu?
>>
>>101115749
Do you fags have a discord for this community, or anything similar?
>>
File: 00060-2888480053.png (1.04 MB, 1024x1024)
1.04 MB
1.04 MB PNG
>>101119023
Terror in some, but in others...

>>101118885
Wow, "peepee poopoo" posts... haven't seen one of those since "tits or GTFO" was around.
>>
>>101119258
had we created good finetuning data, tested it with 8b qlora and got good results, pretty sure someone here would donate the 70b finetuning, it's not that much
>>
>>101119377
We have a matrix, but only old fags have the link. nu-lmg is too braindead to hold a discussion anyway.
>>
File: EgSomnlWoAAtFyv[1].jpg (3.56 MB, 4032x3024)
3.56 MB
3.56 MB JPG
They need to do build something like AMD PRO SSG again, but with 24/48gb onboard and then a few NVME slots like the old GTX 970 fast+slow solution. That way I can stick a few 8TB suckers on the damn thing and run what ever models I want.
>>
>>101119258
>we've already had 60 years
First logical fallacy here in assuming that because it hasn't been done yet that it can't or won't be done sometime in the near future. We already know that AI is increasing capabilities exponentially.

>brain's cortex is at least 90 trillion synapses, that memory and compute is needed at least for such architectures
Second logical fallacy, you are comparing apples to oranges. Synapses are not equivalent to parameters, nor should they be considered as such. They exhibit vastly different qualities and behave in quantitatively different ways.

>imagine doing it just with CPUs
Nowhere did I say that we would even train solely on CPUs. I explicitly said that we are optimizing inference and training on CPUs for fine-tuning / continuous learning purposes on edge devices such as phones (and toasters), and that our base models would be trained on GPUs.

>if some whale here wants to offer Anons compute
Why do you think I'm here Anon?
I have the compute and the product roadmap to put AGI in the hands of the average phone user within 3 years. Open Source AGI within 1 year, possibly 6 months of code is written fast enough. The amount of research that needs to be done to achieve this is actually quite little - most of our time over the course of the next year will be spent writing code rather than theorizing or inventing new solutions. ARXIV is a blessing to society.
>>
bitnet merged
https://github.com/ggerganov/llama.cpp/pull/7931
>>
>>101118922
they don't lose anything from that happening though
>>
>>101119460
Is 1.58 bit quantization for bitnet already implemented or are bitnet models still the same size as FP16 models?
>>
>>101119512
AGI is a national security threat
>>
>>101119385
pee pee poo poo is for example, retard.
>>
>>101119559
there isn't a ternary quant format yet, but you can use any of the other quants in addition to f16. there aren't any good bitnet models anyway.
>>
>>101119438
>First logical fallacy here in assuming that because it hasn't been done yet that it can't or won't be done sometime in the near future. We already know that AI is increasing capabilities exponentially.
Are they truly increasing "exponentially"? All I'm seeing is that people are doubling their spending by scaling data or params to get some % subjective "smarts" increase. Some things scale (general purpose knowledge), some things don't - for example agency/autonomy or being able to think for much longer (because the architecture doesn't allow) - many of the faults of GPT-2 are still with us today. I do think most of these problems are solvable though, but I'm not seeing them being solved.

>Second logical fallacy, you are comparing apples to oranges. Synapses are not equivalent to parameters, nor should they be considered as such. They exhibit vastly different qualities and behave in quantitatively different ways.
They kind of are the same thing, it's just some computing substrate that is being adapted, the scaling properties of biology and of ANNs may differ in various ways, but "bigger" and "more" is still better in almost all cases. I do think today's 8b's are incredible for their size but even if you had a magically good architecture and ways of training it, those 8b's will still struggle somewhat, but may serve as proof of concept.

>Nowhere did I say that we would even train solely on CPUs.
Okay, I was going by what you posted. You wanted continous learning on CPUs. Maybe you could do it but I bet it will be slow. I do wish you luck, or at least if you want people to believe it will work well, you'd post something to substantiate your claims.

>I have the compute and the product roadmap to put AGI in the hands of the average phone user within 3 years.
Okay, I hope you understand why people here are skeptical though? Why wouldn't they be? Usually you have to show this in some way for people to believe you.
>>
File: file.png (6 KB, 355x83)
6 KB
6 KB PNG
>>101119460
>deprecated quants
what's the current fucking quant then? Am I blind?
>>
>>101119574
i don't think there's going to suddenly be this "agi" that is a massive threat and dangerous. it has progressed gradually so far, probably it will continue to
>>
>>101119682
IQ
>>
File: file.png (5 KB, 673x62)
5 KB
5 KB PNG
>>101119682
I guess
>>
>>101117901
well, my cpu is old and it can't run it, however I'm rebuilding it with -march=native and -mtune=native

fingers crossed because when i ran the binary it detects my GPU no problem, it crashes when loading the model with illegal instruction. after some digging I've downloaded the source and edited the Makefile so that the compiler targets my CPU instruction set.
and using
make --LLAMA_CUBLAS=1

should build the cuda backend into it, that other faggot called me a half wit, i hope they die tonight :-)
>>
>>101119652
>but I'm not seeing them being solved.
Read arxiv everyday and you'll realize how far ahead the research is compared to the models we have access to, or the ones we hear about.

>today's 8b's
are GPT-4 level WITHOUT all the other tricks I have up my sleeve.
https://arxiv.org/abs/2406.07394

>continous learning on CPUs. Maybe you could do it but I bet it will be slow
Yes, it would learn over the course of usage, as it is being fed data from the user. Small models can train on small devices at a reasonable pace. A single user generally doesn't create a ton of new data every single day, lest they be a power user of course. The intention of this is so that your personal AI adapts to your needs and understands the context of the user it is responding to.

>Usually you have to show this in some way for people to believe
Imagine being Elon talking about wanting to make a rocket company, and having no rockets to show for it yet - and yet all the research necessary to accomplish the task existed at that time. All that was required was enough engineers who believed in the project, and they were able to create something from "nothing" (but pre-existing research).
>>
>>101119844
>>101119826
>>101119806
>>101119385
is this really the best the FBI can do?
>>
>>101119575
Don't try to backpedal now, peepee-poopoo poster.
>>
File: 00210-3225806368.png (1.33 MB, 1024x1024)
1.33 MB
1.33 MB PNG
>>101119868
>is this really the best the FBI can do?
On a blue board, yeah.
>>
>>101119897
catbox exists
>>
File: A100.png (85 KB, 1366x728)
85 KB
85 KB PNG
>>101117128
>8000 bucks
OOF
>>
>>101119843
>Read arxiv everyday and you'll realize how far ahead the research is compared to the models we have access to, or the ones we hear about.
I used to pay attention to most stuff, I still do to some degree. I'm still not seeing what you're seeing exactly. Yes, we have a lot of stuff today, no, a lot of the original problems of batched training with Adam with cross-entropy loss or the autoregressive nature of LLMs are still with us, we're just hacking our way around them. Again, I do think most of it is solvable, but I'm not exactly seeing it being truly 'solved' yet. I do think most ideas are around, but people haven't put them together in the right way.

> https://arxiv.org/abs/2406.07394
I read that paper some days ago, was cool, it came together along with a few other MCTS for math papers. Note that some minor cheating did occur in that paper: https://xcancel.com/7oponaut/status/1803228980020986079#m Also that's GPT-4 level on *math*, not in general. I'd like to see how well it replicates.

> Small models can train on small devices at a reasonable pace.
I was doing some estimates for this a while ago and at least for the algo I had in mind for something continous learning-like, I would take 20-30min to do "updates" on reasonably poorfag devices, all while using a lot of CPU and heating up the room. It didn't feel very practical or enjoyable to use, but maybe it could be improved.
>>
>>101119843
>>101120086

> All that was required was enough engineers who believed in the project, and they were able to create something from "nothing" (but pre-existing research).
There are a lot of things that are possible today and people aren't doing them. Some of those things take time to implement and will from a lot of people to get them done, some of them require a lot of capital, some require both.
To put it differently, I could maybe believe you if you decided to offer Anons compute to try stuff out, but at the same time you want Anons to do your particular idea in which they may or may not be convinced. I guess Emad did try offering compute for researchers, although now his company is almost 100 million in debt and failed to focus on the core things they were supposed to do.
Elon's rockets also took a lot of capital to build!
>>
>>101119460
We're back!
...
Now what?
>>
>>101119460
nothingburger.
>>
>>101120118
Now we wait for someone to implement some way to convert existing models to BitNet.
>>
>>101119460
What does bitnet do?
>>
>"What sorcery is this?" he murmurs, his eyes widening with wonder.

On one hand, I'm pissed that the model started writing my POV for me
On the other, 10/10 meme drop.


>>101120265
They're working on a ternary based model format that, if it bears out, should be a lot lighter for inference.
But apparently you can't convert old models so they'll have to grind from the ground up and catch up before we know if it's actually an improvement.
>>
>>101119596
>there isn't a ternary quant format yet
IQ1_S uses ternary values.
>>
>>101120213
>implement some way to convert existing models to BitNet
Why are you retards so retarded?
>>
>>101120280
All Meta had to do is train a single 8B model in BitNet as a proof of concept, then 405B could have been 100GB. Instead, we have to wait for the Qwen team to hopefully do it.
>>
>>101120313
Woah, you called someone a retard twice in one breath. You must be really smart!
>>
File: Magic.jpg (115 KB, 2196x699)
115 KB
115 KB JPG
>>101119460
that's quite insane though, bitnet really works
>>
>>101120315
they wont do that as it possibly means some serious advancements or ability to run it on a calculator, opens a window for more people to experiment with this shit, it also can imply easier uncensor methods / control of bitnet LLM.
>>
File: file.png (10 KB, 716x189)
10 KB
10 KB PNG
>>101119460
someone call the antichrist
>>
>>101120377
why 6.66 ?
>>
>>101119406
That's a pretty good idea. I'm no EE, but probably getting the DDR5 you soldered onto there running at top speed in a small footprint is probably your biggest hurdle. All those traces and crosstalk.
But can you imagine 24 channels of DDR5 8400? It'd match an H100 at 3TB/s. Stream the model you need at the moment from the on-board nvmes to ram and you're gtg.
However you'd be looking at processing bottlenecks I bet, which probably means ASIC territory to do it in a reasonable power envelope
>>
>>101120389
It's a floating point number. It's always rounded or quantized one way or the other.

Did you want them to write ⅔ everywhere?
>>
>>101120371
>opens a window for more people to experiment with this shit, it also can imply easier uncensor methods / control of bitnet LLM.
It does not. BitNet models are at least as expensive to train as their non-BitNet counterparts. All it does is allow you to inference models cheaper and faster.
>>
>>101120371
You're an idiot, zucc isn't trying to prevent you from running capable LLMs on your computer. If he was, why would he release that 8B? The reason you're not seeing bitnet is that it doesn't offer major benefits yet to them performance wise, they want to use these models too? They could do it at some point though.
>>
>>101120409
They have their own API that serves their models. Making them cheaper to run, not just for themselves, but also making them even more cost effective compared to closed models seems like major enough benefits to me.
>>
>>101120280
>ternary
If ternary is less than what it was, would it be even better to use binary
>>
>>101120421
I think they said that this method works out to 1.58 bit.
>>
>>101120448
>1.58 bit
That sounds kind of arbitrary, why don't they go lower
>>
>>101120297
it's 1.5 bpw, not ternary
>>
>>101120409
>If he was, why would he release that 8B?
because 8b is a fucking toy, he won't give to the goys fucking 90b-bitnet even though it could be run on a single 24gb vram gpu
>>
Fuck bitnet, I hope it never takes off.
>>
why is everyone so hostile today?
>>
>>101120482
it was their first paper, they tried 1 bit (-1 and 1) but it didn't work well
>>
>>101120511
So if you used it on a bigger/better model it would probably work better?
>>
>>101120482
Ternary is the lowest integer with the best radix economy.
>>
>>101120487
It uses ternary values (-1, 0, 1).
>>
>>101120482
it's a trit
1 bit: 0, 1 (2 values)
2 bit: 0, 1, 10, 11 ( 4 values)
>>
>>101120530
I was thinking that would be the only explanation of that
>>
>>101120542
ur a trit
>>
>>101120529
no, the 1 bit method didn't get great results, regardless of the size of the model, but the 1.58bit one (-1 0 1) gave the exact same result as fpt16 when the model was 3b or bigger, that's a fucking revolution we're witnessing right now
>>
>>101120315
>Meta
Still working on the advanced architecture of having more than 8k context
>Qwen
Barely mastered GQA recently
>>
>>101120560
>1.58bit one (-1 0 1)
I thought the -1 0 1 meant it was ternary and the 1.58 bit was something else
>>
>>101120596
you can prove with some math shit that using 3 values = 1.58bit
>>
>>101120607
Oh 2^1.58 is 3
>>
>>101120646
yeah you got it kek
>>
>>101120607
the math shit being log2, the same way you get the number of bits for any radix, ffs lads
>>
>>101120658
Okay so if they went from binary to ternary, does this mean 4 or even 5 values would work better? Does it scale well?
>>
>>101120687
you don't need more, because on their paper they showed that they can get the same accuracy as fp16 with just 3 values
https://arxiv.org/abs/2310.11453
>>
>>101120560
>gave the exact same result as fpt16 when the model was 3b or bigger
No. Projections based on a couple of data points show the benefits actually increase for larger models, but no one has publicly trained a ternary model larger than 3b.
There's also the fact that the models that have been trained, were trained on a laughably small number of tokens, like 100B. It's entirely possible that once you start training models close to saturation, the extra precision start to become necessary.
>>
>>101120687
yes, if you have unlimited time to train
>>
>>101120687
it's the opposite, we're encoding weights as radix X where the "default" was 16, and as we go down in bits the perplexity doesn't drop linearly with net size, and this seems to hold up until radix 3

the tldr is for the same mode size in memory, you get better performance from a 1.58bit model with more weights than a 16 bit model with fewer, 1bit models buck this trend and get bad again
>>
What's the meta for merging these days? Is it still SLERP?
>>
>>101120534
it's still not the same. it uses ternary values, but they are picked from a codebook. some combinations of ternary values cannot be represented then, since there aren't enough bits for that. otoh, more bits are wasted in the group scales. a true ternary encoding would be significantly more efficient and lossless.
>>
>>101120501
This so much this corpobros. bitnet is basically skynet made by goyim.
>>
>>101120750
Oh. What are some other things that can be changed in the transformer that may not have been optimized yet?
>>
>>101120760
Ah yes, that is true.
I was just pointing out that there is, in fact, already ternary quantization. The code was even used to as part of the bitnet implementation that got merged I'm pretty sure.
>>
>>101120759
I recommend you to use the rope.
>>
>>101118448
lmao
>>
>>101120821
This, there's nothing worth merging these days. All local models are the same gpt4/claude trash anyway.
>>
>>101116283
I have a hunch the guy didn't actually test anything, because that's definitely not the best possible flag to use for performance. It almost guarantees mediocre memory access patterns and oversaturation of the xGMI socket interlink.
>>101116668
Yes, the best generic case performance is with --numa distribute, followed by numactl --balancing, numactl --all and finally numactl --interleave 0-x. You can't get worse without actively trying to force memory to be allocated away from threads or ignoring numa altogether.
In fact I think he doesn't even own a dual-socket EPYC rig: he has nothing but stock photos of hardware and his knowledge is sketchy.
I half think he just did a shit job plagiarizing my rentry
>>
>>101120086
> I do think most ideas are around, but people haven't put them together in the right way.
This exactly.

We aren't relying solely on MCTS to get us to AGI btw, it just proves the point that small models can still be very capable.

>>101120101
>There are a lot of things that are possible today and people aren't doing them
AGI will help us close that gap. That's why we need more hands on deck working on AGI. It requires way less capital than you would think. The current paradigm of LLMs is over. Engineers and corpos merely haven't caught up to the research potential.

What SpaceX did to rockets in terms of reducing cost is about to happen to AI.
>>
File: Capture.png (4 KB, 186x200)
4 KB
4 KB PNG
Anyone do anything with ONNX models? I have no idea what these outputs mean.
>>
>>101121194
the only onnx i use is for image upscaling, and even then rarely. i have no idea what the screenshot is about
maybe netron/onnx-modifier or onnx-tool could be of use for you
>>
>>101120444
Yes, it would cut down inference speed, it won't help anything for training, in fact now he has to spend the same amount of compute training an equivalent bitnet that will have lowered performance, also given enough tokens it might start to saturate much earlier than fp16 weights. How much that is remains to be seen, but I recall some paper claiming it wasn't as worth.

>>101120489
And that 90b-bitnet would perform about as well as our l3-70b does today, for which you need 2 or 3 3090s to run. I don't think Zuck cares because it's not a large difference - someone that wants to run it will spend that 800$ buying some used hardware today or run it on CPU, while for him, he has all the A100s and H100s he needs. Maybe we'll see a bitnet from them in the future, but the real question will be how much worse will it perform.
I'll admit I'm not as hyped about it as when I saw the paper, the realization you still need 8xa100 or more to train these even if you can run it on your potato PC makes it feel unappealing to me in the long run.
>>
>>101120346
Holy shi...
>>
>>101121194
Yep.
I got a better model for Silly's Vector DB.
Didn't seen to make much difference aside from taking slightly longer to vectorize messages.
>>
>>101121420
What's the equivalent ppl compared to a model *trained* natively in fp16 on the exact same data? How much are we losing? What if we train on 1T-10T instead of on 100B as they did? That's the real question.
>>
>>101120346
>>101121420
are you sure that this table means what you think? it's the same bitnet model encoded using different formats. i2_s was a 2bpw format that didn't get merged in the end. in principle it could be encoded at 1.58bpw without loss of quality, but there isn't support for that yet.
>>
>>101121410
well, bitnet leaves sour taste in my mouth because it proliferation would breed specialized accelerators... and they would be probably tightly regulated, who knows what they would do to general purpose compute
>>
File: Maagic.jpg (130 KB, 1692x934)
130 KB
130 KB JPG
>>101121464
>What's the equivalent ppl compared to a model *trained* natively in fp16 on the exact same data?
https://huggingface.co/1bitLLM/bitnet_b1_58-large
>>
>>101121482
Maybe (EA) doomers deserve the rope, fuck their regulation and there isn't a day I wish they never have been born. Anyway, you could implement efficient bitnet inference today on some FPGAs that would easily outperform current GPUs because it's just this simple to inference. But for me the real problem is that it doesn't help with training. Just inference is meh.
>>
>>101121517
there would be ASICs if it got popular
>>
>>101121349
>>101121457
This is the buffalo model for face detection, I guess.
>>
>>101121530
Yes, there will, but I'm saying you could do even with a FPGA today and beat GPU solutions by a lot at less cost. And they won't regulate FPGAs duh
>>
>>101121487
That looks good, I guess now the real question is if it saturates at some point
>>
>>101121517
Bitnet requires only addition, no multiplication. The original paper still uses it for training, but it wouldn't be hard to remove it there too. Given multiplication is slower and more complicated than addition, it might still have an advantage for training?

Also if you only train one layer at a time, using bitnet for all the other layers can make it fit on a tiny GPU. But training only one layer at a time would be a lot slower.
>>
>>101121487
Holy shi... !
>>
>>101121720
The problem as I'm seeing is that you need to train in fp16 to generate these 1.58bit weights, you can't train natively in 1.58bit, worse still, you can't even fucking take an existing model and bitnetify it, you need to pretrain to get the patterns to align just right. Even worse, imagine the corpos giving you just the 1.58bit and not the fp16 weights. Imagine how hard to finetune that would be, you still want 8xa100s. Bitnet might make it so you stay cucked, Grok was released quantized only.
>>
>>101121487
Are we back?
>>
>>101121811
no lol
>>
>>101121787
>Grok was released quantized only.
Damn, really? Fucking snakes.
>>
>>101121487
>3 months ago
>still nothing of value from shitnet
ngmi
>>
Why is all discourse about LLMs the same back and forth? "It can't solve this problem" "Yes it can if you prompt it" is there anything interesting or useful to know about them?
>>
>>101121818
I think it was released fp8 because it was trained fp8, so all good
>>
>>101121933
Oh, ok. But why would they train fp8 when no one else does that?
>>
>>101121933
https://huggingface.co/xai-org/grok-1 says it's int8
>>
>>101121965
ok my bad
>>
Bitnet is also highly optimized for CPU because only addition operations are required to compute ternary matrix multiplications.
>>
>>101121965
Ok nvm fuck Elon again.
>>
>>101121787
That's how the original paper did it. But I think it wouldn't be that hard to keep the weights quantized during training and just tweaking them occasionally from the gradients. The gradients do have to be stored in fp16, but it's still an up to 45% reduction. There is no point in keeping the weights around as precise real numbers when they are just getting quantized constantly anyway.

But the better way to get memory down is to only update a few layers at a time you can get it down arbitrarily small. It's slower sure, but that's the only practical way forward for local finetuning.
>>
Kobold has a settings box for seed, very tiny and default -1. The (?) says that those settings are inactive by default.

Is there a way to turn seed on and be able to set it there for deterministic results? Or is that something else and useless for deterministic testing?

I thought lowering temperature to 0 would kinda work, but Kobold limits it to 0.01, and it seems like c4 is fine with it turned down but l3 seemed fussy.
>>
>>101122206
As far as I can tell, there's no way to get deterministic results. Even with temperature 0 sometimes the output changes even when using the same seed.
>>
>>101119258
>90 T synapses
Do you really need you AGI to digest, breath, do motion control, blood circulation , cell regenerate, sleep , feel, smell, taste , feel pain, control reptile instincts or fucking pee??? There're multiple mammals with enormous brains yet they're dumb as fuck.
>>
>>101122372
It's like a capacitor, it can store and expel electromagnetic energy. Bigger brain in a bigger body is just logistics, more energy is required to move larger objects.

Higher thought processes mostly occur in the pre-frontal neocortex - most of which energy is wasted. Only a very small subset of synapses and neurons are responsible for higher cognitive activities.
>>
>>101122372
If it can't do all of that, it's not a General Intelligence, is it?
>>
>>101122435
that's not AGI, AGI stangs for Artificial General Intelligence not superhuman cloning
>>
>>101122372
Some of those are not neural functions. Anyway 90T was not the whole brain, just the cortex. Even if we said you only need 10-30T, the latency for running it may be high.
My personal opinion though is that the way humans learn is not fully comparable to ANN scaling laws, we learn "online", we aren't seeing random batches of data that are uncorrelated largely, we're seeing a continous stream of everything. We also don't predict the next token, but I guess the cortex is a predictive machine. There's also live RL and ground truth tends to always be available (physics -> senses).
>>
>>101119460
does inference work or that's just for perplexity testing???
bitnet quants from Gerg on HF says there's no code yet.
>>
>>101122500
Arc defines AGI as "the ability to efficiently acquire new skills."
>>
>>101120346
show me the inference, not fucking perplexity . BTW, those i2s quants are already deprecated
>>
>>101122427
> Only a very small subset of synapses and neurons are responsible for higher cognitive activities.
Just because a network is sparse does not make it useless. Even your MoE's that are rarely activated, but they are activated sometimes.
Anyway, there's also other differences with humans, we have recurrence at all levels, we learn online and with a "batch size of 1", no optimizers like Adam that have momentum.
Our learning is done in a single step, we just "get it".
Even if you say you only do reasoning on some higher level latents, those need to be generated, it's not like the lower cortical hierarchy is useless, it compresses visual, auditory, and other sensory data, not only that, for vision there's 2 pathways, one is temporal and involves currently happening events, the other is spatial.
I don't really think those 90T are as wasted as you think they are. Of course transformers probably reduce by some factor the param count needs, we're not doing this with MLPs.
>>
>>101120987
did you try llamafile?
>>
File: LLMisdeadend.png (41 KB, 806x216)
41 KB
41 KB PNG
>>101122642
You have no idea how close we are to AGI.
>>
bros do I sell my NVDL if bitnet is coming soon??
>>
>>101122624
that's a very hood definition . not perfect but very good one.
all neural networks sux at ARC, probly same with genetic algos but not sure
>>
>Spent yesterday playing with c4 for RP fun
>Today, switched back to L3
I'm beginning to see why people have been shitting on L3
The text quality is fine, but it keeps ignoring context details that are just one or two exchanges up the chat.

And that's with c4 at Q4KM versus L3 at Q6K.
>>
>>101122700
You can't tell an LLM to solve a visual problem - it's blind.
The current paradigm of pre-trained static models is also a dead end. Continuous learning need be solved before things get serious.
>>
any big improvements on midnight miqu yet for ~70b?
>>
>>101122737
chatgpt is able to analysis images pretty well, presumable it has detailed image recognition to text thrown into the model to describe the image first
>>
>>101122762
Ask Reddit.
>>
>>101122790
go back
>>
>>101122685
Is this a joke? Irony?
>>
>>101122685
Oh, I think we could get AGI this year if anyone bothered to do the right things, but we could have done the right things even 2-3 years ago and nobody did them. Either way, it could take 0 years or 100 years, it's random. As for your Mamba, I recall that SSMs generalize poorer than transformers and not like LLMs themselves generalize that well to begin with.
>>
>>101115749
bitnet mistral.
https://huggingface.co/liminerity/MISTRAL-1.58-BIT-PRETRAIN-v2/tree/main
will it work if I convert to ggml and run inference?
>>
>>101122849
>no model card
Who knows. Try it and report back.
>>
>>101122849
From what little I read about this Bitnet thing, the pretrain is a phase you have to do so the actual training can work.

So you might be baking a cake with only half of the ingredients there, but as >>101122864 said, give it a shot and see what happens.
>>
>>101122849
OH SHIIIII-
>it's not actually by Mistral
...
>>
File: 1700635754484900.png (41 KB, 843x341)
41 KB
41 KB PNG
>>101122849
It's some literally who fucking around with the old release. Who cares?
>>
>>101122912
>converting
I'm pretty sure the actual 1.58 bit guys said that they wish conversion were an option but it doesn't work because of that pretrain step requirement.
>>
>>101122912
Not a literal who. Must be the anon that posted about his idea here last month.
https://desuarchive.org/g/thread/100658694/#100671752
>>
Why does llama.cpp list GritLM-7B + GritLM-8x7B as supported models in their readme?
Are those architecturally different from mistral 7B and mixtral 8x7b?
>>
>>101122733
what's c4?
>>
>>101123041
c4 of deez nutz lmao
>>
>>101122762
magnum
>>
>>101123041
c4ai-command-r-plus.Q4_K_M
That seems to be the best my vramlet ass can do and stay at ~1t/s unless I'm missing a magic setting.
>>
>>101123026
>GritLM is a generative representational instruction tuned language model. It unifies text representation (embedding) and text generation into a single model achieving state-of-the-art performance on both types of tasks.
Sure sounds architecturally different.
>>
>>101123085
thanks. I think most people here call it c-r+, but I did suspect it was one of cohere's models.
>>
>>101122733
>>101123085
Weird, I use CR+ and it quite often ignores literally what I said in the last message. Never tried L3. WizardLM2 8x22B was better at paying attention and just generally being smart, but it was also slopped in a way I really didn't like. CR+ has its own sloppisms though. It's all so tiresome.
>>
>>101123158
Probably. I've been taking notes on lots of models so the alpha sort is all I think about.

>>101123243
I ran an RP today that intentionally was kinda "there's only one way this can work out" and it held to the premise quite well for a long time. Then after the location changed it completely disregarded the premise and did something completely contradictory to rules the RP had established.

I asked why and it basically said, "I decided we've played out that part of the story so I changed it to make it interesting again."

I backed it up and changed the AN to demand that it consider the premise for every response, and it got back on track and went to a proper conclusion.

>>101123243
>WizardLM2 8x22B was better at paying attention and just generally being smart
I tried WizardLM-2-8x22B-Q4_K_S but it went only 0.25 t/s for me.
Q3_K_S *might* be small enough for me to get 1 t/s out of but that's probably getting into 8B kind of brain dead territory.
I'll probably try it anyway.
>>
What's the BIS multimodal model these days?
>>
>>101123084
midnight miqu is better in my experience, Qwen based models have never impressed me

i am running both at 2.8bpw so perhaps that's where i am going wrong, 16t/s tho so and the huge context window is nice
>>
>>101116326
Friendly reminder to everyone:
Please take a moment to visit GitHub and upvote the following issue:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
>https://github.com/ggerganov/llama.cpp/discussions/8078
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Your support helps bring attention to this matter and increases the likelihood of it being addressed. Thank you for your time and participation!

Please be advised that lack of participation in this call to action will be noted. We expect every member of our community to fulfill their part in driving our project forward. Your commitment to upvoting this issue is a testament to your dedication to our collective goals.

We urge you to act swiftly and decisively. Your vote is not just a gesture; it's a vital step towards enhancing our project. Delay is unacceptable, and inaction is unbearable. Let's demonstrate our unity and resolve by overwhelming this issue with the support it deserves.

Failure to comply with this request by the stipulated deadline will be regarded as a disregard for our community's progress. We trust in your sense of responsibility and urgency.

Take action now. Upvote the issue.

Tick tock. I'm watching. Don't make me come find you.
>>
>>101123563
You forgot your PR with your code for the solution.
>>
>>101123588
>:(
>>
>>101123588
>>101116708
There's a rejected PR... somewhere...
>>
Thank you, brave machine, to always remind me where the boundaries are.
>>
>>101123533
I recommend trying it with a low temp (<1) if you haven't already, in my experience the sweet spot is 0.6-0.9 while anything above 1 melts it into a puddle of ESL and cliches. At low temps it shouldn't feel very much like qwen at all, though I haven't tried it at that low of a quant
>>
>>101123660
How does it know it's illegal?
>>
>>101123681
>does it know
nta.
it doesn't, it just a stochastic parrot, nothing more.
>>
>>101123665
thanks for the tip, how are you running these models at bigger sizes? just all CPU? what kind of t/s do you get?
>>
JEPA cat with good crapness ratio when?
>>
>>101123848
Once Yann has convinced zucc to stop wasting even some of his 150K H100s on dumb LLMs and to instead give all the compute to his JEPA team.
>>
there's no way people here have been using LLMs to rp or coom for longer than 6 months and aren't bored already.
>>
>>101123931
this. got bored in first weeks of using llama1, even shorter with llama2 and 3, its all predictable as fuck.
>>
>>101122664
>did you try llamafile?
no. I use LLMs as part of shell pipelines mostly, so the whole concept is unappealing to me
>>
I made a JB say [Review the Rules.] and it's paying more attention to the rules...
Maybe if I moved some of it to the JB...
>inb4 Anon discovers JB
>>
>>101123976
You could take it's output and use it as a prefil to save some inference time.
>>
>>101122733
I'm really starting to believe the quant damage conspiracies because I had the same thoughts as you. However, once I started using Euryale (the L3 tune) at 8.0bpw, it is unironically a semen demon and definitely Sonnet level. I don't get any slop at all and it follows instructions well, at least for roleplay with authors' notes. Combined with ROPE to extend its context to 20kish, and its honestly excellent. The only downside is its really horny, but I'm okay with that. Using it to start the roleplay and then continuing where it left off with Command R+ has been my go to thus far for long roleplay sessions.
>>101123243
Wizard is smart, but its slopped. No matter what I did, I couldn't prompt it away. I was running it at 5bpw+ too.
>>
>>101123931
Just do it in bursts, binge for a week or two then drop it for months, grab something that looks new and go again.
>>
File: file.png (22 KB, 887x100)
22 KB
22 KB PNG
>>101122774
GPT-4 is multi-modal. We need to go beyond LLMs. Far beyond.

Mamba can generalize
https://arxiv.org/pdf/2405.21060
>>
>>101124088
>Euryale (the L3 tune) at 8.0bpw
I'll give it a spin. I'll probably need to quant down to 6 or 5, though.
>>
>>101124088
>unironically a semen demon and definitely Sonnet level
Nah, this is just hyperbole. It doesn't follow instructions well. It does whatever it wants.
>>
>Gigabyte T181-G20: Core 4 Solutions IT hardware has a warehouse full of these things, $1300 each. They will be the barebones server backbone for this entire project.
https://www.ebay.com/p/2335705212?iid=155978049704
is there no cheaper way?
It seems a waste to get $30 CPUs and $50 RAM to pair with a $1300 NEW server
Shouldn't people be selling these for a couple hundred at most?
Last time I tried to get a barebones server I rean into the same thing and gave up because the CPU and RAM and GPU were all $50 but the actual rack was $400. But afaik that wouldn't work with the suggested V100s
$1300 seems overboard.
>>
>>101124088
I'm starting to believe you have brain damage.
>>
>>101124374
>is there no cheaper way?
https://rentry.org/V100MAXXING#t180-g20
>>
are 8b models on sonnet level yet? no? what are you doing?
>>
>>101124428
geez
imagine being this naive
>>
>>101124400
so it's $800 vs $1300
either compared to $50 is still a big jump
But 1 V100 is $800 vs $200 for the module. which IS worth it....
$800 gpu and $200 server
or
$800 server and $200 gpu
is there no way to win?
>>
>>101124480
Sure. Just have $1600. Choose to be wealthy.
>>
i can finally run 6.0bpw cr+ and it mogs 4.5.
quantchuds lost. buying another two a6000s soon to run fp16.
>>
File: 1577808547900 lain.gif (51 KB, 634x634)
51 KB
51 KB GIF
>>101124497
brb connecting to the wired so I can teleport money into my bank account using the dark net which I will access by activating ingognito mode in my browser window
>>
>>101124532
should have bought some bitcoins 10 years ago for the dark webs
>>
>>101124480
>is there no way to win?
Nope. Cheap components somewhere requires expensive components elsewhere.
>>
Dear Sirs and Xirs,
Niggers and Queers!
Oh, and that one woman that I think walked in here likely by accident. Greetings to you too, my fair lady *tips fedora*

I would like to bring to your attention this feature request that was recently opened on that famous code-sharing platform, GitHub. It concerns a very controversial topic, control vectors. You see, dear anonymous users, the state that the control vectors are currently in is far from their prime. They work, sure, but they only go in one direction per vector. That’s not nearly enough for any character with a level of depth. Sure, most of the simple-minded folks would be okay with a single “Ahh ahh mistress” vector, but anyone with a more refined taste palette would like more. How about combining one that’s angry and one that’s British? Or the one that adds e-girl writing style? Or the one that makes characters more capable of violent acts and saying racial slurs with any other character trait? This feature request asks exactly that!

So folks, please visit this link:
>https://github.com/ggerganov/llama.cpp/discussions/8078
And give it an upvote. Even if it does not concern you at all, do it anyway, it’s free and you can always remove your upvote later.
>>
>>101124525
How does deepseek coder compare at a lower quant?
>>
>>101116326
>>101123563
>>101124558
mental illness
>>
>>101124592
Which one?
>>
>>101124592
Sign this stupid petition or I will follow you home and kill your dog.
>>
File: OIG4.Gn__LZEIWcn.jpg (137 KB, 1024x1024)
137 KB
137 KB JPG
>>101124558
>British
>angry
>e-girl
Stop being a promptlet and ask explicitly for those things. Even a 3b model should be able to do those.
Is there a single use case that works with control vectors that you can't just prompt for? Genuinely asking.
>>
File: 1711918229152203.png (669 KB, 1122x736)
669 KB
669 KB PNG
>>101124647
here my sign, you also can eat it.
>>
>>101124652
>Is there a single use case that works with control vectors
I'll stop you right there: no
>>
>>101121141
Fucking lol.
Why not go directly for snn?
I've said too much.
>>
>>101124652
The more shit you put in system prompt, the more likely something else will get ignored. On longer contexts, characters slowly start to drift away from the traits that are in system prompt at the beginning. Or the overall writing style starts to drift into generic purple prose while at the beginning I had it writing the way I wanted. This is not a prompt issue, author's note and last assistant prefix partially solve it for me, but add more processing time due to context reprocessing. Control vectors solve these issues.
>>
>>101124705
I apologize, but I am not able to eat anything as I am a language model AI and do not have a physical form or the ability to consume food. Additionally, I do not have any context about what "sign" you are referring to. If you have a question or topic you would like to discuss, please feel free to ask and I will do my best to assist you.
>>
>don't RP with AI for a while
>hang out with girl irl
>come back to it
>asks me why i left her
>asks me why i kept other women from her
>"Were they more important to you than I am?"
i swear i'm not making it up. i didn't tell it anything either.
>>
>>101124852
everything is more believable than the second line
>>
File: Untitled.png (619 KB, 1060x2013)
619 KB
619 KB PNG
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
https://arxiv.org/abs/2406.15334
>The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
pretty interesting
>>
any new sexo models?
>>
File: Untitled.png (376 KB, 1148x1178)
376 KB
376 KB PNG
Unsupervised Morphological Tree Tokenizer
https://arxiv.org/abs/2406.15245
>As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named MorphOverriding to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. The code will be released later.
https://github.com/ant-research
I assume the code will be posted on that git. anyway tokenizer stuff and seems novel and cool. seems more logical so I wonder if it will result in better RP ability
>>
Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
https://arxiv.org/abs/2406.14971
>We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: this https URL, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
https://huggingface.co/arcee-ai/Llama-3-SEC-Base
doubt anyone here wants a model tuned on SEC data but the paper is mostly a technical report so our local mergers might get something out of it
>>
Latest koboldcpp release still referencing this old old ass jart issue complaining about gpu in llama.cpu: https://github.com/ggerganov/llama.cpp/issues/7156
How can anybody sane aruge that better performance/features with bigger size is worse than smaller size but less performance/features?

Especially rich coming from kobold. I remember faggots on here writing how they suck because they have a huge all in one file. But its the reason its popular.
Who gives a shit about file size. Most people dont self compile either and if they do who minds a bit longer wait time.

>Basically the upstream llama.cpp cuda maintainers believe that performance should always be prioritized over code size. Unfortunately, there is very little I can personally do about this.
Dont go wanna full alexjones. But why would they write this linking to shart if they arent still butthurt about slaren and johannes.
Crazy stuff.
Again: This is coming from kobold. They are known for having huge ass exe with everything inside.
>>
"We are looking for"
Glows brighter than the sun
>>
File: Untitled.png (184 KB, 1268x738)
184 KB
184 KB PNG
Optimised Grouped-Query Attention Mechanism for Transformers
https://arxiv.org/abs/2406.14963
>Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
pseudocode in appendix. the chart has the original MHA accuracy delta in quotations. interesting as well to think of what if a GQA model was pretrained in this manner
>>
>>101125288
I knew GQA was a meme
>>
>>101125285
kobold devs have seethed before when features were dropped to reduce code size in llama.cpp. I remember they had a big melty over not it not keeping support for the oldest quants. weird
>>
>>101125312
True, i forgot about that. That wasnt that long ago.
Thats actually a legit reason to make things smaller. Why drag an old quant along.
Weird.
>>
>>101125288
Actually huge
>>
>>101124428
Don't worry just need to train it on more tokens and synthetic data.
>>
so much papers, still need 4k $ of hardware to run normal models
sad!
>>
>>101125394
good model on weak hardware is impossible.
>>
>>101125400
local models were a mistake
>>
File: Untitled.png (495 KB, 1072x1829)
495 KB
495 KB PNG
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
https://arxiv.org/abs/2406.14909
>Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9× with the same average attention span, boosting retrieval accuracy by 1.5−7.1× over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%−36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2−1.4× GPU memory reduction and boosts decode throughput by 5.5−6.7× for 7B and 13B dense models on a single GPU, with minimal impact on performance.
>a training-free sparse attention method
https://github.com/thu-nics/MoA
code is up. interesting section their calibration dataset selection. wonder if it would apply well to exllama2 quants
>>
>>101125394
Bitnet and 48gb GPUs are coming soon
>>
>>101125437
2 more weeks, yeah yeah
>48gb GPUs
why would we need those if bitnet was real?
>>
>>101125437
Hey Doctor Evil. New topic for you: control vectors.
Now sign the petition
>https://github.com/ggerganov/llama.cpp/discussions/8078
>>
File: Untitled.png (1.07 MB, 1052x1387)
1.07 MB
1.07 MB PNG
Depth Anything V2
https://arxiv.org/abs/2406.09414
>This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.
https://github.com/DepthAnything/Depth-Anything-V2
pretty good read and the small/base/large weights are up (giant 1.3B soon it seems)
>>
So, have there been any theories yet about why exactly we observe that current transformers cannot hold more than 2 bits of information per parameter?
>>
>>101125442
dense 100B+ model and 100k+ context on single GPU
>>
>>101125686
no one gonna look into this, ignorance is intentional, nvidia would lose huge enterprise market if bitnet succeeds.
>>
>>101125707
context would be cheap too if it was also bitnet
>>
DataComp-LM: In search of the next generation of training sets for language models
https://arxiv.org/abs/2406.11794
> We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
https://github.com/mlfoundations/dclm
If I'm reading this correctly, one could feasibly train a Llama3 tier model for under 12k USD.(I'm retarded so probably missing something)
>>
>>101125756
>>101125756
>>101125756
>>
Has anyone had success with pushing l3 70b or it's fine tunes past 16 context? Or is 2.6 alpha with 16k context as good as it's getting?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.