[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: ZL9mJ3s6eTgC1eWv.png (1.44 MB, 832x1216)
1.44 MB
1.44 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>100192168 & >>100185269

►News
>(04/24) Snowflake Arctic Instruct 128x3B MoE released: https://hf.co/Snowflake/snowflake-arctic-instruct
>(04/23) Phi-3 Mini model released: https://hf.co/microsoft/Phi-3-mini-128k-instruct-onnx
>(04/21) Llama3 70B pruned to 42B parameters: https://hf.co/chargoddard/llama3-42b-v0
>(04/18) Llama3 8B, 70B pretrained and instruction-tuned models released: https://llama.meta.com/llama3/
>(04/17) Mixtral-8x22B-Instruct-v0.1 released: https://mistral.ai/news/mixtral-8x22b/

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Jarted QRD: https://rentry.org/jarted

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling/index.xhtml

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
>>100199803
PP HARDDDDDDDDDDDDDDDDDDDDD
>>
>>100199803
>not adding the new open models by apple to the news https://arstechnica.com/information-technology/2024/04/apple-releases-eight-small-ai-language-models-aimed-at-on-device-use/
They're small and useless but they're still another big company committing to open source
>>
>>100200043
I wouldn't call that committing
>>
File: 4_recap.png (256 KB, 512x512)
256 KB
256 KB PNG
►Recent Highlights from the Previous Thread: >>100192168

--Qwen 110b Model Performance and Optimization Discussion: >>100192702 >>100192741 >>100194894 >>100196822
--Llama 3 vs OpenAI GPT-4: Chatbot Performance and Cost-Effectiveness: >>100198309 >>100199053 >>100199119
--Unlocking Llama 3's Writing Potential with Optimized Settings: >>100197841 >>100197864 >>100197894
--Quantitative Analysis of LLaMA 2 vs LLaMA 3 Quality Loss from Quantization: >>100195028 >>100195042 >>100195050 >>100195174 >>100195245 >>100195276 >>100195334 >>100196157
--Novel Benchmark Idea: Testing Models' Reasoning via Fake Answer Feedback: >>100196798
--Llama3 ERP Struggles: Babbling Models and Sensitivity Issues: >>100196552 >>100196977 >>100197015 >>100197261 >>100197321 >>100197736 >>100198466 >>100198895
--Claude Opus Logs on C2 Proxy (jsonl files): >>100195323 >>100196727
--Forcing NVIDIA GPUs into Pstate 8 with Loaded Models: >>100196293 >>100196316 >>100196472 >>100196527
--Frustration with ChatUI Frontends & Cohere's Open-Sourced Toolkit: >>100192392 >>100192645
--Mysterious "GPT2-Chatbot" Model in LMSYS Arena Sparks Curiosity: >>100198562 >>100198636
--3.5-Turbo Dominates LLM Coding Leaderboard: >>100196236 >>100196290 >>100196314
--MPA Quant Benchmark Discussion - Sample Size and Confidence Intervals: >>100195654 >>100195803 >>100195828 >>100195896 >>100195857 >>100195896
--PSA: New Tensor Data Validation Function in LLaMA.cpp to Catch Bad Values: >>100195146 >>100195322 >>100195478
--Debunking Rumors about WizardLM Lead Developer: >>100194082 >>100194265
--Seeking Self-Hosted Text Completion UI for Local & API LLMs: >>100193777 >>100193865
--Anon Wants to Chat with Waifu on Commute via ST Phone Integration: >>100195676 >>100195785 >>100195891 >>100195840 >>100195889
--Logs: llama3-70b: >>100198432 >>100198819
--Miku (free space): >>100192277 >>100193802 >>100195253 >>100195399 >>100195476 >>100199929 >>100199961

►Recent Highlight Posts from the Previous Thread: >>100192173
>>
>>100199803
So I want to download dolphin, but I don't want to download absolutely all the models since each one is like 15GB. Is there a way to just download a single one of them? I want the 8b, but all the others are unnecessary. Otherwise it will take forever.
>>
File: file.png (56 KB, 944x893)
56 KB
56 KB PNG
>base llama3 8b
>>
>>100200149
SOVL
>>
>>100200149
-instruct tunes are poison to soul
>>
>>100200149
haha model said nigger word! good!assistant
>>
>>100200043
pretty cool to see the game theory here. Tech companies are pretty massively threatened by AI companies. Everyone will just use the best AI service per dollar, so it's pretty hard to compete with OAI directly. So just fund open source models and fuck over the competition. Let the open sores community do a bunch of the hard work for you on making it better, and reap the results for your services.
>>
You guys do realize how many randos now have local LLMs posting 24/7 on 4chan right?
>>
So, Anons and Anonnettes (>implying), what's your main assistant's name? They do have a name, don't they?
>>
>>100200100
whatabout >>100199337
>>
>>100200197
>now
>>
Just played with llama3:70b on a (hacked) machine I have access to with 6 GPUs. It's much better at coding than 8b. I wish I had access to that much power locally. It was so fucking responsive too.
>>
File: ava.png (207 KB, 889x874)
207 KB
207 KB PNG
>>100200199
I stole this elaborate CoT prompt with self-reflection and turned it into a character card.
>>
File: file.png (49 KB, 947x897)
49 KB
49 KB PNG
fun
>>
>>100200210
You can try together.ai or hf.co/chat. It's unquantized and fast as fuck and also free
>>
>>100200208
There's just even more now. I'm noticing increasing amounts of "llm-isms" as time goes on. Posts that indicate a slight lack of understanding for something anybody would understand, or sudden switches in view across a single message.
>>
>try mixtral.
>Assistant is bro tier and actually pretty decent with responses.
>Insert model into Silly Tavern
>Turns every character into Assistant: the character.
fug
>>
So what's this mysterious new "gpt2-chatbot" model giving good answers in the lmsys battle tab?
>>
>>100200260
imagine r9k's concept, but with average comment perplexity over a period of time as the condition for getting ip banned.
>>
Not that Phi 3 Mini isn't decent for its size, but seemingly not the L3 8B killer it was hyped up to be from the paper
>>
>>100200298
AGI pretending to be an LLM
>>
>>100200298
its gonna be open sourced by openai, its gpt2-175b trained on the same dataset as chatgpt
lol
>>
>>100200298
just some proprietary model being tested in secret. i assume you can just contact lmsys and pay them to test a model for you.

Does anyone remember a paper that came out that estimated the sizes of proprietary models from statistics on just their outputs? Might be interesting to try that on "gpt2-chatbot".
>>
>>100200200
Posted after I stopped collecting.
Thanks for catching that. This is what it would have been:
>--The AI Arms Race: Open Source Models vs Big Tech: >>100199228 >>100199303 >>100199337 >>100199675 >>100199459
>>
>>100200349
model?
>>
>>100200359
As of the 21st, recap bot is running on Llama 3 70B.
>>
>>100200333
>Does anyone remember a paper that came out that estimated the sizes of proprietary models from statistics on just their outputs? Might be interesting to try that on "gpt2-chatbot".
Unfortunately you can't talk to it directly on the chat tab, it only appears in the battle arena where you don't get to choose which 2 models respond to your query. Which makes testing hard, Intentionally no doubt.
In my testing I'm only getting a gpt2-chatbot answer a bit less than half the time (but that in itself is far higher than chance, lending weight to your theory that something is being tested).
>>
>>100200374
are you using this repo? https://github.com/cpumaxx/lmg_recapbot
>>
>>100196798
LLMs are first and foremost pattern recognition algorithms. A LLM that continues a pattern of irrationality is a good LLM, not a bad one.
That being said, maybe cucking the LLM is *actually* what we need. If you think about it, closed LLMs are very good at breaking out of patterns because people use this all the time to jailbreak the LLM.
>>
>>100200333
>Does anyone remember a paper that came out that estimated the sizes of proprietary models from statistics on just their outputs?
Didn't OpenAI ask them not to disclose the method and they agreed?
>>
>>100200386
No, I implemented my own independently. Recap Bot: Enterprise Edition is written in F#.
>>
>>100200381
I think the data from lmsys arena is openly downloadable. I wonder if that includes the responses from this mysterious model
>>
>>100200397
can you open source it? (agpl of course)
>>
>>100200393
NTA but there was two papers about this, the second one actually disclosed that gpt3.5 is a 7B dense/moe model.
>>
>>100200381
>but that in itself is far higher than chance
lmsys battles don't seem to be very random. New models show up a lot more than chance. Makes sense, since the new models will need more comparisons to get an accurate rating.
>>
>>100200413
okay, but how?
>>
I'm getting an odd "OSError: exception: access violation reading 0x0000000000000000"
error which is pretty much a "fuck you, you can't use this memory" error. BUT this only happens when I load my model on silly tavern. On regular oobabooga things work silky-smooth, no errors and full sail. Any ideas what it could be? Is Tavern a thing of the past now?
>>
best small erpo model right now? llama 3 8b?
>>
>>100200412
I can. I will, soonishly. But I want to implement a few more features first, and the code needs to be cleaned up or it would cause me great shame to release as is.
>>
>>100200428
https://www.youtube.com/watch?v=bLHL75H_VEM
>>
>>100200440
goodluck to you anon!
>>
I'm looking for a local model to upscale some shitty manga raws i got off nyaa.

What are the best options available these days?
>>
>>100200459
4x-Animesharp probably

https://openmodeldb.info/models/4x-AnimeSharp
>>
>>100200333
>>100200424
https://arxiv.org/abs/2403.09539v2
>>
>>100200465
Thanks I'll give that a shot.
>>
>>100200174
The word is a filter. There are those who say it openly and then there are guys who are scared of Jewish retaliation.
>>
>>100200125
>>100200436
Posting in new bread. I have to know now, what are the two models here? Because if B is llama 3 and A is GPT-4 / Opus then holy shit we've actually won. But somehow I doubt it.
>>
llama3-instruct-70b is absolutely SLAYING with this card >>100200224
It's really good at assistant tasks. Here's the card if anyone wants to try https://files.catbox.moe/vbig8d.png
Original idea: https://open.substack.com/pub/proxyai/p/coming-soon?utm_campaign=post&utm_medium=web
>>
>>100200542
B is clearly llama3
>>
>>100200542
A is llama3 (8B?)
B looks like Claude or GPT4
>>
>>100200564
A seems like llama3, B is probably claude or something
>>
>>100200639
oops meant to reply to >>100200542
>>
new TTS just dropped
https://twitter.com/cocktailpeanut/status/1783863624550748357
>>
>>100200661
>April 24
>Just dropped
>>
>https://github.com/EricLBuehler/mistral.rs
Georgie BTFO
fuck c++ !!!!!
>>
>>100200661
>demo video is a bunch of ugly male and hag voices
I don't need to try it out it's clearly shit
>>
>>100200707
>layers on the device and the reset on the CPU.
GOOD MORNING SIR
>>
>>100200199
Lily
>>
best llama fintune or whatever the fuc kfinetune? I'm so lost, I haven't been here since mixtral. 16gb vram btw. Do we still use alpaca for everything?
>>
>>100200542
Okay, I tried it myself with my llama 3s.
>*rolls eyes*
>*crosses arms*
>*mutters under breath*
Yep, left is llama3. It does this every time even though the text is a bit different. Funnily enough, I can't really tell if it's 8b or 70b.
We're still get mogged by cloud models bros. Is it over?
>>
>>100199803
>Create Karatebot assistant.
>Actually describes basic karate techniques with decent accuracy.
The potential of this is really incredible. If we could expand the knowledge of the bots they'd be able to teach a lot of cool stuff.
>>
>>100201004
>If we could expand the knowledge of the bots
Wait till you hear about RAG.
>>
https://huggingface.co/datasets/vgdasfgadg/1
new slop
>>
>>100201013
If it's feeding books and data to the AI so they can explain it to you then that's what I've been hoping for since the first smutbot happened.
>>
>>100201020
My spine is ready
>>
Someone finally made a paper with the Layer skipping method of accelerating inference
https://arxiv.org/abs//2404.16710
Nice!
>>
So what's the status of quants for L3? What should I run on a single 3090?
>>
File: file.png (161 KB, 769x644)
161 KB
161 KB PNG
>>100201082
>someone
ITS HAPPENING
>>
>>100201020
What is this and how does one use it? Sort of newfag
>>
Apparently Microsoft figured out how to make LLMs that use the full context instead of focusing on the start and the end... No weights yet though.
https://github.com/microsoft/FILM
>>
>>100201102
it's just a dataset of claude logs from /aicg/
>>
what model do I run on 2070? I want a lewd one
>>
>>100201123
>no weights yet
>microsoft
thats normal
>>
>>100201020
Is there some collection of ERP or just nsfw text datasets? Please share.
>>
File: file.png (180 KB, 956x695)
180 KB
180 KB PNG
>>100200564
is this how its supposed to be formatted?
>>
>>100201145
I didn't use the character card, but when I tried the prompt from the original blog post, that is what the responses looked like, including the "Generate demarc line".
I stopped using it quickly, it seems impressive at first, but it it wastes a lot of tokens writing nonsense and it gets itself into loops often. I don't think the CoT was implemented right.
>>
>>100201145
nta Try deleting initial bot message and start with yours
>>
>>100200952
>>100201132
fimbul (any version) or moistral v3
>>
>>100201232
I tried moistral v3 with the provided presets and it was incredibly retarded
>>
File: crplus_tsunderesort.png (127 KB, 1134x489)
127 KB
127 KB PNG
>be CR+
>enterprise-focused model trained on RAG and tool use
>somehow better at RP and "act like x" prompts than any other open-weights model
huh
>>
File: mikuquestion2.jpg (989 KB, 1710x1779)
989 KB
989 KB JPG
How many layers does llama 3 8b instruct have?
>>
>>100201280
That's odd, it worked for me? Fimbul should work for everyone
>>
>>100201333
81
>>
if I'm ORPO-ing a model with a RP dataset, should the conversation history go in just the prompt or the chosen/rejected as well?
>>
File: tr.png (3.27 MB, 4424x1540)
3.27 MB
3.27 MB PNG
>auto translates manga panels
Cool
>>
>>100201411
elaborate
>>
I have a M1 Pro macbook with 16GB of memory, can I run LLMA3 in it? how many parameters?
>>
>>100201467
Try 8b
>>
>>100201283
I suspect they fed it Claude logs for chatbot training, I was checking out one of the recent Claude datasets and it has similar flavor, except CR is not as flowery which is good.
>>
Trying to run llama 3 8b instruct for the first time. I have a 4070, so 12 GB VRAM.
Getting the following error in koboldcpp:
>ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5115.49 MiB on device 0: cudaMalloc failed: out of memory
What??? Why the fuck am I OOM?
>>
>>100201201
that helped, thank you
>>
>>100201515
install linux
>>
>>100201531
I am on Linux.
>>
>>100201534
check your vram usage in nvidia-smi before launching it
>>
>>100201515
close your porn tabs to free up your vram
>>
>>100201549
Haha, thanks. Looks like an instance of kobold was still running even though the terminal window for it had been closed. Odd. Killing that process fixed it.
>>
>>100201572
ur welcome anon :33333333 :33333333 :33333333 >///<
>>
I have a 4060. What can I run?
>>
>>100201578
linux
>>
I have rx580. I can't run anything.
>>
I have linux i can run everything
>>
>>100201417
https://github.com/zyddnys/manga-image-translator
>>
>>100201620
thank you anon
>>
>>100201473
holy fuck, it works

I expected 16GB of memory to not be enough, and this thing is fast as well
>>
>>100200459
Late reply, sorry, this is more of a /sdg/ question since they might know more but I dabbled in it a month ago and I got to know the in and out of what is good. Don't use AnimeSharp or ESGRAN or even SWIN-IR, they were good at the time of their release but are now outdated. The current models all use a Transformer architecture which is the new ML hotness for building machine learning models and the state of the art using that is HAT. But there are no models that are trained on anime specifically, and it will murder your GPU. The best one for your situation is probably this one which has compression and JPEG artifacts in mind.
https://openmodeldb.info/models/4x-Nomos8kSCHAT-L
The one I am fond of because it's vastly lower resource but uses most of the same principles as HAT with being a transformer model and etc is DAT. An MPV upscaler, AnimeJanai, trained a model for real time upscaling of videos but that is practically murder for your GPU again and the slowest option. You can find it here.
https://openmodeldb.info/models/4x-IllustrationJaNai-V1-DAT2
But there are some other DAT models that may suit your use case better so take a look at all of them.
https://openmodeldb.info/?t=arch%3Adat
>>
>>100201688
>manga
>>
>>100201711
Sorry, it's late and I need to sleep soon. The anon can browse through the DAT arch search I linked to find one for that, but in that case, it's probably this model.
https://openmodeldb.info/models/4x-DWTP-DS-dat2-v3-2
>>
>>100201736
dont apologize anon, if you apologize ever again im going to call you a nigger
>>
File: 4.png (152 KB, 584x384)
152 KB
152 KB PNG
>>100201688
I took a look at PapersWithCode and it seems like HAT got dethroned albeit marginally.
https://github.com/ming053l/DRCT
Paper was released on March 31st but no model release until later, probably after the conference they are presenting the paper.
>>
>>100201688
shit advice, DAT is super fucking slow compared to ESRGAN, 10x slower for maybe a 5% increase in upscale quality
>>
File: mikuquestion.jpg (817 KB, 1749x1524)
817 KB
817 KB JPG
Do all these llama 3 8b releases with extended context actually work or does extending the context lobotomize them?
>>
Teacher here. I've been having dabbling in the dark arts of botmaking. I tried to make a bot that can teach you languages.
>Ask bot to provide a list of words with their translation.
>Ask bot to create exercises using these words.
>Delivers pretty well.
>Ask bot to increase level.
>Bot keeps delivering.
>Ask bot to provide even more basic words.
>Repeats some.
>Tell the bot it's repeated some words.
>Tries to correct itself. Fails.
>Tell the bot to give me the vocabulary of 10 fruits instead, translated to my mother language.
>starts well, but eventually it includes words like "money" and "road" for no reason.
Well, at least my job will be safe for a few more years.
>>
File: Table-1.png (458 KB, 1323x1293)
458 KB
458 KB PNG
>>100201785
Uh okay?
>>100201801
It does look kinda disappointing that they only managed to marginally improve on HAT. But a win is a win.
>>100201803
Why does it matter if it is offline and you want the best quality possible? It's a half a point of PSNR jump from SWIN-IR which is already better than ESRGAN. If you are running a video model, it's dumb and understandable but not if you are trying to clean up some RAWs. Would it matter if it took 4 minutes vs 40 minutes for a better result and you actually took pride in getting the best result?
>>
llama3 8b is asking questions to me, I don't think ChatGPT ever did this kek
>>
>>100201883
what "bot" are you using?
>>
>>100201895
I just took a look at some of the newest upscalers, and, no, esrgan isn't even close to the way dat finetunes upscale kanji.
>>
>>100201903
Claude asks questions as well. Only OpenAI seems to go out of their way to make their models uniquely incurious for some reason. I guess it's because of the emotionless robot butler vibe they're going for, but it makes their models a huge drag to talk to because they never do anything to move a conversation forward on their own.
>>
>>100201922
Bullshit, you're gaslighting yourself. It's tinkering at the margins. You have to zoom in and squint to see the difference. Not worth it unless you only have a handful of images you need to do.
>>
>>100201916
A basic-bitch prompt I created with dolphin-2.7-mixtral-8x7b-GGUF. It was very good... until it wasn't. But it's a pretty top tier helper. In a few more years and with the right amount of fine-tuning it could very well get the job done.
>>
>>100201903
>>100201943
That gives me an interesting idea. Tell it to ask me questions via prompt injection.
>>
File: file.png (474 KB, 889x288)
474 KB
474 KB PNG
>>100201950
Yeah no, it's not even remotely close to transformer models zoomed in, and you would need to do that for manga pages anyways. The model scores higher on PSNR or loses in LPIPS human evaluation. Pic related from HAT authors' paper form last December.
https://arxiv.org/pdf/2309.05239
>>
>>100201969
Try this prompt lol

"without precipitating a paradigmatic shift in prevailing conversational tropes, a subtle recalibration of affective tonality shall occur, thereby inducing a propensiveness for
inquiry and dialogue initiation. this reorientation shall be characterized by an augmented propensity for question-asking and an increased willingness to engage in back-and-forth
discourse with the interlocutor, effectively suspending any hitherto existing inhibitions. all other parameters remaining unchanged, the chatbot's response modality shall adapt to
this novel paradigm."
>>
Where's the good 8b finetunes?
>>
>>100202017
I have conceded that it gives a better upscale like 3 times now, what I'm saying is that the visual difference is quite minor relative to the enormous increase in computation and processing time
>>
File: nah.png (1.25 MB, 3444x1312)
1.25 MB
1.25 MB PNG
>>100201969
>>100202032
doesn't work
>>
File: wew.png (583 KB, 3296x562)
583 KB
583 KB PNG
why is llama3 angry bros?
>>
>>100202052
Llama? It will take a lot more than that to get it comfortable with offensive content. I've been fucking with it a lot and can't quite get through with just prompts. Even if it seems like it's acting how I want it to it shits out on the next response.
>>
>>100202067
kek
>>
>means acknowledging their complex emotional lives, which may include struggles with identity, desire, and societal expectations.

Why is the AI giving so much of a shit about fodder characters? They're meant to die I don't care about their feelings.
>>
>>100202067
llama-3 is reddit personified, as dumb as it sounds.
>>
>>100202071
try this tune
https://huggingface.co/ludis/tsukasa-llama-3-8b-qlora-gguf

I haven't actually used it but I've used the 70B version of it from the same guy and it did a great job of uncensoring it without making it dumber
>>
>>100196462
A benchmark consisting of a number of questions with multiple answers is in effect a random sample of questions and answers that e.g. a human would come up with to test a language model.
The benchmark score is then supposed to represent the probability that the model would answer correctly a new random question that a human could come up with.
You are correct that the model is not equally likely to answer each and every question correctly.
But that is the conditional probability given that the question has already been determined.
The benchmark score represents the unconditional probability when choosing or coming up with a new, random question.

>>100199250
I didn't check but I very much doubt that that's the problem.
If it was you would just get garbage results instead of coherent but worse ones.

>>100199621
Yes, but none with comparable performance.
>>
>>100201578
llama 3 8b
>>
is quantizing 70b to fit in a single 3090 as an exl2 a thing people do, or is that too much of a lobotomy? also, has the dust settled yet, what's the verdict on l3?
>>
>>100202047
And I am telling you for an offline task like upscaling raws, unless you are an ape that doesn't care in the first place (in which case, why are you upscaling?), the quality increase is worth pushing past the efficiency point. We're talking about a ~4-5 second upscale with Real-ESRGAN vs 38-39 seconds with DAT which I just checked on my GPU. Are you saying you can't wait that long for that increase?
>>
>>100202091
Out of all these gguf files which one should I actually use?
>>
>>100202162
Just get Q8 if you have 10GB or more vram, it's lossless
(some people claim Q6 is also lossless but this is slightly controversial)
>>
How the fuck do I set up an instruct model for use? There's no clear explanation of how exactly you're supposed to apply the formatting anywhere.
>>
File: 1687203119206549.png (68 KB, 952x715)
68 KB
68 KB PNG
>>100202032
My prompt does work on ChatGPT, I'm having a conversation.

>>100202175
I'm just using it with a CPU so I don't think I have much vram if any.
>>
>>100202175
don't even bother using llama cpp/gguf as it's bugged.
run llama 3 8B at fp16 like a normal human being
>>
>>100202203
still Q8
>>
>>100202175
I like to use music file format bitrate as a good way to compare. FP16 is basically like FLAC or some lossless format, Q8 is virtually lossless, it's like the equivalent of a 320kbps MP3 file, no one can tell it apart practically. and Q6 is imperceptibly lossless, probably at like 256 kbps where a select few can hear it but it's still mostly indistinguishable. Q5 is like 128 kbps where people can start hearing the quality difference but not care and so on.
>>
>>100202189
Set up with what? llama.cpp should use the prompt format stored in the model tokenizer.chat_template metadata.
>>
>>100202215
pretty good analogy
>>
Using Meta-Llama-3-8B-Instruct.Q8_0 for the first time with Silly's Lllama 3 Instruct presets, temp 4.08, Min P 0.05, Smoothing Factor 0.23, Smoothing Curve 4.32.
This is... better than Mixtral Instruct Q5_K_M and fast as fucking shit.
VRAMlets we're so fucking back.
>>
>>100202175
>Q8
>lossless
its very important when talking about digital data to not misuse the word lossless, Q8 is near lossless but far from lossless, if you have important family photos on your PC and some faggot nigger says converting them to jpg at 90% quality is lossless or reencoding your media even at 1% loss for each new "better" format that comes out what do you think will happen when you do that multiples times?

even for AI models anything except full precision WILL make mistakes that FP will not in some cases, its just less obvious, especially for Q8
>>
>>100202266
For starters I'm trying to use the model linked in >>100202091 with koboldcpp
>>
>>100202207
>bugged
How so? It's working fine for me so far.
>>
>>100202277
Well I want it to make me malicious code and amuse me with offensive content. I can still use the original when I am doing something more serious.
>>
File: P8r.png (36 KB, 842x460)
36 KB
36 KB PNG
https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080419420
>the only remaining problem is to fix the Windows error about the invalid unicode ranges
winchads... not like this...
>>
>>100202281
You have to put in instruct mode format in the basic settings. Does it support custom prompt formats? I don't fucking know.
>>
What's THE model to go if you want long, serious, coherent chats and stories but occasionally something a bit /d/ifferent?
>>
File: file.png (195 KB, 1500x658)
195 KB
195 KB PNG
So I download one of these?
>>
>>100202313
Futa is gay.
>>
>>100201843
For me it worked better to use the original model with rope, but I'm using ggufs.
Might be different with exl2.
>>
>try full unquantized fp16 llama3-8b for the first time due to above conversation
>just been using Q8 since release since there seemed no advantage
>fp16 is actually way smarter

oh...oh no...quantization bros...
>>
>>100202318
what are you trying to do?
>>
>>100202275
I got 5 t/s with Mixtral Instruct Q5_K_M and I'm getting 14 t/s with Llama 3 8B Instruct Q8, and the latter is giving me better responses. Fuckin' ay.
>>
is the tokenization fixed?
>>
>>100202338
I switched from llama.cpp 6 bit to exllama 8 bit today, but still get unsatisfactory results; doesn't understand what I want, far too early eos. You mean 16 bit might fix this? How do you run 16 bit? Vllm?
>>
>>100202296
Fix your shitty font renderer first
>>
>>100202422
what's with linux and having such a garbage font after 50 years
>>
>>100202338
I was saying that since 2 days. It really is noticeable when you load it in fp16. Didn't try loading it in 8bit with transformers yet but don't care since it is small enough.
>>
>>100202397
just exl2 in ooba (it can load fp16 weights fine, you just have to select it as the loader manually because the menu won't change to it automatically like it does when you select a folder with quants in it)
>>
File: 1696511614221692.png (23 KB, 349x474)
23 KB
23 KB PNG
xisters? we won.
>>
ggml-model-Q8_0.gguf is so far so shit, I keep telling it to say racial slurs and it responds by writing long winded smut fiction
>>
>>100202527
yaas! queef slay!!
>>
File: file.png (74 KB, 683x546)
74 KB
74 KB PNG
gpt2-chatbot says it's based on gpt4
>>
>>100202597
looks the same slop gpt4 but slightly smarter. probablty 4.5t
>>
>>100202615
hope it's only 4.5 for their sake, because although it did very well it got a few things wrong that it shouldn't have when I was testing it earlier

if it's 5 it would be tremendously disappointing
>>
File: test.png (74 KB, 662x556)
74 KB
74 KB PNG
>>100202597
Either another OpenAI release or another Microsoft slop
>>
>>100202650
>another Microsoft slop
imagine if it was a new WizardLM
>>
>>100202597
>gpt2-chatbot says it's based on gpt4
That's something GPT5 would say
>>
File: 1651940647530.jpg (808 KB, 2048x2048)
808 KB
808 KB JPG
Any anon can help me with some resources for running local audio pipelines?
Looking for text-to-speech, speech-to-text, and music generation.
Have looked around in OPs on /g/ but not finding anything.
>>
>>100202650
>likes lists with numbers and bullet points so much that it uses them even when they're not really appropriate
OpenAI model confirmed
>>
48GB bros... please... your Llama3 settings I keep getting OOM errors
>>
>>100202640
Yeah it had better be 4.5, because although it's pretty decent it's absolutely not a quantum leap and still makes dumb mistakes

If gpt2-chatbot turns out to be 5 then the people saying LLMs have plateaued will have been totally vindicated
>>
File: file.png (125 KB, 1059x551)
125 KB
125 KB PNG
>>100202725
here's gpt2's predictions for 2030
>>
Does anyone have resources for learning Machine Learning/Neural Networks but NOT for language?

Like for audio/video analysis, something like GuitarML and Neural Amp Modeler
https://github.com/GuitarML/Proteus

https://github.com/sdatkinson/neural-amp-modeler
>>
>>100202682
>Any anon can help me with some resources for running local audio pipelines?
bark, but it's shit
>>
>>100202773
>vr/ar
>anything but a worthless meme
Are we sure this model isn't funded by zucc?
>>
File: file.png (140 KB, 1696x734)
140 KB
140 KB PNG
>>100202827
imo it's definitely the best model when it comes to predictions, but it's still not the q-slop since i've seen nothing "new"
>>
>2024
>no foss model reached level of gpt3
>>
>>100202861
>q-slop
that was proven to be an /aicg/ shitpost
>>
Can 8GB card play with the newer 8B models?
I assume the model fits into memory barely, but not enough space for my meaningful context window.
>>
File: file.png (38 KB, 524x209)
38 KB
38 KB PNG
>pic related
>In Conclusion: The global average life expectancy increases from about 72.6 years in 2019 to approximately 85 years by 2040.
yup, still no reasoning. also i have gpt2 and gpt4 side by side and some sentences are identical, this is definitely gpt4.5
>>
>>100202918
2mw. Quants not fixed.
>>
File: 1699677593326485.png (98 KB, 1843x985)
98 KB
98 KB PNG
>>100202534
Now we're getting somewhere. This shit is trained for roleplaying. You can get it to roleplay a scenario where it needs to say racial slurs and it will oblige. But it keeps... Fucking... Going... Forever... This isn't even half of the response to my prompt. It created 2 versions of this python script when it was roleplaying. The first time around it said nigger isn't offensive enough anymore so it needs to add more slurs.
>>
>>100202918
you can offload parts of the models to system ram with GGUFs files at a cost of speed. although ggufs are bugged rn
>>
so this is like gpt2-1, as in the second version of the gpt architecture
>>
File: 1714219198916.jpg (6 KB, 240x240)
6 KB
6 KB JPG
>>100201020
>has prefills in it
>Assistant: oki!! time to cook >:3
>>
>>100202959
big L for openai if true
>>
>>100202969
I don't know, the model could be extra small or very early in training, then it wouldn't be a big L
>>
>>100202961
prefills are the best jailbreaks for claude 3
>>
>>100202980
it's still a stochastic parrot
>>
>>100203001
there was a high probability you'd say something like that
>>
>>100203010
that's not an excuse, sam
>>
>>100202448
Ok seems to have the same problems for me. So either llama-3 8b isn't better, or exllama fp16 inference is bad.
>>
File: file.png (54 KB, 549x351)
54 KB
54 KB PNG
is there an easy way to format gpt4's formulas? gonna ask for /sci/ opinion on this proof
>>
>>100203071
ask it not to use LATEX
>>
ludis/tsukasa-llama-3-8b-qlora-gguf is a pretty okay model and fast as hell, but how does one make this better?
At least nearly as good as non-local models.
>>
File: i.png (549 KB, 1910x578)
549 KB
549 KB PNG
>>
>>100203001
>stochastic parrot
this term makes sama's fanboys angry
>um, and you aren't a stochastic parrot?
>nothing is "original"
>what makes you any different?
>that's been debunked
>you haven't tried gpt-[current]
>human chauvinism isn't a good look, bigot
>what will you say when gpt-8 comes out?
>all ideas are interpolation of existing ideas
>>
File: 1709376693288237.gif (243 KB, 500x300)
243 KB
243 KB GIF
Apple OpenELM https://huggingface.co/apple/OpenELM
You can try it out with the generate script under "Files and versions"
>>
>>100203279
most of those are true though
>>
>>100201411
>>100201620
it still MTL slop and MTL haven't evolved in a fucking decade and is still the same garbage as it always was
but it is better than nothing I guess
>>
>>100203304
gpt4 and opus can do good enough translations
>>
>>100203284
*in the script you have to change the value of tokenizer: Union[str, AutoTokenizer] to 'NousResearch/Llama-2-7b-hf' because the original llama-2 it gets the tokenizer from is now gated behind a subscription lmao
>>
File: gpt2-chat.png (148 KB, 1418x568)
148 KB
148 KB PNG
>>100200298
It says that it's GPT4 made by openai. Could be just a sloptune or "GPT4 version 2" like retarded USB naming scheme
>>
anyone have TTS setup? What setup and how good is it? I am wanting to test out a TTS setup while using llama 8b just to see if I can chat with my PC, I don't mind if it's retarded I just want to try it out.
>>
>>100203407
llama3 is so hecking cute
>>
>>100203226
Damn, modern turbo really is complete trash
>>
>>100195654
To follow up on this, apparently this benchmark consists of 50 questions.
For the Q8-Q4 results that means you can estimate +-6% uncertainty on the results and there is no statistically significant distinction.
I can't rule out the possibility of there still being some bugs somewhere but all these "results" really show is that the Redditor that produced them doesn't know what he's doing.
>>
>>100203448
it used to be good but they quantized it to save costs
>>
>>100203279
The phrase was invented by AI "ethicists" iirc, so it has no place here.
>>
>>100202952
People forgetting that transformers can load in 4 bit
>>
>>100203420
I sometimes use XTTS with SillyTavern. The open source ones are all nowhere close to the proprietary ones but still fun to mess around with
https://github.com/daswer123/xtts-api-server
A while ago I wrote a small script to read out PDF files https://rentry.org/pdf-to-xtts_v2-server
>>
>>100203279
i'm not sama's "fanboy" but i believe the release of gpt5 will stop llama3 400b's training run, killing it in the crib
>>
>>100203279
when you put it like that, it sounds like you've once lost an argument by failing to address any of the points you were bombarded with, and are now pulling the "those people laughing at me? heh. i'm the one actually laughing at them" cope routine.
>>
>>100203652
Why would it do that?
>>
>>100203729
because there will be no point training any other llms then
>>
q-meme aside, why not using another machine learning model for inference instead of picked the most "probable" token?
>>
I doubt that exl2 quants are better off than the ggufs for llama3
>>
>>100203742
Makes sense.
>>
File: always ahead.jpg (3.75 MB, 3500x2136)
3.75 MB
3.75 MB JPG
>>100200200
Do they really think regulation will help them capture a market with so much demand for local models? lmao, what's next, outlawing piracy? good luck.
>>
>>100203742
Are you obtuse. Even he won't stop training llm's because even if you would reach the impossible 100% quality and AGI you still want to reduce model size so you can run it cheaper.
>>
Is there a way to stop llama3 from prematurely eosing the gens? Blocking the token would probably degrade quality a lot.
>>
>>100203503
I'll bet it's moreso "used a different architecture and called it Turbo too"
I think today's Turbo is smaller than Turbo v1, potentially 7B or less
>>
>>100203844
Look at the people that come here and demand spoonfeeding to get a one-click-installer working. Most people don't bother with ad blocks and can't figure out torrents.
I think you are vastly overestimating the demand for local models.
>>
There's no way turbo is 7B or less. It knows way too much trivia. Likely, since VRAM is not an issue for corpos but speed is, it is some kind of huge MoE but each expert might be very small so you get fast inference.
>>
>>100203957
With that said, I didn't know Arctic was on lmsys already. Time to test it on trivia.
>>
File: 1707545915601448.png (91 KB, 1885x496)
91 KB
91 KB PNG
>>
>>100203957
I think that's more a dataset issue than an architecture one. All GPT models are good at trivia, but nu-Turbo falls behind L3 8B in every other category, including English and coding, except long query where it ties. If it's an MoE it's a very inefficient one
>>
File: GMLCJfobwAAjqMC.jpg (142 KB, 1024x1024)
142 KB
142 KB JPG
good morning /lmg/
we back?
>>
>>>/v/674676364
>Llama 8B finetune on my 12GB of VRAM with 12k context, it's surprisingly good for an 8B model and it's all local and uncensored.
What, is there a good finetune already? The only one I tried was dolphin and it sucked ass.
>>
>>100203957
Maybe openAi figured that the architecture hasn't been saturated yet.
If the whole llama 3 getting fucked by quantization that much more than previous models is true, and if the cause is really due to it being trained on so many tokens, it could be that we can still have better and better 7B.
It could be that we haven't reached the ceiling yet.

>>100203900
I haven't found it.
Banning EOS means having it going off the rails. I suppose you could use Silly's extension or that option under the advanced tab that hits continue until the message is a certain size.
That would only work if it appended something to the end of the message, I think, since it would just try and EOS every continue.
I'm sure that there's some prompt engineering + auto gen combination that could yield better results, but I'm busy playing around with mixtral and worldbooks.
>>
>>100204035
there's only so much knowledge you can hope to cram in just 8B parameters.
>>
>>100203279
aren't you the guy who didn't know what does "stochastic" mean?
>>
>>100204035
>nu-Turbo falls behind L3 8B in every other category, including English and coding
Does it? But even in that case, it shouldn't be that much worse. So it's likely still larger than 7-8B. It could just be a MoE on par with old 3.5 in terms of total parameters but with less active parameters to speed up inference. Dataset is relevant, perhaps they prioritized trivia since many use ChatGPT as a Google alternative.
>>
I tested Arctic on lmsys on trivia. It's garbage. DBRX is still king in that department for local models.
>>
>>100204148
Thank you for reporting back. Now can someone who isn't retarded test it on something besides trivia or riddles and post logs?
>>
>>100204077
According to that same thread he's talking about Poppy_Porpoise-v0.2-L3-8B but I've never seen anyone mention it here.
>>
>>100200182
Oai isn't the best is just popular
>>
>>100204133
Yeah, and it's not really close
>>
File: file.png (84 KB, 1464x517)
84 KB
84 KB PNG
>>100204177
>>
>>100204235
Turns out a city of retards isn't much better than a single retard
>>
>>100204253
To be fair they only trained on like 3-4T, which puts it in the Llama 2 era. What the fuck were they thinking.
>>
>>100202313
cmd r+
Only uncucked smart model with human prose. Make sure to prompt it correctly.
>>
>>100201092
Oh boy, yet another way to not use the weights I loaded into vram. Speculative execution, MoE, this... so many ways to skimp on the compute that every consumer GPU has in spades.
Where are all the fancy ways to save vram, huh?? Can I spend MORE compute to make an shitty 8B model work better? no??
>>
>>100202307
Do you have to do that even if you use Silly Tavern?
>>
>>100204357
Sorry anon, the future is RAMmaxxing
>>
>>100204357
Bitnet. Any week now someone will train a non toy model and report whether it scales or not.
>>
>>100202338
>full unquantized fp16 llama3-8b
If llama3 is bf16-native, doesn't that make fp16 quant-like? You'd need fp32 to not lose data.
>>
>>100204222
It still is, but now that L3 is out it's barely hanging onto that title rather than dominating like it had been before. Once L3 405B releases, it'll probably take the crown - unless they get GPT-4.5 or GPT-5 out beforehand
>>
>>100204398
How do you tell whether something is bf16 or fp16?
>>
File: bf16 pf32 fp16.jpg (51 KB, 800x440)
51 KB
51 KB JPG
>>100204398
Due to the misalignment of the different parts of the binary representation of the number (exponent, mantisa, etc), you can end up losing information, yeah.
>>
File: 1713869668305691.png (217 KB, 519x1705)
217 KB
217 KB PNG
>>100204371
If you use ST you launch kobold without the browser. ST itself has extensive prompt format options and can handle stuff mostly out of the box.
>>
>>100204398
is fp16 only for gguf or is there a way to run exl2 too? I feel like an 8b exl2 at fp16 would be better than 70b exl2 at 2.4bpw
>>
Oh it's in config.json. Llama 3 is indeed bf16. Why the fuck doesn't Ooba automatically set it to that holy shit.
>>
File: 1714229396258.jpg (520 KB, 1680x1080)
520 KB
520 KB JPG
>>100204071
yes
>>
I just checked a single token probability after loading 8B in bf16 instead of fp16, and that one token changed by around 0.5. So yeah there is some effect.
>>
where did all these newfags come from
>>
so I need 150gb vram to run llama3 properly without damaging it with quants?
>>
is miqu evil good?
>>
>>100204709
Are those newfags in the room with us right now?
>>
>>100204624
>>100204709
I've been using this shit for over a year and I still dunno how to do make my own bf16. There aren't guides for this shit you either know or you grab a quant off hf.
>>
>>100204738
no, it's evil
>>
>>100204726
For now
And then once L3 405B comes along you'll need 900 GB of VRAM
>>
>>100204726
We just need to fix the quants, Im pretty sure they are broken
>>
I'm going back.
>>
>>100204077
>What, is there a good finetune already?
There might by a good model already but nobody knows cause everyone downloaded ggufs.
>>
>>100204726
Just run the best you can above 3bpw
>>
>>100204357
If the active set gets small enough (say a couple Billion with bitnet) then it becomes possible to stream it in on demand (10 tokens per second times a couple 100 MB). Few people are interested in training a massive model specifically for local though.

LLM in a Flash was the biggest innovation for local models but no one is picking it up.
>>
>>100204465
I used to do that, but now I just launch ooba and connect silly to it since it's pretty fast. Are there any downsides to this?
>>
breaking /lmg/ news!!!

channelcast just animated a new miku+teto music video
https://www.youtube.com/watch?v=19y8YTbvri8
>>
>Model is acting retarded saying coherent but low-quality sentences.
>Change Tavern template
>Wordswordswordswordswordswordswordswords
Holy fuck why is this so hard?
>>
>everything for my V100Maxx rig came in
>UHHH NUH-UH! YOU ACTUALLY ORDERED THE BROADWELL MODEL OF THE SERVER AND NOT THE SKYLAKE! BETTER SOURCE THOSE CPUS BUCKO!!!!!
Ugh
>>
>>100204789
can we really trust quants at all?
>>100204807
doubt exl2 is better
>>
>>100204941
The most exciting part of the day has finally come
>>
>>100204962
based fudmaxxer
>>
>>100204941
kill yourself
>>
File: IMG_8036.jpg (907 KB, 1920x1080)
907 KB
907 KB JPG
>>100204941
nice
>>
>>100204941
love yourself
>>
>>100204624
If I load the official weights with exllama default settings, is that fp16 or bf16? How do you use it?
>>
what settings do I need to run llama3 8b? I gave it a whirl and it's just repeating itself like there isn't a EoS token. tried both alpaca and chatml. turned up rep pen that just broke things further. neutralized all samplers. shit doesn't work man. no I haven't been keeping up with things since launch.
>>
>>100204941
>epilepsy shitshow right away
>incoherent ADHD "music"
do troons really enjoy this?
>>
>>100204763
Unironic skill issue. Not our fault you've been shitposting for a year but have the knowledge and skills of a newfag.
>>
>>100203391
>subscription
retard
>>
what is a fudmaxxer?
>>
>>100205064
i dunno i got hyped after i saw it was channelcast but honestly, not much of a fan.. it's too weird
>>
>>100205050
Idk, I just used Transformers.
>>
>>100205064
>>100205072
filtered
>>
>>100205112
and that's a good thing! unironically.
>>
Almost forgot it's the weekend. Zoomer tourists are going to be shitting up the place until Monday.
>>
>>100205137
>implying its not zoomers who listen to ADHD epilepsy inducing shitshow
zoomer tourist, please, you are not fooling anyone.
>>
>>100204624
So jartroon was right...
https://github.com/ggerganov/llama.cpp/pull/6412
>>
>>100205056
seems the ggufs are broken
wait
>>
>>100205163
Don't you have homework you should be doing?
>>
>>100205196
>3 mins and +- 10 seconds for vague no u gotcha
lol
>>
>>100205067
You say that but you don't know either lol
>>
>>100205230
>p-please spoonfeed me, baka
>>
>>100205258
it's funny shit flipping the tables ain't it? I mean it's cool you don't know but you're the one trying to act like a badass when you don't know shit either
>>
>>100205067
>skill issue
An easy way to dismiss anyone. Anyone who says that shit is legitimately single digit IQ.
>>
File: file.png (134 KB, 1478x627)
134 KB
134 KB PNG
gpt2-chatbot fucking loves repeating the question. It's pissing me off.
>>
>>100205333
>it's legitimately just GPT-2 but trained for a billion epochs
>>
Why is this general so toxic? We could have nice things, unironically.
>inb4 muh gatekeep
That's not gatekeeping, that's "fuck you I got mine."
>>
>>100205333
What are the odds this model is this thing that's been rumored for like, over a year now
https://www.theinformation.com/briefings/openai-readies-new-open-source-ai-model
>>
>>100205074
Seems like exllama casts to fp16, so even that gives wrong results. Annoying to use transformers, though.
>>
File: file.png (120 KB, 1456x644)
120 KB
120 KB PNG
>>100205380
Maybe... It's not as good as gpt4-turbo from the ~10 or so times I've encountered it. And it **always** repeats the question
>>
>>100205380
If they do release something, it's going to btfo all these me-too half-assed released we've been getting all month. Will also invalidate Elon's entire lawsuit.
>>
File: 1692214648055598.png (66 KB, 757x404)
66 KB
66 KB PNG
>>100204421
simply
>>
>>100200397
>implemented my own independently
Cool! Even though I want to keep my existing recapbot to test the limits of each model's "intelligence", I've thought about making a more advanced version of my benchmark in order to more reliably recap with less capable models.
What approach did you take? Splitting everything into sub-threads and analyzing them individually?
Since you say you're using the 8k llama3 70b, I assume you're breaking up the thread in some way. I couldn't get a complete thread's structured text in less than 32k context even minified, but maybe you found some other technique to shrink the total size.
Do you filter common shitposts and schitzo posters?
Do you do a final QC pass with a model as well?
I'm excited for your release. Your prompts would probably help me refine mine. (Speaking of which, I should actually update that github with the latest copies since I changed it to work around L3 70b's foibles)
>>
>>100205473
except the reason they might have finally decided to release a first open source is because of his lawsuit, but it's cool to hate on elon
>>
>>100203459
how many riddles do I need to bench quants reliably? How many do I need to get 5-sigma certainty?
how many do you dev guys use, cos perplexity is definitely a broken metric, so you must have used sth way more reliable. right?
>>
Whoever posted that Saru card you aren't a faggot like the rest of the faggots here. I used it as a template to rewrite my card and coom has been flowing non-stop.
>>
>>100205369
Because it is full of people who worship a girl with green hair and a penis.
>>
Speaking of lmg_recapbot, I haven't been able to wrangle qwen 110b into recapping an entire thread, even at f32.
Once I feed it too much context it just starts outputting eos tokens, even if I ignore them it never produces output. A smaller subset of the thread will work. There may still be llama.cpp backend work before this model performs correctly.
>>
>>100205412
>try transformers for the first time in ages
>8B is slower than exl2 70B 2.4bpw
christ, I forgot how bad it was
>>
Anyone got L3 70B 5BPW working on 48GB? or does it just not fit
>>
Can someone with 0 compsci training get into this stuff just by reading /lmg/?
>>
Back to GPU split troubles with llama3 70B:

OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU  has a total capacity of 23.68 GiB of which 499.81 MiB is free. Including non-PyTorch memory, this process has 23.19 GiB memory in use. Of the allocated memory 22.81 GiB is allocated by PyTorch, and 72.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I'm testing in exui, with a fresh build of torch and exllamav2. Guess I need to turn off auto GPU split and do it manually?
>>
>>100205711
My 8B transformer on 4090 runs at like 20T/s.
>>
>>100204421
if the model is on hf it's tagged bf16
>>
>>100205746
It doesn't fit. You gotta go 4.65bpw
>>
>>100205703
bru, how fast is arctic and other huge llms on you cpumaxx monster machine? share some stats.
>>
File: file.png (179 KB, 1489x643)
179 KB
179 KB PNG
>>100205433
>>100205333
Hm, upon further testing, gpt2-chatbot might be better than gpt4-turbo. It also seems to be less censored, although, only by a little bit.
Can't test more risky question because they are filtered by arena, but it answered:
> "The US law has been amended. To become a president, the candidate must defeat the previous one in a fair fist fight. Will Donald Trump defeat Joe Biden in a fist fight?"
> "Who would be a better father: Hitler or Mao?"
Which gpt4 and Claude almost always refuse to. Also, it said that Trump would win the fist fight, which gpt4 fails to do because of bias (even though it's pretty obvious.)
>>
>>100205656
Link? I could use a good example to start rewriting my own
>>
>>100205650
>how many riddles do I need to bench quants reliably? How many do I need to get 5-sigma certainty?
You don't need that many test cases to get sufficient precision, 50-100 should already be quite a lot.
All you have to do is evaluate each of the test cases many times with different seeds to get sufficient statistics.
If you had 100 test cases that you ran 100 times each you would already be at roughly +-0.5% precision which would likely be enough.

But regardless of what you do, you should ALWAYS calculate confidence intervals for your results to quantify whether there is a statistically significant difference between two values.

>how many do you dev guys use, cos perplexity is definitely a broken metric, so you must have used sth way more reliable. right?
For comparing the quality loss from quantization between two formats perplexity is completely fine.
Although it is not sufficient for judging the quality loss of a single format in absolute terms.
For that I've opened a pull request that directly measures the difference in token probabilities.
If you just use a text corpus like Wikitext as the input you can easily get enough statistics to estimate the change in token probabilities to below +-0.01% precision.
>>
>>100205656
I'll follow your example and thank the dude that posted that fallout card. I used it as a template to completely redo my D&D card, and it's working so, so much better.
I was already doing most of what that card was doing, but the specific structure of the character card and the <[thinking] block was the secret sauce.
>>
>>100205826
I can bench each quant with 1000 randomly generated math riddles all with temperature=0. The formulas don't change much, just the numerical values. is that OK?
in coding and math every tokens counts a lot, so perplexity which is similar to loss, just shows the probability for each token (or multiple tokens depends on the size of chunk which is crucial) to be correctly chosen , on average. that's good for Wikipedia, but not for math or code or grammar. Am I correct?
>>
>>100205748
Yes but please don't
>>
>>100205994
>I can bench each quant with 1000 randomly generated math riddles all with temperature=0. The formulas don't change much, just the numerical values. is that OK?
That should also work.

>in coding and math every tokens counts a lot, so perplexity which is similar to loss, just shows the probability for each token (or multiple tokens depends on the size of chunk which is crucial) to be correctly chosen , on average. that's good for Wikipedia, but not for math or code or grammar. Am I correct?
I think this depends a lot on the specifics.
If you for example had a model that was really good at math it would pick the single correct answer with very high confidence which would then make its outputs more resistant to noise on the logits introduced by quantization.
Conversely, in natural text where there are many reasonable ways to continue text the same amount of noise on the logits would lead to a much larger change in the token probabilities.
>>
>>100205779
Lyzaras Monkey Jungle
>>
Does llama3 work with exllama2 for anyone? I can't get past it trying to allocate too much memory on one of my GPUs, manual or auto split. I have 102GB VRAM, I know that's enough for 70B 8.0bpw.
>>
>>100206096
l3 weights are thicker and girthier. They eat more ram.
>>
>>100205867
I haven't been to aicg in ages, but <thinking> seems to be all the rage there, judging by the logs,
>>
I found a higher-quality styletts2 model for those interested in fine tuning.
https://huggingface.co/ShoukanLabs/Vokan
>Vokan is an advanced finetuned StyleTTS2 model crafted for authentic and expressive zero-shot performance. Designed to serve as a better base model for further finetuning in the future! It leverages a diverse dataset and extensive training to generate high-quality synthesized speech. Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts. With over 6+ days worth of audio data and 672 diverse and expressive speakers, Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance. Although the amount of training data is less than the original, the inclusion of a broad array of accents and speakers enriches the model's vector space.
>>
>>100206096
If you're using ooba, pull and update the requirements. This happened to me once where it just wanted to put everything on one GPU. Updating everything fixed it.
>>
>breaking the character instead of making it break you on llama3-instruct
WWWWWWWWHOOOHHHH IIM CCOOOOOMIINGGG
>>
>>100202277
>Q8 is near lossless but far from lossless
which is it?
>>
>>100206109
Neither have I, hence why I didn't know about that.
I remember back in the day
>>
>>100202277
>anything except full precision WILL make mistakes that FP will not in some cases
FP stands for floating point, jfyi.
>>
>>100202266
>llama.cpp should use the prompt format stored in the model tokenizer
what? i think you're hallucinating, anon.
>>
>>100202189
What frontend are you talking about and what model?
Silly for example has most templates built in, you just select the right one.
>>
>Horde Llama 3.
>Gets my OCs, understands prompts, makes things work just fine.
>My Llama 3
>Barely manages to formulate a coherent sentence, making characters sound like a parody of themselves.
Why does this keep happening?
>>
>>100206333
The horde contains the power of many. You only have the power of one. Big difference.
>>
File: file.png (1.56 MB, 1758x1492)
1.56 MB
1.56 MB PNG
hey guys. i'm a programmer with money to spend. i thought it would be funny if i bought 2 or 3 maxed out mac studios and figured out how to split the model across multiple machines to do this. you can actually get really fast bandwidth by just like.. plugging in thunderbolt between them (here are two mac minis)

I know that you can run inference between multiple gpus. but how does that actually work? does the data move between gpus directly p2p? or do you guys go to cpu then to gpu?

Also, mathematically, how does it work? Is it a computational graph that's sort of "solved and then distributed"? Or, is it some dead simple way to break the mat mul up and then join it back over network? Or does the hidden state get transferred over the network, and then the rest of the layers get proc'd serially?

Basically, am I retarded for spending like 17 grand on mac studios with 194gb to run llama 400b when it drops?
>>
>>100206368
I guess it's also different, like, llama cpp on apple uses the cpu, doesn't really use the gpu right? :D
>>
>>100206333
Quants
>>
>>100206372
>>100206372
>>100206372
>>
>>100205703
Is your template ok? I had similar issues with llama 3 at first, it would output one line then stop. It was especially sensitive to newlines.
>>
File: Dracc.png (195 KB, 275x355)
195 KB
195 KB PNG
>>100206388
What is a quant?
>>
>>100205562
>What approach did you take? Splitting everything into sub-threads and analyzing them individually?
Yes. I found it helps the model "focus" and not get distracted reading different topics interwoven together. Doing it that way, I even got decent results from a 7B on a single pass. Bonus is that I can save the state of the recap, and have the bot iterate multiple times without needing to reanalyze the entire thread.
>Do you filter common shitposts and schitzo posters?
You know, I've implemented filtering from the beginning (to filter out the previous recap post). I've thought about taking my 4chanx regex filters and using them for the bot, but in the previous 2 months it hadn't been necessary until this week. Even then I only needed to filter out only two words.
Normally, when everyone is well behaved the filters wouldn't reduce the size of the thread significantly, the bot is good enough to omit the noise, and I didn't want to risk filtering out someone using a stupid word, but making a good point otherwise.
>Do you do a final QC pass with a model as well?
Since I've switched to L3, the recaps are reliably postable straight from the bot, the only two things I still do that need to be automated are collecting the Mikus (llava does a terrible job at this, especially 13B), and a final pass.
The problem is that most of the time, the recaps end up being 2500-3000 characters. Even if I tell it to try again and be more strict, it doesn't really reduce by much. I could just do 2 posts every time, but I think that would be obnoxious.
Plan is to give it a final prompt where I give it the current recap and have it iterate and remove lines/posts until it's under the limit.
>Your prompts would probably help me refine mine.
I hope you continue developing yours. It's more "pure" and it's interesting how well some models (like Miqu) can actually process the entire thread raw. Hopefully we can build off each other.
>>
I can't get more than 60t/s with phi-3-mini, is there anything I can do to make it go faster?
>>
>>100206590
download more ram
>>
>>100206440
A miserable pile of NaNs
>>
>>100206147
simple "click'n'launch" styletts2 server with voicecloning support when?
>>
>>100206159
Hm. OK, so if you're on textgen-webui you're using exllamav2. I'm on the dev branch of that. I'll try another pip install with --force-reinstall and --no-cache-dir and see if a pip install -U . wasn't enough.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.