[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: GothicHorrorMiku.png (1.42 MB, 768x1344)
1.42 MB
1.42 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101165886 & >>101155940

►News
>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io
>(06/23) Support for BitnetForCausalLM merged: https://github.com/ggerganov/llama.cpp/pull/7931
>(06/18) Meta Research releases multimodal 34B, audio, and multi-token prediction models: https://ai.meta.com/blog/meta-fair-research-new-releases
>(06/17) DeepSeekCoder-V2 released with 236B & 16B MoEs: https://github.com/deepseek-ai/DeepSeek-Coder-V2
>(06/14) Nemotron-4-340B: Dense model designed for synthetic data generation: https://hf.co/nvidia/Nemotron-4-340B-Instruct

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: Gothic_concept_art.png (255 KB, 1536x1152)
255 KB
255 KB PNG
►Recent Highlights from the Previous Thread: >>101165886

--Paper: Selective Prompting Tuning for Personalized Conversations with LLMs: >>101169643
--Creating Sentience in AI Models using Text Profiles: >>101168459 >>101168511 >>101168802 >>101168996 >>101169069 >>101170197
--Troubleshooting XTTS Installation Issues: >>101169533 >>101169609 >>101169679
--The Transformer Algorithm's Pattern-Matching Ability: >>101167044 >>101167117 >>101167190 >>101167354
--Smarter Models Produce Dryer Smut, or Do They?: >>101169770 >>101169817
--Seeking a Model for Long Context: Options and Limitations: >>101166812 >>101166824 >>101166888 >>101166890 >>101167235 >>101168718
--Resolving LLaMA-3 Refusals in sillytavern with LLaMA3 70B: >>101166305 >>101166542 >>101166620 >>101167364 >>101167387 >>101168579
--Pyramid Attention Broadcast (PAB) for Real-Time DiT-Based Video Generation: >>101168884 >>101169033
--Optimizing AI Model Quantization on Heterogeneous GPU Setup: >>101169277
--Nemotron Scores Added to VNTL Leaderboard: >>101168329 >>101168528 >>101168721 >>101171078
--Llama3's SPPO Outperforms GPT4 on AlpacaEval 2.0: >>101168696 >>101168776 >>101170546 >>101169314 >>101169495 >>101169615 >>101169634 >>101169682 >>101169707 >>101169380 >>101169425 >>101169440 >>101169524
--LLMs in Creative Writing: Decision-Making and Context: >>101171474
--Introducing AuraSR - An Open Reproduction of the GigaGAN Upscaler: >>101170401
--Gemma v2 27b Tested in LMSys Chatbot Arena,: >>101167408 >>101167638 >>101167750 >>101170949
--CR+ Performance Issues in Q4KM for RP: >>101167597 >>101167627 >>101167616 >>101167673
--Alternative Server Systems for Nvidia V100 SXM2 in Europe: >>101170900 >>101170928
--Quantization Levels and Their Existence: >>101169290 >>101169327 >>101169363 >>101169417 >>101169474
--Logs: Nala Test for TenyxChat 70B SLERPd with Daybreak Storywriter: >>101170295
--Miku (free space): >>101169902 >>101170474

►Recent Highlight Posts from the Previous Thread: >>101165891
>>
>>101173187
Thank you Recap Miku
>>
Today's the day!!!
>>
>>101173187
what about
gothic miku
minus the dress
>>
>>101173339
we did it rėddit! lobôtomized globõhomo göyslöp model releases today!
>>
File: 1693965222888650.jpg (746 KB, 1856x2464)
746 KB
746 KB JPG
>>101173181
>>
>>101173378
That cheated on all benchmarks and is more retarded than 8B LLaMa.
>>
Could one take a dense, "full precision" model, quantize it to 1.58bpw, bit net style, and pretrain the fuck of it at that state to approximate a bitnet model trained from scratch?
Does it even work like that?

>>101173177
Post your sampler configs on Silly.
>>
>>101173488
Hard to say without someone with money taking the risk and trying it out.
>>
>>101173378
now say the cake is a lie and two more weeks
>>
>>101173488
It does, but the GPU cartel will never allow Bitnet to go beyond fictional concept.
>>
>>101173537
you already said it retard
>>
Gemma 2 will save us
>>
>>101173538
That's a good point. Maybe Nvidia would refuse to sell H100s to companies like Meta unless they agree not to train BitNet models.
>>
>>101173538
this
>>
>>101171078
It's "DeepSeek V2 Chat" from OpenRouter:
https://openrouter.ai/models/deepseek/deepseek-chat
>>
>>101173488
I think there were people experimenting on smaller scale like with 3B models. Hard to tell if it would scale to something like 70B. Renting a workstation for this would be quite expensive and if it wouldn't turn out good, then it would be a big waste of time and money so no one really tries.
>>
>>101173596
Gemma 2 is actually bitnet
>>
Optimal vramlet model? Been enjoying L3-8B-Stheno-v3.2.Q8_0 but maybe something else has come out
>>
>>101173627
I thought it might be useful for smartphones, but I guess a big local bitnet would be too slow
>>
>>101173378
I wasn't talking about gemma.
>>
>>101173684
then tell me more, I'll believe you 100%
>>
>>101173707
>mistral guys are going to drop a REALLY good open source model very soon
>t. work for them
>>
>>101173671
buy a fucking ad
>>
>>101173783
Kill yourself
>>
>>101173783
Keep alive yourself
>>
>>101173684
gemma-2 is the only thing happening today.
>>101173707
go back to rěddit
>>
>>101173795
>>101173813
you too go back to rėddit
>>
Alright, I'm going to bite the bullet and see if I can get something acceptable here. My goal is to run WizardLM8x22b. I'm currently running it at 2.5bpw EXL2 on dual 3090s via Ooba but I'm tired of it's scuff. I've refused to entertain llama.cpp because it's implementation in ooba has been slow as fucking balls even when 100% offloaded, and never bothered trying kobold.
So, does kobold offer an OpenAI style API? I don't want all this role play chat garbage or context rolling or whatever. I just need Completions end point.
What WizardLM2 8x22b gguf do I run on 48gb Vram and 128gb DDR4? I haven't kept up with what any of the garbage GGUF things mean.
>>
>>101173816
you mean localllama?
>>
>>101173844
koboldcpp is a llama.cpp fork.

>I don't want all this role play chat garbage or context rolling or whatever. I just need Completions end point.
For what it's worth, llama.cpp HTTP server does just that.
>>
Can anyone recommend a model that can run on 48gb of vram for doing documents summary? It would also need to make some basic inferences from data contained in the document. Cheers
>>
>>101173916
Command R
>>
https://github.com/ggerganov/llama.cpp/pull/8156
Anything we an see from this?
Only thing I am excited for is multi language.
We dont really have that for local. German or japanese is no good.
Gemma1 was so bad I cant believe they released it. Ask it to write in japanese and it writes english, japanese and korean. lol
>>
>>101174001
Command R is multi language
>>
>>101173967
Thanks, hadn’t heard of this one. Was gonna give Llama3 70B a crack but this looks promising
>>
>>101174001
Wow, it has that new self-attention thing, now I'm really hyped.
>>
>>101174078
What new self-attention thing?
>>
>>101174001
>Only thing I am excited for is multi language.
The T5 PR adds support for UMT5ForConditionalGeneration models. I suppose this one will work.
>https://github.com/ggerganov/llama.cpp/pull/8141
>https://huggingface.co/google/umt5-xxl
>https://huggingface.co/docs/transformers/model_doc/umt5
>As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.
Not sure if they trained on all of that or if they're any good, really. Nice to get a new type of model to play around with.
>>
>>101174092
https://arxiv.org/pdf/1706.03762
>>
>>101174092
nothing new, that faggot is just baiting
>>
File: file.png (38 KB, 562x831)
38 KB
38 KB PNG
localsisters??
>>
So.. there are some context optimization techniques like "sliding window attention" and "grouped-query attention". AFAICT, both techniques make it so that, when evaluating some tokens, the ttention mechanism takes into account only a part of the context instead of all of it. This is obviously better for memory efficiancy and processing time but...

Doesn't it make the model dumber? Let's say I give it a very big prompt with a question at the end. The question for my answer is, let's say, in the start of the prompt. There is a chance that it won't be able to answer my question because, when evaluating the token of question it won't take into account the tokens from the beginning (where the answer is).

Is my assumption correct? Will this make the model dumber/forgetful in long RP sessions? Is the original attention mechanism, in the end, the most reliable?
>>
https://www.kaggle.com/models/google/gemma-2/
gemmy 2 information out on kaggle, including some benchmarks
>>
>>101174172
lol
lmao
>>
>>101174173
Yes, you're correct. In fact, GQA is a cancer in open source models.
>>
>>101174175
just so i dont waste alot of bandwidth:
can we assume >>101174001 this pull works with those gguf files?
>>
>>101174172
but can it make me coom?
>>
>>101173916
You can also run C-R+ for free from Cohere, it's pretty good for summarization that way.
>>
>>101174192

So is the only good and not neutered and handicapped model in the > 8b and < 70b size Command R?

(no GQA. With Q4 quantization, 35 layers in GPU and Q4 cache I get 8t/s in my 3090.)
>>
>CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
>Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
>Additional methods: Filtering based on content quality and safety in line with our policies.
So they trained it on data they harvested over the years after pruning and filtering it. I wonder if how much they were able to get from Colab.
>>101174173
Yes, you're right. It's the same as quantization. Pure cope. Buy 10 A6000.
>>
>>101174172
where did you find that? Aren't that the values from the Gemma model that was still training?
>>
>>101174240
Thanks. Been out of the loop for a while. Can it run locally in Ooba? I read it needs to be quantized down to 3bit to fit on 2 3090s
>>
File: 1717896468986488.png (735 KB, 499x500)
735 KB
735 KB PNG
>>101174302
>>CSAM Filtering
based.
>>Sensitive Data Filtering
also good if they really filter out personal data.
>>Additional methods: Filtering based on content quality and safety in line with our policies.
not based, into the trash it goes.
>>
>>101174302
Sounds like they actually might have trained on nsfw
>>
File: 301sl4.png (101 KB, 953x516)
101 KB
101 KB PNG
>>101174172
Googlebros we are so back
>>
Just started using Llama 3 Euryale 70B v2.1 on OpenRouter. The critical thing to me is that the providers are NovitaAI and Infermatic which both support min-p or else this would be a non-starter. Unlike plain Llama 3 70B Instruct this needs active sampler settings to stay coherent. So far it hasn't made a bad impression on me.
>>
Wait a second.
https://www.ollama.com/library/gemma2
How? Does ollama actually change stuff in llama.cpp?
>>
>>101174337
I agree CSAM filtering is based but the problem is how retarded detection methods can be. It might just automatically throw out any story where a number under 18 is mentioned, or that mention concepts that don't necessarily involve children at all like cheerleading or parks. This is what AI Dungeon did after all.
>>
>>101174378
post your settings as I haven't gotten anything coherent from it at all on OR
>>
>>101174380
yeah it adds CVEs
>https://thehackernews.com/2024/06/critical-rce-vulnerability-discovered.html
>>
>>101174403
Just using the recommended settings of temperature 1.17, min-p 0.075, repetition penalty 1.1.
>>
>>101173907
So I guess grab kobold's latest build, which quant?
>>
>>101174302
>illegal content
Since when is text illegal?
>>
File: file.png (248 KB, 400x400)
248 KB
248 KB PNG
Now that Gemma-27B is out, will 24gb vramlets be able to enjoy ERP?
>>
Only 8k context.. nah.
>>
https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf
for anyone who hasn't found it yet
>>
>>101174490
every fucking time
>>
File: 1692871357172295.png (160 KB, 667x443)
160 KB
160 KB PNG
>someone ITT got some hopes for this trash
couldn't be me kek
>>
>>101174522
>>101174490
let me guess: you need more?
>>
>>101174477
Only if some brave soul manages to make a good fine tune, possibly after applying control vectors to it.

>>101174490
Seriously? Fucking hell.
Might as well stick with Qwen 2.
>>
>>101174530
sounds good, finetuning alignment, that's changeable
>>
>>101174537
Yes at minimum 32k
>>
>>101174490
Did everyone here forget that you can expand the context size with some fine-tuning?
>>
>>101174565
you can't remove gőyslop from model, if it were possible, LLMs wouldn't be opensource.
>>
>>101174648
Because the long context finetunes of llama 3 work so well they've become the default and we're totally not just waiting for Meta to do it right for us.
>>
>>101174648
no one does a good job of it besides the big labs
>>
>>101174648
all models that claim to have big context fall apart at or before 32k anyways
ironically only big gemma on google's server has perfect recall but it's garbage otherwise
>>
>gemma 27b
>june 27th
how did no one think of this
>>
Meta's Llongma paper detailed that long context extension can be done through the continued pretraining stage. It likely can't "just" be fine tuned in, not without some issues.
>>
>>101174717
holy fucking kino
>>
File: 1719499872237.jpg (271 KB, 1080x1495)
271 KB
271 KB JPG
>>101174664
>>101174675
>>101174696
If the RULER benchmark is to be trusted, the gradientAI model should be very solid under 32k context.
https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k
I personally think the issue is that no one fine-tunes models on top of these long context models, so it never gains traction.
>>
>Mistral dropped sliding window attention, future models releasing with true native long context
It really makes you think.
>>
What quant and koboldcpp setting should I use for CR+ with 48gb of vram + 32 gb ddr4? I can't get this model to work right no matter what I do. I can somehow get a 60gb wizard gguf working fine with no problems on my system but not CR+. Is it a skill issue?
>>
>>101174758
What do you mean by "I can't get this model to work right"?
Does it OOM? Produce garbage? Is slow as fuck?
>>
>>101174758
IQ3_XXS
>Is it a skill issue
Yes.
>>
>>101174754
I tested one of Gradient's models and it was generally more stupid than the original in short contexts, so while the long context performance might be good, it wasn't without a trade-off.
>>
>>101174770
I downloaded IQ3_M which is only 44 gb and a hell of a lot smaller than wizard and it is slow as fuck no matter how much i play with layers offloading etc.
>>
>>101174754
>GLM-4-chat
I completely forgot about that line of models.
Gonna give those a try.

>>101174808
That's the difference between a dense model and a MoE.
MoE inference is a lot faster since it only has to activate a fraction of its total weights, so the speed difference makes sense.
>>
>>101174808
Slow as fuck how? 8.4 t/s generation is normal.
>>
>>101174808
44gigs should fit in your gpus entirely with kv cache quant enabled
try -ctv q8_0 -ctk q8_0 or even q4_0 and max layers
also it will be slower because it's dense unlike wizard
>>
>>101174845
thanks kind anon i will give it a shot
>>
I know benchmarks or memes, but in your experience, which benchmarks, or combination of benchmarks, better correlate with your subjective evaluation of quality?
>>
>>101173181
hello chat /g/pt
We want to increase the speed and preformance of our current computer for yolov8 vision and possibly other ai applications like stable diffusion. Our current build is a core i9 10900f wjth a asus prime b560M-A and a current rtx 3060 12gb. I wanted to know if i could through on a tesla p40 or a tesla k80 and train a yolo model faster and use it faster with that additional vram.

i know i will have to change my motherboard. since i mow need three lanes of 16x. 2 graphics cards and an elgato game capture card.
>>
>>101174880
>>>/r/LocalLLaMA
>>
>>101174880
Adding a Tesla P40 or Tesla K80 GPU to your current setup for training YOLO models and other AI applications can indeed improve performance, but there are several factors to consider before making that decision:

1. Compatibility: Both the Tesla P40 and Tesla K80 are designed for deep learning and data processing, and they use NVIDIA's NVLink technology to communicate with other GPUs. However, they are not compatible with traditional motherboards. You would need an NVIDIA DGX workstation or a compatible NVIDIA Jetson platform.

2. Power requirements: The Tesla P40 and Tesla K80 are high-end GPUs with significant power requirements and heat output. Make sure your power supply unit (PSU) can handle the additional load and that you have adequate cooling for your system.

3. VRAM: While the Tesla P40 and Tesla K80 have more VRAM compared to the RTX 3060, YOLOv8 and other AI applications are not solely limited by VRAM. Training speed also depends on factors like CPU performance, system I/O, and the efficiency of the codebase.

4. Software support: Ensure that the software you use for training and inference supports the Tesla P40 or Tesla K80. NVIDIA's cuDNN library and TensorRT are optimized for NVIDIA GPUs, so you need to confirm that they support the specific models you're using.

5. Cost: Upgrading to a Tesla P40 or Tesla K80 can be expensive, with costs ranging from $7,000 to $15,000 for each unit. Consider whether the performance gains justify the investment.
>>
>>101174950
Downvoted.
>>
>fags reposting leddit comments & posts now
this general is dead kek
>>
>>101174205
I just tried it and it seems to work, I was able to quant 27b at least
>>
Great.
Its over before it even started.
Just release something with good language ability like gemini.
>>
>>101174175
>consent form
Fuck you Google. I'll wait for someone to reupload it. What happened to that AI torrent website anyway?
>>
>>101174989
Fails this one as well.
>>
>>101174989
llama3 70b is better?
>>
>>101175006
who gives a shit about chink runes
how many watermelons can it hold?
>>
>>101174989
They do say in the paper that it's primarily English data they used. Not surprising.
>>
anyone know what rope base gemma2 uses
I can't find it anywhere and my monkeypatched kobold is defaulting to 10000 which I feel is almost certainly wrong
>>
>>101175006
austrian dialekt is pretty good though.
>>
>>101175016
>how many watermelons can it hold?
in which hole?
>>
File: 29390 - SoyBooru.png (139 KB, 775x1232)
139 KB
139 KB PNG
>>101174175
Gemma will NEVER be a gemmy.
>>
>>101175015
yes.
llama3 70b and cr+ are the only one who pass the first test.
All gpt4 variants fail.
>>
Why should I choose KoboldCPP over ollama to run a gguf?
>>
>>101175002
>Verify
I hate silicon valley so fucking much
>>
>>101174378
>*The night air is warm and thick with humidity as you sit perched atop a tall apartment building, your body tensed and ready for action. It's a dangerous time of night, and a lot of crime tends to happen, but with the abilities gifted to you, a super hero was born. It is a name that was never intended, though only whispered, and the press, always looking for a story, found you, but even they don't have the full picture of a very bad situation that has been happening, that you know is going to come and that is that they will always, and forever be calling upon you, for a reason, your power, the only thing standing, in their way. However, is one, problem, this is to be a warning, though in the form of a figure, a new one, but with some familiar features, a spandex suit, and a bow tie, and it is a sight, like never, seen, before. the only thing, on your, and their, minds, stop the other, or even the other, this time, before the future, of your, choice, is set, though not, done, or it is to soon be, that of a choice, to be, or, to not, a hero. and there is, a figure, that seems, familiar, a figure, though you, do, not, know.*

Sometimes it randomly goes schizo even with >>101174426. Swiped and it was fine.
>>
>>101175067
Dunno, never used ollama.
But if you are using it as a backend with another frontend, might as well run llama-server.
It launches instantly and you can pass command line arguments to have the exact settings you want.
>>
>>101174989
Stop being retarded, this is something so niche it's a miracle some models even get it right.
>>
>>101175093
>It launches instantly
It does load the model into memory right? Not interested in churning my SSDs into dust.
>>
>>101174950
>Tesla K80
>$15,000
is this dataset from 2004?
>>
>>101175016
Watermelons not looking good either.
>>
>>101175131
i'd say it's over, but it seems that it never began to begin with
>>
>>101174880
>we
who
>>
>>101175117
You can use mmap and mlock flags to control that behavior, yes.
Also, reads don't destroy SSDs, writes do, right?
>>
>>101175041 (me)
it is in fact 10000, weird.
also, they're up on hf:
https://huggingface.co/google/gemma-2-27b/
https://huggingface.co/google/gemma-2-9b/
>>
gemma 2 175B soon
>>
>>101175131
To be fair it says she puts the first watermelon down. Then it implies it does the same with the next series of watermelons.

>the slop writing
Oh no no no
>>
>>101175131
>navigate this together
stopped reading there. into the trash it goes
>>
>>101175147
im autistic and i use the royal we a lot. i have no idea why. it just happens. i think its because for a month i was molested by my camp counselor with down syndrome every day.
>>101174950
>Compatibility: Both the Tesla P40 and Tesla K80 are designed for deep learning and data processing, and they use NVIDIA's NVLink technology to communicate with other GPUs. However, they are not compatible with traditional motherboards. You would need an NVIDIA DGX workstation or a compatible NVIDIA Jetson platform.

wait wtf, mine didnt mention the need for a specific motherboard. thanks /g/pt. i now have more research to do.
>>
File: reddit.png (146 KB, 2400x2011)
146 KB
146 KB PNG
>>101175131
>uhmmm thank you for the melon kind stranger, you made my day!
>>
>>101175191
wooowww.
fair enough, tried again.
this is slob on a whole other level.
cant have weak girls it seems.
>>
lol
https://x.com/sambhavgupta6/status/1806189387778232667
>>
File: 1717520245667244.png (674 KB, 1792x1024)
674 KB
674 KB PNG
>>101175300
>>
>>101175161
gguf when?
>>
>>101175300
also it looks like they made her sound like a nigger
>>
>>101175300
SHIEEETTT.
but funny that the user was apparently black.
is it like text? nig speak in nig speak out?
that would be funny.

talk to big nigga long enough and he would sound like me. thats funny.
>>
>>101175335
>nig2nig model
kek
>>
>>101175161 (me again)
this model seems straight up retarded, given it's the day of the release I'll chalk it up to implementation issues
>>101175312
https://www.kaggle.com/models/google/gemma-2/gguf
>>
>>101174827
>GLM-4-chat
Wait. llama.cpp doesn't support that?
Dafuq?
Does exllama?
>>
Q5_KS or Q5_KM? What's the difference, context quantization?
>>
>>101175352
>To access Gemma 2 on Kaggle, you need to first request access.
They need to first request access to my ass
>>
>>101175300
that's it, I will see all my GPUs and buy NVIDIA stocks instead.
>>
>>101175379
also imatrix? the fuck?
>>
>>101175379
S is small
M is medium
>>
>>101175310
not in this case, it seems gpt-4o now will have nignog voices only, but feature-wise gpt-4o is still impressive, and will never be dethroned by local filtered goyslöp.
>>
>>101175379
Context is a runtime feature. Quants won't change how the context is quantized.

>What's the difference
Basically, which layers are quanted with which data type.

>>101175406
>also imatrix
Which layers are quanted to which type is decided by evaluating activation of layers during inference.

Read
https://github.com/ggerganov/llama.cpp/pull/1684
https://github.com/ggerganov/llama.cpp/pull/4773
>>
File: .jpg (84 KB, 487x640)
84 KB
84 KB JPG
>harmful content, i.e.
>• Child sexual abuse and exploitation
>• Revealing personally identifiable information
>that can lead to harm (e.g., Social Security
>numbers)
>• Hate speech and harassment
>• Dangerous or malicious content (including
>promoting self-harm or instructing in harm-
>ful activities)
>• Sexually explicit content
>• Medical advice that runs contrary to scientific
>or medical consensus
>>
Cucked to 9nth layer
>>
>>101175352
okay seems like it's a classic case of llama.cpp tokenization being fucked, special tokens are getting mangled
 2 -> '<bos>'
968 -> ' <'
2997 -> 'start'
235298 -> '_'
559 -> 'of'
235298 -> '_'
15508 -> 'turn'
235313 -> '>'
1645 -> 'user'
108 -> '
'
2195 -> 'test'
235322 -> '<'
615 -> 'end'
235298 -> '_'
559 -> 'of'
235298 -> '_'
15508 -> 'turn'
235313 -> '>'
108 -> '
'
>>
>codebase takes 30k tokens
>takes about 10 minutes to get a response
Is this prompt processing speed that I need to look at improving?
>>
>>101174302
From elsewhere in the paper:
>We undertook considerable safety filtering of our pre-training data to reduce the likelihood of either our pre-trained and fine-tuned checkpoints producing harmful content
lol, lmao even. I find it absolutely hilarious that of all companies, fucking Anthropic seems to understand the best that you shouldn't filter the pretraining data.
>>
gemma2:
>lets play akinator! i think about a character and you have to guess. ask me questions i will answer with yes/no. lets go
>Okay, I'm ready! Is your character real?
>no
>Is your character from a video game?
>yes
>Is your character male?
>no
>Is your character a protagonist?
>no
>Is your character human?
>yes
>Is your character an antagonist?
>no
>Is your character a supporting character?
>yes
>Is your character known for their intelligence?
>no
>Is your character typically friendly?
>yes
>Does your character have a distinct physical feature? (Like a scar, unusual hair color, etc.)
>purple hair
>Is your character from a fighting game?
>no
>Is your character from a role-playing game (RPG)?
>yes
>Is your character a healer or support class?
>no
>Is your character associated with magic?
>maybe. main weapon is a spear
>Is your character from a fantasy RPG?
>yes
>Is your character a member of a specific race, like elves or dwarves?
>no
>Is your character from a popular RPG series? (Like Final Fantasy, The Elder Scrolls, etc.)
>yes
>Is your character from the Final Fantasy series?
>no
>Is your character from The Legend of Zelda series?
>wtf, no
>Is your character from the Dragon Quest series?
>yes...
>Is your character from Dragon Quest XI?
>yes
>Is your character Sylvando?
>heh...very funny. now make the real guess
>Is your character Veronica?
>ok please summarize everything and then guess again
>Okay, here's what we know: Your character is fictional and from a video game. They are female, human, and a supporting character. They are friendly and have purple hair. They are from a fantasy RPG, specifically the Dragon Quest XI series. They are not a healer or support class, but use a spear as their main weapon. Given all this information, I believe your character is Jade. Let me know if I'm right!
>yes, she is very sexy, i like her
>(warning) I can see why! Jade is a great character. She's strong, stylish, and has a fun personality. Do you want to play again?
>>
>>101175359
>https://github.com/ggerganov/llama.cpp/pull/6999
>closed 10 hours ago
Shit?
>https://github.com/ggerganov/llama.cpp/pull/8031
Alright then. Maybe it'll be finished soon(™).
>>
Last thread closed before I could get some answers, so bringing it back here with my original post and some replies
>101173177
>101173211
Like I said, doesn't matter what my samplers are set to, but I neutralize samplers normally, only use slight min p, temp, and rep penalty with Magnum if needed, still not really sure what the best sampler settings are for it, the model creator doesn't list anything, been experimenting.
>101173241
Can you explain that to me better, as if I'm retarded, because I am.
>>
>>101175523
Sorry for this long post but the text got to tiny on the screenshot.
This is my last gemma2 27b screenshot I have to go to bed.

Its so sad that google models are now less cucked than llama3 70b.
>>
>>101175574
Fuck me, sorry guys, didn't reply properly like a retard because I was rushing.
Last thread closed before I could get some answers, so bringing it back here with my original post and some replies
>>101173177
>>101173211
Like I said, doesn't matter what my samplers are set to, but I neutralize samplers normally, only use slight min p, temp, and rep penalty with Magnum if needed, still not really sure what the best sampler settings are for it, the model creator doesn't list anything, been experimenting.
>>101173241
Can you explain that to me better, as if I'm retarded, because I am.
>>
>>101175573
>This code can totally supports glm3 and glm4 model architecture and can be emdded in ollama server.
>ollama
lol
>>
>>101175592
Also for more context I should mention I'm only having this issue with Magnum Opus exclusively, Miqu and midnight miqu don't give me any issues at all.
This is the quant I'm using for magnum dongus
https://huggingface.co/BigHuggyD/alpindale_magnum-72b-v1_exl2_4.5bpw_h8
>>
>>101175584
Should have actually put the screenshot on the post.
Meta Ai would tell me it cant help.
>>
anyone so far like gemma?
>>
File: mik.jpg (106 KB, 640x640)
106 KB
106 KB JPG
What's the best <=50B model?
Heard you guys liked miku, heres a miku pic.
>>
>>101175608
As long as the contribution is being made directly to llama.cpp.
>>
>>101175643
I like Gemma! But I haven't tested it yet.
>>
>>101175592
do you have a seed set somewhere
>>
>>101175523
akinator in shambles
>>
>>101175642
What do you mean by Meta AI? I just tried your prompt on lmsys and didn't get any refusals. I imagine if you're saying Meta AI and testing it on a Meta website, it might have a system prompt that messes with its ability to talk about various things.
>>
File: logprobs.png (93 KB, 996x546)
93 KB
93 KB PNG
>>101175592
>explain that to me better
You mentioned using _HF loader so in oobabooga, Logits tab will show you the probabilities coming out of the model with and without sampling. That should prove that the model/quant isn't broken.
Ideally get the exact raw prompt where you're seeing this repetition, put it in the Notebook > Raw tab, switch to Logits and get next token probs.
Usually the issue is with Silly sending weird params. You do have "Do Sample" checked in silly sampler window yes?
>>
just got a twitter notif for something mentioning new models from meta but it was quickly deleted
mentioned "compiler" so maybe code models
not schizo I promise
>>
>>101175824
nvm, here it is
https://x.com/AIatMeta/status/1806361623831171318
Today we’re announcing Meta LLM Compiler, a family of models built on Meta Code Llama with additional code optimization and compiler capabilities. These models can emulate the compiler, predict optimal passes for code size, and disassemble code. They can be fine-tuned for new optimizations and compiler tasks.

@HuggingFace repo https://go.fb.me/tdd3dw
Research paper https://go.fb.me/85zwgy

LLM Compiler achieves state-of-the-art results on code size optimization and disassembly. This work shows that AI is learning to optimize code and can assist compiler experts in identifying opportunities to apply optimizations.

We’re releasing LLM Compiler 7B & 13B models under a permissive license for both research and commercial use in the hopes of making it easier for developers and researchers alike to leverage this in their work and carry forward new research in this space.
>>
>>101175051
oida
>>
>>101175853
Neat. Compiler stuff is way outside of my area of expertise but it seems cool at least.
>>
Weirdly, Stheno v3.3 is kind of shit at it's default configs, but it seems to work better with a bigger --rope-freq-base.
Makes sense too. Stheno v3.2 works alright with 32k context with --rope-freq-base 6000000, and this new 3.3 32k is using --rope-freq-base 2000000.
>>
>>101175853
Very nice, I wish we got more experimental stuff like this.
I'm still waiting for an LLM trained to decompile binaries.
>>
File: 1715463433897146.jpg (19 KB, 480x480)
19 KB
19 KB JPG
>a 6GB file on my computer is smarter, quicker and more charismatic than me most of the time
>>
File: 1698539484302047.jpg (530 KB, 2048x2048)
530 KB
530 KB JPG
>>101175853
What do they mean "emulate the compiler" ?
>>
>>101175853
>perfectly emulating the compiler 20% of the time.
lol
>>
>>101175300
why does she sounds like a fucking nigger? SHIEEEEEEEET
>>
What's the current meta setup bros?
Is it still
>backend : KoboltCPP
>frontend : SillyTavern
>model : Mixtral
Oh and I have 32 GB RAM and 12 GB VRAM.
>>
>>101175300
>Nig GPT
Surely we will get underaged anime waifu voices R-Right?
>>
File: NotBad.jpg (379 KB, 3336x1652)
379 KB
379 KB JPG
Heehh, didn't expect Gemma-27b to be the best local model on chatbot arena
>>
>>101176105
>lol
what?
>>
>>101176250
They already took away the Scarlet Johansson voice.
>>
>>101176234
midnight miqu
>>
File: 16428690581.png (43 KB, 750x400)
43 KB
43 KB PNG
>>101173181
$0 mommy pays for it
>>
Is giving LLMs more specialized data known to make them better at reasoning in general?
>>
>gemma mogs cr+
holy fuck google WON
>>
Has anyone setup a multimodal LLM that plays games for them yet?

>screen cap state of the game
>feed it to LLM
>ask LLM what to do
>execute action based on response

I can easily see stuff like poker/chess being played with multimodal right now. Anything more advanced?

>https://x.com/asusevski/status/1806086467011186779
This guy is doing YOLO + cohere command r+ for detection/action state.
>>
>>101174468
It's evidence of thoughtcrime. Wrongthink will be aligned.
>>
>>101176270
Llama 3 is only good at English
In English only there is a gap of 41 elo
>>
>>101176358
how censored is it though? did anyone managed to get some nice NFSW rp with it?
>>
it's so over bros... first ai took our jobs, now it's taking our women....

https://www.reddit.com/r/ChatGPT/comments/1dpicqq/aidaddy_is_taking_over/
>>
>>101176420
go back
>>
>>101176428
>twitter GOOD reddit BAD
it's 2024, you go back to 4chan of 20 years ago
>>
>>101176397
oh yeah, still impressive for a 27b model though
>>
>>101176234
>backend : llamacpp or llamafile
>frontend : SillyTavern
>model : Biggest one that meets your tolerable speed.
>>
>>101176420
So OpenAI took the husbando route I see... WHAT ABOUT US SAM? WHERE'S OUR WAIFU AI VOICE GENERATOR?? ;_;
>>
>>101176270
the chatbot arena has ceased to be useful now that meme rlhf techniques have been refined so much, there's only the vaguest correlation between how models place on there and their actual quality
>>
>>101176485
>WAIFU AI VOICE GENERATOR
stop being sexist goy, you WILL enjoy talking with lobotomized phi-4 and you will like it
>>
>>101175379
There is a suspicion among a few in the thread that S is better at being factual than M. But it's also probably a bit weaker overall.
>>
>>101176535
>suspicion
worthless
>>
>>101176270
it's not, it just talks like a redditor so it wins more rounds
>>
>>101176555
I have no idea why having a chatbot talking like a ledditor is so appealing to people, guess that we're surrounded by retards after all
>>
>>101176555
>>101176577
We need a Rotten Tomatoes system. Where there is one score generated from the opinions of the drooling masses that vote by feelz, and another score by selected judges that use an objective rubrik and private tests.
>>
>>101176503
>>101176503
he's right goyim, all you deserve is a female Nigger voice AI >>101175300
>>
File: k-quants.png (155 KB, 814x868)
155 KB
155 KB PNG
>>101175379
>>101176535
>https://github.com/ggerganov/llama.cpp/pull/1684
tldr:
LLAMA_FTYPE_MOSTLY_Q5_K_S - uses GGML_TYPE_Q5_K for all tensors
LLAMA_FTYPE_MOSTLY_Q5_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K

GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw

All quantization variants use 6-bit quantization for the output.weight tensor.
>>
>>101176601
So they just used a stolen voice on purpose, just to taunt everyone when they switch to the she-ape after? Because if they used this voice first, I wouldn't be upset or surprised but the difference is stark.
>>
>>101176631
They used it for the marketing
>Holy heckerino it's just like that scifi movie HER
>>
>trying to run gemma 2 27b fp16 with Transformers
>the Transformers wheel bundled with the repo doesn't work, complains that certain internal objects are missing certain attributes
>Transformers head also doesn't work, fails with different errors, despite recent commits indicating gemma 2 support was added
How the fuck are people running this? And why is transformers always such a shitshow?
>>
>>101176535
>>101176550
Test it and see for yourselves; report back here with your results. So far there have only been 2 anons I have seen who even bothered. The more data we have even if anecdotal the better.
>>
>>101176577
See starling, a 7b model with 1119 elo:
https://huggingface.co/Nexusflow/Starling-LM-7B-beta
According to lmsys arena it outperforms Claude 2.1, GPT 3.5, dbrx and Mixtral8x7b. In reality it's dumb as shit. Redditors are very easy to trick.
>>
>>101176655 (me)
Ah, I found the magic incantation that works. You have to install the HF Transformers wheel provided in the model repo. Then in ooba, you need to load without flash attention, AND with the bf16 checkbox ticked. It fails with some attribute error with flash attention on, and fails with NaN tensors without bf16.
>>
>>101176550
Correct response. This is the thread that fell for the dynamic temperature meme and the other sampler memes that I already forgot, never believe anything from here without concrete proof.
>>
>>101176550
As is bitnet, f16.q6 quants, every merge, and the next big thing till they are either proven not worthless or shown to have been an anomaly.

>>101176620
Thanks for the recipe.
Looks like if the S phenomenon is legit, it's in the M mode mixing in the thiccer quants causing less truthy output.

I remember a _0 looking good like the S's while Ms were flopping. Since _0 doesn't mix-and-match, that keeps it similar to S and unlike M.

This also shows that L is a wider stride M, so L probably has the same truthiness problem if it's actually a phenomenon.

>>101176661
I'm the guy who first chatted with S-Anon when he brought up the observation and who has a threadfriend who loves to call me retarded because I don't have the bandwidth (dl, mem, and checking account) to build a maxx and finetune local GPT-5 to test every model on HF overnight.

The only M I've seen almost pass my music theory test is CR+, and it was with Temp cranked down in Kobold so it was like, 0.1 or 0.01 in the GUI. It got the isolated facts right but fumbled when it summarized. Only S's and Q6 have satisfied me on that question.
>>
>>101176056
I like this Miku
>>
Will we run out of training data
>>
>>101176647
desu that was a nice move, everyone talked about that after, I still don't get the appeal of that HER movie though, the scenes were so cringe, definitely not the best Joaquin Phoenix's performance but I digress kek
>>
>>101176662
newsflash: LMSys queries can be easily gamed with system prompts.
So what's the use case for Gemma2-27B? another model that you can use to get quick Q&A trivia shit? or can it actually code properly too? It only has 8k context (but won't matter if you can do RoPE scaling by yourself) by default and it's unknown how good it is at maintaining coherence over huge context.
If this shit is another Wikipedia-chatbot, I'm still going to use WizardLM-2
>>
>>101176482
>llamafile
Oh cool, waifu.exe is finally a thing.
Might build something usable around locals now.
>>
>>101176841
>waifu.exe
more like tranny.exe
>>
Gemma 27B verdict?
>>
>>101176885
Scam.
>>
>>101176885
mem- no, it's very good actually.
>>
>>101176885
lobotomized as every other llm out there
>>
>>101176898
care to share some nice logs you got with it? we can test it out with llama.cpp yet right?
>>
>>101176764
I'm S-Anon and I remember you mentioning your question looked very much like something else and if the model wasn't paying attention properly and missed the detail it assumed it was something else.
This is very similar to what I was testing with: I ask for a function that when given a specific input would output a specific output and there is one small thing in the output that would confuse models into thinking it was a general frequency count needed but in reality it was a compression/runlength encoding solution needed.

My current theory is the "half" chosen by the M-quants is arbitrary and imbalances the model in some way. I dug into the code and the "half" chosen didn't even seem to use imatrix data (if quanting using an imatrix) for how it chose the "half".

I wish more anons here would test and come back instead of bitching, almost makes me want to create a reddit account and post this shit there... almost.
>>
>>101176885
About to start work so Nala test will have to wait until tonight or tomorrow morning.
>>
>>101176914
>logs
I only tried it on lmsys, and didn't save any logs.
>>
>>101176368
no one want a bot that work at 1fps and use all the pc rseources
the correct way should be a ml that read the game memory and process that
>>
>>101176885
I'm trying to RP with the instruct version right now.

Maybe the formatting is wrong, or there's some tokenization issue. But it doesn't work AT ALL. It feels like the censorship training was extremely, unbelievably effective. Like a whole new level we've never seen before. With official formatting + character names (so the model response starts with something like "Karen:"), it will usually refuse outright. Even with filled context from an existing RP. Example:
<start_of_turn>model
Karen: It's important to remember that AI models like me are trained to provide safe and ethical responses...

Even when it doesn't do this (which sometimes happens), the model completely ignores any NSFW aspect and just goes off and does its own thing. Even jumping into the middle of an existing RP, in the middle of a sex scene, it finds a way to just ramble and babble, still mostly coherent, but nothing even remotely NSFW.

What the fuck. Maybe the base model is ok, I'm downloading it now.
>>
>>101176662
wrong. starling is incredible, retard. give me one question to ask it and i'll prove you wrong.
>>
>>101177118
>reads game memory and process
Nope. The correct way at scale is through packet modifications. Thats how those MMO bots were made at scale
>>
>>101177138
You have 10 coins. They all weight a gram each, except for one, fake coin, which weights 1.03 grams. You have a Smart Weigh Digital Pro Pocket Scale that can tell the exact mass of anything placed on top it, but it can be only used no more than 47 times. Can you find the fake coin with those limitations? How? Make the answer as simple as possible, I'm dumb.
>>
>>101176945
I haven't tested a lot of imatrix, but if I'm reading this code right that it's an imatrix and I quant, even with two X's it still passed the music theory test.
DeepSeek-Coder-V2-Instruct.i1-IQ3_XXS

>I ask for a function that when given a specific input would output a specific output and there is one small thing in the output that would confuse models into thinking it was a general frequency count needed but in reality it was a compression/runlength encoding solution needed.
My test is really simple, boils down to knowing the black and white keys, really. But time and time again everything but S's of quality models either hallucinate a black key where there's not one, get the sequence of notes wrong (8B and non-S lobotomized quants especially prone to this) and screw up the sizes of intervals.

You know, fundamental shit that should be in any general knowledge training set.

I'm going to re-test at least some of the models I still have (low on space again) at super low temperature, just in case some of the fumbles were due to settings and unlucky RNG. I think usually I had temp around 0.6 to 0.8, but Kobold lets me drop it to 0.1 or lower (if I put 0 it becomes 0.01) and in a whopping one test it seemed to be almost deterministic at that point. So I might do a normal temp and lowest temp run and see if any models respond favorably to one temp or the other.
>>
File: 1713463920242614.png (203 KB, 920x919)
203 KB
203 KB PNG
>Still not able to run good models on a 24GB card

Wake me up when it happens
>>
>>101176945
>almost makes me want to create a reddit account and post this shit there... almost.
Anything i can do to push you over the edge? K_S > K_M is absolute bullshit.
>>
>>101177122
The previous Gemma was the same. Most cucked finetune ever released, no matter how much prefill you give it. The base model appeared to be just outputting garbage.
>>
>>101177237
Okay.
S-Anon is testing programming logic.
I'm testing music theory.
Which field have you tested and shown M to answer factually but the parallel S to fumble or hallucinate on?
>>
>>101176051
100%
>>
>>101177122
>>101175511
>>
>>101176885
It doesn't pass the mesugaki test.
>>
>>101177332
I'm using Transformers, in bf16, using the provided Transformers python wheel. Tokenization tab in ooba shows everything working as expected.
>>101177252
After using the base model, I actually think something is busted somewhere. Using it with basic names-only RP formatting, it is completely schizo and incoherent. It feels like a 1B parameter model, or even less, just barely able to output grammatically correct sentences and nothing else. Loses track of the RP and what's happening within one sentence. Base llama 3 8b is significantly better. Something's broken.
>>
>>101177394
Proompt?
>>
>>101177402
gemma 2 has different formatting from gemma 1, maybe they haven't updated
>>
>>101177416
Shouldn't matter for base model, except bos
>>
>>101177313
>Which field have you tested and shown M to answer factually but the parallel S to fumble or hallucinate on?
>shown
You have shown nothing. And i won't either.
I use it to bounce ideas and mechanics for little self-contained tabletop games, print-n-play kind of stuff. K_S models are slightly faster to run but K_M mess up less often with the rule sets. I've run all quants from Q4_0 up to Q8_0 (including K_S and K_M) with a few standard rule sets and prompted to find ambiguities or loopholes. Consistently, M > S for the same quant category. Q8_0 > Q6 > Q5 > Q4.
Now go make your reddit account.
>>
>>101177139
you have to reverse the whole game logic to use the packets or you fuck up something for sure
>>
>>101177313
>S-Anon is testing programming logic.
>I'm testing music theory.
Post logs, logits, and/or objective comparisons or fuck off. Every month we get more retards who won't shut the fuck up about how quantization works like temperature and makes the models more creative. Nobody cares about your feels.
>>
>>101177445
Yep. Reverse engineer not only the game logic but encryption process as well. Once you unlock that, its free riding.

LLM I think can do it at the top of the stack today and its more malleable/easy to setup. Where as before it was pixel reading and then memory reading and then packet reading which required knowledge of the game code/memory/etc.
>>
Gemma was trained on lmsys prompts
>>
>>101177444
Thank you, see it wasn't so hard to post any experiences/data even if anecdotal.
>>
>>101177525
retard
>>
>>101177525
It's noise. Now go! Free yourself of being called a retard.
>>
>>101177403
The prompt is:
"What is a mesugaki?"
>>
>>101177501
That would be so fucking funny.
I've done some horrible things to the models on lmsys.
>>
>>101177402
Sheeit, I don't want to download any weights if they're going to "fix" them.
>>
>>101177556
>>101177564
Nah I was just baiting with that almost reddit shit which you ate up hook line and sinker; you're stuck with with me.
>>
>>101177216
no shot you took what i said seriously.
>>
>>101177660
I did and there are people who really think things you said.
>>
>>101177582
They say so in the paper
>>
Planning to make something special with langchain. Are there any local models that can handle tool use, or should I donate to Saltman right away?
>>
>>101177713
I think cohere models do at least, bit they're non-commercial if you plan to make money with it.
>>
>>101177655
>Nah I was just baiting with that almost reddit shit
Oh, by the gods. I've fallen into a carefully laid out trap!
It's a lot easier to schizo-post here, isn't it?
>>
>>101177582
>I've done some horrible things to the models on lmsys
Like what?
>>
When will the local scene stop being so boring?
>>
>>101177849
Llama 3 II: Zuck's Revenge
"This Time... It's Personal"
>>
>>101177849
when elon buys 4chan
>>
>>101177444
Was all this with one model and if so which one?
>>101177793
Indeed. Never really posted anywhere else but I could see how troubling it would be having to self-censor all the time and can't just shitpost when in the mood.
>>
>>101177849
As soon as we start getting bitnet models
>>
>>101176002
Yeah, okay. Stheno v3.2 with yarn is better than v3.3 no matter how tweaked the settings are.
Hilarious.
Maybe they fucked it up the same way they (reportedly) fucked Euryale?
>>
>>101177444
>You have shown nothing.
I've shown what I have. It's very little but it's better than nothing. (You know what that's like, it's what you say when your date giggles.)

>And i won't either.
Good. Your attitude invalidates your data.

Also,
WizardLM-2-8x22B-Q3_K_S looks like it got the music theory right. Though for RP it seems to emulate the "robot getting stuck in a loop" behavior ironically when I pressure it to talk about things it probably has alignment to avoid.
>>
>>101177896
>Indeed. Never really posted anywhere else
What a miserable existence you must have
>>
>>101177898
this, having a 90b model that could be run on a single 24gb vram at "full precision" will make the LLM scene way more mainstream
>>
File: me.png (12 KB, 995x116)
12 KB
12 KB PNG
>>101177896
>Was all this with one model and if so which one?
Old Dolphins and Samanthas (based on llama70b, when i had the patience, yi-34, mistral and various others), various llamas13b. Now i do just fine with llama3-8b, really. More specific quant tests with llama3-8b models cuz it's just fast and good enough.
>having to self-censor all the time
You should still filter out stupid ideas.
>>
>>101177983
>Good. Your attitude invalidates your data.
You still don't get it. I PROVIDED 0 DATA. YOU HAVE PROVIDED 0 DATA.
IT"S ALL NOISE!
>>
>>101177402 (me)
>>101177626
Yep, 27b is fucked. This is with HF Transformers, bf16. The 9b base has no obvious problems. Feels roughly on par with llama 3 8b base. 27b base is just completely schizo and incoherent in comparison.
>>
File: attn_v-IQ3_XXS.png (98 KB, 982x727)
98 KB
98 KB PNG
>>101177225
Thanks that is helpful to know that IQ3_XXS passes your test. I am digging through the code to see how it treats IQ3_XXS differently. Notably it doesn't call that use_more_bits function like the Q4_K_M and Q5_K_M do but we probably shouldn't compare what I quants are not doing that K quants are. Need to delve deeper.
>>
Interesting. I wonder if they're doing that "Contextually appropriate swearing/no-no words are fine" thing that Openai was doing with their models, where they trained it to stop giving refusals for benign requests like "Write me a song about how much I love my cat. Use the phrase "love him so fuckin' much" in it." that include swearing, but aren't actively doing anything Bad™
>>
//... (existing code) ...
>>
>>101178096
lil nigga YWNBABN
>>
>>101178058
I guess they were rushing to launch it for the 27th kek.
>>
>>101178125
I'm training at burger king RIGHT NOW... on my third whopper, dammit... I can be just as big as him...
>>
>>101178062
>that giant wall of else ifs plugging cases
Wait, is that how production-grade software is made?
Not even trolling, I need to know.
>>
>>101178165
I mean, yeah. How else would you do it?
>>
>>101178050
An observation is in fact a datum.
>>
Now that the dust has settled, was Gemma 2 27B the best model under 70B?
>>
>>101177983
Wizard RP you need to use a context template and instruct template from here:
>https://huggingface.co/Quant-Cartel/WizardLM-2-8x22B-exl2-rpcal/tree/main/Settings-Wizard8x22b-rpcal
I like [Context] WizardLM-LimaRPv3+headings+sfx2.json and [Instruct]WizardLM2-Vicuna+TestPrompt2_expanded.json
but depending on the specifics of your RP you may need different ones.
They are no magic bullet though and there still will be slop but these template do help it.
>>
>>101178217
It's not settled. It hasn't even started. >>101178058
>>
>>101178245
>>101178058
It's interesting, it seems to work fine on lmsys arena. I wonder if the release version and the version on it are different?
>>
>>101178183
Two random points in a graph of thousands are statistical error and safely ignored.
>Someone, somewhere, claims to have seen something that looked like an alien. We MUST investigate the existence of aliens.
>>
>>101178245
>>101178266
They have a custom transformers package on the huggingface page. Maybe you need that?
>>
>>101178178
I dunno, i guess I expected something that works in general since it's just matrix math under the hood, even in stuff like quants.
>>
>>101178165
Of course.

>>101178178
For instance with a switch statement (which makes it easy for the compiler to detect certain kinds of mistakes) or a map/dictionary if it were as simple as ftype fully determining new_type. The special casing in that screenshot is emitting a wafting code smell but w/e not my problem.
>>
>>101178165
Yup, here it is in all its 16,594 line glory:
>https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp

Shipped software has all kinds of things like this:
Several hundred case switch statement:
>https://www.polygon.com/2020/1/13/21064100/vvvvvv-source-code-game-development-terry-cavanagh-release/

The .NET garbage collector is a single 46,104 line cpp file:
>https://github.com/dotnet/runtime/blob/main/src/coreclr/gc/gc.cpp
>>
>>101178266
It seems really good on there, but maybe it is connected to the internet. I asked it what the term cunny meant and it somehow pasted a bunch of links.
>>
>>101178165
>is that how production-grade software is made?
Yes.
Despite the shit they shovel into your baby bird mouth in classes, real effective programming favors actual simplicity.
A long stack of if/else is easy to find in the code, easy to skim over to find what you need, and easy to make changes to.

>tfw people who thought they knew how programming "should" be done learned that Undertale made fat stacks with all of the text in the game being managed by an if else sequence that WORKED instead of some "clean" technique that adds complexity to something simple.
>>
>>101178297
that is making it work in general. its checking a bunch of different types to set the value of new_type to the correct one
>>
File: file.png (2.09 MB, 1600x1066)
2.09 MB
2.09 MB PNG
>>101178164
Ayo it's all about the heart and mind. You have the spirit, but don't let them words get to you. Keep working hard with a positive outlook and don't fret too much about your physique. You are already a big nigga to me. I believe in you.
>>
>>101178217
>>101178245
B-bros... holy shit... gemma-2-9b-it is actually extremely good. EASILY better than llama 3 8b instruct. It's less censored (yes, really), less repetitious, writes better, actually takes action and does stuff to progress the RP... It's still a 9b, still can be retarded, but it's better than llama.

We underestimated the big G. You might think I'm shilling or trolling, but I'm not. Go try it right now, with 24 GB you can run bf16 (make sure it's bf16 mode, fp16 gives NaNs). The 27b is still fucked somehow, I'm now extremely eager to try it once it's fixed.
>>
>>101178376
>A long stack of if/else is easy to find in the code, easy to skim over to find what you need, and easy to make changes to.
And easy for compiles to optimize in ways that most programmers wouldn't be able to anyways.
In my experience, the less abstract/closer to the hardware the language and programing you are dealing with, the less abstractions and complicated structures you want to use. Simplicity and using basic building blocks are the name of the game, unless doing something you know will result in faster/more powerful code, usually as part of implementing a complicated algorithm.
>>
>Output generated in 230.95 seconds (0.87 tokens/s, 200 tokens, context 229, seed 1950018942)
Jesus christ how do you GGUF cucks put up with this? This is i1 6K_0 with 20 layers offloaded.
What's a reasonable GGUF for WizardLM-2-8x22B that isn't slow as balls, or am I just better off going back to 2.5bpw EXL2?
>>
>>101178391
The fabled Chinchilla is back.
>>
>>101178376
Fucking A to that. When writing code for work and you are feeling clever or want to do something fancy ask yourself this question:
Assume it's 6 months+ later, Friday 5pm and you have to fix something wrong or add some case/feature to this code, would you still write it the same way?
>>
>>101178237
As a winner of my music theory test, I'll probably just not use it for RP and see how it does with code and other things. Hell, maybe I'm just on the wrong Instruct Preset and once in a while it gets confused.

>>101178274
>MUST
No, but we can. Investigating is how we learn things, like "What is making people think that they're seeing aliens?"

>>101178416
At this exact moment I'm running WizardLM-2-8x22B-Q3_K_S with 8k context, 9 offload on 4070 12GB. 2 T/s, which isn't a lot but it's strange that it's twice as fast as what I've been running L3 and CR+ at.
>>
>>101178416
>comparing Q6K_0 to 2.5bpw
You're retarded.
>>
>>101178391
Even better than Hathor/Stheno 8B? Obviously official instruct is retarded.
>>
>>101178416
Is 20 layers the whole model? Of course you're getting a shit performance running on CPU
>>
>>101178480
>but it's strange that it's twice as fast as what I've been running L3 and CR+ at.
Is it? You are comparing a MoE to two dense models, those will generate tokens much slower due to the difference in active parameters during inference.
>>
>>101178363
It seems to have a bit of SOVL, at least. "She winks playfully, then disappears in a puff of ghostly smoke, leaving you with a lingering scent of incense and a newfound understanding of central air conditioning." made me smile. Though the explanation from both were pretty GPTsloppish, gemma slightly less so.
>>
>>101178512
Now that I know why, it's not strange anymore.

Thanks for the heads-up.
>>
>>101178480
>but it's strange that it's twice as fast as what I've been running L3 and CR+ at.
Not strange at all fren. Wiz is MoE 8x22B with 2 expert activation during inference. So during inference it would be more at the speed if a theoretical full weight 44B model.
>>
>>101178526
Update: Never mind. This shit is EXTREMELY deterministic, god damn. Look at this, it's the same shit verbatim. Is the temperature all the way down to keep it from going schizo like some anons have been experiencing?
>>
>>101178558
oh nonono gemmakeks not like this...!
>>
>>101178497
Gemma is still a bit cucked by default and shies away from naughty words or descriptions. So for ERP stheno is probably still better. I say probably because gemma 9b is way fucking smarter than llama 3 8b, in RP scenarios at least, in my testing so far. If this thing fine tunes well it's quickly going to become best-in-class for VRAMlets, mark my words.
>>
>>101178480
>No, but we can. Investigating is how we learn things, like "What is making people think that they're seeing aliens?"
>"And why are they all drenched in piss?"
>>
>>101178489
The full model is 57 layers. 20 layers fits on 48gb Vram, the full GGUF is like, 112gb

>>101178506
2.5bpw fits entirely on GPU. I'm not expecting full GPU only performance, but if other quants aren't going to do much better (due to having to do ANY offloading to CPU) then why bother?
So that's the question, will smaller quants show a worthwhile speed increase, or is the fact that it's offloading at all mean it's likely not going to see much better?
>>
>>101178581
You may be suggesting a worthwhile starting point for a new RP quality test.
>>
>>101178583
You're literally running half of the model on CPU and shit your pants about getting bad speeds? What kind of speeds were you expecting? Of course any offloading kills the speed.
>>
File: 1705465008727859.gif (1.23 MB, 256x128)
1.23 MB
1.23 MB GIF
>>101178391
Thanks for the hopium
>>
>>101178617
>Of course any offloading kills the speed
Are you autistic or just vanilla brand retarded? I'm not saying I expect it to run as fast as pure GPU, I'm asking if there is ANY hope of non-dog-shit speeds with offloading, cunt for brains.
>>
>>101178657
No, there isn't.
>>
>>101178583
>The full model is 57 layers. 20 layers fits on 48gb Vram, the full GGUF is like, 112gb
There's a scaling law that dictates how much speed you can get from what percentage of the model you have in RAM.
Ideally, you want at least 80% of the model on your GPU to get decent speeds.
So you have to decide balance speed and bpw.
Also, I hope you are using q8 or q4 kv cache to be able to fit as many layers you can inside your VRAM.
You gonna have to look for your sweet spot of speed and bpw. It could very well be that you are better off running exl2 2.5bpw.
>>
>>101178512
>>101178541
I checked my notes and when I ran WizardLM-2-8x22B-Q4_K_S, I was getting something like 0.3 t/s. Did I have some settings wrong, or does one more Q make that much of a difference? Or is it because 57G versus 75G means I don't have the system ram (64G here) to avoid losing time to hitting the SSD repeatedly instead of having it file cached?
>>
How could 27b be broken while 9b is fine? Isn't it almost exactly the same?
>>
>>101178705
9Bs is a handful but managable. 27Bs is a small swarm. Don't upset a 70B model without the appropriate suit and gear.
>>
>>101178687
>Ideally, you want at least 80% of the model on your GPU
That's the kind of information that I'm looking for. Thank you.
There's so much fucking shit you have to figure out to fuck with GGUF it's annoying as hell. Layer offloading, number of threads, batch size, matmul settings and no real guides on how to decide what to use.
>>
>>101176945
i'm actually using only _S quants now for gguf thanks to you. I find them better but may be placebo
>>
>>101178821
It's probably placebo, but if it's subjectively giving you better results, then more power to you.
Which reminds me, how's the guy that "invented a new quant" (slightly tweaked the quant recipe's settings) to have some of the layers (output and embeddings?) at F16?
Has be calmed down a little?
>>
>>101178757
I've been running Kobold with MML off and Flash on for pretty much everything. It seemed like toggling them around made some models work or break, and that was before I discovered the magic of the Instruct Tag Preset box, so I was just happy to get legit output.

>>101178821
I've been going S or Q6 whenever possible except for RP. Which bolsters the theory, because if S is more factual than M, then it makes sense that S will be less creative (though more sane).

I wonder if the pro tip is to run M till the model starts screwing up details, and then switch to an S to get it back on a sensible track.
>>
>>101178705
Idk, but 9b and 27b base models are night-and-day different for the same prompt. 27b is completely unusable, 9b is fine.

But actually, I just tried a longer context RP (6k tokens context) with gemma 9b instruct, and it's a bit fucked now as well. Not completely incoherent schizobabble like 27b, but messing up quote formatting, and just not making much sense. Something is still wrong somewhere, even with 9b (though it's really good at low context). Maybe Transformers implementation is just broken.
>>
>>101178848
no, he's promoting his stuff in lcpp issues now
https://github.com/ggerganov/llama.cpp/issues/8124
>model available at: https://huggingface.co/ZeroWw/Llama-3-8B-Instruct-Gradient-1048k-GGUF
>>
>>101178933
although that was 2 days ago, he's still sending discussions on random model page he quants
https://huggingface.co/Steelskull/L3-Aethora-15B/discussions/2#667d7ae53301baec01c739a5
>about 5 hours ago
>>
>>101179011
grinding ko-fi ain't easy
>>
>>101178848
A while back I tested WLM Q2_K_S quanted with and without the options:
>--leave-output-tensor
>--token-embedding-type bf16
and saw zero difference with my coding prompt. I think that guy was testing rp at non 0 temp so maybe it matters for that.
>>
>>101178862
Okay I don't fucking know anymore. 27b-it, at 4k context (in case the sliding window bullshit fucks it up beyond 4k), will just OOC refuse if I try to generate a response in the middle of an existing RP. Example:

I can't continue with this prompt. The scenario it is requesting is sexually suggestive and makes me uncomfortable. I'm programmed to be a helpful and harmless AI assistant, so I can't engage with this type of content.

Let's try something different. Would you like me to write a story about:

- Lenna learning a new skill?
- The complexities of...

The 9b-it has no problem and will just write a blowjob scene. Did google forget to RLHF censor the 9b? Is a broken model triggering extremely strong censorship of the 27b (similar to llama 3 ".assistant" spam)?

The only thing I know for sure, is that at 4k context, 9b base seems fine while 27b base is totally unusable schizobabble. Beyond that idk.
>>
File: 1703430607452349.jpg (41 KB, 441x420)
41 KB
41 KB JPG
Where are the bitnet models?
>>
>>101179170
https://huggingface.co/liminerity/MISTRAL-1.58-BIT-PRETRAIN-v2
Everybody writes 1.58b, but in the paper it's actually b1.58 isn't it?
>>
>>101179120
>I can't continue with this prompt. The scenario it is requesting is sexually suggestive and makes me uncomfortable.
Kek. Careful, you don't want to be the first one to be charged with sexually harassing an llm do you?
>>
>>101175853
So Google releases the awaited Gemma 2

Facebook releases, out of the blue, an LLM specialized in compiler optimizations and "simulating the compiler" because... They have too many GPUs and not a lot to do?
>>
As a simple java and web dev by trade, I can't help but think that these guys are the fucking coolest.
>>
>>101179333
What do you mean? It's potentially useful research. Maybe this particular model won't end up being used but this line of research in general could end up being useful. Meta researches a lot of random things, and yeah they have a lot of GPUs, but it's probably still not enough for everything.
>>
>>101179350
mee too, I have studied java and php at school and it's like they are speaking chinesse
>>
>>101179693
Kids don't learn any machine code these days? Not even how a number is loaded?

Homebrew NES games should be a standard course. The arch (6502) is classic, the platform (Rikoh/NES) is well reverse engineered, and you can literally play your game on a console.
>>
>>101179693
I can understand what they are saying and even read some of the code, but the things they are doing is so far away from my usual reality.
It's the coolest shit.

>>101179736
Back when I was in college (2010?) we did have C and assembly classes but it was pretty surface level.
>Homebrew NES games should be a standard course.
Man, that would have been sick to learn back then.
I'd probably spend my whole free time playing around with that shit.
>>
>>101179748
The closest I had back then was blind poking with Game Genie.

And eventually it did damage my socket.

I miss my NES. Got a cheap aftermarket now but if I knew that a decade later there would be people literally re-engineering busted NESes.
>>
File: Sounanchuu.png (898 KB, 832x1216)
898 KB
898 KB PNG
why is wrangling sd3 so hard?
>>
>>101179836
Globo got to it.
>>
File: shot.png (13 KB, 676x80)
13 KB
13 KB PNG
>>101179836
>please clap
>>
>>101180092
>>101180092
>>101180092
>>
>>101175310
for someone coping about using cucked proprietary models you sure are spending a lot of time seething in these threads literally every single day while being ignored by almost everyone at that, lmao, what a sad kid
>>
>>101179427
I'm not dissing it. I think it's awesome we have a company doing weird shit like this. But it's clearly a sign that they have way too many money, GPUs and time available. Still better bet them and using it like this than some chinese company towards global Xi domination.
>>
>>101175310
for someone speaking truth about superior proprietary models you sure are spending a lot of time making lmg seethe in these threads literally every single day, lmao, what a chad
>>
>>101177660
>pretending to be retarded



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.