[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: IMG_20240630_010541.jpg (2.29 MB, 4000x3000)
2.29 MB
2.29 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101205004 & >>101197169

►News
>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156
>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw
>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io
>(06/23) Support for BitnetForCausalLM merged: https://github.com/ggerganov/llama.cpp/pull/7931

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: oig (13).jpg (107 KB, 1024x1024)
107 KB
107 KB JPG
►Recent Highlights from the Previous Thread: >>101205004

--Studying Babies to Improve Foundational AI and Synthetic Data: >>101206053 >>101206360
--Anon's Struggle to Integrate Mistral.rs with SillyTavern and Gemma 27B: API, CUDA, and RAM Problems: >>101208799 >>101209341 >>101210250 >>101211162
--Google Accused of Cheating in Chatbot Arena with Gemma Model: >>101207038 >>101207301
--Gemma2 Verdict: Promising but Flawed Language Model: >>101205835 >>101208095 >>101208133 >>101208169 >>101210176
--Using A750 GPU for LLM Inference with Ryzen 5900x and 64GB RAM: >>101208835 >>101208885 >>101208961
--Magnum Model Issues with ChatML and MMQ Enabled: >>101210640 >>101210769 >>101210800
--LLaMA Devs Clash Over Vulkan Shader Removal: >>101210689 >>101213011
--Heroic GitHub Discussion on Sliding Window Attention Logic: >>101208230
--Google's AI Progress: Catching Up with the Industry Leaders: >>101207913 >>101207943 >>101207958 >>101208091 >>101208137 >>101208265
--Anons Share Deepseek Experiences and Discuss Cloud vs Local Model Pricing: >>101205229 >>101205422 >>101205468 >>101205512 >>101205808 >>101206209 >>101207307 >>101205994 >>101207588 >>101207662 >>101213358 >>101210238 >>101210321 >>101210257
--Anon's Claims about Self-Play Model Performance Spark Debate: >>101207279 >>101207300 >>101207326 >>101207354 >>101207380 >>101207594 >>101210205 >>101207400
--Ooba has Context-Free Grammar Support: >>101211624 >>101211659 >>101212005
--Newbie Seeks AI Model Recommendations for 4070ti and 32GB RAM Setup: >>101209163 >>101209186 >>101209187 >>101209195 >>101209229 >>101209252
--Koboldcpp CUDA Out-of-Memory Error with 3090+4090 Setup: >>101207089 >>101207871 >>101207918 >>101207952 >>101207965 >>101207997 >>101208140 >>101208289 >>101208740 >>101212050
--Seeking a Good Local Voice Cloning Tool: >>101207597 >>101208098
--In Situ Quantization: >>101213505
--Miku (free space): >>101207535 >>101207577 >>101211472

►Recent Highlight Posts from the Previous Thread: >>101205012
>>
>>101198756
https://huggingface.co/ChuckMcSneed/control_vectors/blob/main/command-r-plus/unslop1.2/control_vector-commandr-unslop1.2.gguf
I've made another unslop vector, this one focusing on NSFW. The good news is that it successfully kills slop during NSFW, unlike my previous vector, which was make to kill sfw slop. The bad news is that it kills performance as well, if you try to roleplay with it. Poor CommandR doesn't know what's happening most of the time and is trying to guess words, so it needs EXTREME handholding. Normally, I would blame it on control vector, but since it's almost impossible to prompt away the slop during nsfw, I'm afraid that there is a problem on a much deeper level. What likely happened is that the model learned to associate the slop with sex from all those shitty humanslop novels, diversity of data is very low, so anything deviating from slop-style sex is almost completely unexplored territory. Model simply doesn't know how to write about fucking in any other style. We need better datasets.

Should I try dbrx next? Maybe if it knows lots of trivia, it knows lots of writing styles? Sadly, the official tune is shit and the only other tune is GPTslopped dolphin.
>>
>>101214216
Dead general.
>>
so which is the best model currently for japanese?
>>
>>101214510
Dead poster (RIP)
>>
>>101214617
GPT 4o
>>
>>101214685
i heard claude is better when it comes to closed shit but i want local
anyway not gonna buy your subscription gonna go claude if no luck
>>
So now that the dust has settled, is the LLM industry stagnating? It's been almost 2 years since gpt-4 got trained and there is still no model that is a league beyond it. Even Llama 3 400B will be around gpt-4 range.

Is this it? Are all future models just going to be around gpt-4 from here on out, with maybe minor improvements, QoL features and better quantization/inference techniques?
>>
>>101214692
:(
>>
File: 43676 - SoyBooru.png (61 KB, 411x485)
61 KB
61 KB PNG
The image is a humorous caricature of the "Wojak" meme, also known as "Feels Guy," depicting a character often used to express various emotions. In this version, the character is drawn with attributes suggesting a connection to OpenAI: the teal color and the OpenAI logo on the cheek. The joke likely plays on the idea of the character being an AI enthusiast or perhaps embodying the AI itself, with the typical Wojak expression giving it a humorous twist.
>>
What's the easiest path to training a model of a person based off of a collection of their writing?
>>
Question: 32gb of vram and 32gb of ram, what quant of CR+ should I go with? Is q2 even worth it? I can run 70b llama 3 fine-ish but slowly. Haven't ever tried CR, familiar with most other models in my performance range.
>>
how hard is it for retards to do math? EUGH!!!! EUGHHHH!!!!!!! 32 + 32!! WHAT DO I GET? EUUUUUUUUUUUUUGH!
>>
> mistral-7b-bitnet.gguf
a lot of time has passed and we still don't have any usable bitnet model.
>>
>>101214804
Sorry, but the average IQ in this thread is not high enough to make most people catch the correlation between model size with the amount of RAM they have.
>>
lazy ass gerganov
>>
I have 8 vrams and ddr. What model can I walk?
>>
>>101214888
no
>>
>>101214510
This.
>>
>>101214804
>>101214853
>>101214888
>>101214937
funny how easily reddit tourists can be spotted.
>>
>>101214958
Sorry for not looking up the size of each quant on huggingface for you, nigger.
>>
Hi anons, can i prompt multimodal models like llava with a few examples, like game icons, and then have it recognize and label them when i send it more? I was thinking if itd be viable to make some automation scripts
>>
should i lurk on /aicg/ to find interesting system prompts? will they help not getting slop?
>>
>>101214995
Try using a good model, you won't need a system prompt.
>>
>>101214995
why don't you ask your favorite schizo mix to rewrite your system prompt for you?
>>
>>101215019
i can't. i'm a vramlet
>>
>>101215026
Then yes, you should go to /aicg/. And stay there.
>>
I'm a vramlet, I've been here since lmg split from aicg. I do not ask, I do.
-Anonymous
>>
>>101214958
>being able to do basic math and logical deductions is now reddit
>>
>>101215133
r/clevercomebacks
>>
>>101214804
so stop jokin around, what do you get
>>
>>101215183
55
>>
No reasonable person will bother with local models when Sonnet 3.5 is out there for free.
>>
>>101215426
Speaking of free Sonnet 3.5

https://openrouter.ai/models/anthropic/claude-3.5-sonnet/apps

Will this startup be alright? Because damn that's a lot of usage.
>>
>>101215449
>22k$
kek
>>
>>101215461
>i calculated with input token prices
>110k$ for output tokens
rip startup
>>
What's the state of Gemma 2 27B? My understanding is that some changes to llama.cpp were missing which is why it is so retarded. With those changes being merged, do we expect the 27B model to become usable or is it fundamentally broken.

I expected there to be more activity because IQ4_XS should fit perfectly in 16 GB VRAM with 4K tokens.

Speaking of which, I haven't kept up with new models and tunes. Anything interesting that fits in 16 GB VRAM or gives at least ≥10 t/s with offloading? I remember being unimpressed with the first few Llama 3 7B finetunes compared to Yi 34B.
>>
Off-topic but that Kling shit is pretty nice.
>>
>>101215609
I haven't found anything that wasn't retarded.
The smallest model I've seen pass my quick tests is a 41GB quant of Qwen2, and a 27GB quant of Aya, though it's a provisional pass and to fit the 13B into 16GB would be Q2_K and that's probably lobotomized.

Maybe try Aya 8B at Q6_K or Q8_0? Bigger than your VRAM but it should still be peppy and it's the only small model I've seen that didn't immediately make me facepalm.
>>
>>101215133
>bragging about his high IQ and abilities
The main sign that you have neither and just seeking a cause to stir some shit on chink basketweaving forum, actual high IQ-fag will hide his power level.
>>
>>101214325
What if instead of removing slop we tried injecting SOVL? Does anyone know good prompts for soulful writing?
>>
Can't use flash attention with gemma 2 and the latest build of llama.cpp?
>>
Do local models take up roughly the same size in gpu memory as they do on disk? Is there a difference between the disk size of safetensor and pickles?
>>
>>101214713
Truthfully, GPT4 was way too big, and they should've just trained a dense 180b for much much longer.
I think Anthropic plans to do basically that with Opus 3.5 (no way it's bigger than 200b dense, see Yi Large being 132b and falling squarely in the middle between L3 70b and Opus/Furbo), and considering you can get effectively a SOTA model out of 70b as evidenced by Sonnet 3.5, Opus 3.5 might be the first legitimate capacity leap.
I also think we are early when it comes to steering these fucking things without a shitton of data, and synthetic slop is the holdover until we get more stable / generalizable ways to optimize preference from only a couple examples (see: https://github.com/uclaml/SPPO)
I don't think we are anywhere near approaching theoretical limits, and the optimal parameter size of a language model trained on everything humans have ever written is most probably in the multi-trillions rather than hundreds of billions.
>>
>>101215871
Continuing..
Or should I just multiply the number of parameters by the float size (eg 8 bytes or less if quantized) to determine in-memory size of model?
>>
>>101215865
>llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off
Ah. I see.

>>101215871
The model itself yes, but there's context to consider too, so a model in use can take a little more to a lot more memory than its size on the disk depending on the size of the context, the techniques being used (GQA), etc.
>>
>>101215884
A useful heuristic for me is that the amount of GB a model takes in memory (before context) is roughly the same as its parameter count at q8_0, and double that at full precision / fp16. I.e, Mixtral 47b is ~47GB at q8_0 precision, and 94GB at fp16/bf16
Context size memory used will also depend on quantization and whether or not the model was trained with GQA.
>>
>>101215889
Thanks
>>
>>101215738
Thanks, I wasn't aware of Aya until now. The 35B version should fit using IQ3_XXS which might just be borderline usable. I'll give this and 8B Q8_0 a go.
>>
whats undi up to nowadays
>>
>>101215609
Old quantizations should be requantized with the latest version of llama.cpp.
The sliding window isn't working yet, so the model has effectively 4k context. But FlashAttention isn't compatible with it, so you'll have large memory usage anyway. Still, 4k context with 6.5 bpw 27B Gemma-2 is attainable on 24GB.
A possibility is configuring the sliding window to 8k tokens, which should disable the SWA mechanism. It works but I haven't tested it in depth.
>>
>>101216080
context less than 32k is unusable, gemma is a bad joke
>>
my heart hurts when i edge, is this normal
>>
gemma 2 9b working good. especially with my language.
>>
>>101216118
Have you tried asking Dr. Llama?
>>
>>101214713
You just need to wait for open AIs tech breakthrough. 500bx16 200T tokens gpt 5
>>
>>101216100
32k isn't shit, I don't get out of bed for less than 65k context.
>>
>>101215882
I believe (without knowing too much about LLMs) GPT-4 could be much better than it currently is if we had access to its base model. All commercial LLMs (probably all instruct tunes actually) are heavily RLHF-lobotomized. Afaik, the degradation of reasoning capabilities this causes is of surprising magnitude. Of course you need to do some tuning for it to follow instructions, but I expect any tuning other than that specific for your use case will actively hurt its performance. If, in theory, we could take the GPT-4 base model and only fine-tune it to roleplay, it would be unparalleled. I also speculate that Claude's vastly superior ability to imitate writing styles and to feel much more sovlful is almost exclusively caused by differences in alignment/fine-tuning, and not by differences in its training data. So GPT-4 probably isn't inherently sovlless.

And it's the same for other use-cases I think. Issues like hallucinating are probably LLM-specific and therefore impossible to get rid of. But fine-tuning one of the large models would still yield something much more powerful than training a bigger model that's then lobotomized again.

>>101214713
>the optimal parameter size of a language model trained on everything humans have ever written is most probably in the multi-trillions rather than hundreds of billions
Is this speculation or do you have any source that has evaluated this? My gut tells me that the largest models still aren't nearly as efficiently compressed as theoretically possible, but I also know nothing about the mathematics behind entropy and information.
>>
>>101215183
i'm not telling you any thing, retard. do basic math. if you can't, just take all your electronics and throw them in a bathtub full of water, then jump in yourself.
>>
>>101214713
imo the stagnation in the LLM field (especially local) is a consequence of the collective expectation that the 'perfect' model is going to be released any moment now. This belief has led to a complacency in implementing features with traditional programming, after all why waste endless hours trying to code something that the next model might be able to do out of the box? We really need a big wake up call that makes people go "hey maybe I need to hardcode reasoning, memory and anti-slop abilities around these models". My hope is that GPT5 gets released and ends up being only marginally better than GPT4o.
>>
>>101216303
Why would anyone bother learning basic math when we have tools to do this for us?
>>
I've now worked out how to use mistral.rs successfully with ST. You just have to disable streaming and put the model detected in the model line of the API tab.

Now what? I guess I'll test if it can really do 8k, first.
>>
>makes models for vramlets
>tells them they can't use quants
genius
>Also, if you're using gguf or other quants, stuff is broken there. PoSE doesn't play well with quants.
>https://huggingface.co/Sao10K/Fimbulvetr-11B-v2.1-16K/discussions/2
>>
>>101216246
>is if we had access to its base model
yeah, I'm sure some kofi finetuners throwing some shitty erp chatlogs at it would really improve gpt4
>>
>>101216453
don't see you doing anything better
>>
Any recommendations for 32GB with 16GB VRAM?
>>
>>101216505
>>101216012 here. Aya 35B IQ3_XXS works barely, I had to check Offload KV, otherwise it would out-of-memory. I decreased context size to 4096 before, but with KV offloaded, 8192 should work, too. Getting a bit over 10 T/s. It's surprisingly good, subjectively much more engaging than any Llama 3 8B or Mixtral tune, but sometimes it's a bit retarded because of the quantization. Still, very good first impression so far.
>>
>>101216429
Top 10 ko-fi betrayals lmao.
Talk about biting the hand that feeds
>>
>>101216621
Happy to hear it.
I guess Aya's been kind of a sleeper. Did nobody care about c4ai till CR(+) got the coomers cooming?
It claims to be multilingual, too, so I'm looking forward to trying it as translator and maybe coding.
Have you tested a fatter quant with partial unloading? I get 2½t/s on Q6_K. It might be worth the time to get fewer retard moments.
>>
File: file.png (131 KB, 1080x606)
131 KB
131 KB PNG
lul
>>
OK so it looks like mistral.rs and 27B won't work with larger context. For some reason, it has a HUGE spike in memory usage when it begins inference, and even if you quant it down to Q2K, it still can't do even beyond like 3k before the spike results in a OOM error. I'm literally using a damn 3090 and it can't fit both the model at Q2K and the memory spike. What the hell.

I guess in the end it's practically the same as not having support for Gemma beyond 4k context kek.
>>
>>101216870
>mistral.rs
>>
>>101216886
Someone posted that it supported Gemma 2, so I thought I'd see just how bad it is. Yeah it's bad.
>>
>>101216886
You forgot your message.
>>
>>101216905
>>
>>101213854
XCOM 2 with Long War of the Chosen mod.
>>
>>101212050
>For regular /lmg/ use 2 kW for 6 4090s is unproblematic because the software is currently not efficient enough to parallelize them in such a way that each GPU draws a lot of power.
For compute-heavy tasks you have to limit the boost frequency in order to avoid peaks in power draw that cause instability (and then there is basically no benefit in getting 6 4090s instead of 5).
I'm surprised you can even get 5 to run on 2050w that's actually crazy
>you have to limit the boost frequency in order to avoid peaks in power draw
do you do that by staggering the allreduce or gradient update during training or something similar?
>>
>>101216246
>also speculate that Claude's vastly superior ability to imitate writing styles and to feel much more sovlful is almost exclusively caused by differences in alignment/fine-tuning, and not by differences in its training data. So GPT-4 probably isn't inherently sovlless.
100% correct & a deliberate design decision as confirmed by them various times.

>>101216246
>could be much better than it currently is if we had access to its base model.
Kinda cope, you need very good data to align the big model to the desired distribution, open source just has slop data at the moment so not happening.

>Is this speculation or do you have any source that has evaluated this?
Impossible to empirically evaluate because it would cost a ridiculous amount, but for reference, I remember seeing loss averages of like ~1.5 for llama3 70b, and llama3 8b having ~2.2 when I did a test on English web data from FineWeb.

If nearly 10x the parameters on the same data is a ~1.5x relative average loss improvement on _15 trillion tokens_, that tells me the optimal theoretical size to get below ~0.3-0.4 loss on average for English (without a metric fuckton of epochs) over a cleaned internet text corpus is probably a dense model with a param count in the low trillions. "Everything humans have ever archived or wrote online" is an exceptionally broad thing to model.

However, multiple epochs on a smaller but "big enough" model makes more economic sense. You have to actually deploy it later on, and you can't serve a trillion parameter beast at scale forever on current HW without losing money (OpenAI discovered this and as a result distilled 4 into Furbo).

I think that is why 4o and Sonnet 3.5 exist; they are pushing the most they can out of mid-range models with more compute, and the fact that they are roughly equal to the original GPT4 in performance is a consequence of a deliberate design decision to save costs.

TLDR; the economic sustainability of going bigger is what is plateauing.
>>
>>101214713
LLM industry shifted from ungabunga-style throwing one big prompt at one big model to sophisticated agent workflows
>>
>>101216999
I do it via commands like
sudo nvidia-smi --lock-gpu-clocks 0,2000 --mode 1

IIRC you can then reliably draw something like 300 W per 4090 without running into stability issues and with ~90% of the maximum potential performance,
Notably setting a power limit via nvidia-smi does NOT help to reduce spikes in power consumption.
If you don't limit the boost clocks 4 4090s running in parallel can already lead to instability.
>>
you ok there buddy?
>>
>https://huggingface.co/UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
>there's an SPPO version of Gemma 2 9B now
Welp. Is this the model to use for VRAMlets now, assuming SWA gets supported?
>>
hello, I am insane. What is the best model for me?
>>
>>101217171
goody2
>>
>>101217171
Petra-13b
>>
>>101217131
holy fuck that was fast, less than a week after the llama3-8b-SSPO release
>>
>>101217131
I hope he'll do it on gemma2-27b aswell, the potential is here
>>
>magnum gguf doesn't behave
>check debug log
>bos_token and eos_token are set to the same
>qwen2 model config has set bos to null, chinks say it is intended.
>bos gets forced added by kobold and can't be null in the gguf thanks to that reddit shizo
Its annoying how often I find quants or models with broken configs or wrong bos or eos tokens.
At least magnum now outputs eos token on high temps with the "right" bos token.
>>
>>101217131
>LC. Win Rate of 53.27
Holy shit this is actually crazy if legit. This would literally place it at third place on the verified AlpacaEval Leaderboard, just below 4o and furbo. Even on the unverified leaderboard, it would place above all other community fine tunes. Even if these are all scams, it probably says something that 9B SPPO beats all of them.
We're so back.
>>
File: GRCPSc7XIAARV4U.jpg (123 KB, 1638x2047)
123 KB
123 KB JPG
https://x.com/WABetaInfo/status/1806101428609622181
Wtf? They finished Llama3-405b?
>>
>>101217171
Grok-1
>>
>>101214713
imho the ability of a neural network to reason is related to its depth, not its width, the latter only allows them to store more data and understand more concepts
llms are very wide but not that deep, and there doesn't seem to be a lot of research on very deep ones, transformers with hundreds or thousands of layers still suffer from the vanishing gradient problem
>>
>>101217383
Probably still testing. But yeah it's almost July, you know, when they were originally planning to release L3. I'd still be more confident about a late July release though.
>>
>>101217361
>>xkcd/927
Wouldn't it have been nice if there were a shim to automatically handle conversion between the tokens that the model declares and what the interface is set for so everything Just Works™?

>>101217383
Not 1.58, won't fit my (V)RAM limits, not relevant to my interests.
>>
>>101217383
>(preview)
probably still in the final stages of tuning
>>
>>101217369
9b, come on. How can it possibly be legit.
>>
>>101217490
By being trained on the test.
It worked for students in the oughts.
That's why the rate of brilliance jumped so much between the generation that went to the moon and the generation that went to argue about the color of a dress on Twitter.
>>
>>101217369
>unverified leaderboard
the unerified leaderboard has shitty 7b tunes above 4o and 3.5 sonnet, kek
>>
>>101217369
https://tatsu-lab.github.io/alpaca_eval/
>Yi-Large Preview is third
is this a closed model aswell?
>>
File: ComfyUI_00274.jpg (801 KB, 2048x1536)
801 KB
801 KB JPG
>>101217361
exl2 quants of Magnum and Qwen don't work unless you add the proper EOS token to the config json. Sounds like a related issue.
>>
>>101217490
Because it's not model capability that's being tested but stylistic preference using GPT-4 as the evaluator. If we can get models to be truly more like GPT-4 in how it answers requests, that's a good thing.

>>101217530
We don't know how smart those ones are though. Anons have tested SPPO and verified it's actually good.
>>
>>101217559 (me)
*BOS token, not EOS token
>>
>>101217102
interesting thank you, I don't have access to 220v plugs right now so I'm stuck at 1600w which without changing the clock speeds the most it can handle is 3 3090's at full power draw. I've seen dual psu set ups but I was waiting to get more 220v plugs installed before i do that. I'm not seeing a lot of info about dual psu set ups but I don't see a reason why I can't have one psu for the motherboard/cpu/first 3 gpus and then another psu for the last 4 gpus. if it's all on one 220v circuit with an Add2PSU don't think I'll have a problem
>>
>>101217564
Well I should say they claimed it was good. I don't have any confirmation myself about their impressions of the model.
>>
>>101217131
https://huggingface.co/hrtdind/Gemma-2-9B-It-SPPO-Iter3-Q5_K_M-GGUF
Can this be run on ooba? Dunno if the latest llama.cpp_python package has all the fixes to run gemma
https://github.com/oobabooga/text-generation-webui/commit/66090758df4a2003974e0499b697f926fcb472ba
>>
>>101217592
I tried the L3 8B SPPO and it felt like L3. It didn't pass my music theory question but 8B never does; needs 70B and decent quant to pass that.
>>
>>101217602
Pretty sure not. Ooba is always like 2 steps behind on supporting stuff.
>>
>>101217616
unfortunately it's not really his fault, he has to wait for llama_cpp_python to update before making the new binaries
>>
>>101217614
Yeah I don't think it adds any knowledge to the model, it's just a fine tune after all. The AlpacaEval leaderboard is more about style and how a model reacts when answering a request.
>>
>>101217425
For exactly that exist the tokenizer_config.json and generation_config.json. But the Qwen2 chinks decided to ignore bos entirely and Llama3 decided there exist now multiple eos tokens.

>>101217559
alpindale fucked up the config to fix exl2, making a mistake and set eos and bos the same and fixing it later.
But quants don't get remade if the model maker changes something, they are left to rot.
>>
>>101217614
try this one now, maybe it's the first good small model >>101217131
>>
I'm afraid to say that AI seems to have plateaued.
>>
>>101217663
>Yeah I don't think it adds any knowledge to the model
Naturally it can't.
But the K_S theory is that some manipulations can damage the model's ability to access its knowledge. Following that, it's worth checking to see if there are advancements that might help it to dredge up facts that are in the training but don't bubble up in the usual techniques.

Given how bad the small models are, anything that helps them approach usefulness would be good to recognize.

>>101217688
So the framework for everything to function smooth as ice is there, but all these technical geniuses with their huge rigs that they spend massive amounts of time and energy on are goofuses who fumble a few lines in the config file and put all of that wattage into broken shit.

Typical.

>>101217709
I'm pulling the Q5_K_M GGUF right now, (just finished as I type this). Cursory testing on the way.
>>
>>101217739
Kobold's not ready for it. Cursory testing awaiting AUR update.

I'll test some Yi in the meanwhile and keep trying to think of the perfect coding question to add to my cursory testing series.
>>
>>101217730
yeah but whaddya gonna do
>>
>>101217739
>Following that, it's worth checking to see if there are advancements that might help it to dredge up facts that are in the training but don't bubble up in the usual techniques
Perhaps, but there is also the issue for SPPO that they're tuning on top of Instruct models instead of base, so it's probably another layer of difficulty to get the optimal knowledge out of the pretrain data.
>>
you still can't run gemma 2 9b or 27b beyond 4k context with llama.cpp, right?
>>
File: VaguelyUncomfortableMiku.png (1.27 MB, 832x1216)
1.27 MB
1.27 MB PNG
>>101217627
>unfortunately it's not really his fault, he has to wait for llama_cpp_python to update before making the new binaries
Deciding to rely on a third party python wrapper kind of is his fault
>>
Can I put gemma into my ASS
>>
>>101217856
>python
The only fault.

Everyone is relying on third parties somewhere along the line.
>>
>>101214216
>nothing about hardware in the OP
so what do I buy, is P40's with custom shrouds off ebay still the meta for cheap VRAM?
>>
>wonder how Gemma 2 support is going for Exllama
>go and search the issues and prs
>literally nothing
>like not even discussion
Wtf?
>>
>>101218041
MITbros...
>>
File: unknown.png (968 KB, 1366x768)
968 KB
968 KB PNG
bros can anyone share any alpha on voice to voice models ??

like omni models shown by openai

anyone workin on that??
>>
>>101218041
stop using exllama then.
>>
>>101218083
i mean theres whisper for voice to text, and there are many text to speech models
>>
i am GRIPPING
>>
File: MikuFitTrainer.png (1.2 MB, 832x1216)
1.2 MB
1.2 MB PNG
>>101217865
>Everyone is relying on third parties somewhere along the line.
Yeah, but there's relying on core OS services, fundamental libraries like glibc and mature well known frameworks...and then there's being beholden to a python shim that adds little to no value
>>
>>101218161
Fork and choose not to need the shim, then.
>>
File: disappointment.png (1005 KB, 917x898)
1005 KB
1005 KB PNG
>>101218031
>nothing about hardware in the OP
>https://rentry.org/lmg-build-guides
This is something
>>
>>101218128
turns out plugging them together is kinda of a pain and the latency is insane.

so was thinking if we just fundamentally remove the deps and merge the layers or something wondering if anyone tried playing around ??
>>
/home/USER/llama.cpp/build/bin/llama-server -ngl 33 -m home/USER/Downloads/L3-70B-Euryale-v2.1-Q4_K_M.gguf

Just cmade llama.cpp. Why am I getting a seg fault?
>>
uhh guys, what's gemma2 context and instruct templates?
>>
>>101214617
command-r-plus

(sorry, ran out of Migus)
>>
>>101218274
specs?
>>
>>101218326
CR+ is kinda feeling like best overall. I haven't taken it programming but I might need to.
>>
>>101218335
I run the same thing in kcpp no problem, so specs shouldn't be relevant. I have a 3090 and 32 gb ram, though.

This works:
./koboldcpp --usecublas --gpulayers 33 --model /home/USER/Downloads/L3-70B-Euryale-v2.1-Q4_K_M.gguf
>>
>>101218277
You can modify these according to your preferences:
Context: https://files.catbox.moe/vcbyyx.json
Instruct: https://files.catbox.moe/hi0ho5.json
>>
>>101218274
I'm using make and not cmake, but I usually find doing a make clean will clear up otherwise inexplicable segfaults. Maybe remove the cmakecache?
>>
>>101218031
>>101218185
Worth noting that if you decide to build a mikubox, you should probably opt for P100s instead of P40s. P40s are starting to show their age. P100s are faster, supported in exllama2 and have fp16 tensor cores so can use flash attention.
But cuda dev says he will continue to support P40 and they still work if you just need a lot of VRAM for cheap.
>>
>>101218412
Thanks for the suggestion. I am just running commands without truly understanding them, could probably figure this out eventually, but... is everything relevant stored in the llama.cpp folder? Can I just delete the folder and start over?
>>
>>101218506
yeah you can, when using make you can also do make clean and when running make you can
-j12 (for 12 threads)
>>
>>101217602
>>101217627
It's easy to build llama-cpp-python with updated llama.cpp
>>
>>101218458
but p100s have only 16GB
>>
File: FlippantBusinessMiku.png (1.25 MB, 1216x832)
1.25 MB
1.25 MB PNG
>>101218550
>easy
I've got it working, but calling it easy is being flippant. unless you're a dyed-in-the-wool pythonfag its really not easy and I don't think there are any guides
>>
>>101218599
i would cut off her head and use her as a throatpussy until i broke her
>>
>>101218336
cr+ seems like the best overall but i'm going to do a big comparison today, spent most of yesterday just catching up and downloading models, i have the following i'm gonna run a bunch of tests against

>dolphin-mixtral:8x7b-v2.7
>dolphin-yi-1.5-32k:34b-v2.9.3
>command-r:35b
>command-r_plus:104b
>deepseek-coder-v2:16b
>deepseek-coder-v2:16b-instruct
>gemma2:9b
>gemma2:27b (this one is still broken right?)
>hermes2theta:8b
>hermes2theta:70b
>lama3_sppo_i3:8b
>llama3-chatqa:8b
>llama3-chatqa:70b
>phi3:14b-medium-128k-instruct
>phi3:3.8b-mini-4k-instruct
>smaug_llama3:70b
>tess2.5_phi:14b
>midnight-miqu-v1.5:70b

I'm mostly gonna be testing programming and RAG tasks but i'll throw in some roleplay, my plan is to just run like 6 prompts 3x each against all models and then human eval accuracy vs gen speed, all models are gonna be running fully in GPU

any tips/any models i missed that should be up there?
>>
>>101218458
>P100s are faster, supported in exllama2 and have fp16 tensor cores so can use flash attention.
P100s have fast FP16 but they do not have FP16 tensor cores.
The FlashAttention Github page doesn't explicitly say whether or not P100s are supported but my impression is that V100 is the minimum.
llama.cpp FlashAttention definitely works with P100s though.
>>
>>101218652
>no petra-13b-instruct
ngmi
>>
>>101218142
It seems like you're expressing a strong emotional state. If you're feeling overwhelmed or anxious, it's important to take a moment for yourself to try and regain composure. Here are a few steps that may help you:

1. **Find a quiet space:** If possible, find a quiet and comfortable environment where you can sit down and focus on your breathing.

2. **Deep breathing:** Take slow, deep breaths, inhaling through your nose for a count of four, holding for a count of seven, and exhaling through your mouth for a count of eight. This technique is known as the 4-7-8 breathing method and can help reduce anxiety.

3. **Focus on the present:** Ground yourself in the present moment. Engage your senses by noting what you see, hear, touch, taste, and smell.

4. **Progressive muscle relaxation:** Tense each muscle group for a few seconds and then release the tension. Starting with your toes and working your way up to your head can help release physical tension.

5. **Reach out to someone:** Talk to a friend, family member, or a professional you trust about your feelings.

6. **Take a break:** Step away from the situation that's causing you distress, if possible.

7. **Practice mindfulness:** Engage in mindfulness exercises or meditation. There are many apps and online resources available to guide you.

8. **Physical activity:** Sometimes, physical exercise can help release built-up tension and stress.

If you find that your feelings of being gripped by anxiety or stress are persistent, it may be helpful to seek the support of a mental health professional who can help you develop strategies to manage and cope with these feelings.
>>
>>101218729
my heart hurts when i edge
>>
>https://huggingface.co/UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3/discussions/1
>We are planning to do 27B as soon as a stable release of transformers and vllm generation on Gemma-2-27B-It is available.

Shame about the context length, but will THIS be the best general use low context model for 24GB vramlets soon?
>>
>>101218677
I switched my Mikubox over to 3x P100 (plus 2x 3090). P100 enables the use of exllamav2, which compared to P40 on llama.cpp is faster, but if you want flash attention, the minimum is actually Ampere.
I can run command-r-+ exl2 and get 3-5 t/s with a full context. It doesn't seem that slow because prompt processing is very fast, and the reply begins to stream back after only a short delay.
>>
>>101218652
>ollama mode
The "tags" button gives you the Ollama names for the different quants. It'd be good to track that through your testing. I'm not alone in finding some quants to be better at facts that others, including the phenomenon that K_S seems to respect facts better than K_M, not only for parallel Q but even for smaller Qs.

I don't see any Qwen2 relatives in your playlist. Qwen2 and one of its spins, Magnum, passed my music theory and pop culture tests.
>>
>>101216303
Stop projecting your suicidal fantasies onto others
>>
>>101218458
>>101218185
thanks I missed that, new to this general
>>
File: 1719383601544693.png (103 KB, 600x600)
103 KB
103 KB PNG
>>101218831
>qwen2:70b
>>
>>101218750
It sounds like you're referring to a sensation experienced during sexual arousal that is causing you discomfort. If you're experiencing pain during the plateau phase of sexual arousal (commonly referred to as "edging"), it's essential to address this with care. Here are some steps you can take:

1. **Stay calm:** Try to remain relaxed, as tension can exacerbate the sensation of pain.

2. **Reduce stimulation:** Temporarily slow down or stop the sexual activity that's causing the discomfort. Taking a break can help alleviate the pain.

3. **Communicate:** If you're with a partner, communicate what you're feeling. Honest communication is key to a healthy sexual relationship.

4. **Check for medical conditions:** Pain during arousal can be a symptom of an underlying medical condition, such as infections, skin disorders, or other sexual dysfunctions. It may be wise to consult a healthcare provider or a specialist in sexual health to rule out any medical issues.

5. **Consider lubrication:** If you're engaging in penetrative sex, ensure that there is adequate lubrication to reduce friction and potential pain.

6. **Look at technique:** Sometimes, the way sexual activity is performed can contribute to discomfort. Exploring different positions or techniques may help.

7. **Mindfulness and relaxation techniques:** Practicing mindfulness and relaxation exercises can help reduce anxiety and muscle tension, which might contribute to the pain.

8. **Seek professional help:** If the pain persists or if you're concerned, a consultation with a sex therapist or a medical professional specializing in sexual health could provide you with tailored advice and treatment options.

Remember, everyone's body is different, and what feels pleasurable for one person may not be the same for another. It's vital to listen to your body's signals and respond to them accordingly.
>>
>>101214216
>Yi has gone closed

It's over
>>
>>101218869
>lubrication
is this what everyone loves using?
(referring to chatgpt)
>>
>>101218878
I just tried some Yi.
Failed the tests, couldn't pick up an RP cleanly.
We must do better.
>>
>>101218921
That's just a meme. If the model is as good as Opus (meaning its MMLU, etc... are that high), then we can just finetune it to do RP, etc... Bad RP or just sounding too much like a bot is just bad RLHF, not an inherent setback of a model.
>>
>>101218887
Lubrication can definitely enhance intimacy and pleasure, but whether or not "everyone" loves using it is a bit more nuanced.

Here's the thing:

* **It's personal:** Preferences for lubrication vary greatly from person to person. Some people find it essential, while others don't feel the need. There's no right or wrong answer!
* **Different needs:** The need for lubrication can also vary depending on factors like arousal levels, hormonal changes, medications, and individual body chemistry.
* **Types matter:** There are many types of lube (water-based, silicone-based, oil-based, etc.), each with its own pros and cons. Finding what works best for you is key.

Instead of focusing on what "everyone" does, it's more helpful to figure out what feels good for you and your partner(s). Communication is key! Talk openly about your needs and preferences to have the most enjoyable and comfortable experience possible.
>>
What about attention makes transformers intelligent?
>>
File: th-2278000819.jpg (40 KB, 474x606)
40 KB
40 KB JPG
>>101219026
Nothing, they aren't magically smarter even if the outcome is closer to desired
>>
>>101217559
>>101217856
>>101218161
>>101218185
>>101218599
shit taste btw
>>
>>101219026
its still just a big trained neural network
>>
>>101218762
maybe it could even compete with Q5 70b?
>>
>>101218774
When I already have you here, can you benchmark this PR https://github.com/ggerganov/llama.cpp/pull/8215 vs. master on a P100 using either legacy or k-quants?
>>
>>101218960
>Bad RP or just sounding too much like a bot is just bad RLHF
By bad RP, I mean for example in Kobold, I give a character sketch and some rules in the Author's note, reiterate the premise and characters in my first turn and set a starting point for the RP to run from and it,
- Immediately portrays my character, or goes 3P narrator and writes like a novel even though I said it was to be RP and which character was mine
- Repeats the premise or note information and goes nowhere with it
- Starts writing off something bland and uninspired, and repeats it over and over. I've had some of these models just write continuously the same two paragraphs like it's the fucking Shining.

L3, CR+, they seem to be fine with taking whatever, picking up the characters, and doing the needful.
>>
>>101219094
I wouldn't think so. But it could be relatively close, enough to justify the speed difference, perhaps.
>>
File: ita.png (206 KB, 652x563)
206 KB
206 KB PNG
>>101218394
maronna... works in italian too
>>
best model to make me bust a fat nutty?
>>
>>101219385
petra-13b-instruct
>>
>>101218831
maybe incluse wizardLM-8x22b
>>
>>101218652
How much Vram? If you're doing 48gb, throw WizardLM-2-8x22B in there at 2.5bpw. Only 4k context admittedly but I think it's one of the smarter models.
>>
>>101219385
CR+
>>
File: 114774158_p0.jpg (3.47 MB, 3343x4737)
3.47 MB
3.47 MB JPG
>>101218599
Fair enough figuring out what to do might not be easy, once you know it's a handful of simple commands tho. Sad so many aren't curious enough to teach themselves, or even search, I wrote about it many times https://desuarchive.org/g/search/text/vendor%2Fllama.cpp/
>>
Should I use Q5_K_M or Q5_K_S if I have to close a program or two in order to fit M into memory?
>>
>>101218652
Try Euryale too
>>
>>101219459
Q4ks is good enough, q4km is great, q5* if you are paranoid
>>
>>101219471
I mean I can fit Q5_K_S just fine, but I'm wondering if the difference between that and M would be large enough to be worth the hassle of needing to free up some memory when I want to run the model.
>>
>>101219459
Did you try S and see if it suffices? Unless you're in a hurry to get it "right" the first time that is.

And some of us are still investigating the "K_S is truthier than K_M" phenomenon.
>>
>>101219507
I was just facing a decision of which to download. Haven't tried anything yet.
>>
I'm liking Higgs-L3 70B
>>
>>101219519
Get S because it's a little smaller, download M while you're testing S, when M's down, compare.
>>
the world is shit and should be nuked completely and instantly, the creator made a mistake
>>
>>101218831
iirc midnight miqu is qwen2 based? but i'll grab magnum

as far as quants go i've done some testing and i've never really run into a situation where Q8 was obviously better than Q4_K_M, any time i've run into an issue where i though "maybe this is a quant problem" and upped the size to Q8 or Q6 it's never solved it

i'm generally doing RAG stuff anyway so i don't really care if the model gets things right as long as it doesn't fuck up on context

>>101219429
>>101219444
yeah ok, i'll put wizard LM on the list though it's going to be mostly academic, i'm down to 43gb of VRAM right now so I'll have to offload some layers to CPU
>>
hello where are the uncucked vramlet llama
>>
>>101219626
I don't know anything about what a miqu is. I figured it was a Mixtral thing.

>i've never really run into a situation where Q8 was obviously better than Q4_K_M
It's been a subtle factual details kind of thing, but that's why indicating the quants would be good for reference. And if you don't care if the model gets things right, then hell, i1-IQ1-XXXSNL let's goooooo.
>>
Anyone have an idea of what will and won't work on a ROCm 5.2 device, well it's not even a real ROCm 5.2, it's a 5700xt, but I managed to get basic torch ops working so far

I'm testing Bert right now, does anyone have prior experience with this.
>>
Hi, I'm new to AI, where do I download GPT 4?
>>
File: 12.png (4 KB, 401x49)
4 KB
4 KB PNG
what is this abomination of a quant?
>>
hi, 'im also new to ai. whats' the best model for erp with 2bg of rvam?
>>
>>101219999
https://huggingface.co/nomic-ai/gpt4all-falcon/tree/main
>>
>>101220033
>Experimental, uses f16 for embed and output weights. Please provide any feedback of differences.
Interesting.
>>
>>101220033
>>101220051
buy an ad
>>
>>101220093
Introducing Gemma 2: The Future of Local AI by Google

Elevate Your Community with Cutting-Edge Technology!
Unleash the Power of AI in Your Hometown

Welcome to the future of local intelligence with Gemma 2, Google's latest innovation in Language Learning Models (LLM). Designed with community spirit in mind, Gemma 2 brings world-class AI capabilities right to your neighborhood.
Why Gemma 2?

Hyper-Local Precision: Tailored specifically for the unique needs of our local businesses, schools, and residents, Gemma 2 understands and speaks your language—literally!

Data Privacy at Its Core: Your data stays where it belongs—in your community. With advanced security measures, Gemma 2 ensures your information is protected and used responsibly.

Ultra-Fast Performance: Experience lightning-fast responses and unparalleled efficiency. Gemma 2 is optimized for high performance, ensuring you get the answers and insights you need in an instant.

Seamless Communication: Whether it's assisting customers, translating languages, or providing local updates, Gemma 2 enhances how we connect and interact within our community.

Revolutionize Your Daily Life

For Businesses: Supercharge customer service, streamline operations, and gain deep insights into local market trends.

For Schools: Enhance learning experiences with personalized educational tools and real-time support for students and teachers.

For Residents: Stay informed, get local recommendations, and simplify your day-to-day tasks with the help of a smart, responsive assistant.

Join the AI Revolution!

Be a part of the community that leads the way into a smarter, more connected future. Gemma 2 is here to support, innovate, and grow with you.
Available Exclusively in Your Area

Gemma 2 is launching exclusively for our local community. Get early access and be the first to experience the benefits of Google's latest AI marvel.
>>
>>101219999
Make one yourself. its just coding 0s and 1s
>>
>>101220051
Not at all.
Look at the previous thread, there's been plenty of, not discussion per se, more like calling the guy a overly excited retard due to lack of testing on his part and plenty of evidence to the contrary.
>>
>>101220033
*taps sign*
>>101182212
>>
>>101220136
https://huggingface.co/ZeroWw/activity/community
>>
>https://github.com/ggerganov/llama.cpp/pull/8031
I've been waiting for this for a while now.
>>
>>101220273
>glm3 and glm4 model architecture
what's that?
>>
File: file.png (534 KB, 3477x1654)
534 KB
534 KB PNG
>20x more expensive than Llama 3 8B
What did Sao mean by this?
>>
File: 1719383601544693.png (155 KB, 600x600)
155 KB
155 KB PNG
>>101220321
>different providers
>>
>>101220298
Chat GML, another chink family of models.
The thing is that it came and went and made very little buz, but it's seemingly pretty good according to the few who tried it.
Can't wait to be able to compare it to L3 8b side by side on my most complicated cards.
Even aya 8b was just okay compared to L3 8b.
>>
>>101220356
now you have to compare to gemma2-9b, it's the new king of small models kek
>>
>>101220391
Nah.
I'll just skip it for now due to all the brokeness and due to being incompatible with flash attention.
>>
>>101220391
nah, Google cheated.
>>
>>101220434
>>101220429
I tested it, it's better than llama3-8b in my opinion, and everyone cheat kek
>>
I'm still laughing at ollama's attempt at having Gemma 2 support before everybody else.
>>
>>101215609
27B seems retarded/schizo when I run it in FP16 with Transformers too, so it's not just a llamacpp issue. Seems like nobody's got the right inference code yet to make it work like it works on lmsys.
>>
>>101220448
I also like Gemma 2 27B better than Llama 3 70B, at least with less than 4k context before everything breaks.
>>
>>101220457
>Seems like nobody's got the right inference code yet to make it work like it works on lmsys.
What does lmsys use to run models then? I thought it was using the transformers loader
>>
>>101220490
Google API
>>
>>101220490
I don't know. But 27B on lmsys it's definitely nothing like when we load it in Transformers locally (much less schizo). Some have suspected Google are hosting it themselves, and hooked lmsys up with an API.
>>
File: SUCKS.jpg (483 KB, 3402x1651)
483 KB
483 KB JPG
>>101220504
Small models really suck at trivia, that's a shame...
>>
>>101220548
>Google ads built into the model
>>
>>101220548
>just completely hallucinates doing a Google search
Kek
>>
File: Rin.jpg (41 KB, 600x600)
41 KB
41 KB JPG
>>101220548
>Rin looks at gemma-2 with a hint of pity and annoyance
>>
>>101220548
I think they don't suck enough.
Once they make a small LLM that isn't even able to recognize who is Santa Claus we will be eating good. There's no need to bloat the already small weights with useless knowledge.
>>
>>101219757
can you give an example of the sort of situation where the S vs M truthiness difference occurs? I'm curious, i just meant that because most of the values i'm looking for will be loaded into the context it probably matters less, but i'm still interested in testing/understanding what's going on here

(tho with my large vram i end up using 6_K for most of the small models and then whatever the largest quant that fits into vram for the big ones, which usually *is* an S of some sort)
>>
>>101220597
>There's no need to bloat the already small weights with useless knowledge.
oh there is anon, I love to talk to people about my favorite movies/films/series/games, and if the LLM hallucinates while doing so it just sucks
>>
>>101220548
>>101220597
yeah the solution here is the model for model things and RAG for facts
>>
>>101220548
I wonder what would be the treshold size required to get all the human history trivia memorized on itself, I doubt it's 70b, it's gotta be bigger than that
>>
>>101220624
or, having the model searsh through the internet like Bing chat
>>
>>101220624
>RAG for facts
rag lorebooks whatever don't work for a simple reason, the model can't bring up something by itself, it has to be triggered by something, either user input or, something the model has already said, if the latter, possibly something the model fucked up, and having the correct info after won't help
>>
When will we get a multimodal open sourced llm? Will it ever happen?
>>
>>101220693
Very soon, llama 3 400B.
>>
>>101220597
We already tested this hypothesis and it's flawed. Models like Phi are made to be extremely good at reasoning, but even with plentiful context they suck to actually use. Even many big models will be worse at cards featuring niche IP, even if you insert as much information as possible about them into context. The reality is that a model not trained on niche knowledge will also be worse at manipulating such knowledge when inserted through RAG.
>>
Hermes models guy posted that he's suddenly seeing 405B show up as an option for the AI in WhatsApp

https://twitter.com/Teknium1/status/1807490685387591983
>>
>>101220800
Why is he using the sarcasm emoji tho
>>
>>101220800
how many 3090s will I need to run that?
>>
>>101220810
He's annoyed that the platforms he uses aren't the ones they're making their beta testers.
>>
I didn't know that it will be multimodal, that's really cool.
>400B
But somehow I don't have 12 rtx3090 to run it anyway...
>>
>>101220735
Phi is great though. Maybe not for RP, but for reasoning it's very good.
>>
>>101220821
22 of them should do for 8 bpw at 32K context I reckon.
>>
>>101220881
>22 3090 gpus needed
bruh...
>>
>>101220800
Let's not pretend that 400B is going to be worth running anyway. Even if it's slightly better than what we have right now, it's not worth the money.
>>
>>101220881
>>spend 15400$ on gpus
>>the model is censored, finetune never
It's over..
>>
>>101220918
I think it's gonna be a monster, Meta have probably paid tens of millions of dollars to train this shit, they will do anything in their power to get what they want
>>
>>101220937
finetuning a 405b model would be so fucking expensive, no one is gonna do that yeah
>>
>>101220880
Yes. The point is that just great reasoning is not enough to make it a great model for everyone. It's good that it exists and can serve certain niche use cases.
>>
>>101220937
At least it'll exist. It could encourage other good developments in the industry.
>>
>>101220918
I'll use it for smut if it's good at it and affordable on OpenRouter. Nemotron which is almost as large still costs 4 times less than Claude Opus.
>>
>>101220943
I dunno, gemma is only 27b and is comparable with llama3 70b now. It's unlikely that 400b was trained very differently (better) than 70b, it's just bigger. I am kinda interested in its multimodal capabilities though.
>>
>>101221057
That's interesting, maybe it could encourage Anthropic to drop their prices a bit.
>>
>>101220952
people absolutely will, it's just that it's going to be AI grifters benchslopping for VC bux instead of anything good
>>
>>101220937
>gemma is only 27b and is comparable with llama3 70b now
[source needed]
>>
>>101221075
who will be their public though? who the fuck has 22x3090s in their home kek
>>
>>101220668
naw you just embed the question + the model's halucination in response and then use that to do a second pass with semantic matches from a vector db
>>
>>101221083
You don't need to run it at Q8. 11x3090 can get you Q4 and ~6 of them maybe still Q2
>>
>>101221047
Yeah, that's a fair point and I wish for this, It's good to inhale some hopium from time to time.
>>
>>101221093
>the question
what if I'm doing rp and want the model to bring something random up? that's literally what 'soul' is for some, the model bringing up a meme or whatever, not everything is a question
>>
>>101220598
For me, primarily it has been my music theory question test. Nothing complicated, but K_M models seem to forget how the chromatic scale works while the parallel K_S would get it right. That doesn't mean every _S was right, but they could get it right down to DSC's i1-IQ3_XXS, while I've never seen a K_M get it right except maybe c4ai-command-r-plus.Q4_K_M, which is borderline; it got it right, then it goofed up when it summarized, so I give it half credit till I test it again later since it also could've hallucinated the correct answer at first.

Large quants, Q6_K, Q8_0 also can pass.

The difference is that _0 and _S quants are consistent. _M does some quants at Qn and some at Qn+1 (_L is +2) and my suspicion is that this imbalance disrupts the inference in a way that favors what's typical over what's factual even though those facts are in the model.
>>
is it worth waiting for gemma 2 27b to become stable if i can already run qwen2 70b at q4k_s/m?
>>
>>101221082
Lmsys, also my own testing
>>
>>101221137
no
>>
>>101221157
qwen2 is very good, but also very autistic. It's like a smarter less slopped mixtral, but same kind of autism when you try to get it to "act naturally"
>>
>>101221142
lmsys doesn't mean shit
>>
>>101221203
gemma is 4kektexts for now, maybe swa'll get fixed, but even then it'd just be 8 so i find that hard to be excited for
>>
>>101221203
I was about to post "why can only commercial model makers figure out how to make a model that's but smart AND soulful at the same time", but then I realized that's too broad, it's not 'commercial model makers', it's literally just Anthropic. Everyone else's proprietary models are autistic too.
>>
>>101221094
>11x3090 can get you Q4 and ~6 of them maybe still Q2
Is there anyone here who has at least 6 3090s?
>>
>>101221239
johannes (cuda dev) 6x 4090 iirc
>>
>>101221239
Check OP's pic and weep.
>>
>>101221236
They ripped the soul out of Sonnet 3.5 too, it's very smart but it's not Claude anymore.
So it seems they're going the "autistic butler" route too now. I guess we'll see if Opus 3.5 is the same.
>>
>>101221267
>>101221249
If I were a rich man, yaba dibba dibba dibba dibba dibba dibba dum... *cries*
>>
I personally prefer a focus on the assistant use case but it does suck when literally EVERYONE focuses on it, decreasing true diversity and inclusivity.
>>
File: 1719299488912056.gif (3 MB, 1920x1080)
3 MB
3 MB GIF
>>101221305
You want a hug, anon?
>>
>>101221327
This smiley is so cute ;_;
>>
>>101221327
Hugging is overrated, I want muneh! *grins mischievously*
>>
Is there any frontend or maybe SillyTavern extension that can read the context and then perform a command in a terminal (of course with a command whitelist so it can't do anything bad)? I want to control stuff on my PC while also just chatting with my model.
>>
>>101221129
then the strategy of a tiny model + rag is not for you lol, i'm not saying that small models + rag are the end all be all of LLMs my guy, just that when it comes to small models it's a very effective strategy to get better perf on weak hardware

>>101221130
interesting, i'll investigate this
>>
>>101221129
you can kinda emulate this by injecting stuff into the prompt randomly. people don't experiment enough with this, it can be fairly powerful.
>>
>>101221668
>cuddling with waifu
>"did you hear about harambe anon?"
>>
why the fuck gemma so slow
I don't even get 0.5 tok/s
>>
>>101221270
The annoying thing is for a corpo the size of the ones that do this shit now it would probably not even be difficult to make a model that would be absolutely mind blowing at all kinds of entertainment. They just don't care to. If a model turns out with "soul" it was more of an oversight than anything else.
>>
i still get my main llmslop enjoyment out of a 3 month old 7b merge i made for myself
despite my efforts i have not found a single L3 model that doesnt bore me within 20 minutes
if anyone knows of some soulful schizo retard L3 models, kind of like what mistral holodeck was with low coherence but high entertainment value and no slop, im interested
>>
>>101221696
i recommend CR+ at Q8
>>
>>101221689
I can work with this.
>>
>>101220321
is something going on with openrouter?
It's giving me Cloudflare errors
>>
How far off do you think we are from being able to tell an AI to design a body for itself, and it being able to design that functional body? I imagine some company will make a killing from designing robots to the specifications that are sent to them.
>>
>>101221798
Quite far. We keep training image models on 2D art instead of 3D models so it can create actual spatial forms and then render them into 2D. Which would solve a lot of the problems that image gen suffers right now.
>>
>>101221755
i believe you are unironically correct but am condemned to vramlet hell unfortunately
also i am not looking for a coherent model, just something 8B that has creativity and nothing else, something overtrained on literature, forums, and/or web stories like holodeck was so i can merge it
>>
>>101220800
That means they finished training.
I'm assuming they will open-source it in a week or two.
>>
>>101221824
>something overtrained on literature
https://huggingface.co/maldv/badger-writer-llama-3-8b
https://huggingface.co/maldv/llama-3-fantasy-writer-8b
>>
Before I go and reinvent the wheel, is there something that evaluates how well an LLM can generate a piece of text? For example, given the phrase "The company that created LLaMA is Meta.", is there an algorithm that calculates the probability for each token if the LLM were to generate them? I guess that's what perplexity is for?
>>
google won.
>>
ive skimmed over a lot of the op post. need tiny point in the right direction. thanks in advance.

i wish to run my llm to my phone with my back end being a pc that is never used besides AI shit. ik 1 way is droidair but its shit. what are other ways

i kinda wanna give access to multiple family members but if its too much of a hassle its just for me
>>
>>101222008
>he fell for the outdated op post
ngmi
>>
>>101222022
not ssure what youre talking about? gotta love the smartasses who give zero context. im not here all day everyday hours at a time

you autis
>>
>>101220800
>Hermes guy saw
It was already posted earlier, retard. >>101217383
>>
>>101221835
Why a week or two? It's already July and a Monday. Now we wait for them to start their work day.
>>
>>101222008
Run whatever backend you want on your pc. Anything that provides a web ui. On the phone you just need a web browser. You'll need to figure out authentication if it's gonna be exposed to the internet. If it's only for your LAN you don't need auth. Try llama.cpp's server (llama-server).
>>
>>101222008
>>101222163
And be sure you don't use old Ollama because it had an exploit.
>>
>>101221835
>(preview)
No. It doesn't always mean finished training. It could be a checkpoint. But yes pretraining probably finished and this might be a fine tune checkpoint.
>>
An AI trained on all movies/tv shows/video known to man
An AI trained on all music known to man
>>
>>101222349
should train an AI to find anomaly's on the ocean floor
>>
>>101222349
Most of everything is shit. And it's also hard to define what 'good' is. Cheap training with custom data is the only way.
>>
>>101222349
That would require a lot of alignment for safety.
>>
>She leans in closer, her voice dropping to a conspiratorial whisper.
>>
>>101222393
>After removing all questionable content from the dataset, we trained the model on the resulting 13 billion tokens. A 470M parameter model seems to be sufficient for this task.
>>
>>101216935
Nice. I think I spent 1k+ hours on that game, despite the absolutely horrendous screenplay.
>>
>>101221130
>I've never seen a K_M get it right except maybe c4ai-command-r-plus.Q4_K_M, which is borderline; it got it right, then it goofed up when it summarized
Just retested it, it failed. It indeed hallucinated a correct answer by chance.
>>
>using kobold with ST
>group chatting
>each reply resets the prompt processing making generation take forever
why
>>
File: 1698297232185128.png (4 KB, 272x63)
4 KB
4 KB PNG
>>101222623
Because a big chunk of your prompt changes with each new message as ST swaps out the character cards.
>>
>>101222623
Because each card has a different character description that causes the cached prompt to be reprocessed from where the card begins onward.
>>
>>101222648
>>101222653
Is there a way to fix it? I tried switch to join character cards, but that didn't seem to fix it. I think I may have to delete the group chat and start a new one with that option enabled, but I don't want to lose my current chat progress.
>>
>>101222680
deleting the group chat wouldn't do anything, ST is just broken shit when it comes to group chats
>>
>>101222816
Damn, so group chats are just unusable in ST then... That sucks.
>>
>>101222680
One thing you could do is to empty the description field of the cards and put the information in the character notes at low depth.
Or merge all the descriptions in one description and copy the information into all of the character card's descriptions.
>>
How long until Gemma 2 27B is usable?
>>
>>101222680
What I do is put all the character definitions into separate entries of a lorebook and then chat with a single scenario card that references those characters. Probably not the best workaround, but it beats having to wait forever for the context to reload every time.
>>
>>101222393
You could just hype it up but not release it
>>
>>101222623
If you have your model in VRAM it should be extremely fast, even for very large contexts. Just upgrade bro.
>>
>>101222887
Wagies go back to their shifts on monday. Give them a few days.
>>
>>101214216
>Gemma 27B is arguably the best model for vramlets, yet its GGUF support is still kind of shaky
>>
>>101223032
It's not just llamacpp/GGUFs despite what this thread focuses on, the Transformers implementation of 27B is retarded and schizo too.

No one has local inference of 27B that's as functional as the lmsys version yet and there's no definitive answer yet as to what's wrong (again, llamacpp isn't the whole picture because Transformers doesn't work properly either).
>>
>>101222857
The likelihood of it writing for other characters and (you) will be greater if you do this
>>
File: 3462454523.png (7 KB, 944x100)
7 KB
7 KB PNG
>>101222857
>>101222899
these workarounds seem kinda janky. I just wish group chatting worked better.
>>101222932
vram too low (8gb). The longer the chat goes on, the more tokens that have to be reprocessed... I can't fap to these gen times...
>>
>>101223083
How did this even happen? Google released a HF Transformers version of the model. Shouldn't it be GOOGLE who develops that implementation and upstreams it to transformers, and is responsible for checking that it's identical, logit for logit, to whatever internal tensorflow / jax implementation they use for the API?

Like I don't fucking get it, a bunch of people want to run your model, start fine tuning it, etc (including me), and the Transformers implementation is still just broken somehow for 27b.
>>
>>101223257
>these workarounds seem kinda janky.
They are.

> I just wish group chatting worked better.
The thing is, due to how context cache works and how this character card based system works, the standard one front end one backend one cache setup can't work in any other way.
One way it could work better was if Silly leveraged llama.cpp's slot system, since you can have different caches associated with different slots (I think), so each character of a group chat would call the server API on a different slot with its own cached processed prompt.
Something like that.
>>
>>101223032
Just asked 3.5 sonnet and Gemma 27B for some very niche coding help regarding huggingface-cli and downloading specific subfolders, 3.5 sonnet hallucinates a command that does not exist, gemma 27B gives me the proper answer (use wget or just clone the repository).

Earlier Gemma also gave me a very comprehensive html/css design based on my description. I can finally actually trust local models fo cli and coding knowledge about as much as I trust closed cloud shit, not bad.
>>
>>101223265
Yeah it's fucking weird. Transformers doesn't even have sampling working properly for it yet. It's like Google doesn't want people to be able to run it properly.
>>
they are always going to intentionally cap how good the local models they release are.
we're in hell
>>
File: MiquTraining.png (1.41 MB, 1216x832)
1.41 MB
1.41 MB PNG
>>101223360
Don't worry. She's just training so she doesn't disappoint you.
Once humans know something is possible, it becomes nigh-on inevitable.
>>
>>101223447
I trust this Miku
>>
>>101220810
There's been rumors of it not coning out (according to a credible leaker who was right with ClosedAI stuff). When asked about it the employees played dumb. It's been months and there's still no word, so there's your hint.
>>
>>101223447
Just like how warp drives now theoretically work within out standard model of physics. They call it "Constant-Velocity Subluminal Warp Drive". Right now its just on paper, but imagine once that tech becomes a reality. No longer will it take months to get to mars but rather minutes!*
*Might be more than minutes, I actually have no idea if they addressed how they will slow down the spacecraft once they start going that fast.
>>
>>101223480
Reverse the polarity.
>>
>>101223480
How will this tech improve LLMs though?
>>
>>101223583
LLMs will figure out how we warp drive.
>>
>>101223583
They are already relying on AI to make Fusion Power function. AI will probably be the thing that makes Warp drives function as well.
>>
>>101223592
>>101223726
it's funny how being able to contextualize information was the thing preventing chat bots from seeming smart, and now contextualizing large sources of info to piece together something new might turn into AI's super power
>>
>>101217559
Sorry I'm retarded. Can someone please explain in more detail how to add the proper BOS token to magnums config.json, I wanna get it working properly. What changes do I make?
>>
>>101223764
The trick is in that a truly human-like AGI would be less useful to humanity than a machine-like one. If it becomes too good at being human it will then be emulating the same limitations that cause us to have so many football fans and so few Albert Einstiens, Nikola Teslas, and Richard Feynmans.
>>
https://x.com/tsarnick/status/1807517000664850671
kek, at least some players take strides and are aware of the slop issue
>>
>>101224025
More like KINOhere, saviours of the hobby.
>>
>>101224025
the canadians fucking won
>>
>>101224025
What north america has to house so many talents!?
>>
>>101224025
We're going to be so back when they release CR++!
>>
>>101224025
The zucc got mogged
CR++ will shit on sloppa3 save /lmg/
>>
>>101224025
what are they gonna do about it? release a 405b model?
>>
File: ComfyUI_00167.jpg (830 KB, 2048x2048)
830 KB
830 KB JPG
>>101223820
Add the line
    "bos_token_id": 151644,

to the config.json under eos_token_id
>>
>>101223820
Anyone? Or just share a properly corrected config.json and I can compare them?
>>
>>101224123
Thanks, appreciate it.
>>
>>101224025
You can immeditely tell which corpos actually use their own models instead of doing investor scams with cheated benchmarks.
>>
command-r-plus-IQ2_XXS or normal quant of L3-8B-Stheno-v3.2?
>>
>>101224229
Buy an ad.
>>
When the FUCK can I replace myself with a perfect AI so I can finally kill myself
>>
>>101224233
suck my fucking penis, thanks.
>>
>>101224229
Why not something in between like the Qwen2 MoE or the regular CommandR?
Between your options, Stheno, for sure.
>>
If these are AI why can't they just use a calculator
>>
>>101224280
Some do.
Was it chatGPT that had Wolfram Alpha integration?
>>
>>101224258
>Qwen2 MoE
Is that a thing, or did you mean Qwen1.5 MoE?
>>
>>101224292
https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
>>
>>101224302
>The Qwen2-57B models seem to be broken. I have tried my best, but they likely need to be fixed upstream first. You have been warned.
Are they even usable?
>>
9B Gemma2 follows the prompt really well and feels uncensored. Much more so than llama3 actually.
But its insane how much gpt slop this shit was trained on.
I hope its only the instruction version and not base.
Twinkling mischievous spine shivering in 2 sentences. Its crazy.
>>
>>101224321
>>101224321
>>101224321
>>
>>101224330
At least with llamacpp with flash attention on yes.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.