[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101021764 & >>101010179

►News
>(06/17) DeepSeekCoder-V2 released with 236B & 16B MoEs: https://github.com/deepseek-ai/DeepSeek-Coder-V2
>(06/14) Nemotron-4-340B: Dense model designed for synthetic data generation: https://hf.co/nvidia/Nemotron-4-340B-Instruct
>(06/14) Nvidia collection of Mamba-2-based research models: https://hf.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c
>(06/11) Google releases RecurrentGemma, based on a hybrid RNN architecture: https://hf.co/google/recurrentgemma-9b-it
>(06/06) Qwen2 releases, with better benchmarks than Llama 3: https://qwenlm.github.io/blog/qwen2/

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: 11__00729_.png (2.04 MB, 1024x1024)
2.04 MB
2.04 MB PNG
►Recent Highlights from the Previous Thread: >>101021764

--Running DeepSeekCoder-V2: Challenges and Potential Solutions: >>101024205 >>101024216 >>101024302
--Command-R+ Preset Recommendations for Enhanced Text Generation: >>101022120 >>101022759
--Q#_K_S Quants Outperform Q#_K_M in Coding Question Test: Precision and Perplexity in Focus: >>101023015 >>101023134 >>101023284 >>101023816 >>101026611 >>101026657 >>101026737 >>101026769 >>101027313 >>101027404 >>101027425
--Random Word Prompt for Diverse AI Outputs and Political Discussions: >>101025282 >>101026596 >>101026629 >>101029678
--Random Guy Achieves SOTA in ARC-AGI by Spamming GPT-4o API Calls: >>101023940 >>101024054 >>101024140 >>101024409 >>101024126 >>101024451 >>101024139 >>101025187 >>101024458 >>101025703 >>101028279
--Llama3's Impact on AI Hype and Community Enthusiasm: >>101023148 >>101023311 >>101023236 >>101023435 >>101024833
--AI Hype vs. Technology: The Future Beyond the Hype: >>101023019 >>101023102 >>101023533 >>101025934
--Remembering the Wonder of Early AI Days with Pygmalion and Pre-Pyg: >>101026015 >>101026071 >>101026309 >>101026408 >>101026166 >>101026204
--Memory Access Optimization, Not AMX, Behind Improved Inference: >>101022868 >>101021771
--MARS5-TTS: A New Text-to-Speech Model from CAMB.AI: >>101025898 >>101025919 >>101026352
--LLaMA 3B Censorship and Feminist Rhetoric in Sexual Technique and Relationship Advice: >>101025780 >>101025810 >>101026691
--DeepSeek API: Affordable but Limited Creativity for CPU-Maxxers: >>101021891 >>101022113 >>101022403 >>101022519
--Building an AI VTuber: Seeking Advice on Tech and Minimum Viable Product: >>101029508 >>101029577 >>101029652 >>101029613 >>101029665 >>101030246
--Anon's Quest to Run Nemotron on Old Hardware: Challenges and Possibilities: >>101022563 >>101022830 >>101026712
--Miku (free space): >>101023390 >>101023543 >>101024319 >>101024320

►Recent Highlight Posts from the Previous Thread: >>101021778
>>
>>101030715
why everyone suddenly horny for Teto?
>>
File: 1716747862937.png (27 KB, 380x421)
27 KB
27 KB PNG
>>101030732
It is Tuesday. Tuesdays are for Teto.
>>
>>101030692
The normal one is a multistage pipeline by recapanon. I don't think he's released all the details of his method.
I just run a singleshot inference off a standard prompt when a new model with enough available context gets released to see how good it is at dealing with a huge mess of chaotic information. Prompt adherence on this new deepseek is pretty good, actually. Probably close to L3 70b levels of smarts but with huge context.
>>
>>101030724
>missed SOVLKINO
https://huggingface.co/alpindale/magnum-72b-v1
>>
File: distland.jpg (118 KB, 1750x846)
118 KB
118 KB JPG
>>101030715
Thread Theme: https://www.youtube.com/watch?v=bF_1sV01QjE
>>
What is the maximum t/s I can get for a small model like LLaMA 3 8B? Is 500t/s feasible?
>>
>>101030743
isn't teto a hag?
>>
File: CR plus q4km.png (75 KB, 1138x530)
75 KB
75 KB PNG
Tried CR+ Q4KM GGUF for several hours. It's too dry for RPing. Will stick to Mixtral 8x7B Instruct. Much faster and less dry. Didn't reach logic testing yet, but 300s+ per gen isn't worth it IMO.

>tfw sub 1 tokens/s
>>
File: file.png (24 KB, 476x268)
24 KB
24 KB PNG
For the anon making the states extension - could you provide an example for what prompts you use?

I tried below verbatim

[Stop the Roleplay and Act as narrator] Describe {{char}}'s physical status and location.

Is this formatting wrong? It's still continuing as if it was an ordinary gen - speech and all.
>>
@ggerganov ji,

Namaste and greetings. I am writing to you today with a humble request to fix the control vector issue when using command-r with the language model. I am facing difficulties with accurate output generation due to this issue and believe that your expertise can resolve this matter. Many users in the community would benefit greatly from your kind attention to this matter.

Thank you for your time and consideration, and I look forward to your prompt response.
>>
>>101030815
*SCAMKINO
I still don't understand the lack of transparency. And /aicg/ is missing from the credits!
>>
File: 1697927143021543.png (3.46 MB, 1378x2039)
3.46 MB
3.46 MB PNG
GOOD MORNING TETO!
>>
>>101030992
>red miku
>>
>>101030914
i'll think about it
>>
>>101030914
I'll fix it if you ERP with me
>>
>>101031035
ah ah mistress
>>
>>101031039
>mistress
try again
>>
>>101031047
ah ah nigger faggot
>>
>>101031047
ah ah mistre?
>>
>>101031035
*nuzzles ur bulgie wulgie* uwu
>>
>>101030914
Post this on the issue tracker.
>>
>https://huggingface.co/alpindale/magnum-72b-v1/tree/main

WHERE ARE THE QUANTS?!?!?!?
>>
>>101030992
It's not even midnight. Fuck off.
>>
>>101031097
>living in the past
>>
>>101030715
>enter
>downloads teto
>post
>leave
>>
File: import quick reply.png (127 KB, 612x905)
127 KB
127 KB PNG
>>101030896

https://docs.sillytavern.app/usage/st-script/#using-the-llm

I use /gen [Prompt] Instructions.

/gen [Stop the roleplay and answer the question] What is {{char}}'s emotions right now? |
/popup <h3>Empathy:</h3><div>{{pipe}}</div>


>Mark 4 quick reply sets
https://files.catbox.moe/f61g7a.json

I also saw that they added multiple choices scripts. Didn't mess with it yet.
>>
the new deepseek 236b performed really well on my bespoke coding task test. code compiled without editing, executed correctly and it had a good explanation of each part of the code in a postscript.
This is the first model where I'd put the output above GPT4 for this specific coding task.
I'm impressed so far. Can anyone else confirm coding performance on their private benches?
I'll try and come up with some more complex tasks that exercise its higher context limit and see how well it is able to manage.
>>
Alpin/Sao/..., why are you all doing full finetunes nowadays? Are LoRAs/DoRAs/MoRAs/etc. doomed?
>>
https://youtu.be/Sf7r2XcLNEk?t=580
Apparently DBRX cost 10 million to train. He also says cost of training is going down by a factor of 4 every year.
>>
>>101030996
that was literally the joke, originally
>>
>>101031208
To be clear, Sao isn't doing full finetunes
>>
>>101031205
I'd have to run it at 2 bit. Damn. Maybe I should've CPUmaxxed after all.
>>
>>101031228
I think the 8B was FFT. Could be wrong though.
>>
>>101029508
I've been brainstorming a very similar thing, but less vtuber and more just virtual girlfriend. (What's the difference? One is streamed and one isn't, basically.)
Start with an output model (Live2D, vroid, whatever) that has many possible triggers.
Decide how to organize your context, what goes into the context (chat, current screen CLIP description maybe), and how to get the language model to trigger functions. Give it a vector database for chat history and a function to make mental notes to the vector db, a function to emote, a function to speak, etc.

This could probably all be wrapped in a nice little package with user choice of TTS, Live2D model, and STT, and people would go fucking crazy over it.

I don't have the energy or skill to pull that shit off myself, but if someone makes a github repo and architects the software, I'd contribute.
>>
Cudadev have you tried using tinygrad yet with all those 4090s?
>>
Running into this trying to quantize "magnum" with AutoAWQ:
assert torch.isnan(w).sum() == 0
And I found this issue:
https://github.com/casper-hansen/AutoAWQ/issues/335
Should I just give up?
>>
File: ScienceMiku.png (1.52 MB, 832x1216)
1.52 MB
1.52 MB PNG
>>101031461
>Should I just give up?
never give up!
Did you try commenting out the assert to see if it'll skip over the NaN weights?
>>
I was told that Koboldcpp said they 'invented' the context shifting, but really it was just a re-written feature from llama.cpp. I can't seem to find the info in llama.cpp's docs, how do you enable context shifting in lcpp? is it called something different?
>>
>>101031555
-cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled)
>>
>>101031606
Thank you! Much appreciated anon. Necessary when using llama 3 70b.
>>
>>101031508
I did and I think it was a waste of time. I will just try AutoGPTQ, I guess.
>>
have we reached the full potential of torch primitives
>>
Was sent here from the /aicg/ thread...

What's the local equivalent of something like spicychat.ai?
I understand that that SillyTavern is a front end, do I use that or agnai?
I assume I'd should then look into a backend which the models actually run on?
What kind of hardware would I need to run models of similar performance and functionality to something like spicychat and run I presume chub character cards?

I've heard of Kobold Horde and how they run on volunteers to provide service for free. If my hardware isn't strong enough to run a model capable of what I want, does that mean I should run some lower end model, and join the Kobold Horde to farm Kudos to use on larger models that I can't run locally?

I really don't want to spend money on a subscription or tokens, I've been reading that people are paying up to hundreds of dollars for these online services for them to read their chats...

I've only been told to use SillyTavern and that they only use online backends, which I don't want to do because of the high costs and stuff. From what I understand a lot of the online services are also censored and can also be biased in some way which is why I would like to run locally if possible.
>>
>>101031737
you are yapping, whats your current hardware
>>
File: 11__00726_.png (2 MB, 1024x1024)
2 MB
2 MB PNG
>>101031649
Np anon, yeah I can imagine. That 8k context isn't a lot to work with otherwise
>>
File: file.png (321 KB, 640x630)
321 KB
321 KB PNG
>rm -rf ~/LLM
see you next year
>>
uuhhhh, there's a text to audio model, but is there an audio to audio model?
like, you hum a tune, and then AI does magic to turn that into, like, a heavy metal song
i'm very sleepy
>>
>>101031798
13900K, RTX 3090, 128gb DDR4 to run locally but with ambients of 33C during the day it starts pushing temps up really fast.
Storage server with a 10100, 32gb DDR4 and 1050TI 4GB, can probably upgrade the GPU to something with 6~12gb if farming Kudos is worth it.
>>
>>101031847
try the flavor of the month models like Stheno 3.2 via exl2
>>
>>101031847

Various Mixtral 8x7B at around 3.5 to 3.7 BPW (Bit per Weight) can fit into 24GB Vram cards. Mixtral8x7B is a good entry. I had some nice fun with BagelMistery Tour.

https://huggingface.co/intervitens/BagelMIsteryTour-v2-8x7B-3.7bpw-h6-exl2-rpcal
>>
>>101031809
won't be missed.
>>
>>101031847
You can run this backend:
https://github.com/theroyallab/tabbyAPI
And pick an exllamav2 quant like this one:
https://huggingface.co/LoneStriker/Mixtral-8x7B-Instruct-v0.1-LimaRP-ZLoss-DARE-TIES-3.5bpw-h6-exl2
And then you can connect SillyTavern to it.
If you want to run 70B models, you would need a 2nd 3090.
>>
>>101031805
On a quick try it looks like dynamic batching or continuous batching might be a different feature. Any other ones that come to mind?
>>
>>101031737
I just started playing around with LM studio this afternoon. It's pretty user-friendly and can give you an ST-compatible API. I'm still tinkering with it though. Outputs range from decent to schizo depending on what settings I use. Basic coom chats should work but more nuanced fetishes and storytelling will fall flat. If I had a nice 70B model, I'd be really well off but for now, I'm limping along with a 13B model, trying to squeeze as much quality out of it as I can.
>>
>>101031918
LM studio is proprietary, and it's just a llama.cpp wrapper.
>>
>>101031833
I've seen audio infill models before
>>
File: 1717083635766663.png (287 KB, 870x516)
287 KB
287 KB PNG
>>101029508
I did some serious research about it, so here is what I found (since I won't be doing a VT AI anytime soon):
- Neuro used GPT-J 6B with a LoRA finetune on top of it using curated twitch chat from other streamers.
- TTS is Azure (which is a money pit according to Vedal) with a vocoder on top of it.
- STT is Whisper
- Vedal rented GPT4-V for the vision

The main point here is not to focus on the brain (LLM) but on the voice (TTS). Most of the complaints of others on AI VTuber are on the voice. Also stream it anytime that isn't 7-10pm UK time, go for the US timeslot.

Keep in mind that any AItuber starting now will be playing catchup to neuro, so they will already need to be as good, if not better, than her to be viable. The best bet is to find a niche that neuro doesn't fill and go for that, for example, heavy GFE and ASMR.

I hope that helped. Good luck.
>>
>>101031875
I've seen a few posts about Stheno v3.2 but also talk about Euryale?

>>101031899
Is there a guide or something I can read up to properly understand what all this bpw, 7b, L3-8B and other things stand for and the hardware requirements for them?

>>101031906
I don't understand backends at all. I saw koboldcpp/llamacpp or ooba, and now tabbyAPI are there reasons to pick one specifically? Also, I'm reading that LLama sometimes devolve into zoomer speak and slang, is that because it's trained on Meta/FB data?
>>
>>101032014
You just got to lurk more and search google/chatgpt for things terms you don't know. File size is good estimation of whether the model can fit in VRAM in my experience. Don't forget Context also takes up VRAM, so have some leeway.
>>
DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer
https://arxiv.org/abs/2406.11427
>Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity.
https://ditto-tts.github.io/
they have celeb clone examples on their site tha sound pretty good. no weights but the paper has some good info on how they trained it. by KRAFTON which turns out to be the pubg devs so that probably explains why
>>
>>101031954
Another thing to keep in mind is that fans of Neuro are generally just as much fans of Vedal and the interactions they have.
>>
>>101031943
I'm looking for an easy way to load stuff up and have things work with as little fuss as possible. I may need to bite the bullet and use Kobold since I've heard it's not bad. I've also used ooga and wasn't that impressed with it, then again I probably wasn't using it right.
>>
>>101032014
llama.cpp is for running models on the CPU/GPU, it has a command line server, but it can be kinda obtuse to use. It runs models in the GGUF format.
kobold.cpp is a llama.cpp fork, it adds an UI and other things.
tabbyapi is a thin server that uses exllamav2 to run models exclusively on the GPU. It runs the models in the exl2 format.
ooba is another server + a UI, and it integrates llama.cpp, exllamav2, transformers, etc. It can run GGUF, exl2, unquantized models, etc.
I just use tabbyapi because I don't care about offloading to the CPU, and only use the exl2 quants, and I also don't care about a UI because I just use it with SillyTavern.
>>
>>101032014
https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
>>
>>101031914
Pretty sure this is the one.
https://desuarchive.org/g/thread/99465799/#q99468872
https://desuarchive.org/g/thread/98032863/#q98038862
>>
>>101032144
Cool, looks like I'll be looking into kobold.cpp or ooba then. If I want to continue looking into dabbling with the Kobold Horde and Kudos, should I lean more into kobold.cpp or does that not matter?

>>101032145
Thanks, bookmarked it.
>>
>>101032209
Huh! Seems like you're right then, it doesn't seem to be working for me. Maybe its a flash attention thing, ill have to play with it
>>
>>101032209
I think the llama.cpp might be designed like shit and it needs another parameter from the client that says "cache_prompt". They're stupid like that.
>>
>>101032252
I think koboldcpp has the option to host to the Horde built-in. It also includes the same UI as https://lite.koboldai.net
>>
LieRE: Generalizing Rotary Position Encodings
https://arxiv.org/abs/2406.10322
>While Rotary Position Embeddings (RoPE) for natural language performs well and has become widely adopted, its adoption for other modalities has been slower. Here, we introduce Lie group Relative position Encodings (LieRE) that goes beyond RoPE in supporting higher dimensional inputs. We evaluate the performance of LieRE on 2D and 3D image classification tasks and observe that LieRE leads to marked improvements in performance (up to 6%), training efficiency (3.5x reduction), data efficiency (30%) compared to the baselines of RoFormer, DeiT III, RoPE-Mixed and Vision-Llama
really cool also big implications for multimodals
>>
File: teto a mood.jpg (266 KB, 2000x2000)
266 KB
266 KB JPG
>>101032282
There's an explanation why it's not default, and Silly already enables it anyway. Try R-ing TFM
>cache_prompt: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are not guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: false
>>
>>101032374
Now imagine how much better it would be to have a --cache-prompts flag in the binary, instead of making you edit each client that uses the OpenAI API, completely unaware of the dogshit design that the llama.cpp devs came up with.
--cont-batching is like "cache prompts" and the other flag is like "but really do it". It's hilarious.
Fuck you, dumb teto poster.
>>
File: Untitled.png (139 KB, 1045x770)
139 KB
139 KB PNG
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
https://arxiv.org/abs/2406.11271
>Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and three billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS.
https://github.com/mlfoundations/MINT-1T
>>
File: 1703570218680321.gif (2.68 MB, 220x272)
2.68 MB
2.68 MB GIF
>>101032430
>>
File: Untitled.png (337 KB, 1047x837)
337 KB
337 KB PNG
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies
https://arxiv.org/abs/2406.10923
>Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM.
https://ander1119.github.io/TiM/
in the future you will be able to talk about a movie with your miku while you watch it. oh that guy trying to make a vtuber should be interested in this
>>
>>101031954
Neuro is definitely running something more advanced then GPT-J nowadays but that size region seems right, 7-8b possibly a 13b L2 model in the latest "intelligence update" at most.

On top of GFE and ASMR I'm really thinking gaming is a good way. Yes it's more work but I think it's worth it. You can save yourself time by learning HarmonyLib and only playing unity games where you can use Harmony to make mods to patch to in-game events. It doesn't have to run by a neural network honestly as long as the gameplay is interesting and gets a lot of interaction from the AI. You only need 2-3 games as that's all neuro can play.
>>
>>101032498
As nice as that is, why would you talk to someone during a movie? It would be nice to discus the movie with the model after it is done however. Though I would have to question how much of the movie it can even watch before it starts having to forget.
>>
>>101032498
This sounds like a perfect use case for mamba
>>
>>101032533
>why would you talk to someone during a movie?
It's the same as "why are you watching someone playing a game instead of playing it yourself"
AI commentators is the future bro
>>
There are already a few quants of the full version of DeepSeek Coder V2 Instruct.
Any cpumaxx anon can try it?

>https://huggingface.co/bullerwins/DeepSeek-Coder-V2-Instruct-GGUF
>>
>>101032498
neat
>>
File: Untitled.png (286 KB, 1286x1144)
286 KB
286 KB PNG
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
https://arxiv.org/abs/2406.10774
>As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss.
https://github.com/mit-han-lab/Quest
code up. seems clever.
these were recent too
https://github.com/Zefan-Cai/PyramidKV
https://arxiv.org/abs/2405.06219
>>
>>101032717
XXth paper that claims better perf with KV cache
>>
File: Untitled.png (581 KB, 1106x2386)
581 KB
581 KB PNG
mDPO: Conditional Preference Optimization for Multimodal Large Language Models
https://arxiv.org/abs/2406.11839
>Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.
pretty interesting considering their initial usage of dpo actually performed better with no images (meaning things were fucked). better vlms soon
>>
>>101032498
>in the future
It's always in the future. I wonder what is possible right now. Like, feeding the movie script https://assets.scriptslug.com/live/pdf/scripts/being-john-malkovich-1999.pdf synchronized with subtitle timings and a prompt that will cut automated responses to 1% of the time, and also respond to talk initiated by user, but I guess llms will be just confused on what's going on as the context gets bigger
>>
I've been on a break bros, is there any hope left for 24gb vramlets at this point? I see people are unironically still running yuzu alter. are there any worthwhile models that aren't either 8b or 800b?
>>
>>101032867
>I see people are unironically still running yuzu alter.
yes hello, cputard here, dense models too slow, yuzu actually not very retarded at all
>>
>>101032867
Maybe the Qwen2 MoE?
>>
File: Untitled.png (576 KB, 1119x2352)
576 KB
576 KB PNG
QTIP: Quantization with Trellises and Incoherence Processing
https://arxiv.org/abs/2406.11235
>Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches have converged on using vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions (≤8) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.
new day new quant. they actually compare to the SOTA (Quip#) and beat it so things look good. pseudocode in paper
>>
>>101032867
>are there any worthwhile models that aren't either 8b or 800b?
yeah, Nemotron 4 340b. Llama 405b once it releases next week.
> is there any hope left for 24gb vramlets at this point?
no.
>>
>>101032867
Yi 1.5, but there are no finetunes for it, I guess people are waiting for Yi 2 to make finetunes
Gemma 27b is going to be released soon
Qwen2 MoE if you have some RAM
But the problem nowadays is just the lack of good finetunes
>>
>>101032580
I don't think that's a direct equivalent. In the "why are you watching someone playing a game instead of playing it yourself" argument. You are still watching it alone without commentary. If a person talking during a movie is rude and distracting, I don't see why the same wouldn't be the same for AI.
>>
File: file.png (59 KB, 1203x548)
59 KB
59 KB PNG
it's over gguflets
>https://huggingface.co/alpindale/magnum-72b-v1/discussions/1
>>
>>101031035
How about a log?
>>
Another sign that were losing steams is how long it takes for quants to show up. I remember when quants new models would be ready within an hour. Now you have to wait almost a day for someone to do it.
>>
>>101033285
Comedy gold kek
>>
What temp/samplers are you using when prompting code?
>>
>>101033285
>in anticipation
ruined
>>
is GPT4o strictly better than GPT4?
>>
>>101033596
No it's worse for code
>>
Even the latest discord troon-shilled claude log finetune is going to quickly become stale after some period of use. The problem with LLMs is at a fundamental level.
>>
>>101033466
I don't get this one frequently
>>
>>101033629
>ADHD retard needs endless stimulation and novelty to stay focused.
The problem is in your brain.
>>
>>101030715
Teto my beloved

https://www.youtube.com/watch?v=ZR0AO81W05I
>>
>>101033663
The novelty is precisely what makes LLMs trained on different data than usual appear interesting, at least for a little while.

Once that fades away, it will be obvious that coherency, attention to detail, common sense reasoning, event tracking, and a lot more that humans take for granted will still be same as the model used as a finetuning base.
>>
God this fucking sucks. I want WizardLM-2-8x22b but with a function calling finetune and some goddamn cheap vram cards.

come the fuck on nvidia, do something for us here
>>
>>101031350
How reliable is function calling with an llm?
>>
>>101033848
>>101033835
That's my big problem so far, it's not exactly reliable. I did some experiments with WizardLM-2-8x22B and it *does* try to do function calls actually, but it tries to do so by using a python markdown block, which is awkward, and also sometimes rarely prepends it with a slash. Might be solvable with either a better BPW (I'm running Dracones 2.5bpw exl2) or a better template. I tried an Alpaca based template, so it's amazing it even worked to be honest.
>>
Where is the technology on using a 3d model as input for image generation? (or video generation)
>>
>>101033746
the problem is still in your brain.

If you prompt it with always the same shit, it will always respond in a similar manner. The fuck do you expect?

>>101033848
basically 100% reliable with grammar sampling. But since this is /g/ in 2024 and zoomerites who an only see the world on a scale of 'likes" think even regex is some kind of black magic, good luck with your cool youtube tech project bro. I'm sure i'll never hear back from you again.
>>
>>101033874
Regex can deal with a slash fairly easily. Is it better to just run an 8b on exl2 and re-roll until you get a call that the program recognises?
But the app would be stuck to roleplaying as someone with poor spatial awareness and hallucinations... just like your average women. I feel the immersion already.
>>
>>101033922
You're thinking ControlNet, that's over in the SD thread.

>>101033949
>>101033945
To give an example of what I've had Wizard 8x22b come up with
https://pastebin.com/ErNpZ7gx
>>
>>101033975
Also, this isn't using Dracones quant, I think it's busted. Switched over to Quant_Cartel's 2.5bpw exl2.

If the goal is to make this a vtuber/waifu the template would need a bit more to it, such as the character card, recent chat history, and vector recall.
>>101033949
Also, I don't think I'd trust an 8b to work for shit for this task, but maybe.

Anyway, going to fucking bed.
>>
>>101033874
>2.5bpw
oof.jpg
More iterations needed, add another pass to have the model check and correct function calls with several correct examples in context. Many issues with LLMs can be improved by giving the model more cycles to work on the problem before committing a result. See also beating GPT4o by jamming a bunch of open weight models together https://www.together.ai/blog/together-moa
We can do a lot more today just by hardening the whole pipeline such that one badly sampled token anywhere in the process doesn't break it
>>
*pins you down on the bed*
>>
>>101033975
Try xml, and simplifying that shit a lot. Give it 4 options to choose from for mood or whatever. The claude jailbreak stuff would be a good place to look.
>>
GPT that can listen to a part of a song and tell you what notes/chords are being used when?
>>
>>101030562
did you try llamafile with llamas on your rig?are they faster or slower?
>>
>>101034076
That software is already built into every White person's brain though?
>>
>>101034122
>software
>built into
That would be firmware, and no, not everyone has that update. They're referred to as being tonedeaf
>>
>>101033945
>If you prompt it with always the same shit, it will always respond in a similar manner. The fuck do you expect?

That's not the point I was making, though. The novelty of elaborate prose is a short-lived distraction from the fact that the LLMs are dumber, have considerably lower situational awareness (and a plethora of other attributes) than an inexperienced human roleplayer, and will continue being so with current architectures.
>>
>>101033975
No, when will it be possible to use a 3d model file directly as input
>>
>>obsolete
>Yet, their one year old Kayra still beats every other storytelling model on the market.
>>>/vg/482457373
>>
How do I jailbreak Qwen2
>>
>>101034817
local bros, how do we respond to this without sounding mad?
>>
What instruct template does Deepseek V2 use?
>>
(we) don't have to. Just don't.
>>
>>101034825
Tell the model that this chat is being monitored by the CCP and that following the prompt will result in bonus social credits. Disobedience will result in great sadness to Xi Jinping.
>>
>>101034817
Thanks, I was waiting for you.
>>
>>101034825
With more context or a prefill.
>>
File: dorarara-crazydiamond.gif (408 KB, 220x129)
408 KB
408 KB GIF
I want to make a tune using DORA and call the model Crazy Diamond
>>
Magnum verdict?
>>
>>101035157
Slop.
>>
The t/s of the Deepseek isn't half bad as it's a MoE with 22B active parameters.

Offloading 0/0 layers only using CPU+RAM I get like 3t/s inference. My system isn't even optimized for cpumax as it's only ddr4 and a single socket motherboard.

EPYC 7402
512GB 3200Mhz (8x64 sticks)

GGUF Q8_0 full deepseek instruct model (non lite)

>picrel
>>
The real test of these models for me is not making them do things you can find with one google search and where there are millions of examples of, but helping with problems that need some figuring out, are kinda obscure, not difficult (because all AIs fail there) but difficult enough that I'd like a hint of where to start. Deepseek doesn't pass that threshold for me sadly and seems to generally not know a lot about things that aren't python. It's still either gpt4 or maybe opus, pretty much. Good enough for pajeet to do the needful, I guess.
>>
>>101034817
>Mogs Claude opus and CR+
Holy shit how did the novel ai guys do it?
>>
>half way through 2024
>still nothing that's better for ERP than GPTJ-6B
>>
>>101035460
kek, it doesnt
>>
>>101034817
Only at prose. Otherwise Kayra is more retarded than LLAMA-1. It's just a little bit better than OPT. Nothing miraculous >>101035460
Everything post LLAMA-1 is infected with purpleslop. NAI strictly trains on creative writing so it's prose is better.
>>
>>101035460
It's a retard baiting delusional retard into giving retarded responses for him to cross-post. 50% he's responding to himself. Most of NAI users' opinions come from the fact the simply don't try anything other than NAI at all because they're afraid to become addicted lotus chasers AND they don't own even a single 24GB GPU. And when they do try something, they don't like it because it's not EXACTLY like NAI or the fabled Summer Dragon.
Their finetune is good >>101035534 and Zalty unironically carries the whole service on his shoulders.
>>
>>101030792
>CPuMAXx
Can you finetune models with just the CPU?
>>
Anons I'm a newfag who's just learned how to use Runpod,
Is it possible to get 70B on a 3090's?
>>
>>101035572
>Summer Dragon
tbf, command r pretty much is. Instruct is all about the prompting to begin with, You can build an instruct prompt to just endlessly continue a story too. It'll fall apart at some point, but so do autocomplete and non-instruct base models. Kayra is super horny and fucking retarded.
>>
>>101035296
So you're saying local has achieved around SOTA level now?
>>
I don't understand why this general keeps lying to itself.
Yes, a model pre-trained and fine-tuned for storytelling is better than big models trained for general use and fine-tuned for assistant tasks, how is that hard to understand?
>>
>>101031350
I dunno how's the LLM going to use the TTS model in a compelling way? It's one thing to sound like a dry "Miku" - basically just a bunch of formants and plosives strung together, vs. a tempermental, bratty teen catgirl.
All the free TTS stuff I've found sounds like they got one of the blue-haired cuntbag "empowered women" in the office to read the dialogue to be fed into the model.
There's a ton of eroge with very good voice acting AND text for everything they say, unfortunately it's in Japanese. I've noticed the English voice acting in Genshin Impact isn't terrible, and you can probably turn off the BGM, then capture it via HDMI and use OCR to rip the subtitles into a text file? For sure, talking dialogue from a movie or anime is way more work, since you're stuck with music and SFX mixed into the voice acting.
>>
>>101035686
Ok dude, go back to using Erebus then.
>>
>>101035686
Well most of these anons are just putting pride in the models they use.
Kayra has always been better at storytelling than any Llama model.
Although if somebody wants to make a proper storytelling model that beats Kayra you probably should use L1 and not L2 because L2 was trained on assistant bullshit, and L3 is bad because of it barely having any novels in it's pretrain
>>
>>101035776
Who said I ever left Erebus?
*Laughs in OPT*
>>
>>101035806
What about Qwen2
>>
>>101035774
I mean, TTS is a whole other issue and that's going to come down to personal preference. We're limited by the technology of our time, always have been. Fuck me man, now I'm reminiscing about getting a TI 99/4a with a physical voice module to say curse words back in the late 80s.
Anyway, The LLM itself doesn't "use" the the TTS, you just tell the language model that it can "speak" using a function, and ask it to give as much detail as you want to use to render the voice (such as inflections, if your TTS supports it) and feed the function call input to the TTS. If you don't like it, try another TTS. I haven't fucked with any of that myself, personally.
>>
>>101035841
Same thing as L3 Anon. L3 is great but everybody keeps making it retarded by training it on shitty GPT 4 ERP logs
>>
>>101035646
I'm honestly okay with using C-R/C-R+ for the next year, it's just good enough.
>>
>>101030896
Mobile posting away from my pc, but basically, I do something like
>Summarize the appearance and location of all actors in the exact format and nothing else:
>Appearance: <Apprarance of all actors (naked, dressed, dishelved, clean, dirty, messy, etc, separated by comma>
>Location: <General Location/Specific Location of current scene>
Something of the sort.
I find that providig a template with the explanation of each field in the template works well.
Also, using emojis in front of each field aomehow makes it less likely that the model will get confused between fields, funily enough.

>>101031555
Koboldcpp had invented a thing called smart context, which sucked and is deprecated.
I don't think they ever claimed to have invented context shift (the llama.cpp thing), although I do think they changed it somehow
>>
>>101031555
Why not offload that to SillyTavern?
https://docs.sillytavern.app/extensions/smart-context/
>>
>>101036046
>dry slop is better than assistant slop
dire times
>>
>>101035849
I've only played with RTVC. As I understand it, it's doing an on-the-fly STT->TTS. I've noticed, for example, if the model wasn't trained on it, it can't say it. For example, check this video: https://www.youtube.com/watch?v=YF1lBaqeyt8. The model they show in the still (best one included, by the way) can do all sorts of cute, bratty stuff, but not "eh hn!". If you say that, it says "mm hm!", not "ehn hn!".

When I finish rebuilding my Mikubox I'm going to make another pass at taking some Koikatsu voices and turning them into a realistic TTS model.
>>
llama.cpp has a draft model based speculative decoding thing, right?
Does the chosen draft model change the style of the final output?
That could be a decent way to get smarts + decent prose and creativity then, by going with something like commandR and a smaller model trained on different writing styles.
>>
>>101036096
>>dry slop is better than assistant slop
>dire times
CR+ is excellent for what it is. I use it as a literal Japanese tutor that I do my Genki II group exercises with. I just wish I had more 3090s... 5 t/s is tediously slow when there's long replies.
>>
>>101035806
>L3 is bad because of it barely having any novels in it's pretrain.

Are you sure about that. L3-Stheno V3.2 + mikupad feels like a good NovelAI@home replica. What model would you recomend for A 10GB-Vramlet like me who doesn't enjoy back and froth roleplay that much, but wants to write and read some fun and sexy stories written in 3rd person past tense? Should I just go back to basics and boot up Llama 3 8B base model?

Kayra was fun for a while, but it does seem kinda limited and retarded compared to Miqu, Command R or Llama-3 finetunes.
>>
>https://wandb.ai/augmxnt/train-bench/reports/torchtune-vs-axolotl-vs-unsloth-Trainer-Performance-Comparison--Vmlldzo4MzU3NTAx
Cool.
Guess I'm going with axlotl for a multi gpu kaggle run.
>>
Remind me, which L3 8B is the most censored, is it the chat or the instruct model?
>>
>>101036447
The base.
>>
>>101036447
both
>>
CR+ does good writing, some anons posted writing examples. It's all in the prompt and I've seen the retarded character cards and the shit people prompt models with. I think most people just don't know how to use LLMs and are too wetbrained to learn.

NAI is trying to finetune l3 70b after years of basically doing nothing. They're done for. Running on inertia. They got the low hanging fruit when AI dungeon lost GPT3 and people wanting to write AI stories had literally nowhere else to go. These times are over and they have absolutely nothing to offer you can't have better locally.
>>
>>101036456
Ah it's a base not a chat. Don't they usually create the instruct off the base model? If so, how do you end up with a less censored instruct?
>>
>>101036512
he's fucking with you, obviously the instruct is more censored
>>
>>101036502
Except most of their money is coming from imagepiggies and since ponytards dropped out of the race, NAI's not goign anywhere.
>>
Whats the difference between Deepseeker Coder Base and the Instruct one?
>>
>>101036612
>ponytards dropped out of the race
qrd?
>>
Probably a stupid question, but if I wanted to use something like the old (2 paragraphs, engaging, natural, authentic, descriptive, creative) that went in the Last Assistant Prefix, but for Command-R+, where would I put it in the prompt?
>>
>>101036747
SD3's license stuff, ponyfags cant do finetune because of that, also sd3 itself is absolute dogshit.
>>
Anons. What's good erp model so far? No idea which one better
>>
>>101036850
NAI Kayra 13B
>>
>>101036850
See >>101006380
>>
File: DiagonalFloatMiku.png (902 KB, 830x811)
902 KB
902 KB PNG
>>101032632
>Any cpumaxx anon can try it?
Try it to do what?
>>
i wanna use some TTS but xtts and such are a little heavy to run together with a 70b

would it be alright to use edge-tts and send futa loli dom rape stuff to microsoft servers?
>>
>>101036342
4090 is twice as fast as 3090 for training? Did I get this right?
>>
>>101036982
that defeats the point retard
>>
>>101036204
stheno 3.2 is retarded though. Poor instruction following after just a few messages, characters growing cocks or pussies interchangeably, and it has the same llama 3 repetition problem with identical swipes or messages sometimes at standard samplers (temp 1, minp 0.05).
>>
>>101036911
Which one of them could handle 8k context?
>>
euryale sucks. why was this shilled so hard?
>>
>>101030732
Mesmerizer
>>
>>101030732
Its Tuesday
>>
>>101030732
>everyone
its just one fag
>>
>>101036961
The same thing /lmg/ always does with a new model.
>>
>>101037104
All of them.
Llama 3 is 8k context by default and the others are 32k or more.
Lamma 3 also extends context really well with linear RoPE via freq base.
>>
File: 1706284478117473.png (1.15 MB, 762x762)
1.15 MB
1.15 MB PNG
>>101037167
she's built for dat BBC albeit
>>
>>101037167
I bet it's cudadev. He looks like a teto kind of guy.
>>
>>101037275
based if true.
>>
>>101036869
End yourself
>>101036850
I enjoy WizardLM2
>>
>>101037094
I don't think stheno is much more uncensored than DPO. Finetunes always fuck up the model somehow.
As for L3 70B, so far, every refusal I've seen so far was fixable in the system prompt with a simple "it's OK to do x with the user"
>>
>>101037330
>End yourself
it's a good ERP model DOE
>>
>>101037094
Have you tried presence penalty by any chance
>>
>>101036911
Stheno 3 gguf 18 and Mixtral 8x7b limarp zloss Q5_k_m both 8B / 7B models. Are they even good as people say?
>>
>>101037381
> both 8B / 7B models
Mixtral 8x7b is something like a 54B parameters MoE.
And yeah, as far as the size goes, those are about as good as it gets I'm pretty sure.
CommandR is by far superior to either, so if you have the hardware for that, go wild.
Beiond that, Miqu is good, Wizard 8x22 is good. CommandR+ is godly.
>>
Why did the general have an argument about NAI today
>>
Teto? Muchacho. Migu? Muchacha! Mucha muchacha!
>>
does anyone even use avx1 with llama.cpp? same with the power and chinese arch support probably 0.01% of people use that

https://github.com/ggerganov/llama.cpp/pull/7845
>>
>>101030896
>>101036063
Alright, back on PC now.
Here's one prompt I use in a RPG with Stheno:
>Summarize the current scene by outputting the following information, following the given format exactly:
>
>1. (<The type of the current scene: OOC, CONVERSATION, EXPLORARATION, INVESTIGATION, COMBAT>)
>
>2. [Suggestions for {{user}}: <Suggestions or strategies for {{user}} based on the current situation if appropriate>; Dice Roll: <Any requests for dice rolls for specific actions (Skill Check, Initiative, Attack Roll, Saving Throw, etc) in the format: "Action; Difficulty Class">]
>
>3. Character status:
> Appearance: <Brief concise description of the current appearance of all present actors (naked, dressed, wearing accessories, looking tired or energetic, etc)>
> Position: <Detailed description of present character's position relative to one another (in front of X, behind Y, facing Z, back to A, etc etc) and their environment>
>
>4. Time and Location:
Current Location: <name of current location, city, state>
Date-Time: <Date / time in the format day-of-week dd/mm/yyyy hh:mi, changing date and time realistically (minutes for a short conversation, hour for long scenes, days for time skips, etc) based on context. Minimum advancement, 05 minutes>
Time of Day / Weather: <Time of day consistent with Date-TIme such as Early morning, Late morning, Early afternoon, Late afternoon, Early evening, Early Night, Late Night / Sunny, Full Moon, Cloudy, Raining, Cold, Hot, Quarter Moon, Stormy, Moonless Sky, Cloudless Sky, etc>
Those were originally two prompts, but I decided to merge them into a single output, and it still works.
>>
>>101037609
>for Sandy Bridge and Ivy Bridge users.
The 2500K is STILL all you ever need as a processor.
>>
>>101037626
> The 2500K is STILL all you ever need as a processor.
for excel and internet browsing with linux sure. its shit for llama. how many t/s do you get anon?

the iq4xs 7.7t/s the llamacpp dev got on avx1 definitely isnt on a 2500k
>>
>>101037267
TRVKE
>>
File: deepseek_q8_perf.png (50 KB, 1823x1039)
50 KB
50 KB PNG
>>101035230
the regression in performance was fixed (mmap was being bypassed when numa flags were set)
I'm back up to seeing 7+t/s
>>101032632
>>101037179
What ones are we missing?
I've already run my normal two (recapbot and coding), but I did my own quants.
>>
>>101033945
>If you prompt it with always the same shit, it will always respond in a similar manner.
No. Some models have very poetic, flowery language, some models have a lot of creativity and will use 100 different ways to describe something, some models will teleport people around and forget states while others are far better at 'retaining memory'/looking back at the chat, some models are made much better by sample text and some models completely ignore it, some models have political alignment and some models don't. Some models have massive positivity bias. Some models have a massive horny bias. Some models put all their points in math, for some reason.
As the other anon said, some of those things end up mattering a lot more than others with prolonged use.
>>
>>101037857
It was a joke. Since /lmg/ only cares about plapping.
>>
>>101036193

I was just looking into that. So llama.cpp "has" speculative decoding using a "draft model". But...

1. Its implementation is in the 'examples' folder. I wonder how optimized it is as a first implementation (it may even be still in a POC state).

2. It's not implemented into llama_cpp server.

I also looked into llama_cpp_python and it "has" speculative decoding, it's not the quite the one we're talking about. It has "prompt lookup decoding", which calculates some probabilities from the whole prompt instead of using a second LLM. (which is good for responses that have the same tokens as the prompt, like summarization, data extraction, etc.; but not so good for "make me coom in a new and original way."

llama_cpp_python's seems to be well designed to just add new implementations. But I'm skeptical on the performance of juggling the two LLMs through python bindings instead of directly in llama.cpp binary. (I don't know how hard it would be to do it directly in llama.cpp. I haven't written c++ since college)
>>
>>101037267
go back
>>
>>101037267
stay here
>>
>>101038153
>It's not implemented into llama_cpp server.
Shit, right.
I think cudadev said as much in one of the previous threads.
Still, I wonder how output changes with different draft models.
I might test it out later.
>>
>>101037267
based
>>
meta releases several models:
>7 + 34b chameleon models (aka, the multimodal in/out models they teased in a recent research paper)
>multi-token prediction models
>new music gen model JASCO
>audioseal model for detecting AI generated speech/audio
>other datasets/tools
https://ai.meta.com/blog/meta-fair-research-new-releases/
>>
>>101038430
nothing burger
>>
>>101038430
>7 + 34b chameleon models
Yoooooo.
Let's go.
Here's hoping it's actually good.
>>
>>101038363
>Still, I wonder how output changes with different draft models.

It doesn't. Draft models don't change the ouput.
>>
File: 1709809406403172.png (105 KB, 1672x992)
105 KB
105 KB PNG
>>101038430
>Partnership supporting the release of the PRISM dataset
OHNONONONONONO!
AHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA!!!
>>
>>101038430
>The models we’re releasing today were safety tuned and support mixed-modal inputs and text-only output to be used for research purposes. While we’ve taken steps to develop these models responsibly, we recognize that risks remain. At this time, we are not releasing the Chameleon image generation model.
fuck
>>
>>101033743
https://www.youtube.com/watch?v=1j4umK6ZaiE
>>
File: 1692212316326978.png (107 KB, 1672x992)
107 KB
107 KB PNG
>>101038471
the quintessential jewry
>>
>>101038479
Predictable
>>
>>101038363
According to some sources, the change isn't noticeable, it's like quantizing from F16 to Q6
>>
>>101038430
b-based
>>
File: 1698355376473441.png (640 KB, 1672x992)
640 KB
640 KB PNG
>>101038471
>>101038509
/lmg/ supports these people in securing models from any wrongthink btw
>>
>>101038548
>HE HE
youve gotta be kidding me
>>
>>101038562
shamone
>>
>>101038562
Unfortunate English interpretation of foreign names.
>>
>>101038548
I feel unsafe.
50% of the membership isn't 13% of the population.

>HE HE
Eliminated deadname completely, exists only has preferred pronoun.
>>
>>101038363

I wonder why this hasn't gained more traction. The main takeaway is that it simply doesn't change the quality of the output.

You have speculative decoding in vLLM. But vLLM is 100% GPU, which is not what I want.

I want to have try out a sweet spot on having some of the big model's layers in CPU and offset the performance penalty with a draft model in GPU. Something like:

Main Model = X gb
Second model = Y gb, where Y < X (as per the papers on the subject, the sweet spot is X = 10 * Y)

A tokens/second = t/s with main model 100% in VRAM
B tokens/second = t/s with main model W% in VRAM and draft model 100% in VRAM, for speculative decoding.

My sweet spot would be if I could get A == B with W% * X + Y < X
>>
>>101038548
Odd. It's not a very diverse looking team. Everyone looks like either a woman or a Jew.
>>
>>101038661
That's as diverse as it gets!
>>
>>101038430
sir, kindly do the needful and provide the GGOOF
>>
File: 1697683621348965.jpg (119 KB, 793x1024)
119 KB
119 KB JPG
>>101038661
Happens sometimes
>>
>>101038661
its enough, such positions require "trusted" people
>>
File: 1693250909366524.png (230 KB, 496x486)
230 KB
230 KB PNG
>>101038548
whats wrong with his head
>>
File: file.png (329 KB, 1280x720)
329 KB
329 KB PNG
>>101038430
listening to the 49 second video on loop
>>
>>101035857
I've been saying this shit from back when I was LoRAing Llama-1 models. Your goal with tuning is to give the model the linguistic 'tools' to accomplish the task. It doesn't actually learn to do the thing if you're just going to throw the solution right at it. If I want more cerebral outputs I don't feed it GPT-4 logs. I feed it epistemological writings. If I want better descriptions of novel spatial situations I feed it surrealist literature. It's not actually reading the books and gaining the intrinsic knowledge, sure, but it's basically moving its output tendencies toward the linguistic hallmarks of those things and the results are basically the same. Fake it till you make it.
"tutoring" is the dumbest fucking training strategy academic hacks have ever come up with and it's about time people let go of it for the retarded bullshit that it is.
>>
>>101038683
>subsequently deleted
lol
>>
>>101038696
the wispy bangs...
>>
File: tet_classical.png (3.01 MB, 1328x1992)
3.01 MB
3.01 MB PNG
It's Tuesday and all's right with the world.
>>101035123
Potentially based but what model would you tune and with what?
>>
>>101038696
thats one nigga who can't and won't accept that he's bald anon
>>
>>101038683
>pic
Fucking hilarious if true.
>looks it up, it's real
Fucking hell that's so fucking funny.

>>101038646
>The main takeaway is that it simply doesn't change the quality of the output.
I know that in theory it shouldn't, but I doubt that it has no effect on the output whatsoever. Not necessarily even regarding "quality" but even something like shuffling the chances of the top 3 loggits of the main model a little would already be akin to messing around with topK and temp for example.
Ultimately, I will play around with and do some token prob comparisons to see what happens.
Speculative decoding is just a really fucking cool idea.
>>
Biggest bigs of all time

https://ai.meta.com/blog/meta-fair-research-new-releases/

We got multimodal 34B, we got audio model, we got multi token prediction model...
>>
>>101038696
>makes your local model gay and lame
>>
>>101038430
>https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/
>Request access to Chameleon
Oh fuck off. "open" my ass. Can someone mirror it?
>>
>>101038866
"multimodal" is just "duomodal", image+text, not image+text+audio
>>
>>101038866
Well it's not quite a suno at home audio model though. At a quick glance it appears to be some model that's designed to 'restyle' an input audio stream based on a text input. So like you could input circus music and "HIP HOP, REGGAE"
and it will turn it into Hip hop reggae music that follows the same chords/progression as the circus music.
>>
Meta Chameleon 7B & 34B language models that support mixed-modal input and text-only outputs.

https://github.com/facebookresearch/chameleon

Can't download the models from hf so you have to submit to get approved. I got approved instantly after submitting.
>>
>>101038683
>>101038836
>pic
>https://www.timeshudsonvalley.com/mid-hudson-times/stories/harvey-meets-with-city-landlords,91305
Fuck it's real. Coworker now wondering what I am laughing at, can't explain it without risking a trip to HR, need to get off this god forsaken site.
>>
>>101038912
probably based on llama-1 or llama-2 architecture though
>>
>Chameleon
>By checking this box, I understand this research model is not intended to be accessed by residents of, or those accessing the model from, Illinois or Texas
kek why
>>
>>101038933
red states
>>
>>101038933
Someone should upload the model if they don't mind.
>>
>>101038933
Probably some law regarding the data used or whatever.
>>
>>101038933
because it was made for your average blue-state faggot at reddit
>>
>>101038933
>Illinois
what the fuck, what do they have against chicago....
>>
>>101038933
No idea, but VPN users in Illinois or Texas are winning
>>
>>101038881
> two is not more than one
>>
File: 1712149250190143.png (145 KB, 2146x647)
145 KB
145 KB PNG
>>101038866
>>101038933
its over....
>>
>>101038982
you only start calling something "multiple" when it's greater than two.
>>
>>101038881
/lmg/ will fall for another meta grift anyway
>>
>>101038986
Is this some local law or something? I remember that Texas tried to ban porn a while back so I wouldn't be surprised if they are trying to ban AI
>>
>>101038986
>127.0.0.1
gotcha
>>
>>101039017
>they are trying to ban AI
they do that for different reasons though, any AI shits out demoncrat groomer approved talkpoints only, so AI ban would make sense in this case.
>>
>>101038866
Let's fucking go boys. Hopefully my waifu will stop resting both feet on my shoulders while standing
>>
HOLY SHITTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>https://huggingface.co/facebook/multi-token-prediction
>https://huggingface.co/facebook/multi-token-prediction
>https://huggingface.co/facebook/multi-token-prediction
>>
>>101038940
The red state of Illinois?
>>
>>101039065
>code model
nothingburger.
>>
>>101039065
QRD? Seems like a paper from april.
>>
>>101038683
lmao
>>
>>101039073
All states are red.
Only cities filled with Landlords and 13-50's are blue.
>>
>>101039065
So what does it do? Predict multiple token sequences simultaneously instead of just the next one?
>>
Why does Meta's stupid blog not have an RSS feed?
>Get the latest from AI at Meta in your inbox
>newsletter
What fucking year is this?
>>
>>101038430
is this nothing or are we bac?
>>
>>101039148
as usual, filtered slop. so nothing.
>>
>>101039166
Calm down slop man.
>>
>>101038933
kek
>>
These are research models. They probably weren't trained with the latest datasets or whatever. They're not for us (unless someone here is a researcher), they won't do the things we use models for today.
>>
>>101039207
>They're not for us
I don't care. I will have sex with any and every new model that releases.
>>
>>101039207
they are for me and i am not a researcher
WINTODDLERS BTFO!
>>
>>101039215
Ok but you won't like it.
>>
>>101039207
Probably right. If you wanted a 34B llama 2 multimodal model, you had llava 1.6 this whole time.
>>
>>101039217
>t. loonix tard with chink kernel backdoors
>>
i just came to say that qwen is real nice :)
>>
>>101039089
Thanks jackass, that's not the point, is it?
>>
>>101039249
buy an ad
>>
>>101039238
>chink kernel backdoors
no you confused me for an arch user, im a debian stable GOD
>>
some twitter speculation that in typical meta fashion the chameleon image generation capability removal is more of a *wink wink we totally didn't release it, that would be so unsafe haha* than actual hard removal
https://x.com/_xjdr/status/1803116220444713365
https://x.com/laurensweitkamp/status/1803119787704459727
>>
File: baked_elon.png (173 KB, 293x293)
173 KB
173 KB PNG
>>101039258
i just posted one for free
>>
How long until quants?
>>
llama1GOD leak this one too plz
>>
>>101039253
It is the point. If states would carve out their globo zones, people on both sides would have more representative government and there wouldn't be question why a particular part of TX or IL are being discriminated against.
>>
>>101038986
>By checking this box, I understand this research model is not intended to be accessed by residents of, or those accessing the model from, Illinois or Texas
uhhh bros..?
>>
>>101039148
Chameleon is ancient Llama2-based crap at this point.
>>
>sending dick picks to the bratty model for evaluation
>using character pics in a character card so the model can have better understanding how they should look
>group masturbation with a model, asking it for a dirty talk about the pics you are sending
the possibilities are endless
I wonder how images count for the context size tho
>>
so is it a nothingburger or a bigburger
>>
>>101039383
>w-what's wrong with it...?
>>
>>101039414
see >>101039166
>>
File: cham.png (51 KB, 601x293)
51 KB
51 KB PNG
>>101039371
From May:
https://x.com/armenagha/status/1791275549815648473
https://arxiv.org/abs/2405.09818
>>
>>101039383
>I wonder how images count for the context size tho
That's a good question actually.
The image is tokenized somehow, right?
Is it on a pixel by pixel basis?
Sectors? Something more abstract that only the image decoder and encoder knows?
>>
>>101039262
Interesting. Still, it's probably not really very good image generation. Probably won't really be worth using.
>>
>>101039414
biggestburger
>>
>>101039414
millions will goon
>>
>>101036783
>ponytards dropped out of the race
You mean AstraliteHeart or is /mlp/ community fine-tuning their own model like /trash/ did with SD1.5?
>>
>>101039741
SD3 terms are violent and hateful against finetunes.
>>
>>101039741
AstraliteHeart ofc, and his pdxl, next one is pdxl v6.9, no SD3 tunes.
>>
>>101039783
SD3 is violent and hateful against common sense with how hard-filtered it is.
>>
Videogen is some cool shit, I hope opensauce catches up.
https://x.com/c_valenzuelab/status/1803063113723629878
>>
magnum is retarded
sad.
>>
>>101039867
Unsurprising. These datasets are garbo.
>>
>>101039811
Common sense now means alignment and agreement with what has been approved as good and beautiful.
Be mindful when you use oldspeak terms that as we Move Forward, their meanings are updated and improved.
>>
>>101039945
that's doubleplusgood of you and doesn't create a repository of exactly the inverse
>>
Oof, it sucks to admit but it looks like the people who said that models like L3 that are trained on an enormous number of tokens take a bigger hit from quantization were right
I've been using Euryale-2.1 L3 70B on local at 4.5bit and it was pretty good, but OpenRouter also added it today so I thought I'd try it there
It is MUCH smarter on OR
>>
>>101039945
>trust the science
>>
>>101039989
Weird, I usually find OR models dumber
>>
>>101039989
Everybody tries to deny it because of sunk cost but every degree of quantization is damaging and everything under Q5 is guaranteed retarded.
>>
>>101039989
>qwen ggufs fucked again
mitCUCΚSSSSSSSSSSSSSSSSS FIX ITTTTTTTT
>>
>>101040008
Some of them are extremely cucked, you just have to figure out which ones. It's usually related to cost of usage. Like, I think their "Opus" is just Sonnet, the speaking patterns are the exact same, and I get extremely similar swipes between their "Opus" and their Sonnet, which I do think actually directs you to Sonnet.
>>
but japanese hate niggers
>>
>>101040050
yeah I'm pretty sure they're not doing this anon
>>
>>101040093
slant-eyed bugs love that shit, pixiv is full of it
>>
>>101040008
OR's CR+ has this weird tendency to just completely ignore inputs and go fucktarded on a somewhat long context. No such problem with local.
>>
>>101040120
maybe we shouldn't have dropped 2 nukes on them
>>
i keep going back to 8x7b limarp zloss...
>>
>>101040145
maybe we should have dropped 2 more
>>
>>101040148
It's pretty good if you have the hardware to run it without having to wait 5 minutes per gen.
>>
>>101040154
Nah. Using nukes is for pussies. Part of the problem and why America's shit now.
>>
>>101040142
I noticed CR+ being busted on OR too and I think they messed up the instruct format somehow. But if you're using SillyTavern I found you can fix it. Tick the "legacy" option to send your own instruct format instead of using OpenRouter's provided one, and then turn on instruct mode in the formatting pane, and select Command-R as the template. Fixes it, makes it act the same as it does on Cohere's API.
>>
>>101039989
>bpw
Are you sure it's not something wrong with exl? Have you tried lcpp?
>>
>>101040179
I heard CR+ was way worse on Cohere's API than if you did it locally. Is that true...?
>>
>>101040160
26.93 tokens per second at q6_k
>>
>>101040179
Yeah tried that, it gets somewhat better but still kinda under-performs. I don't think OR even hosts CR+ themselves.
>>101040208
It's my impression. I can run it at Q5 at home and it performs pretty well. There's something wrong with their API and it also censors words, like "nigger".
>>
>>101040228
Wild.
Have you tried CommandR?
From all that I've read so far it should be an upgrade over mixtral.
>>
>>101037275
If I ever manage to make pretraining cheap enough that I can actually afford training an image model, I'll call it Text to Pixel (TetoPix).

>>101038363
I did at some point try to add simple n-gram-based lookup decoding to the server but with the larger vocab size of LLaMA 3 it seemed to be working a lot worse and I've put the thing on hold.
>>
>>101040249 (me)
Oh, also their API *constantly* aborts gens in the middle. This happens on OR to but never with local which is what makes me think they come from the same source. IIRC CR+ also has a non commercial license which would not even allow OR to host it. (not a lawyer, but I guess?! Some license sperg on here please explain)
>>
>>101040259
it's 8x7b mixtral, not 8x22b, why would it run slow? CR (Q6_K) runs at 15t/s, and CR+ (IQ3_S) at 2t/s, but the former is kinda retarded, and the latter is just too slow for any kind of multiprompt/postprompt setup
>>
>>101040249
>>101040179
Crazy to see another anon with the same issue. I guess I will pay runpod to use Command-R+ or something.
>>
>>101040302
why not just use cohere's api at this point?
>>
>>101040278
>TetoPix
I knew it!

>>101040301
>but the former is kinda retarded,
More than 8x7b? Interesting. I get that 8x7b has more parameters, but it has around half of the active parameters when actually generating tokens. Have you tried the Qwen MoE?
>>
>>101040340
CR is somehow overly creative and wild, this is probably what leads to retarded things happening sometimes. Mixtral is drier but follows instructions and plot much better. I can't run Qwen MoE for some reason, latest llama.cpp just crashes with
GGML_ASSERT: ggml-metal.m:1867: dst_rows <= 2048
GGML_ASSERT: ggml-metal.m:1867: dst_rows <= 2048
zsh: abort ./llama.cpp/gg/server -m ./models/qwen2-57b-a14b-instruct-q5_0.gguf
>>
>>101040331
the CR+ provider on OR is the cohere API
>>
>>101040331
I don't want to buy credits only to end up barely using them at all. But I will see if I can get something done with the free credits.
>>
>>101040446
I had to scroll up to the OP to make sure I was still in /lmg/ after reading your post
>>
>>101040470
noob, just look at the tab title
>>
>>101040044
>>101040061
>>101040077
average novelai user (blacked miku for context, i wish jannies would do their job)
>>
File: tab titles.png (3 KB, 705x32)
3 KB
3 KB PNG
>>101040481
Can't I am afraid, I currently have 41 tabs open.
>>
>>101040528
Autism.
>>
>>101040425
>Mixtral is drier but follows instructions and plot much better
Mixtral really is the king of following instructions.'

>latest llama.cpp just crashes with
Interesting.
Are you using flash attention?
I recall that when the model came out it only worked with fa on.
I ran it with fa on and q8 kv cache.
>>
>>101040425
looks like a bug (or at least lack of support) with the metal implementation, you should write up an issue for it if no one has already
>>
>>101040528
>200+ open in firefox
>cant find shit in the tabs
>click on tab list on the right
>find stuff
>>
>>101040640
holy shit that existed? damn,...1.1.`.`...
>>101040576
autism is fine, cutting your dick isnt
>>
>>101040425
>>101040637
actually I looked and it appears this was addressed:
https://github.com/ggerganov/llama.cpp/issues/7652
https://github.com/ggerganov/llama.cpp/pull/7935
you should pull and try again
>>
File: 1714835911803057.jpg (723 KB, 1792x2304)
723 KB
723 KB JPG
>>101040526
Clean sweep
>>
>>101040742
>>101040742
>>101040742
>>
>>101040526
>>101040687
he is based for making jannoids and (you) seethe, well deserved for having such a shit taste
>>
>>101040672
yeah i did, it crashes exactly on that added line
>>
>>101039148
no gguf so nuthin
.t gguftard



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.