[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102429190 & >>102417229

►News
>(09/17) Mistral releases new 22B with 128k context and function calling: https://mistral.ai/news/september-24-release/
>(09/12) DataGemma with DataCommons retrieval: https://blog.google/technology/ai/google-datagemma-ai-llm
>(09/12) LLaMA-Omni: Multimodal LLM with seamless speech interaction: https://hf.co/ICTNLP/Llama-3.1-8B-Omni
>(09/11) Fish Speech multilingual TTS with voice replication: https://hf.co/fishaudio/fish-speech-1.4
>(09/11) Pixtral: 12B with image input vision adapter: https://xcancel.com/mistralai/status/1833758285167722836

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: zzz.jpg (13 KB, 367x316)
13 KB
13 KB JPG
►Recent Highlights from the Previous Thread: >>102429190

--StyleTTS-ZS zero-shot text-to-speech synthesis project and xtts2 improvements: >>102430346 >>102430383
--Mistral AI correctly answers Castlevania trivia, but struggles without game context: >>102430040 >>102430050 >>102430068 >>102430106 >>102430513 >>102430626 >>102430648 >>102430653 >>102430665 >>102430751 >>102430762
--Mistral-Small-Instruct generates a Python script for the booba API: >>102429844 >>102429886
--Mistral AI's Pixtral 12B model shows high accuracy in multimodal knowledge and reasoning tasks: >>102430951 >>102430997
--Q6_K_L and Q8 quantization types compared: >>102432386 >>102432464 >>102432519 >>102432528 >>102432832
--Prompt engineering automation and evaluation for smaller LLMs: >>102429803 >>102429867 >>102430180 >>102430253 >>102431460 >>102431498
--Poor quality AI output with explicit content, users hope for improvement through finetuning: >>102434343 >>102434414 >>102434484
--Mistral Small's intelligence drops at longer context sizes: >>102431384
--Mistral Small model generates in-character response for "The girl called Alice": >>102431859 >>102432030 >>102432064
--Mistral Small Q8 can solve Sally question due to training on quizzes: >>102430428 >>102430654 >>102430636
--IQ2_M Mistral Small model is usable and generates smart responses: >>102431056 >>102431145 >>102431429
--ExLlamaV2 8bpw models padded with extra precision, 6bpw precision sufficient: >>102432045 >>102432069 >>102432103 >>102432834 >>102432962 >>102432987 >>102433001 >>102433040 >>102433048 >>102433108 >>102433180 >>102433192 >>102433082 >>102433149 >>102433244 >>102433764 >>102433261 >>102433277 >>102433311
--Teto (free space): >>102429241 >>102429806 >>102430975 >>102431015 >>102432273 >>102433465 >>102433919

►Recent Highlight Posts from the Previous Thread: >>102429197
>>
>>102434739
I respect your high context autism, but goddamn anon. 30k at 30t/s is a pretty nice deal I say.
>>
>>102434752
>IQ2_M Mistral Small model is usable and generates smart responses
True.
>>
>>102434766
>30k
I don't believe there is a single model that can do 30k tokens ERP without the quality being complete trash. There are just too many patterns it can pick.
>>
Hi all, Envoid here. I made a theme song for Drummer
https://voca.ro/19Y676wbJfOM
>>
>>102434851
That's why you use meme sampling that discourages repetition. Personally I use presence penalty at 1 to encourage new tokens over old ones universally (but 1 is still a low enough value it won't pass on important tokens even if they repeat) and DRY at the default values for discouraging repeated strings of tokens, although both need the length adjusted until you find the sweet spot or the output quality will eventually suffer when too many tokens are being penalized.
>>
bros... how do I speed up my vector stores... computing cosine similarities is so fucking slow that it's better on CPU...
>>
is it just me or is mistral small 22B a complete nothingburger?
>>
>>102434973
It's a nothingburger for non-VRAMlets. But I enjoy tuning models for VRAMlet enjoyment. So it's a something burger for me.
>>
>>102434973
It's a collective vramlet delusion.
>>
I ran TabbyAPI four times and the probabilities of the top token were different each time. Is there nondeterminism in exllamav2 or is my graphics card dying?

The whole reason I was doing this was tracking down an unexpectedly large discrepancy between Mistral Small 8.0 bpw exl2 and 8_0 gguf. The top token probability was 0.6850 on llama.cpp on the first run then stable at 0.6843 on multiple subsequent regenerations so I presume it's because something got cached, while using TabbyAPI it was 0.5913 then 0.5951 then 0.5762 then 0.5738. The (IMO rather large) difference ended up mattering in my case because the absurdity I was tracking down was right on the border of being excluded by a min p filter. After making sure samplers were neutral on both the next thing I did was check the SHA256 checksums of the files since I can't believe the difference was that great. If the problem isn't that my hardware is dying, then something is very wrong with at least one of those quants. Both were from LoneStriker. For that matter even if my hardware is dying it still could be true.
>>
>>102434858
kino
>>
>>102435003
>the absurdity I was tracking down was right on the border of being excluded by a min p filter

As an aside, the token to exclude could actually be the start of numerous reasonable clauses, but if that token is picked then the model with overwhelming likelihood (90% confidence in the top token) predicts the next token to be something that contradicts previous information. The gravitational pull of using certain phrases seems too great.
>>
sometimes I think maybe feeding terabytes of scraped web text into a statistics engine will not possibly lead to agi
but I might just be retarded
>>
sometimes I think anons don't realize what bad or good writing is because they don't read good books and have slop brains
>>
>>102435220
female detected
males can easily detect good writing because their dicks will give it a standing ovation automatically
>>
There's unironically nothing wrong with a singular instance of shivers running down a spine. The issue is when it appears in adjacent paragraphs and even adjacent sentences. At that point it becomes slop.
>>
>>102435260
saying this when erotic literature is mainly written for and purchased by women
most men get off to coomer fan fiction
>>
>>102435271
>saying this when erotic literature is mainly written for and purchased by women
hence why most of it is so terribly written and full of slop
>>
>>102435279
do you think models are being trained off of erotic novels? because the slop you see is coming from coomer rp and fanfiction written by horny men.
>>
>>102435220
A lot of people think Asimov was good.
>>
>>102435292
NTA but you've clearly never read an actual book.
Spine shivers come from literally all human writing. Because feeling some sensation in one's spine is a literal actual natural reaction to visceral stimulation. And the model just generalizes it all into shivers down the spine, even though in actual writing it's somewhat varied. But the exact average of it is shivers in the downward direction along the spine.
This is like... 6+ month old discussion around here.
>>
>>102435314
>NTA but you've clearly never read an actual book.
stop reading there. I'm more well read than 99% of the thread. You probably read ender's game and think it was a classic.
>>
>>102435267
Same with eyes slop(widening narrowing rolling). Once in 2 pages is okay, every fucking paragraph until DRY eliminates them all is slop. I even had character with no eyes roll imaginary eyes.
>>
>>102435314
I've never had shivers run down my spine IRL therefore SLOP.
>>
Did some tests. Setup: 2x3090 ti.
Too lazy to make graphs, so read the numbers (c/t is cost per token, and measured as the wattage used divided by tokens/second):
400 w: 17 tok/s = 23.5 c/t
300 w: 17 tok/s = 17.65 c/t
250 w: 14.44 tok/s = 17.31 c/t
200 w: 10.80 tok/s = 18.52 c/t
Conclusion: 300 w optimal (in this setup).
Caveat: used nvidia-smi -pl and trusted software cap. No hw measurements.
>>
>>102435340
>stop reading there. I'm more well read than 99% of the thread.
I read that. I read it in what I imagine this guy's voice sounds like.
>>
>>102435003
It is not entirely deterministic, atomic addition sums numbers in a random order, and FP16 precision isn't exceptional, so a+b+c != b+a+c
>>
File: Untitled.png (865 KB, 1080x2033)
865 KB
865 KB PNG
SOAP: Improving and Stabilizing Shampoo using Adam
https://arxiv.org/abs/2409.11321
>There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: ShampoO with Adam in the Preconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo.
https://github.com/nikhilvyas/SOAP
neat
>>
>>102434973
It's okay and the best for what it is. (20B ~ 30B) The vocabulary is also not terribly slopped like CR's earlier release. That shit was horrible.
It has okay spatial reasoning, but if you can run something larger, go with that.
>>
>>102435374 (me)
>Conclusion: 300 w optimal (in this setup).
Because speed/cost is the efficiency (bang/buck) and
17/23.5 = 0.723
17/17.65 = 0.963
14.44/17.31 = 0.834
10.8/18.52 = 0.583
(in case it wasn't obv)
>>
>>102435374
Cool dudes are now capping frequencies, you're not up to speed, dork
>>
>>102435311
Asimov was good, but not for his writing style.
>>
>>102435425
Ah, Pratyush Patel and Chaojie Zhang. This must be good.
>>
>>102435271
Have you actually read women's crap? I got one book, I thought, oh, well, I'm sure the smut parts will be sort of short or whatever. It had 12 pages of a torture scene. A woman's idea of sexy is strange.
>>
>>102434973
It's smarter than Nemo but only like 25% smarter. Not a nothingburger but I don't think it's as much of a capabilities leap for its size class as Nemo was.
>>
>>102435501
>only like 25% smarter.
Which methodology are you using to come up with that number?
>>
I'm using a 3070 and koboldcpp. Is there anything I need to do to make the AI faster beyond using CUBLAS?
>>
>>102435510
Bellyfeel. Pretend I said "only somewhat" if putting a number on it is offensive to you.
>>
>>102435523
I'm not offended. I was just walking you towards admitting that your statement was a bunch of made up bullshit.
>>
File: fedora.png (993 KB, 1180x630)
993 KB
993 KB PNG
>>102435540
you dropped this
>>
>>102435540
>The only valid opinions are those that can be measured and quantified
redditbrained take
>>
>>102435552
>>102435562
>if I cut my dick off I'm a woman
>>
>>102435425
How do you do frequency capping on nvidia? I only see -pl flag.
>>
>>102435576
nvidia-smi --lock-gpu-clocks 0,1600 --mode 1
>>
>>102434973
It seems good for what it is, meaning 22b models. Obviously it's not going to beat recent models that are bigger
>>102435414
>like CR's earlier release
Do you mean CR pre plus? Used to use that a lot till today, that shit can't follow rules if it's life depended on it. Even basic shit like using "speech" or *actions* is beyond it without you pointing out or editing it's messages.
>>
>>102435383
go read green eggs and ham kid, scram
pathetic
>>
>Use Qwen 2.5 72B
>It's utter fucking garbage
Where'd Qwen go so wrong, bros?
>>
>>102435574
>"gender and sex are completely different thing"
>"to be a real woman I need to cut my dick off!"
How can we take them seriously lol
>>
>>102435657
Multilanguage
>>
>>102435657
Qwen was never good desu, everytime I tried one of them it outputted random chinese tokens
>>
>>102435662
not true, the Mistral models are really focused on multilanguage and they are good models overall, I'd even argue that the SOTA models (gpt4, Claude 3.5) are multimodal aswell
>>
here's a tip, unslopnemo is the best 12b tune
>>
>>102435501
25% would be worth it if it works past 16k context too.
>>
Reasons to not use ollama
>>
btw slop is only a problem for ESLs and retards who don't know to prompt
>>
>>102435713
It is extremely annoying how it takes a long ass time to allocate space before downloading a model.
It offloads the model from memory if you don't use it for a few minutes.
The quants are hidden in the web page
You can't use models directly from hf, so you need to download them even if you already have them in your drive.
>>
File: 1722400106196472.gif (677 KB, 1280x720)
677 KB
677 KB GIF
If you care about good prose in smut, you're gay as fuck.
>>
File: TWQDb-dhtguja7InokhzTg.gif (1.46 MB, 220x220)
1.46 MB
1.46 MB GIF
>grr I don't know how to read
>I just need to coom!
anon just admitted to only having LLMs as a means for sexual gratification and thinks everyone else is gay
>>
Anyone ever used a custom ChatML context/instruct set of instructions that actually improve roleplay? I dunno if snake oil or actually good.
>>
>>102435753
When the prose is the only part of the smut, yeah, I do.
>>
>>102435783
No, just you.
>>
>>102435374
>>102435415
>simple limiting the max wattage instead of optimizing the voltage by undervolting (maximizing the clocks) wasted performance
>>
>>102435657
>>Use Qwen 2.5 72B
Early access? I've heard they went the Meta route and filtered by bad words.
>>
qwen was never good
>>
https://nvlm-project.github.io/
vlm from nvidia (not up yet)
>>
>>102435912
>I've heard they went the Meta route and filtered by bad words.
I mean, what else do you expect from chinks? They are as retarded as the west when it comes to censorship
>>
>>102435619
Thanks.
>>102435825
How much of a performance hit are we talking though? A single nvidia-smi call is trivial effort.
>>
>>102435931
The west censors shit to make themselves (white privileged college students) feel better
China censors shit on mass to protect their governments power and to restrict the masses free speech... wait a second, that's also why the west does it!
>>
What will locusts do when OpenAI enforces CoT for all prompts, making it almost unjailbreakable?
>>
>>102435976
annoying libertarian techbro hands wrote this post
>>
>>102435990
I'm still surprised someone didn't find a jailbreak to make gpt4 repeat the CoT prompt again so we can see it
>>
File: InstructModeSequences.png (358 KB, 1299x725)
358 KB
358 KB PNG
I'm trying to set up the optimal settings for mistral nemo seen here:
https://rentry.org/freellamas
I'm a little confused on the
>Sequences for this model
section. This menu has completely changed nowadays.
Is input and output now: System Prompt Prefix and Suffix?
Which one is now Last Output Sequence? And which is Separator?
I tried to RTFM but didn't really find anything explaining the old settings.
I posted this on /aicg/ initially and they sent me here kek
>>
File: 1726612890211848.png (35 KB, 532x949)
35 KB
35 KB PNG
Found this in last thread. Can someone else confirm if the gap between IQ4_XS and Q3_K_L really is that significant?
>>
>>102435753
Fuck you smugposter, the plot makes the porn better.
>>
>>102436007
I just changed per https://hf.co/bartowski/Mistral-Small-Instruct-2409-GGUF/discussions/1 and the writing of Mistral Small instantly changed for the better. That models still work but with degraded quality when one uses the wrong template is kind of maddening.
>>
>>102435619
It's hovering around 216 watts for each card (capped at 400w). 14-15 tokens/s. 300w seems a tiny bit more efficient, but not bad. (Efficient here means "how much would I have to pay in electricity for it to write me some long blurb".)
>>
>>102435991
nta but literally kill yourself. no, really. stop what you're doing and go swing from a rope
>>
>>102434739
Let us know your findings, I'm currently trying out mistral large IQ2M (~2.7 bpw) because it's much faster than IQ4XS (1.3T/s vs 1T/s)
IQ3M seems to be unoptimized because it's slower than IQ4XS (0.85T/s)
>>
File: 555.jpg (36 KB, 500x499)
36 KB
36 KB JPG
>>102436319
>muh goberment censorship to something something the masses!
idk why you said "nta" when you're clearly the same annoying retard
>>
>>102435412
LLM engineers have to re-invent shampoo and soap because they've never used IRL
>>
How can I make Q6_K_L quants? Llama.cpp quantize doesn't support it and google gives no results.
>>
File: 1708443456817321.png (44 KB, 788x292)
44 KB
44 KB PNG
>>102436439
I think it's done like this.
>>
>>102436467
Oh, so just --output-tensor-type Q8_0 and --token-embedding-type Q8_0?
>>
Will Mistral really make any more MoE models? The primary reason they did it was because it was quick and cheap, as they could initialize from a smaller model they already trained, but now they have the compute and can train models like 123B fine.
>>
>>102436544
We MIGHT get a revamp of 8x22 or 8x7 but they're not going to train a new one from scratch. MoEs were a MeMe.
>>
>>102435752
>You can't use models directly from hf
apparently you can import them, but you have to make a one liner file.
>>
>>102436493
I believe so, yes, when I saw discussion on that matter it was mentioning that Output Tensors and Embeddings at q8_0 was superior, so I think that's all it is. Whether it's truly better or not I can't say.
>>
>>102435990
They use Opus
>>
File: firefox_xDDUSc6GyQ.png (225 KB, 1154x327)
225 KB
225 KB PNG
>>102435003
exl2 is hugely nondeterministic. Has always been. I see it as the price you pay for speed.
>>
>>102435479
I've read one, within about 50 pages i'd actually run into 8 spine shivers, and that was not even the worst of it. It also had "that" sentence structure almost exclusively, the same one the LLMs spout continuously. Just one of these books has such a high slop density that it can poz an entire dataset, i wager. If it hadn't been published in 2021, i would have swore that it was written by an LLM beginning to end with no editing. But if i'm being honest, the LLM would probably do better.
>>
training with my gpus at 200 watts is pretty comfy
>>
File: firefox_aTVZd2RqqJ.png (281 KB, 750x1100)
281 KB
281 KB PNG
Mistral Small almost seems to be capable of playing my dots game.
>>
File: firefox_Bw5YZM04gZ.png (333 KB, 761x585)
333 KB
333 KB PNG
Bros? Are we back?
>>
File: firefox_nA561ujlnT.png (349 KB, 767x625)
349 KB
349 KB PNG
>>102436761
>>
File: firefox_DCuiPLgDDH.png (70 KB, 757x161)
70 KB
70 KB PNG
>>
>>102436790
>Newsflash
it was over before it began
>>
>>102436836
Simply ban that word, and live happily ever after.
>>
> Anything interesting being discussed in this thread? Use bullet points and add >>NNNNNNNN links.

---

Determinism in LLM Generation (>>102435003)
There seems to be nondeterminism in LLMs, with varying token probabilities during regeneration.
Some users suggest this is due to atomic addition and FP16 precision (>>102435383).

Efficiency and Power Consumption (>>102435374)
User experiments with different power settings for running LLMs.
User concludes that around 300 watts is the optimal setting for their setup (>>102435414, >>102435931).

Comparison of Mistral Models (>>102434973, >>102435510, >>102435520)
Discussion on the performance and capabilities of Mistral Small and other models.
Some users praise Mistral Small, while others find it lacking compared to larger models (>>102434943, >>102435825).

Sexual Preferences and Writing Quality (>>102435201, >>102435220, >>102435660)
Debate over the importance of prose quality in erotic literature and the gender dynamics related to writing and consuming such literature.

Jailbreaking AI Restrictions (>>102435990, >>102436005)
Speculation on the impact of potential OpenAI changes to enforce Chain of Thought (CoT) for prompts and its influence on ""jailbreaking"" AI restrictions.

Vector Store Optimization (>>102434943)
A user seeks advice on speeding up computational tasks related to vector stores and cosine similarity computations.

Quantizing Model Weights (>>102436432, >>102436467)
Users discuss the methods for creating Q6_K_L quantized model weights.

Model Training and Dataset Bias (>>102436701)
Conversation around the quality and bias found in training datasets for LLMs, with particular focus on erotic content.
>>
>>102437006
>one of the quotes leads to something completely different
>>
>>102437029
Just showing you how it. Mistral Small isn't perfect.
>>
>st
>[DEPRECATION NOTICE] Model scopes for Vectore Storage are disabled, but will soon be required.
what is this
>>
>>102435668
>random chinese tokens
Based Chink making burger feel the same as we ESL with their shitty models when is non english prompt and chats, even Claude do this shit and turn french into English tokens.
>>
File: 1707310016651155.jpg (172 KB, 947x583)
172 KB
172 KB JPG
>>102437105
well this sounds fun for my 15 rag databases
>>
I guess 2 bit large is much better than 8 bit small?
>>
>>102437307
Is 2 bit smol better than nemo?
>>
>>102437251
Use case?
>>
File: 1726006554401578.png (128 KB, 770x549)
128 KB
128 KB PNG
>>102435520
pls resbond
>>
>>102437429
rp
>>
File: 1710043687041916.jpg (43 KB, 720x960)
43 KB
43 KB JPG
>>102437431
Buy a new gpu
>>
>>102437431
Exllamav2 is faster than koboldcpp however it has some non-deterministic behavior and requires your model + context to fit entirely in VRAM. Aside from that, nothing. Maybe practice chess or watch some shows while genning (coming from a 0.5t/s guy)
>>
What's the smallest model that's effective for writing simple scripts or config files?
>>
File: 900640_00001.webm (822 KB, 720x1280)
822 KB
822 KB WEBM
>>102434744
it's still tuesday somewhere
>>
>>102437553
Nemo
>>
I have 16gb of vram. Glad mistral released the small model. Perfect for people like me.
In b4 people shit on it again for not being as good as models that need 2x24gb+ vram.

WIth Nemo the CoT prompt did not really improve the writing upon further testing.
That might be different with mistral 22b. Actually had a "oh also i should do X". (x being something in the context)
But need to test more.
>>
>>102437666
Really? When I looked at programming benchmarks it scored worse than llama 3.1 8b even.
>>
>>102437692
(In my experience) it performs better with consequential prompts when you need to correct or improve a solution
>>
Reporting. Genning a lot of replies in parallel does not seem to reduce the quality.

I made a script to do sentiment classification for hotel reviews dataset, Russian language, and genning one by one achieves the same result as genning tens in parallel.

Incidentally, Nemo achieves the accuracy of 90%, Mistral-Small - 92%, and 2.75bpw Mistral-large - 89%.

Tabbyapi.
>>
File: 1725697593000880.png (92 KB, 717x352)
92 KB
92 KB PNG
>>102435797
Anyone?
>>
>>102437903
ChatML sucks.assistant
>>
File: nala small lora.png (217 KB, 922x610)
217 KB
217 KB PNG
guys I think I might have just achieved AGI.
>>
>>102437889
2.75bpwos...
>>
Just tried out Mistral Large 2 and I legitimately like it more than Opus.

So far I only used proprietary models for ERP. Why did no one tell me straight up that open weight models already caught up in ERP?
>>
>>102438047
Because most people using the api models can't run large at a worthwhile quant with enough speed.
>>
It seems like even mistral can't agree on how its prompt template should look like.
Opened up tokenizer_config.json of large and small side by side. Small has spaces before [INST] and after, also a space before [/INST].
Large only has a space after [INST].
>>
>>102437975
that's mistral small?
>>
>>102438083
Small also seems to start every reply with a space, which is somewhat infuriating because I often want a single token response for classification, and I always get the space.
>>
>>102438084
Well I did a LoRA on it, but it ended up overcooked (loss drops fast after the first epoch it seems) so I SLERP merged it back into the original model. And it's honestly pretty decent. Although it's one of those models where if you mention "consent" or "NSFW" anywhere in the prompt it will just go full on porn mode.
>>
>>102438081
It doesn't help that the samplers for mistral-large on proxies are completely inadequate. This leads to the model being repetitive and drier than Popeyes biscuits. It also makes me wonder if some of the other corpo models could be saved if they had a more robust sampler setup.
Watch this become openai's next "breakthrough".
>>
Mistral Small is actually great
>>
>>102438126
what samplers do you use?
>>
>>102438143
Aside from temp. Rep pen at 1.1, min-p and smoothing, very rarely dry at 0.45.
>>
>>102438107
Cap at 2 tokens and trim...
>>
>>102438153
I don't think rep pen and dry work very well together.
>>
>>102438177
I look at probabilities, and for everything else I just had to look at probabilities of the first token.
>>
n-nooo. the positivity bias in mistral small is so bad.
can finetunes even get rid of that? gemma2/llama3 is dead because of that.
nemo was good because of the convincing characters. even with all the retardation sometimes.
>>
>>102438262
>can finetunes even get rid of that?
They got rid of it from Largestral so I don't see why not.
>>
>>102435619
Compared to limiting power, this makes idle power consumption get worse for some reason, though.
>>
>>102436467
I wish we could use different quantization levels (e.g. none) for the attention layers, or exclude specific layer numbers (e.g. the first and last one), that's possibly where most of the damage actually comes from (according to Meta).
>>
Why is Euryale 2.2 souless compared to 2.1?
>>
>>102438047
What prompt are you using anon?
>>
>>102438047
It’s nowhere near Opus level. I really wish it was, but it’s not.
>Skill issue
Yeah, but not on my end.
>>
So does Mistral Small have the same prompt format as Mistral Large, or is there a space before the [/INST]?
>>
>>102436022
It depend on your model and particularly its size, you can even go as low as 2 for 70B, it'll be retarded for a 70b but still better than 12B.
You cannot just ask "whats the best quant", "is there always this drop" and get an answer that you can apply to all.
>>
Any LLM based games anon can recommend? If I can snap a local to it, even better.
Only played that yandere thing a year back.
>>
>>102438329
Eh? Which finetunes are you talking about? Been using base mistral large instruct 2.75bpw and it has the same positivity bias like almost every other model under the sun
PLEASE man I'm begging
>>
>>102438927
>>102438329
Would like to know too. I tried Magnum and it lost way too much intelligence.
>>
>>102438768
There are spaces before both [INST] and [/INST], and system message is placed before the last message from user rather than before the last message from user.
>>
>2 years of shitposting and nobody has made a single private quantifiable benchmark to test a new model's vibes
I'm making one tonight. It's pretty close to ayumi's naughty words bench, among other things
>>
>>102439070
>It's pretty close to ayumi's naughty words benc
straight to the trash
>>
>>102438943
try one of the 123b merges then
>>
Why is it that the larger the models the more assistant slop they become?
Is it because for the size you need synthetic gpt output training data?
>>
new model when?
>>
>>102439134
yeah pretty much
>>
>>102439117
I bet you don't realize how dumb merges are since you're using already dumb low bpw quants.
>>
>>102439134
Training larger models is expensive, so they tend to overly compensate to mitigate risk. Large models are also primarily targeted towards corporate clients.
>>
File: jrf8e4xla9241.png (1.24 MB, 1920x1080)
1.24 MB
1.24 MB PNG
Hi all, Drummer here...

https://huggingface.co/BeaverAI/Cydonia-22B-v1a-GGUF/tree/main
Formats: Metharme, Mistral, Text Completion

https://fridge-checked-interpretation-hash.trycloudflare.com/
>>
>>102439378
>Metharme
That's a word I didn't think I'd see any time soon. Ancient pyg prompting format.
>>
>>102439484
I see a template that doesn't require added tokens & assistant tag, I take it.
>>
>>102439509
Should've taken alpaca then.
>>
>>102437975
>shreads your boxers
>wet pussy grinding against your bulging cock, separated only by your underwear
>now, take off your fucking clothes, or I'll spread them off you

Yes, truly AGI, can't even remember the status of clothes in the same reply, definitely not a <70b model
>>
>>102439811
No, concedo. Alpaca's taken and it takes more than 1 token to form "Instruction" and "Response". And no system tag for the one of many versions on Alpaca.
>>
>>102439964
<|system|> takes 5 tokens.
>Alpaca's taken
lol what
>>
>>102439964
He wrote this reply using his garbage ESL model.
>>
As a beginner have only tried gemmacpp with their model because it's very fast, and seems low on resources.

Want to embeed one into my website, a simple stdout from command line, if you tell gamma something it's useless:
>You are an Alien from Planet lol
Hi I'm gamma developed by google... I can do...
>Who are you?
Hii alien from Planet lol, I'm gamma developed by google...

So, my questions are:
(1)What better inference engine(that's their name?) there is? Preferably cpp for lightweight.
(2)What is a good model since you can't download gpt4 normally(there are probably some leaks online?)

Regards.
>>
>>102439378
What's with these random ass names? Why not use base model name + dataset name? The magnum guys almost got it right but they decided to remove the base model names. You're all fucking dumbasses.
>>
>>102440041
There's many models in the gemma family, so i have no idea what you're running on or what you consider fast.
>(1)What better inference engine(that's their name?) there is? Preferably cpp for lightweight.
llama.cpp or kobold.cpp
>(2)What is a good model since you can't download gpt4 normally(there are probably some leaks online?)
Mistral Nemo 12b. or llama3.1 8b. Move up if you have spare resources, quantize more heavily if you don't. Plenty of .gguf models on huggingface (.gguf being the format llama.cpp and kobold.cpp use). https://huggingface.co/bartowski has a bunch for you to try. Roughly, the gguf you have to download is any below your vram/ram capacity. for 16gb vram, download any under 16gb. Leave some space for context.
Read llama.cpp's docs and the README.md.
>>
File: Untitled.png (165 KB, 1239x730)
165 KB
165 KB PNG
>>102440041
a bit outside my wheelhouse, but koboldcpp has this thing that looks pertinent
https://github.com/LostRuins/koboldcpp/releases
>>
>>102440258
If you want to embed it into your site you'll be better served by a json API (which is how most chat uis work), not cli. Both llama.cpp and kobold.cpp have it. Probably every other engine as well, but i've only ever used llama.cpp. Both have an embedded web ui (llama-server) for you to play around and learn how to use them.
>>
>>102440183
>visit the gemma.cpp models on the Hugging Face Hub
Used this (https://huggingface.co/google/recurrentgemma-2b-it-sfp-cpp) because was the last updated.

(1) What does quantize mean?
My setup is an rx570 4gb and 16gb ram.
(2) Are there any free services/Apis that can be used to receive results as follows:
>Input: You are a nigger
>Input: What Color are you?
>Black
>>
>>102440629
>(1) What does quantize mean?
Basically, compressing the model. The more you quantize it, the smaller the model gets, (so you can fit bigger models in your gpu), but it comes at a loss in accuracy with respect to the original model. You can run 2B models in Q8_0 without any issues. Bigger models will need more aggressive quantization. You'll have to play around with what's the best for you (big model+aggressive quant or small model+highbit quant).
>(2) Are there any free services/Apis that can be used to receive results as follows:
Probably, but i don't use any. You can run llama.cpp (their llama-server) example and make queries to it.
Read https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
You can use this model for a close equivalent: https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/blob/main/gemma-2-2b-it-Q8_0.gguf
There's plenty of models to try.
>>
Is there any hope for people like Yann?
>>
>>102440797
In what aspect? He'll keep receiving grants for doing useless work thanks to his existing connections.
>>
Abandon all hope, ye who enter here.
>>
For me crazy thursday is just a regular thursday
>>
For me, today is like any other day.
>>
>>102439929
Did it work? Are you a woman now?
>>
>>102440867
Qwen 2.5 may be quite possibly the biggest nothing burger I've ever seen.
>>
>>102440914
But you haven't seen it yet.
>>
>>102439378
transformers fp16 weights, when?
>>
It seems the Mistral shills are afraid of Qwen...
>>
>>102440807
Look at his "cat-like" AI thing. Is there any hope for people who think like that?
>>
ここから入らんとする者は一切の希望を放棄せよ
>>
If I only care about context length for the model to maintain a story line and coherence about past events, is mistral large or any other 70b or 405b model worth it or is a meme?
For reference, nemo is able to do 64k length for a while, it remembers characters and what not, but the more you approach that number of tokens the more it gets hard for it to be coherent.
Or lets say I told the model to write a 100k words story.
Is it worth it to invest in a multiple gpu rig for bigger models just for that, or will I face the same problem?
>>
>>102440962
>he doesn't know
>>
File: 1718682325840.png (11 KB, 605x152)
11 KB
11 KB PNG
how do you answer without sounding mad



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.